A Python package to run auto optimization of a Pandas DataFrame.

Photo by Matthew Waring on Unsplash

Reduce DataFrame Memory Consumption by ~70% and configure dtypes automatically for best precision with pd-helper.

I released a Python Package in Beta, this is the Announcement

The project is up on PyPi here: https://pypi.org/project/pd-helper/

Also on GitHub here: https://github.com/justinhchae/pd-helper


pip install pd-helper

Basic Usage:

from pd_helper.helper import optimize

if __name__ == "__main__":
# some DataFrame, df
df = optimize(df)

Better Usage With Multiprocessing:

from pd_helper.helper import optimizeif __name__ == "__main__":
# some DataFrame, df
df = optimize(df, enable_mp=True)

Specify Special Mappings:

from pd_helper.helper import optimizeif __name__ == "__main__":
# some DataFrame, df
special_mappings = {'string': ['col_1', 'col_2'],
'category': ['col_3', 'col_4']}
# special mappings will be applied

Stop waiting around for loops to end — speed up iterations by running tasks in parallel with multiprocessing.

Photo by Peggy Anke on Unsplash

How to approach program design with multiprocessing?

In a recent project, I stumbled across some clever ways to boost the speed of forecasting models such as ARIMA and Facebook Prophet and shared the results with Towards Data Science. However, while the performance results and insights are interesting, my prior article primarily focuses on the high-level concept of combining multiple types of forecasting with multiprocessing and does not focus on how to approach the design of such a program.

Why Streamlit and a few tips on deploying a data dashboard app from Python.

Photo by Mark Cruz on Unsplash

I recently built and deployed a data application from my Mac using Python and did not write one line of HTML, CSS, or JavaScript — all made possible with a nifty package called Streamlit.

This article contains a few notes on what worked and what did not work when deploying code with Streamlit.

I recently deployed an early alpha version of a data dashboard with Steamlit. Although the app is in its infancy, I’m happy to have it in some kind of production state and I am excited to continue improving on it. Based on my experience, this article contains some of the biggest gotchas and tips on breaking through a few final issues that you may encounter with…

How to transform a Pandas DataFrame to JSON with flare-like hierarchy to produce D3 Sunburst visualizations.

Photo by Jeremy Bishop on Unsplash

Data is king but colors are cool!

If your data and insights are worth telling, the right colors and design can make or break the crucial connection to your audiences. As a result, whether you are trying to win a competition or just trying to turn in a class assignment, chances are that colors can help make your case.

As I learned from personal experience, tricky data transformations on the back-end often put the desired visualization out of reach. For example, despite various libraries in Pandas, I was surprised to discover…

Making Sense of Big Data

When and how to boost time series forecasting with ARIMA, Facebook Prophet, and PyTorch LSTM neural networks by pooling CPUs and computing in parallel with Python.

Photo by Thomas Kelley on Unsplash

When forecasting time series data, sometimes you need both speed and accuracy. When working from a laptop with Python, try multiprocessing to get the best of both worlds.

This Article May be For You If:

  • You are trying to optimize a multiprocessing problem in Python on your local machine
  • You are forecasting time series data with Statsmodels ARIMA, Facebook Prophet, or PyTorch LSTM
  • You are trying to decide whether multiprocessing is an optimal configuration with PyTorch LSTM if you don’t have access to a GPU

Have Data, Need Time

Most of the time you have plenty of data. Some of the time, you have plenty of time to analyze the data…

Making Sense of Big Data

Reduce Pandas DataFrame memory by 50% or more with this code and app.

Photo by Yulia Matvienko on Unsplash

I built a simple app that takes a csv file and returns a memory-optimized pickle file for use as a Pandas DataFrame — this story shares my experience building with Streamlit, describes the problem to be solved, and promotes a prototype of the app.

This article may be for you if you are working with large CSV files, Pandas DataFrames, and Python.

Edit on 7 April 2021: Since writing this article, I have created and deployed a Python package called pd-helper that runs an optimized version of the code discussed in this story. You can search for it as “pd-helper”…

The bool on when NA is not False, False is False, and NA is not Null.

Photo by Nerfee Mirandilla on Unsplash

When something is nothing, and nothing is something…for boolean data in Pandas, there is crucial difference between NaN, Null, NA, and bools — a brief on when and how to use them.

Task: Clean a Pandas DataFrame comprising boolean (true/false) values to optimize memory. A constraint is to retain all null values as nulls, i.e. don’t turn null values to False because that is a meaningful change.

Action: Explicitly transform column dtypes, i.e. use float32 instead of float64 to conserve memory and bool instead of object.

Problem: When transforming selected columns to bool, all rows evaluate to either all True…

How to visualize area plots and trends lines over grouped time periods with interactive Plotly graph objects.

Photo by Hari Nandakumar on Unsplash

In Brief: Create time series plots with regression trend lines by leveraging Pandas Groupby(), for-loops, and Plotly Scatter Graph Objects in combination with Plotly Express Trend Lines.


  • Data: Counts of things or different groups of things by time.
  • Objective: Visualize a time series of data, by subgroup, on a daily, monthly, or yearly basis with a trend line.
  • Issues: Confusion over syntax for Plotly Express and Plotly Graph Objects and combining standard lines charts with regression lines.
  • Environment: Python, Plotly, and Pandas

Pitfalls to avoid when deploying a data app from Python with Streamlit.

Photo by Science in HD on Unsplash

Deploy Code — Crash App — Learn Lessons

What happens when you deploy a data app without coding for memory optimization? In my case, at least, the app crashed and I spent days painfully refactoring code. If you are luckier or smarter (or both), then you have nothing to worry about. Otherwise, consider lessons from my mistakes and some helpful resources to avoid your own special headaches.

How to avoid my optimization mistakes to deploy your app for the win.

Coding for the Web Vs. Coding for Me

I have always acknowledged the importance of writing optimized code but I did not fully appreciate what it meant until deploying a Web app. On my laptop, even…

Have wacky dates in your data? Instead of dropping or filtering them, impute or substitute them with a reasonable, best-guess.

Photo by Ramón Salinero on Unsplash

The easy choice is to drop missing or erroneous data, but at what cost?

Dealing with missing, null, or erroneous values is one of the most painful and common exercises that we encounter in data science and in machine learning. In some cases, it is acceptable or even preferred to drop such records — algorithms are fragile and can seize up on missing values. However, in other cases when the inclusion of every record is important, what then? The easy choice is to drop missing erroneous data, but at what cost?

What if a single record represents a real person…

Justin Chae

@justinhchae MS AI/Law Northwestern. https://github.com/justinhchae | www.linkedin.com/in/justin-chae | Expressing my personal views. Seeking Sum ’21 Internship

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store