A Python package to run auto optimization of a Pandas DataFrame.
Reduce DataFrame Memory Consumption by ~70% and configure dtypes automatically for best precision with pd-helper.
The project is up on PyPi here: https://pypi.org/project/pd-helper/
Also on GitHub here: https://github.com/justinhchae/pd-helper
pip install pd-helper
from pd_helper.helper import optimize
if __name__ == "__main__":
# some DataFrame, df
df = optimize(df)
from pd_helper.helper import optimizeif __name__ == "__main__":
# some DataFrame, df
df = optimize(df, enable_mp=True)
from pd_helper.helper import optimizeif __name__ == "__main__":
# some DataFrame, df
special_mappings = {'string': ['col_1', 'col_2'],
'category': ['col_3', 'col_4']} # special mappings will be applied…
How to approach program design with multiprocessing?
In a recent project, I stumbled across some clever ways to boost the speed of forecasting models such as ARIMA and Facebook Prophet and shared the results with Towards Data Science. However, while the performance results and insights are interesting, my prior article primarily focuses on the high-level concept of combining multiple types of forecasting with multiprocessing and does not focus on how to approach the design of such a program. …
I recently built and deployed a data application from my Mac using Python and did not write one line of HTML, CSS, or JavaScript — all made possible with a nifty package called Streamlit.
I recently deployed an early alpha version of a data dashboard with Steamlit. Although the app is in its infancy, I’m happy to have it in some kind of production state and I am excited to continue improving on it. Based on my experience, this article contains some of the biggest gotchas and tips on breaking through a few final issues that you may encounter with…
How to transform a Pandas DataFrame to JSON with flare-like hierarchy to produce D3 Sunburst visualizations.
If your data and insights are worth telling, the right colors and design can make or break the crucial connection to your audiences. As a result, whether you are trying to win a competition or just trying to turn in a class assignment, chances are that colors can help make your case.
As I learned from personal experience, tricky data transformations on the back-end often put the desired visualization out of reach. For example, despite various libraries in Pandas, I was surprised to discover…
When forecasting time series data, sometimes you need both speed and accuracy. When working from a laptop with Python, try multiprocessing to get the best of both worlds.
Most of the time you have plenty of data. Some of the time, you have plenty of time to analyze the data…
I built a simple app that takes a csv file and returns a memory-optimized pickle file for use as a Pandas DataFrame — this story shares my experience building with Streamlit, describes the problem to be solved, and promotes a prototype of the app.
This article may be for you if you are working with large CSV files, Pandas DataFrames, and Python.
Edit on 7 April 2021: Since writing this article, I have created and deployed a Python package called pd-helper that runs an optimized version of the code discussed in this story. You can search for it as “pd-helper”…
When something is nothing, and nothing is something…for boolean data in Pandas, there is crucial difference between NaN, Null, NA, and bools — a brief on when and how to use them.
Task: Clean a Pandas DataFrame comprising boolean (true/false) values to optimize memory. A constraint is to retain all null values as nulls, i.e. don’t turn null values to False because that is a meaningful change.
Action: Explicitly transform column dtypes, i.e. use float32 instead of float64 to conserve memory and bool instead of object.
Problem: When transforming selected columns to bool, all rows evaluate to either all True…
In Brief: Create time series plots with regression trend lines by leveraging Pandas Groupby(), for-loops, and Plotly Scatter Graph Objects in combination with Plotly Express Trend Lines.
What happens when you deploy a data app without coding for memory optimization? In my case, at least, the app crashed and I spent days painfully refactoring code. If you are luckier or smarter (or both), then you have nothing to worry about. Otherwise, consider lessons from my mistakes and some helpful resources to avoid your own special headaches.
How to avoid my optimization mistakes to deploy your app for the win.
I have always acknowledged the importance of writing optimized code but I did not fully appreciate what it meant until deploying a Web app. On my laptop, even…
The easy choice is to drop missing or erroneous data, but at what cost?
Dealing with missing, null, or erroneous values is one of the most painful and common exercises that we encounter in data science and in machine learning. In some cases, it is acceptable or even preferred to drop such records — algorithms are fragile and can seize up on missing values. However, in other cases when the inclusion of every record is important, what then? The easy choice is to drop missing erroneous data, but at what cost?
What if a single record represents a real person…
@justinhchae MS AI/Law Northwestern. https://github.com/justinhchae | www.linkedin.com/in/justin-chae | Expressing my personal views. Seeking Sum ’21 Internship