Introducing the pd-helper Package for Pandas DataFrames
A Python package to run auto optimization of a Pandas DataFrame.
Reduce DataFrame Memory Consumption by ~70% and configure dtypes automatically for best precision with pd-helper.
I released a Python Package in Beta, this is the Announcement
The project is up on PyPi here: https://pypi.org/project/pd-helper/
Also on GitHub here: https://github.com/justinhchae/pd-helper
Install:
pip install pd-helper
Basic Usage:
from pd_helper.helper import optimize
if __name__ == "__main__":
# some DataFrame, df
df = optimize(df)
Better Usage With Multiprocessing:
from pd_helper.helper import optimizeif __name__ == "__main__":
# some DataFrame, df
df = optimize(df, enable_mp=True)
Specify Special Mappings:
from pd_helper.helper import optimizeif __name__ == "__main__":
# some DataFrame, df
special_mappings = {'string': ['col_1', 'col_2'],
'category': ['col_3', 'col_4']} # special mappings will be applied df = optimize(df
, enable_mp=True,
special_mappings=special_mappings
)
Add pd-helper to your pipeline:
from pd_helper.helper import optimizeif __name__ == "__main__":
# some DataFrame, df
special_mappings = {'string': ['col_1', 'col_2'],
'category': ['col_3', 'col_4']} exclude_cols = ['col_5'] df = (df.pipe(some_other_function)
.pipe(optimize
, special_mappings=special_mappings}
, parse_col_names=True
, exclude_cols=exclude_cols
, enable_mp=True))
I wrote this package because I kept writing and copying the same code among different projects. I hope this works not only for me but for everyone else working with lots of data!
Prior versions of this idea and code as discussed in the following articles:
Originally I created this project as an application, but I think it has much more utility as a package for developers.
I started building in multiprocessing into my functions as a way to speed things up. Currently, this pd-helper function can run optimization on many columns at the same time for different reasons with multiprocessing enabled.
Conclusion
I wrote a package to help with managing DataFrame data, its called pd-helper — check it out and let me if it works for you!