Introducing the pd-helper Package for Pandas DataFrames

A Python package to run auto optimization of a Pandas DataFrame.

Reduce DataFrame Memory Consumption by ~70% and configure dtypes automatically for best precision with pd-helper.

I released a Python Package in Beta, this is the Announcement

The project is up on PyPi here: https://pypi.org/project/pd-helper/

Also on GitHub here: https://github.com/justinhchae/pd-helper

Edit 15 April 2021: Stable release 1.0.0 is now available.

Install:

pip install pd-helper

Basic Usage:

from pd_helper.helper import optimize

if __name__ == "__main__":
# some DataFrame, df
df = optimize(df)

Better Usage With Multiprocessing:

from pd_helper.helper import optimizeif __name__ == "__main__":
# some DataFrame, df
df = optimize(df, enable_mp=True)

Specify Special Mappings:

from pd_helper.helper import optimizeif __name__ == "__main__":
# some DataFrame, df
special_mappings = {'string': ['col_1', 'col_2'],
'category': ['col_3', 'col_4']}
# special mappings will be applied df = optimize(df
, enable_mp=True,
special_mappings=special_mappings
)

Add pd-helper to your pipeline:

from pd_helper.helper import optimizeif __name__ == "__main__":
# some DataFrame, df
special_mappings = {'string': ['col_1', 'col_2'],
'category': ['col_3', 'col_4']}
exclude_cols = ['col_5'] df = (df.pipe(some_other_function)
.pipe(optimize
, special_mappings=special_mappings}
, parse_col_names=True
, exclude_cols=exclude_cols
, enable_mp=True))

I wrote this package because I kept writing and copying the same code among different projects. I hope this works not only for me but for everyone else working with lots of data!

Prior versions of this idea and code as discussed in the following articles:

Originally I created this project as an application, but I think it has much more utility as a package for developers.

I started building in multiprocessing into my functions as a way to speed things up. Currently, this pd-helper function can run optimization on many columns at the same time for different reasons with multiprocessing enabled.

Conclusion

I wrote a package to help with managing DataFrame data, its called pd-helper — check it out and let me if it works for you!

I write about technology, programming, and general interest topics. https://justinhchae.github.io/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store