I recently deployed an early alpha version of a data dashboard with Steamlit. Although the app is in its infancy, I’m happy to have it in some kind of production state and I am excited to continue improving on it. Based on my experience, this article contains some of the biggest gotchas and tips on breaking through a few final issues that you may encounter with a Streamlit deployment. …
How to transform a Pandas DataFrame to JSON with flare-like hierarchy to produce D3 Sunburst visualizations.
If your data and insights are worth telling, the right colors and design can make or break the crucial connection to your audiences. As a result, whether you are trying to win a competition or just trying to turn in a class assignment, chances are that colors can help make your case.
As I learned from personal experience, tricky data transformations on the back-end often put the desired visualization out of reach. For example, despite various libraries in Pandas, I was surprised to discover there is not a clear-cut way to transform a DataFrame to the exact JSON format required to make D3 work. When short on time, instead of mucking around with code, a seemingly easy option is to click, copy, and paste data but this type of manual work is error-prone and does not scale. …
When something is nothing, and nothing is something…for boolean data in Pandas, there is crucial difference between NaN, Null, NA, and bools — a brief on when and how to use them.
Task: Clean a Pandas DataFrame comprising boolean (true/false) values to optimize memory. A constraint is to retain all null values as nulls, i.e. don’t turn null values to False because that is a meaningful change.
Action: Explicitly transform column dtypes, i.e. use float32 instead of float64 to conserve memory and bool instead of object.
Problem: When transforming selected columns to bool, all rows evaluate to either all True or all False and returns a bad headache. …
In Brief: Create time series plots with regression trend lines by leveraging Pandas Groupby(), for-loops, and Plotly Scatter Graph Objects in combination with Plotly Express Trend Lines.
What happens when you deploy a data app without coding for memory optimization? In my case, at least, the app crashed and I spent days painfully refactoring code. If you are luckier or smarter (or both), then you have nothing to worry about. Otherwise, consider lessons from my mistakes and some helpful resources to avoid your own special headaches.
How to avoid my optimization mistakes to deploy your app for the win.
I have always acknowledged the importance of writing optimized code but I did not fully appreciate what it meant until deploying a Web app. On my laptop, even the most poorly written code will likely run, albeit slowly. However, the consequences on the Web are far more severe — memory leaks and inefficient code can cripple the experience. …
The easy choice is to drop missing or erroneous data, but at what cost?
Dealing with missing, null, or erroneous values is one of the most painful and common exercises that we encounter in data science and in machine learning. In some cases, it is acceptable or even preferred to drop such records — algorithms are fragile and can seize up on missing values. However, in other cases when the inclusion of every record is important, what then? The easy choice is to drop missing erroneous data, but at what cost?
What if a single record represents a real person and dropping this record means a person’s story is not counted? …
I like the kind of math that can be explained to me like I’m five years old.
My take on explaining, like I’m five years old, the math behind a key component of Gaussian Mixture Models (GMM) known as Expectation-Maximization (EM) and how to translate the concepts to Python. The focus this story is on the M of EM, or M-Step.
Note: This is not a comprehensive explanation about the end-to-end GMM algorithm. For a deeper dive, check out this article from Towards Data Science, another one on GMM, documenation from sci-kit learn, or Wikipedia.
Source: Based on my notes from studying machine learning; the source materials are derived from and credited to this university class. …
In this article we will learn how to randomly select and manage data in NumPy arrays for machine learning without scikit-learn or Pandas.
In machine learning, a common way to think about data structures is to have features and targets. In a simple case, let’s say we have data about animals that are either dogs or cats. …
Troubleshooting GeoPandas installation in an Anaconda (Conda) environment with PyCharm on a MacOS.
GeoPandas is a great tool to analyze geospatial data in Python, but getting it to work can be tricky. Unlike other libraries, GeoPandas has a web of intricate dependencies and, as I recently discovered, the smallest environmental issue can be enough to derail the whole project. While troubleshooting, I found plenty of related support articles but nothing that specifically resolved my situation.
How to install GeoPandas for Python in a conda virtual environment with PyCharm and MacOS.
The recommended installation method, based on the documentation, is to leverage conda to install GeoPandas which manages all of its dependencies. But, depending on your base environment and other imports, this may fail. …
Learning to code with DrRacket? Here’s an unofficial starter guide to Beginning Student Language (BSL), Intermediate Student Language (ISL), ISL with Lambdas (ISL+), and Racket.
There’s no love, that is, for Racket the programming and teaching language — at very least, that’s the vibe I get from a recent Medium.com search. For example, the top stories for Racket are mostly about corruption schemes or extortion. However, somewhere in the top five search results for Racket, there ought to be something about the basics of program design or functional programming.
My friends, it is time to lift DrRacket out of the basement of programming blogs with more content on BSL, ISL, ISL+, and Racket. …