Pandera Brings Code Coverage Standards for Data Quality in AI
Reaching the amazing milestone of 50 million downloads, we asked Niels Bantilan, Pandera’s creator, to reflect on the journey of getting here and what might be next.
For those who don’t know, Pandera is a Union.ai open source project that provides a flexible and expressive API for performing data validation on dataframe-like objects to make data processing pipelines more readable and robust.
Niels Bantilan created Pandera as a MLE at a previous company working with pandas dataframes for data cleaning for machine learning models. Niels said:
“I was bitten more than once by invalid data, e.g. columns that are supposed to be dates but are strings instead. This happened enough that I decided to create a tool that would check that all of the types, columns names, and values in my dataframe are as I expected them to be.”
The problem Pandera addresses is the very old but still relevant “garbage in; garbage out” data problem. Suppose you’re a data scientist at Zillow trying to predict house prices from characteristics about a house, e.g. the number of bedrooms, square footage, etc.
If you train a model on bad data, the model will not accurately predict house prices on new houses. An example of bad data would be a negative number of bedrooms (which doesn’t make any sense), but this kind of data corruption makes it into data all the time due to the complex nature of processing raw data into a model-ready form.
Introducing this type of invalid data doesn’t typically get caught by data pipelines unless you have explicit checks in place. Pandera provides a validation tool that makes it easy to write these checks in Python code.
The value is two fold
- Enforce the data quality checks whenever my datasets are being created or consumed.
- Serves as documentation for a data team to see, at a glance, how valid data is defined.
Pandera is for anyone who works with dataframes in Python, including data scientists, data engineers and ML engineers. It started off as a data validation library for Pandas, which is a popular dataframe library in Python. Since then, it has expanded to include validation for Pyspark, Dask, Modin, and Polars.
Niels’ ambition for Pandera is
“to make Pandera ubiquitous in the data ecosystem so that no matter what your data stack looks like, you can use Pandera to validate data anywhere.”
Thank you to the amazing community for the support and feedback that got Pandera to this milestone!
Check out Pandera on GitHub: github.com/unionai-oss/pandera