Pandera 0.16: Going Beyond Pandas Data Validation
I’m super excited to announce the availability of Pandera 0.16! This release features a suite of improvements and bug fixes. The biggest advance: Pandera now supports Pyspark SQL DataFrames!
🐼 Beyond Pandas DataFrames
Before I get into this release’s highlights, I want to tell you how we got here: It’s been quite a journey.
🐣 Origins
I wrote the first commit to pandera at the end of 2018:
At that time I was an ML engineer at a previous company, and I was working with Pandas DataFrames every day cleaning, exploring, and modeling data. In my spare time, I created Pandera to try to answer the question:
Can one create type annotations for dataframes that validate data at runtime?
The short answer? Yes.
Pandera started off as a lightweight, expressive and flexible data validation toolkit for Pandas DataFrames. It’s a no-nonsense Python library that doesn’t create pretty reports or ship with an interactive UI; it’s built for data practitioners by a data practitioner to help them build, debug and maintain data and machine learning pipelines:
With Pandera, you can focus on code-native schemas that you can import and run anywhere, and you don’t have to contend with yaml/json config files or set up a special project structure to get your data validators going.
I first shared Pandera to the broader community at Scipy 2020, where I gave a talk and wrote a paper about its core design and functionality. Since then, the project has accrued more than 2.5K stars and 12M downloads, as of the writing of this post.
🐓 Evolution
Today, Pandera is still lightweight, expressive and flexible at its core, but it now provides richer functionality and serves a wider set of DataFrame libraries and tools in the Python ecosystem:
- Expose a class-based API via `DataFrameModel`
- Expose a Pandera-native type engine API to manage physical and logical data types
- Validate other DataFrame types: Dask, Modin, Pyspark Pandas (formerly Koalas), Geopandas
- Parallelize validation with Fugue
- Synthesize data with schema strategies and Hypothesis
- Reuse Pydantic models in your Pandera schemas
- Serialize Pandera schemas as yaml or json
`DataFrameModel` allows for a class-based schema-definition syntax that’s more familiar to Python folks who use `dataclass` and Pydantic `BaseModel`s.
Supporting the DataFrame types mentioned above was a fairly light lift, since many of those libraries somewhat follow the Pandas API. However, as time went on, it became clear to me and the community that a rewrite of Pandera’s internals was needed to decouple the schema specification itself from the validation logic, which up until this point relied completely on the Pandas API.
🦩 Revolution
After about half a year of grueling work, Pandera finally supports a “Bring Your Own Backend” extension model. As a Pandera contributor, you can:
- Define a schema specification for some DataFrame object, or any arbitrary Python object.
- Register a backend for that schema, which implements library-specific validation logic.
- Validate away!
At a high level, this is what the code would look like for implementing a schema specification and backend for a fictional DataFrame library called `sloth` 🦥.
What better way to come full circle than to present this major development at Scipy 2023? If you’re curious to learn more about the development and organizational challenges that came up during this rewrite process, you can take a look at the accompanying paper, which is currently available in preview mode.
To prove out the extensibility of Pandera with the new schema specification and backend API, we collaborated with the QuantumBlack team at McKinsey to implement a schema and backend for Pyspark SQL … and we completed an MVP in a matter of a few months! So without further ado, let’s dive into the highlights of this release.
🎉 Highlights
⭐ Validation Pyspark SQL DataFrames
You can now write `DataFrameSchema`s and `DataFrameModel`s that will validate `pyspark.sql.DataFrame` objects:
Then you can validate data as usual:
This unlocks the power and flexibility of Pandera to Pyspark users who want to validate their DataFrames in production!
🎛️ Control Validation Granularity
Control the validation depth of your schemas at a global level with the `PANDERA_VALIDATION_DEPTH` environment variable. The three acceptable values are:
Note: this feature is currently only supported in the pyspark.sql Pandera API
💡 Switch Validation On and Off
In some cases you may want to disable all Pandera validation calls — for example, in certain production applications that require saving on compute resources. All you need to do is define the `PANDERA_VALIDATION_ENABLED` environment variable before running the application.
Note: this feature is currently only supported in the pyspark.sql Pandera API
ℹ️ Add Metadata to Fields
You can now add arbitrary metadata at the dataframe- or field-level components of your schema, which gives you the ability to embed additional information about your schema. This is useful, for example, if you need to write custom logic to select subsets of your schema for different DataFrames that have overlapping or common fields:
You can get the metadata easily with the `get_metadata` method:
Note: If you have any ideas for how to extend this functionality to make it more useful, please feel free to open up an issue.
🏛️ Add Missing Columns
When loading raw data into a form that’s ready for data processing, it’s often useful to have guarantees that the columns specified in the schema are present, even if they’re missing from the raw data. This is where it’s useful to specify `add_missing_columns=True` in your schema definition.
When you call schema.validate(data), the schema will add any missing columns to the dataframe, using the default value if supplied at the column level, or to NaN if the column is nullable.
🚮 Drop Invalid Rows
If you wish to use the validation step to remove invalid data, you can pass the `drop_invalid_rows=True` argument to the schema definition. On `schema.validate(..., lazy=True)`, if a data-level check fails, then the row that caused the failure will be removed from the dataframe when it is returned.
Note: Make sure to specify `lazy=True` in `schema.validate` to enable this feature.
📓 Full Changelog
This release shipped with many more improvements, bug fixes, and docs updates. To learn more about all of the enhancements and bug fixes in this release, check out the changelog here.
🛣️ What’s Next?
Hopefully this release gets you excited about Pandera! We’ve now opened the door to support the validation of any DataFrame library, and in fact, any Python object that you want (although at that point you should probably just use Pydantic 🙃).
What DataFrame library are we going for next? One word (of the ursine variety): Polars 🐻❄. There’s already a mini-roadmap that I’ll be converting into a proper roadmap over the next few weeks, but if you’re interested in contributing to this effort, head over to the issue and throw your hat into the contributor ring by posting a comment and saying hi!
Finally, if you’re new to Pandera, you can give it a try directly in your browser, courtesy of JupyterLite.
<a href="https://pandera.readthedocs.io/en/latest/try_pandera.html" class="button w-button" target="_blank">▶️ Try Pandera</a>