Pandera 0.19.0: Polars DataFrame Validation
The day is finally here! Pandera 0.19.0 ships support for Polars 🎉. I’m especially excited about this integration because, even though Pandera is still a Python project, it can now leverage the performance benefits of Rust.
This feature has been many years in the making, from the work of rewriting Pandera’s internals to decouple its strong dependency on the pandas API, to the Pyspark integration effort that added support of a non-pandas-like dataframe library.
Without further ado, here’s an example of how to validate `polars.DataFrame` and `polars.LazyFrame` objects:
You can also the `check_types` decorator for functional validation:
And, of course, if you want to use the object-based API, you can define the equivalent `DataFrameSchema`:
Validating LazyFrames vs DataFrames
The main difference between validating `LazyFrame`s vs `DataFrame`s is that Pandera will only validate schema-level properties—e.g. the presence of columns and their data types—when validating `LazyFrame`s.
On the other hand, Pandera will examine both schema- and data-level properties when validating a `DataFrame`. For example, data-level properties would include any `Check`s that you specify in the schema definition, which require looking at the actual data values:
This behavior adheres to Pandera’s design philosophy of minimizing the surprise for users of the underlying dataframe library. If I have a `LazyFrame` method chain, I don’t want to break the chain of lazy operations and the optimizations that polars does under the hood:
If you want to check the actual values of the data, materializing the actual data with a `collect` call needs to be apparent in the code:
There is a way of overriding the `LazyFrame` validation behavior by exporting the environment variable `PANDERA_VALIDATION_DEPTH=SCHEMA_AND_DATA`, which will then cause Pandera to validate both schema- and data-level properties.
You can read more about this integration in the docs, but here’s a list of functionality that this initial integration with polars provides:
- Validation with `DataFrameSchema` and `DataFrameModel`
- Functional validation with decorators
- Pandera’s built-in checks
- Custom checks
- Almost all of Polar’s built-in datatypes
- Custom datatypes
- Pandera configuration
For a comprehensive list of all supported and unsupported features, you can check out the new handy supported features table in the documentation.
Wrapping up
This has been a highly-requested feature in the Pandera community for quite some time now, and I’m happy that we’ve been able to deliver initial support of Polars with the help of the community. If you want to get involved in Pandera, you can join the discord community. There’s also a dedicated #pandera-polars channel if you want to discuss ideas relating to the polars integration.
I wanted to give special shoutouts to @AndriiG13 and @FilipAisot for their contributions on the built-in checks and polars datatypes, respectively, and to @evanrasmussen9, @baldwinj30, @obiii, @Filimoa, @philiporlando, @r-bar, @alkment, @jjfantini, and @robertdj for their early feedback and bug reports during the 0.19.0 beta. Check out the full changelog for 0.19.0 here.
What’s next for Pandera? Besides the never-ending quest to fix bugs and improve developer experience, we’ve already set our sights on the next big thing: Ibis support 🦩.