Niels Bantilan

Pandera 0.20.0: Pyarrow Data Type Support

I’m excited to announce that Pandera v0.20.* now supports Pyarrow data types in the pandas validation engine 🚀. According to the pandas documentation, the main benefits of using pyarrow are:

  • More extensive data types compared to NumPy
  • Missing data support (NA) for all data types
  • Performant IO reader integration
  • Facilitate interoperability with other dataframe libraries based on the Apache Arrow specification (e.g. polars, cuDF)

In addition to the comprehensive suite of data types offered by Pyarrow, like lists, dictionaries, and structs, native support for missing data is something that Pandera massively benefits from since nullability checks are one of the core checks provided by Pandera API. The last point about interoperability also plays nicely with Pandera’s recent integration with Polars.

With the object-based API, you can write schemas with raw pyarrow types, the pandas string alias, or the pandas.ArrowDtype pandas type:

Copied to clipboard!
import pandera as pa
import pyarrow

schema = pa.DataFrameSchema({
    "pyarrow_dtype": pa.Column(pyarrow.float64()),  # pyarrow type
    "pandas_str_alias": pa.Column("float64[pyarrow]"),  # pandas string alias
    "pandas_dtype": pa.Column(pd.ArrowDtype(pyarrow.float64())),  # pandas arrow dtype
})

If you prefer the class-based API, pandas string aliases are not supported, but you can specify pyarrow types as follows:

Copied to clipboard!
from typing import Annotated

class Model(pa.DataFrameModel):
    pyarrow_dtype: pyarrow.float64  # pyarrow type
    pandas_dtype: Annotated[pd.ArrowDtype, pyarrow.float64()]  # pandas arrow dtype
    pandas_dtype_kwargs: pd.ArrowDtype = pa.Field(  # pandas arrow dtype
        dtype_kwargs={"pyarrow_dtype": pyarrow.float64()}
    )

Release Highlights

I wanted to give a special shoutout to aaravind100 for driving the support for Pyarrow data types 🙏: as a first-time contributor this was a major effort! The 0.20.* release also comes with some other important changes, such as:

For the complete changelog, see here, here, and here.

What’s next?

On the Pyarrow front, Pandera currently does not support data synthesis strategies via hypothesis: if this is something that you’re interested in please chime in on this issue. I’m also happy to announce that Ibis support is well on its way, with the first major PR merged last week. If you want to keep up with progress, feel free to voice your support and watch the issue, and if you like the Pandera project please give us a star ⭐!

Data Quality