Pandera
Data Processing
Partner
Tutorial

Pandera 0.20.0: Pyarrow Data Type Support

Niels Bantilan

Niels Bantilan

I’m excited to announce that Pandera v0.20.* now supports Pyarrow data types in the pandas validation engine 🚀. According to the pandas documentation, the main benefits of using pyarrow are:

  • More extensive data types compared to NumPy
  • Missing data support (NA) for all data types
  • Performant IO reader integration
  • Facilitate interoperability with other dataframe libraries based on the Apache Arrow specification (e.g. polars, cuDF)

In addition to the comprehensive suite of data types offered by Pyarrow, like lists, dictionaries, and structs, native support for missing data is something that Pandera massively benefits from since nullability checks are one of the core checks provided by Pandera API. The last point about interoperability also plays nicely with Pandera’s recent integration with Polars.

With the object-based API, you can write schemas with raw pyarrow types, the pandas string alias, or the pandas.ArrowDtype pandas type:

Copied to clipboard!
import pandera as pa
import pyarrow

schema = pa.DataFrameSchema({
    "pyarrow_dtype": pa.Column(pyarrow.float64()),  # pyarrow type
    "pandas_str_alias": pa.Column("float64[pyarrow]"),  # pandas string alias
    "pandas_dtype": pa.Column(pd.ArrowDtype(pyarrow.float64())),  # pandas arrow dtype
})

If you prefer the class-based API, pandas string aliases are not supported, but you can specify pyarrow types as follows:

Copied to clipboard!
from typing import Annotated

class Model(pa.DataFrameModel):
    pyarrow_dtype: pyarrow.float64  # pyarrow type
    pandas_dtype: Annotated[pd.ArrowDtype, pyarrow.float64()]  # pandas arrow dtype
    pandas_dtype_kwargs: pd.ArrowDtype = pa.Field(  # pandas arrow dtype
        dtype_kwargs={"pyarrow_dtype": pyarrow.float64()}
    )

Release Highlights

I wanted to give a special shoutout to aaravind100 for driving the support for Pyarrow data types 🙏: as a first-time contributor this was a major effort! The 0.20.* release also comes with some other important changes, such as:

For the complete changelog, see here, here, and here.

What’s next?

On the Pyarrow front, Pandera currently does not support data synthesis strategies via hypothesis: if this is something that you’re interested in please chime in on this issue. I’m also happy to announce that Ibis support is well on its way, with the first major PR merged last week. If you want to keep up with progress, feel free to voice your support and watch the issue, and if you like the Pandera project please give us a star ⭐!

No items found.

More from Union.

Union.ai on Nebius: Orchestrating the Future of AI Workloads in the Cloud

Union.ai on Nebius: Orchestrating the Future of AI Workloads in the Cloud

Union.ai
AI
Data Processing
Training and Finetuning
Inference
Flyte vs. Ray vs. Flyte + Ray: Choosing the Right Tool for Distributed AI Workflows

Flyte vs. Ray vs. Flyte + Ray: Choosing the Right Tool for Distributed AI Workflows

Flyte
Observability
Data Processing
Training and Finetuning
Compare
Higher-Order Functions in Flyte 2.0: Composable Workflow Patterns

Higher-Order Functions in Flyte 2.0: Composable Workflow Patterns

Flyte
Union.ai
Data Processing
Training and Finetuning
Inference