Pandera

Protect Your Data & ML Products from Low-Quality Data

The open-source framework for precision data testing for data scientists and ML engineers.

Install Pandera & get started

Quickstart guide
Copied to clipboard!
$ pip install pandera

Build confidence in the quality of your data by defining schemas for complex data objects

Pandera provides a simple, flexible and extensible data-testing framework for validating not only your data, but also the functions that produce them.

Copied to clipboard!
import pandas as pd
import pandera as pa

from pandera.typing import Series, DataFrame

# Define a schema
class Schema(pa.SchemaModel):
	item: Series[str] = pa.Field(isin=["apple", "orange"], coerce=True)
	price: Series[float] = pa.Field(gt=0, coerce=True)

# Validate at runtime
@pa.check_types(lazy=True)
def transform_data(data: DataFrame [Schema]):
	...

transform_data(
	pd.DataFrame.from_records([
		{"item": "applee", "price": 0.5}, 	# invalid item name
		{"item": "orange", "price": -1000}, 	# negative price
	])
)
Validating Real-World Data Output
Copied to clipboard!
import hypothesis
import pandera as pa

from pandera.typing import Series, DataFrame

# Define an input schema
class Schema(pa.SchemaModel):
	item: Series[str] = pa.Field(isin= ["apple", "orange"], coerce=True)
	price: Series[float] = pa.Field(gt=0, coerce=True)

# Define an output schema
class OutputSchema(Schema) :
	item: Series[str] = pa.Field(isin=[ "apple"])

# Implement a function that filters out oranges
@pa.check_types (lazy=True)
def transform_data(data: DataFrame[Schema]) -> DataFrame [OutputSchema]:
	return data.query("item =='orange'") # 🐛 Incorrect implementation

# Test the function
@hypothesis.given(Schema.strategy(size=10))
def test_transform_data(data):
	transform_data(data)

# Run Unit Tests
test_transform_data()
Testing Data Transformation Functions in CI Output

Don’t take our word for it

“Pandera is a great data-validation toolkit! It's fast, extensible and easy to use. The community behind it is very helpful and responsive. Pandera is a must for data-intensive applications.”

Ayazhan Zhakhan
Co-Founder, Dropbase.io

“Before Pandera, I had trouble with validating the data that I was pulling in from various databases. Pandera has saved me numerous times from the consequences of using poor-quality data. When Pandera data checks determine that something is incorrect, I can react quickly to resolve the situation or send a note out to my internal customers. Thanks a lot, Niels and the Pandera team, for such a great tool!”

John Kang
Director, Cox Automotive

“Our data changes frequently, and we want a way to easily maintain and update our expectations about what counts as valid data. We use Pandera as a very fancy assertion statement to catch data errors in all the nodes of our production pipelines. Since we’ve adopted it, Pandera has helped us maintain the quality of our code during development and the quality of our models in production.”

Richard Decal
Senior ML Engineer, Dendra Systems

“On our team, we’re using Pandera in every project that touches pandas DataFrames. As a programming tool, it lets us automatically check our DataFrames at runtime and in unit tests. As a tool for thought, it forces clarity on the purpose of DataFrames that we use and make in our projects.”

Eric J. Ma
Principal Data Scientist, Moderna Therapeutics

Integrate seamlessly with the Python ecosystem

Supported data frameworks

PandasDaskModinPySparkGeoPandasFugue

Supported integrations

PydanticMyPyFastAPIHypothesisFrictionless

Supported orchestrators

FlyteDagster
Suggest integration

Install Pandera & get started

Quickstart guide