From b6d8fb61421f9564894efa06026e2dcf7d697960 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Thu, 26 Mar 2026 13:46:35 +0000
Subject: [PATCH 1/2] Initial plan


From 4de3bb56546e56f614deb1115cc2e96ba2b7a28c Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Thu, 26 Mar 2026 13:54:32 +0000
Subject: [PATCH 2/2] docs: explain how schema validation works with
 Dataframely

Co-authored-by: PeterKretschmerQC <102249383+PeterKretschmerQC@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Quantco/dataframely/sessions/8961276f-2a08-4fea-a5f6-796c9df5923e
---
 docs/guides/features/index.md             |   1 +
 docs/guides/features/schema-validation.md | 155 ++++++++++++++++++++++
 2 files changed, 156 insertions(+)
 create mode 100644 docs/guides/features/schema-validation.md

diff --git a/docs/guides/features/index.md b/docs/guides/features/index.md
index f07fec42..5ea34d48 100644
--- a/docs/guides/features/index.md
+++ b/docs/guides/features/index.md
@@ -3,6 +3,7 @@
 ```{toctree}
 :maxdepth: 1
 
+schema-validation
 column-metadata
 data-generation
 primary-keys
diff --git a/docs/guides/features/schema-validation.md b/docs/guides/features/schema-validation.md
new file mode 100644
index 00000000..be30e3f7
--- /dev/null
+++ b/docs/guides/features/schema-validation.md
@@ -0,0 +1,155 @@
+# Schema Validation
+
+A {class}`~dataframely.Schema` class specifies the expected structure and content of a polars DataFrame.
+It defines:
+
+- **Columns**: the expected column names, data types, and per-column constraints
+- **Rules**: additional row-level or group-level validation expressions
+
+## Columns and column-level rules
+
+Each column in a schema is declared by assigning a {class}`~dataframely.Column` instance to a class attribute:
+
+```python
+import dataframely as dy
+
+
+class UserSchema(dy.Schema):
+    id = dy.String(primary_key=True)
+    age = dy.UInt8(nullable=False)
+    email = dy.String(nullable=True)
+```
+
+When validating a DataFrame against this schema, dataframely verifies that:
+
+1. **All expected columns are present** with the correct data types.
+2. **Column-level constraints** hold for every row. Common constraints include:
+   - `nullable=False` (the default): the column must not contain null values.
+   - `primary_key=True`: values in this column (or combination of columns) must be unique.
+     See [Primary Keys](primary-keys.md) for details.
+   - Type-specific constraints, e.g., `min_length`/`max_length`/`regex` for {class}`~dataframely.String`
+     or `min`/`max` for numeric types.
+
+```{note}
+Each column type exposes its own set of constraints. Refer to the
+{doc}`API reference </api/columns/index>` for a full list.
+```
+
+### The `check` parameter
+
+For one-off constraints that do not have a dedicated parameter, every column type accepts a `check`
+argument. It receives a polars expression and must return a boolean expression:
+
+```python
+class SalarySchema(dy.Schema):
+    # Only allow salaries that are a multiple of 500.
+    salary = dy.Float64(nullable=False, check=lambda col: col % 500 == 0)
+```
+
+Multiple checks can be provided as a list or a dictionary:
+
+```python
+class SalarySchema(dy.Schema):
+    salary = dy.Float64(
+        nullable=False,
+        check={
+            "multiple_of_500": lambda col: col % 500 == 0,
+            "at_least_minimum_wage": lambda col: col >= 1_000,
+        },
+    )
+```
+
+## Schema-level validation rules
+
+Column-level constraints only validate a single column in isolation. When you need to express
+constraints that span **multiple columns** or depend on **aggregated values**, use the
+{func}`@dy.rule() <dataframely.rule>` decorator:
+
+```python
+import polars as pl
+import dataframely as dy
+
+
+class InvoiceSchema(dy.Schema):
+    admission_date = dy.Date(nullable=False)
+    discharge_date = dy.Date(nullable=False)
+    amount = dy.Float64(nullable=False)
+
+    @dy.rule()
+    def discharge_after_admission(cls) -> pl.Expr:
+        return pl.col("discharge_date") >= pl.col("admission_date")
+```
+
+The decorated method receives the schema class as its first argument and must return a polars
+`Expr` that evaluates to a **boolean value for every row**. A row is considered valid when the
+expression evaluates to `True`.
+
+```{tip}
+You can reference a column by its name (e.g. `pl.col("discharge_date")`) or through the schema
+attribute (e.g. `InvoiceSchema.discharge_date.col`). The latter is refactoring-safe and allows
+IDEs to provide auto-completion.
+```
+
+### Group rules
+
+Rules can also be defined on **groups of rows** by passing a `group_by` argument to
+{func}`@dy.rule() <dataframely.rule>`. The expression is then evaluated per group and must return
+an **aggregated boolean** (one value per group):
+
+```python
+class HouseSchema(dy.Schema):
+    zip_code = dy.String(nullable=False)
+    price = dy.Float64(nullable=False)
+
+    @dy.rule(group_by=["zip_code"])
+    def minimum_zip_code_count(cls) -> pl.Expr:
+        # Require at least two houses per zip code.
+        return pl.len() >= 2
+```
+
+All rows belonging to a group that fails a group rule are marked as invalid.
+
+## Schema inheritance
+
+Schemas can be extended through standard Python inheritance. The child schema inherits all columns
+and rules from its parent:
+
+```python
+class BaseSchema(dy.Schema):
+    id = dy.String(primary_key=True)
+    created_at = dy.Datetime(nullable=False)
+
+
+class UserSchema(BaseSchema):
+    name = dy.String(nullable=False)
+    email = dy.String(nullable=True)
+```
+
+`UserSchema.column_names()` returns `["id", "created_at", "name", "email"]`. Inheritance can be
+arbitrarily deep and supports multiple inheritance, provided that the same column name is not
+defined differently in more than one branch.
+
+## Inspecting a schema
+
+You can inspect a schema by printing it or calling `repr()` on it. This shows all columns together
+with their constraints and any custom validation rules:
+
+```python
+>>> print(InvoiceSchema)
+[Schema "InvoiceSchema"]
+  Columns:
+    - "admission_date": Date()
+    - "discharge_date": Date()
+    - "amount": Float64()
+  Rules:
+    - "discharge_after_admission": [(col("discharge_date")) >= (col("admission_date"))]
+```
+
+## Validating data
+
+Once a schema is defined, use {meth}`Schema.validate() <dataframely.Schema.validate>` to check a
+DataFrame and raise an error on any violation, or {meth}`Schema.filter() <dataframely.Schema.filter>`
+for a "soft" validation that returns both the valid rows and a {class}`~dataframely.FailureInfo`
+object describing which rows failed and why.
+
+See the [Quickstart](../quickstart.md) for a step-by-step walkthrough.