| # Schema tracking and validation |
| |
| This examples showcases the `SchemaValidator()` adapter. At runtime, it will collect the schema of DataFrame-like objects and stored them as `pyarrow.Schema` objects. On subsequent runs, it will check if the schema of a noed output matches the stored reference schema. |
| |
| ## Content |
| ## Simple example |
| - `dataflow.py` defines a dataflow with 3 nodes returning pandas `DataFrame` objects. The `SchemaValidator` added to the `Builder` automatically tracks their schema. After running it once, next runs will check the schema against the schemas stored on disk. |
| - `/schemas` includes the stored `.schema` files. They are created using the standard IPC Arrow serialization format. Unfortunately, they are not human readable. We produce a readable `schemas.json` with all the original schema and metadata. We'll be improving it's readability over time. |
| |
| ### Multi library example |
| - `multi_dataflow.py` defines a dataflow with 3 nodes returning pyarrow, ibis, and pandas dataframes and use the `SchemaValidator()` to track their schema. After running it once, you can change the `Driver` config to use `version="2"` and see the schema validation fail. |
| - `/multi_schemas` includes similar content as `/schemas` |
| |
| You can play with the arguments `--version [1, 2, 3]` and `--no-check` to update the stored schema, trigger failing schema checks, and view the diffs. |
| |
| ## Output examples |
| ### Diff examples |
| ```python |
| {'col_b': 'added'} |
| {'col_a': {'type': 'bool -> double'}} |
| {'col_b': 'removed', 'col_a': {'type': 'bool -> double'}} |
| ``` |
| |
| ### Default Pyarrow schema display |
| ```python |
| {'pyarrow_table': item: string |
| value: decimal128(10, 2) |
| count: int32 |
| -- schema metadata -- |
| name: 'pyarrow_table' |
| documentation: 'Create a duckdb table in-memory and return it as a PyArro' + 7 |
| version: 'cc8d5aba6219976c719ef0c7ac78065aef8a6c7612ea1c3ff595d0892660346' + 1, 'ibis_rename': object: string |
| price: decimal128(10, 2) |
| number: int32 |
| -- schema metadata -- |
| name: 'ibis_rename' |
| documentation: 'Rename the columns' |
| version: '31a5e44b3b718f994865e589642a555fee5db44ba86eaa3bcc25bc9f4242389' + 1, 'pandas_new_col': object: string |
| price: decimal128(4, 2) |
| number: int32 |
| col_a: double |
| -- schema metadata -- |
| name: 'pandas_new_col' |
| documentation: 'Add the column `new_col` of type float' |
| version: '7c679d7dbe665b55b8157f3c0a3962ea26aa0a3f6edaea44120374e33c58acb' + 1} |
| ``` |
| |
| ### Human-readable and JSON-serializable schema |
| ```python |
| { |
| "pyarrow_table": { |
| "metadata": { |
| "name": "pyarrow_table", |
| "documentation": "Create a duckdb table in-memory and return it as a PyArrow table", |
| "version": "cc8d5aba6219976c719ef0c7ac78065aef8a6c7612ea1c3ff595d08926603467" |
| }, |
| "item": { |
| "name": "item", |
| "type": "string", |
| "nullable": true, |
| "metadata": null |
| }, |
| "value": { |
| "name": "value", |
| "type": "decimal128(10, 2)", |
| "nullable": true, |
| "metadata": null |
| }, |
| "count": { |
| "name": "count", |
| "type": "int32", |
| "nullable": true, |
| "metadata": null |
| } |
| }, |
| "ibis_rename": { |
| "metadata": { |
| "name": "ibis_rename", |
| "documentation": "Rename the columns", |
| "version": "31a5e44b3b718f994865e589642a555fee5db44ba86eaa3bcc25bc9f42423895" |
| }, |
| "object": { |
| "name": "object", |
| "type": "string", |
| "nullable": true, |
| "metadata": null |
| }, |
| "price": { |
| "name": "price", |
| "type": "decimal128(10, 2)", |
| "nullable": true, |
| "metadata": null |
| }, |
| "number": { |
| "name": "number", |
| "type": "int32", |
| "nullable": true, |
| "metadata": null |
| } |
| }, |
| "pandas_new_col": { |
| "metadata": { |
| "name": "pandas_new_col", |
| "documentation": "Add the column `new_col` of type float", |
| "version": "7c679d7dbe665b55b8157f3c0a3962ea26aa0a3f6edaea44120374e33c58acbd" |
| }, |
| "object": { |
| "name": "object", |
| "type": "string", |
| "nullable": true, |
| "metadata": null |
| }, |
| "price": { |
| "name": "price", |
| "type": "decimal128(4, 2)", |
| "nullable": true, |
| "metadata": null |
| }, |
| "number": { |
| "name": "number", |
| "type": "int32", |
| "nullable": true, |
| "metadata": null |
| }, |
| "col_a": { |
| "name": "col_a", |
| "type": "double", |
| "nullable": true, |
| "metadata": null |
| } |
| } |
| } |
| ``` |