| title | tags | ||||||
|---|---|---|---|---|---|---|---|
Data Validation Engine |
|
The Data Validation Engine (DVE) is a configuration driven data validation library written in Python, Pydantic and a SQL backend currently consisting of DuckDB or Spark. The configuration to run validations against a dataset are defined and written in a json document, which we will be referring to as the "dischema". The rules written within the dischema are designed to be run against all incoming data in a given submission - as this allows the DVE to capture all possible issues with the data without the submitter having to resubmit the same data repeatedly which is burdensome and time consuming for both the submitter and receiver of the data. Additionally, the rules can be configured to have the following behaviour:
- File Rejection - The entire submission will be rejected if the given rule triggers one or more times.
- Row Rejection - The row that triggered the rule will be rejected. Rows that pass the validation will be flowed through into a validated entity.
- Warning - The rule will still trigger and be listed as a feedback message, but the record will still flow through into the validated entity.
Certain scenarios prevent all validations from being executed. For more details, see the File Transformation section.
The DVE has 3 core components:
-
File Transformation - Parsing submitted files into a "stringified" (all fields casted to string) parquet format.
-
Data Contract - Validates submitted data against a specified datatypes and casts successful records to those types. Additionally providing modelling of your data as well.
-
Business rules - Performs simple and complex validations such as comparisons between fields, entities and/or lookups against reference data.
For each component listed above, a feedback message is generated whenever a rule is violated. These feedback messages can be integrated directly into your system given you can consume JSONL files. Alternatively, we offer a fourth component called the Error Reports. This component will load the feedback messages into an .xlsx (Excel) file which could be sent back to the submitter of the data. The excel file is compatible with services that offer spreadsheet reading such as Microsoft Excel, Google Docs, Libre Office Calc etc.
DVE currently comes with two supported backend implementations. These are DuckDB and Spark. If you to need a write a custom backend implementation, you may want to look at the Advanced User Guidance section.
Feel free to use the Table of Contents on the left hand side of the page to navigate to sections of interest or to use the "Next" and "Previous" buttons at the bottom of each page if you want to read through each page in sequential order.
If you have questions or need additional support with the DVE, then please raise an issue on our GitHub page here.