This is the companion repository related to the article “How to monitor Data Lake health status at scale” published on Towards Data Science tech blog.
Checkout the project presentation at the Great Expectations Community Meetup.
In this repository you can find a complete guide to perform Data Quality checks
over an in-memory Spark dataframe using the python package
Great Expectations.
In detail, cloning and browsing the repository will show you how to:
- Create a new Expectation Suite over an in-memory Spark dataframe;
- Add Custom Expectations to your Expectation Suite;
- Edit the Custom Expectations output description and the validation Data Docs;
- Generate and update the Data Docs website;
- Execute a Validation run over an in-memory Spark dataframe.
Folder structure:
-
data quality: here is stored the core of the project.
In this folder you can find:- a template to start to develop a new Expectation Suite using Native
Expectations and Custom Expectations - Custom Expectation templates, one for each expectation type:
single_column,
pair_columnandmulticolumn - code to generate (and update) the Great Expectations Data Docs
- code to run Data Validation
- a template to start to develop a new Expectation Suite using Native
-
expectation_suites: here is stored the
Expectation Suite generated from the execution of the jupyter notebook.
The directory, auto-generate by Great Expectation, follows the naming
conventionexpectation_suites/dataset_name/suite_name.json. -
data: here is stored the data used for this hands-on.
The datasetsample_data.csvwas used either to develop the Expectation Suite
and to run the Data Validation.
In the Makefile are listed a set of commands which will help you to browse and
use this project.
Expectation Suites dev env is based on a Jupyter Notebooks instance running in
a Docker container. The Docker Image used to run the container, is also
adopted as Remote Python Interpreter with PyCharm Professional to develop
Custom Expectations with the support of an IDE.
The development environment run on a Docker container with:
- Jupyter Notebook
- Python 3.8
- Spark 3.1.1
Before to start: install Docker.
-
Build the Docker image which contains all that you need to start to develop
Great Expectation Suites running the command:make build
-
Run Docker container from the previously built image with the command:
make run
-
To reach Jupyter Notebook click on the url that you can find on the terminal.
To validate the data (available in the folder data) run the command:
make validate-data
This will generate the folders validations/ and site/ which contain
respectively results of the data quality validation run and the auto-generated
data documentation.
To locally generate only the Great Expectations data documentation run the
command:
make ge-doc
This will generate site/ folder with the data documentation auto-generated by
Great Expectations.
- Davide Romano – Mediaset Business Digital
- Nicola Saraceni – Mediaset Business Digital
Leave a Reply