hands-on-great-expectations-with-spark

Hands-on Great Expectations with Spark

This is the companion repository related to the article “How to monitor Data Lake health status at scale” published on Towards Data Science tech blog.

Checkout the project presentation at the Great Expectations Community Meetup.

In this repository you can find a complete guide to perform Data Quality checks
over an in-memory Spark dataframe using the python package
Great Expectations.

In detail, cloning and browsing the repository will show you how to:

Create a new Expectation Suite over an in-memory Spark dataframe;
Add Custom Expectations to your Expectation Suite;
Edit the Custom Expectations output description and the validation Data Docs;
Generate and update the Data Docs website;
Execute a Validation run over an in-memory Spark dataframe.

How to navigate the project

Folder structure:

data quality: here is stored the core of the project.

In this folder you can find:
- a template to start to develop a new Expectation Suite using Native
  Expectations and Custom Expectations
- Custom Expectation templates, one for each expectation type: single_column,
  pair_column and multicolumn
- code to generate (and update) the Great Expectations Data Docs
- code to run Data Validation
expectation_suites: here is stored the
Expectation Suite generated from the execution of the jupyter notebook.
The directory, auto-generate by Great Expectation, follows the naming
convention expectation_suites/dataset_name/suite_name.json.
data: here is stored the data used for this hands-on.
The dataset sample_data.csv was used either to develop the Expectation Suite
and to run the Data Validation.

How and where to start

In the Makefile are listed a set of commands which will help you to browse and
use this project.

Expectation suite development with Jupyter Notebook

Expectation Suites dev env is based on a Jupyter Notebooks instance running in
a Docker container. The Docker Image used to run the container, is also
adopted as Remote Python Interpreter with PyCharm Professional to develop
Custom Expectations with the support of an IDE.

The development environment run on a Docker container with:

Jupyter Notebook
Python 3.8
Spark 3.1.1

Before to start: install Docker.

Build the Docker image which contains all that you need to start to develop
Great Expectation Suites running the command:
```
make build
```
Run Docker container from the previously built image with the command:
```
make run
```
To reach Jupyter Notebook click on the url that you can find on the terminal.

How to run data validation

To validate the data (available in the folder data) run the command:

make validate-data

This will generate the folders validations/ and site/ which contain
respectively results of the data quality validation run and the auto-generated
data documentation.

How to generate Great Expectations Data Docs

To locally generate only the Great Expectations data documentation run the
command:

make ge-doc

This will generate site/ folder with the data documentation auto-generated by
Great Expectations.

Contributors

Davide Romano – Mediaset Business Digital
Nicola Saraceni – Mediaset Business Digital

hands-on-great-expectations-with-spark

Hands-on Great Expectations with Spark

How to navigate the project

How and where to start

Expectation suite development with Jupyter Notebook

How to run data validation

How to generate Great Expectations Data Docs

Contributors

Comments

Leave a Reply Cancel reply

More posts

metapac

sauce-kitchen

ox-rst

2021T2-G41