EDA¶

This section introduces the Exploratory Data Analysis component of DataPrep.

Section Contents

Introduction to Exploratory Data Analysis and `dataprep.eda`¶

Exploratory Data Analysis (EDA) is the process of exploring a dataset and getting an understanding of its main characteristics. The dataprep.eda package simplifies this process by allowing the user to explore important characteristics with simple APIs. Each API allows the user to analyze the dataset from a high level to a low level, and from different perspectives. Specifically, dataprep.eda provides the following functionality:

Analyze column distributions with plot(). The function plot() explores the column distributions and statistics of the dataset. It will detect the column type, and then output various plots and statistics that are appropriate for the respective type. The user can optionally pass one or two columns of interest as parameters: If one column is passed, its distribution will be plotted in various ways, and column statistics will be computed. If two columns are passed, plots depicting the relationship between the two columns will be generated.
Analyze correlations with plot_correlation(). The function plot_correlation() explores the correlation between columns in various ways and using multiple correlation metrics. By default, it plots correlation matrices with various metrics. The user can optionally pass one or two columns of interest as parameters: If one column is passed, the correlation between this column and all other columns will be computed and ranked. If two columns are passed, a scatter plot and regression line will be plotted.
Analyze missing values with plot_missing(). The function plot_missing() enables thorough analysis of the missing values and their impact on the dataset. By default, it will generate various plots which display the amount of missing values for each column and any underlying patterns of the missing values in the dataset. To understand the impact of the missing values in one column on the other columns, the user can pass the column name as a parameter. Then, plot_missing() will generate the distribution of each column with and without the missing values from the given column, enabling a thorough understanding of their impact.
Analyze column differences with plot_diff(). The function plot_diff() explores the differences of column distributions and statistics across multiple datasets. It will detect the column type, and then output various plots and statistics that are appropriate for the respective type. The user can optionally set the baseline which is used as the target dataset to compare with other datasets.

The following sections give a simple demonstration of plot(), plot_correlation(), plot_missing(), and plot_diff() using an example dataset.

Analyze distributions with `plot()`¶

The function plot() explores the distributions and statistics of the dataset. The following describes the functionality of plot() for a given dataframe df.

plot(df): plots the distribution of each column and calculates dataset statistics
plot(df, x): plots the distribution of column x in various ways and calculates column statistics
plot(df, x, y): generates plots depicting the relationship between columns x and y

The following shows an example of plot(df). It plots a histogram for each numerical column, a bar chart for each categorical column, and computes dataset statistics.

[1]:

from dataprep.eda import plot
from dataprep.datasets import load_dataset
import numpy as np
df = load_dataset('adult')
plot(df)

[1]:

DataPrep.EDA Report

Stats and Insights

Dataset Statistics

Number of Variables	15
Number of Rows	48842
Missing Cells	0
Missing Cells (%)	0.0%
Duplicate Rows	52
Duplicate Rows (%)	0.1%
Total Size in Memory	30.2 MB
Average Row Size in Memory	649.3 B
Variable Types	Numerical: 6 Categorical: 9

Dataset Insights

fnlwgt is skewed	Skewed
education-num is skewed	Skewed
capital-gain is skewed	Skewed
capital-loss is skewed	Skewed
hours-per-week is skewed	Skewed
capital-gain has 44807 (91.74%) zeros	Zeros
capital-loss has 46560 (95.33%) zeros	Zeros

For more information about the function plot() see here.

Analyze correlations with `plot_correlation()`¶

The function plot_correlation() explores the correlation between columns in various ways and using multiple correlation metrics. The following describes the functionality of plot_correlation() for a given dataframe df.

plot_correlation(df): plots correlation matrices (correlations between all pairs of columns)
plot_correlation(df, x): plots the most correlated columns to column x
plot_correlation(df, x, y): plots the joint distribution of column x and column y and computes a regression line

The following shows an example of plot_correlation(). It generates correlation matrices using Pearson, Spearman, and KendallTau correlation coefficients

[2]:

from dataprep.eda import plot_correlation
from dataprep.datasets import load_dataset
df = load_dataset("wine-quality-red")
plot_correlation(df)

[2]:

DataPrep.EDA Report

Stats Pearson Spearman KendallTau

	Pearson	Spearman	KendallTau
Highest Positive Correlation	0.672	0.79	0.607
Highest Negative Correlation	-0.683	-0.707	-0.528
Lowest Correlation	0.002	0.001	0.0
Mean Correlation	0.019	0.028	0.021

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (fixed_acidity, citric_acid)
Most negative correlated: (fixed_acidity, pH)
Least correlated: (volatile_acidity, residual_sugar)

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (free_sulfur_d...ide, total_sulfur_...ide)
Most negative correlated: (fixed_acidity, pH)
Least correlated: (total_sulfur_...ide, sulphates)

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (free_sulfur_d...ide, total_sulfur_...ide)
Most negative correlated: (fixed_acidity, pH)
Least correlated: (total_sulfur_...ide, sulphates)

For more information about the function plot_correlation() see here.

Analyze missing values with `plot_missing()`¶

The function plot_missing() enables thorough analysis of the missing values and their impact on the dataset. The following describes the functionality of plot_missing() for a given dataframe df.

plot_missing(df): plots the amount and position of missing values, and their relationship between columns
plot_missing(df, x): plots the impact of the missing values in column x on all other columns
plot_missing(df, x, y): plots the impact of the missing values from column x on column y in various ways.

[3]:

from dataprep.eda import plot_missing
from dataprep.datasets import load_dataset
df = load_dataset("titanic")
plot_missing(df)

[3]:

DataPrep.EDA Report

Stats Bar Chart Spectrum Heat Map Dendrogram

Missing Statistics

Missing Cells	866
Missing Cells (%)	8.1%
Missing Columns	3
Missing Rows	708
Avg Missing Cells per Column	72.17
Avg Missing Cells per Row	0.97

'height': 500

Height of the plot

'width': 500

Width of the plot

'spectrum.bins': 20

Number of bins

'height': 500

Height of the plot

'width': 500

Width of the plot

'height': 500

Height of the plot

'width': 500

Width of the plot

'height': 500

Height of the plot

'width': 500

Width of the plot

For more information about the function plot_missing() see here.

Analyze difference with `plot_diff()`¶

The function plot_diff() explores the difference of column distributions and statistics across multiple datasets. The following describes the functionality of plot_diff() for two given dataframes df1 and df2.

[4]:

from dataprep.eda import plot_diff
from dataprep.datasets import load_dataset
df1 = load_dataset("house_prices_test")
df2 = load_dataset("house_prices_train")
plot_diff([df1, df2])

[4]:

DataPrep.EDA Report

Stats

Difference Overview

	df1	df2
Number of Variables	80	81
Number of Rows	1459	1460
Missing Cells	7000	6965
Missing Cells (%)	6.0%	5.9%
Duplicate Rows	0	0
Duplicate Rows (%)	0.0%	0.0%
Total Size in Memory	912.0 KB	924.0 KB
Average Row Size in Memory	910.6 KB	922.6 KB
Variable Types	Numerical: 26 Categorical: 53 GeoGraphy: 1	Numerical: 27 Categorical: 53 GeoGraphy: 1

df1

df2

For more information about the function plot_diff() see here.

Create a profile report with `create_report()`¶

The function create_report() generates a comprehensive profile report of the dataset. create_report() combines the individual components of the dataprep.eda package and outputs them into a nicely formatted HTML document. The document contains the following information:

Overview: detect the types of columns in a dataframe
Variables: variable type, unique values, distint count, missing values
Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Text analysis for length, sample and letter
Correlations: highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
Missing Values: bar chart, heatmap and spectrum of missing values

An example report can be downloaded here.

Get the intermediate data¶

DataPrep.EDA separates the computation and rendering, so that you can just compute the intermediate data and render it using other plotting libraries.

For each plot function, there is a corresponding compute function, which returns the computed intermediates used for rendering. For example, for plot_correlation(df) function, you can get the intermediates using compute_correlation(df). It’s a dictionary, and you can also save it to a json file.

[5]:

from dataprep.eda import compute_correlation
from dataprep.datasets import load_dataset
df = load_dataset("titanic")
imdt = compute_correlation(df)
imdt.save("imdt.json")
imdt

Intermediate has been saved to imdt.json!

[5]:

{'data': {'Pearson': {'x': {1: 'PassengerId',
    2: 'PassengerId',
    3: 'PassengerId',
    4: 'PassengerId',
    5: 'PassengerId',
    6: 'PassengerId',
    9: 'Survived',
    10: 'Survived',
    11: 'Survived',
    12: 'Survived',
    13: 'Survived',
    17: 'Pclass',
    18: 'Pclass',
    19: 'Pclass',
    20: 'Pclass',
    25: 'Age',
    26: 'Age',
    27: 'Age',
    33: 'SibSp',
    34: 'SibSp',
    41: 'Parch'},
   'y': {1: 'Survived',
    2: 'Pclass',
    3: 'Age',
    4: 'SibSp',
    5: 'Parch',
    6: 'Fare',
    9: 'Pclass',
    10: 'Age',
    11: 'SibSp',
    12: 'Parch',
    13: 'Fare',
    17: 'Age',
    18: 'SibSp',
    19: 'Parch',
    20: 'Fare',
    25: 'SibSp',
    26: 'Parch',
    27: 'Fare',
    33: 'Parch',
    34: 'Fare',
    41: 'Fare'},
   'correlation': {1: -0.005006660767066476,
    2: -0.03514399403037967,
    3: 0.03684719786132784,
    4: -0.057526833784441705,
    5: -0.0016520124027188286,
    6: 0.01265821928749123,
    9: -0.33848103596101586,
    10: -0.07722109457217737,
    11: -0.03532249888573588,
    12: 0.08162940708348222,
    13: 0.2573065223849618,
    17: -0.36922601531551574,
    18: 0.0830813628456866,
    19: 0.01844267131074835,
    20: -0.5494996199439061,
    25: -0.3082467589236574,
    26: -0.18911926263203518,
    27: 0.09606669176903881,
    33: 0.41483769862015263,
    34: 0.15965104324216103,
    41: 0.21622494477076254}},
  'Spearman': {'x': {1: 'PassengerId',
    2: 'PassengerId',
    3: 'PassengerId',
    4: 'PassengerId',
    5: 'PassengerId',
    6: 'PassengerId',
    9: 'Survived',
    10: 'Survived',
    11: 'Survived',
    12: 'Survived',
    13: 'Survived',
    17: 'Pclass',
    18: 'Pclass',
    19: 'Pclass',
    20: 'Pclass',
    25: 'Age',
    26: 'Age',
    27: 'Age',
    33: 'SibSp',
    34: 'SibSp',
    41: 'Parch'},
   'y': {1: 'Survived',
    2: 'Pclass',
    3: 'Age',
    4: 'SibSp',
    5: 'Parch',
    6: 'Fare',
    9: 'Pclass',
    10: 'Age',
    11: 'SibSp',
    12: 'Parch',
    13: 'Fare',
    17: 'Age',
    18: 'SibSp',
    19: 'Parch',
    20: 'Fare',
    25: 'SibSp',
    26: 'Parch',
    27: 'Fare',
    33: 'Parch',
    34: 'Fare',
    41: 'Fare'},
   'correlation': {1: -0.005006660767066498,
    2: -0.03409135008914179,
    3: 0.04100991613236293,
    4: -0.06116076582604884,
    5: 0.0012351780934194748,
    6: -0.013975133780990471,
    9: -0.3396679366500525,
    10: -0.052565300044694487,
    11: 0.08887948468090501,
    12: 0.13826563286545587,
    13: 0.3237361394448083,
    17: -0.36166557503434504,
    18: -0.04301876651204207,
    19: -0.022801341928590464,
    20: -0.6880316726256096,
    25: -0.1820612589179174,
    26: -0.2542121174301802,
    27: 0.13505121773428777,
    33: 0.45001397100861634,
    34: 0.4471129882944581,
    41: 0.4100738082761382}},
  'KendallTau': {'x': {1: 'PassengerId',
    2: 'PassengerId',
    3: 'PassengerId',
    4: 'PassengerId',
    5: 'PassengerId',
    6: 'PassengerId',
    9: 'Survived',
    10: 'Survived',
    11: 'Survived',
    12: 'Survived',
    13: 'Survived',
    17: 'Pclass',
    18: 'Pclass',
    19: 'Pclass',
    20: 'Pclass',
    25: 'Age',
    26: 'Age',
    27: 'Age',
    33: 'SibSp',
    34: 'SibSp',
    41: 'Parch'},
   'y': {1: 'Survived',
    2: 'Pclass',
    3: 'Age',
    4: 'SibSp',
    5: 'Parch',
    6: 'Fare',
    9: 'Pclass',
    10: 'Age',
    11: 'SibSp',
    12: 'Parch',
    13: 'Fare',
    17: 'Age',
    18: 'SibSp',
    19: 'Parch',
    20: 'Fare',
    25: 'SibSp',
    26: 'Parch',
    27: 'Fare',
    33: 'Parch',
    34: 'Fare',
    41: 'Fare'},
   'correlation': {1: -0.004090214762393426,
    2: -0.026824400986911346,
    3: 0.02754181401332336,
    4: -0.04839417859092306,
    5: 0.0007978451239667482,
    6: -0.008920866826633959,
    9: -0.32353318439409545,
    10: -0.043385054517253836,
    11: 0.08591509091074537,
    12: 0.13393261225325737,
    13: 0.2662286416742869,
    17: -0.2860814161328999,
    18: -0.03955236574306877,
    19: -0.021019471733083554,
    20: -0.5735307309748154,
    25: -0.14274551945143282,
    26: -0.20011172214961384,
    27: 0.0932489072038393,
    33: 0.4252407973704515,
    34: 0.35826215386190535,
    41: 0.3303597642072928}}},
 'axis_range': ['PassengerId',
  'Survived',
  'Pclass',
  'Age',
  'SibSp',
  'Parch',
  'Fare'],
 'tabledata': {'Highest Positive Correlation': {'Pearson': 0.415,
   'Spearman': 0.45,
   'KendallTau': 0.425},
  'Highest Negative Correlation': {'Pearson': -0.549,
   'Spearman': -0.688,
   'KendallTau': -0.574},
  'Lowest Correlation': {'Pearson': 0.002,
   'Spearman': 0.001,
   'KendallTau': 0.001},
  'Mean Correlation': {'Pearson': -0.024,
   'Spearman': -0.001,
   'KendallTau': 0.0}},
 'insights': {'Pearson': ['Most positive correlated: (SibSp, Parch)',
   'Most negative correlated: (Pclass, Fare)',
   'Least correlated: (PassengerId, Parch)'],
  'Spearman': ['Most positive correlated: (SibSp, Parch)',
   'Most negative correlated: (Pclass, Fare)',
   'Least correlated: (PassengerId, Parch)'],
  'KendallTau': ['Most positive correlated: (SibSp, Parch)',
   'Most negative correlated: (Pclass, Fare)',
   'Least correlated: (PassengerId, Parch)']}}

Specifying colors¶

The supported colors of DataPrep.EDA match those of the Bokeh library. Color values can be provided in any of the following ways:

any of the 147 named CSS colors, e.g ‘green’, ‘indigo’
an RGB(A) hex value, e.g., ‘#FF0000’, ‘#44444444’
a 3-tuple of integers (r,g,b) between 0 and 255
a 4-tuple of (r,g,b,a) where r, g, b are integers between 0 and 255 and a is a floating point value between 0 and 1

EDA¶

Introduction to Exploratory Data Analysis and dataprep.eda¶

Analyze distributions with plot()¶

Analyze correlations with plot_correlation()¶

Analyze missing values with plot_missing()¶

Missing Statistics

Analyze difference with plot_diff()¶

Create a profile report with create_report()¶

Get the intermediate data¶

Specifying colors¶

Introduction to Exploratory Data Analysis and `dataprep.eda`¶

Analyze distributions with `plot()`¶

Analyze correlations with `plot_correlation()`¶

Analyze missing values with `plot_missing()`¶

Analyze difference with `plot_diff()`¶

Create a profile report with `create_report()`¶