eda

The core module provides a set of functions for exploratory data analysis

Univariate analysis

In univariate analysis we explore each variable by itself

Ratio variables

For ratio variables the function plot_univariate_continuous generates a single plot showing all the basic information needed to describe it

source

plot_univariate_continuous

 plot_univariate_continuous (df:pandas.core.frame.DataFrame, var:str,
                             var_name:str, ax)

	Type	Details
df	DataFrame	Data
var	str	Variable to plot
var_name	str	Variable name
ax		Axes on which to draw the plot

To see the function in action we use the diamonds dataset provided with seaborn

diamonds = sns.load_dataset('diamonds')
diamonds.head()

	carat	cut	color	clarity	depth	table	price	x	y	z
0	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43
1	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31
2	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31
3	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63
4	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75

This dataset only includes variables with two levels of measurement (ratio and ordinal), the variables can be classified as,

diamonds_ratio = ['carat', 'depth', 'table', 'price', 'x', 'y', 'z']
diamonds_ordinal = ['cut', 'color', 'clarity']

The density plot of the carat feature is

sns.kdeplot(data=diamonds, x='carat');

We create an axis in the whitegrid style and call plot_univariate_continuous

with sns.axes_style('whitegrid'):
    fig, ax = plt.subplots(figsize=(8,4))
plot_univariate_continuous(diamonds, 'carat', 'Carat', ax);

findfont: Font family ['Century Gothic'] not found. Falling back to DejaVu Sans.
findfont: Font family ['Century Gothic'] not found. Falling back to DejaVu Sans.

The resulting figure shows an histogram plot made using seaborn.histplot with stat='percent'. It includes a vertical line drawn at the mean of the data and uses colors to distiguish three groups: the first quartile, the fourth quartile and the second and third quartile (together)

Ordinal features

Each of the ordinal features have the following categories ordered from best to worst,

for feat in diamonds_ordinal:
    print(f'Categories in {feat}:')
    print(diamonds[feat].unique())
    print('\n')

Categories in cut:
['Ideal', 'Premium', 'Good', 'Very Good', 'Fair']
Categories (5, object): ['Ideal', 'Premium', 'Very Good', 'Good', 'Fair']


Categories in color:
['E', 'I', 'J', 'H', 'F', 'G', 'D']
Categories (7, object): ['D', 'E', 'F', 'G', 'H', 'I', 'J']


Categories in clarity:
['SI2', 'SI1', 'VS1', 'VS2', 'VVS2', 'VVS1', 'I1', 'IF']
Categories (8, object): ['IF', 'VVS1', 'VVS2', 'VS1', 'VS2', 'SI1', 'SI2', 'I1']

Bivariate analysis

The goal is a fuction that can calculate a measure of dependence between all the features in a given dataset.

source

strength_of_assoc

 strength_of_assoc (df:pandas.core.frame.DataFrame, ratio_vars:list=None,
                    ordinal_vars:list=None, nominal_vars:list=None,
                    binary_vars:list=None)

	Type	Default	Details
df	DataFrame		Data
ratio_vars	list	None	Columns in `df` with ratio variables
ordinal_vars	list	None	Columns in `df` with ordinal variables
nominal_vars	list	None	Columns in `df` with nominal variables
binary_vars	list	None	Columns in `df` with binary variables

We say a variable has a ratio level of measurement if it is a variable for which ratios are meaningful. Ratio variables have all the properties of interval variables plus a real absolute zero.

For the diamonds dataset we have

diamonds_rcorr = strength_of_assoc(diamonds, diamonds_ratio)
diamonds_rcorr.head()

	feat_1	feat_2	value	metric	assoc_strength
20	y	z	0.952006	Pearson correlation coefficient	strong
2	carat	price	0.921591	Pearson correlation coefficient	strong
3	carat	x	0.975094	Pearson correlation coefficient	strong
4	carat	y	0.951722	Pearson correlation coefficient	strong
5	carat	z	0.953387	Pearson correlation coefficient	strong

strength_of_assoc returns the correlation between each of the n (n - 1) / 2 pair of variables where n is the number of ratio variables

assert len(diamonds_rcorr.index) == 21

source

soa_graph

 soa_graph (cdf:pandas.core.frame.DataFrame, min_strength:str='strong')

	Type	Default	Details
cdf	DataFrame		A dataframe as output by `ratio_corr`
min_strength	str	strong	Threshold for high correlation

diamonds_high_soa, diamonds_soa_graph = soa_graph(diamonds_rcorr)

nx.draw(diamonds_soa_graph, with_labels=True)

import numpy as np
from scipy.stats import chi2_contingency

table = np.array([[205534, 302607], [33395, 71466]])
table = np.array([[205534, 302607], [40896, 63965]])

res = chi2_contingency(table)

res

Chi2ContingencyResult(statistic=75.75518055236239, pvalue=3.2110752959071082e-18, dof=1, expected_freq=array([[204275.33128766, 303865.66871234],
       [ 42154.66871234,  62706.33128766]]))

res.statistic

AttributeError: 'tuple' object has no attribute 'statistic'