import plotly.plotly as py import plotly.graph_objs as go from plotly.tools import FigureFactory as FF import numpy as np import pandas as pd import scipy
To look at various normality tests, we will import some data of average wind speed sampled every 10 minutes:
data = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/wind_speed_laurel_nebraska.csv') df = data[0:10] table = FF.create_table(df) py.iplot(table, filename='wind-data-sample')
In statistical analysis, it is always important to be as percise as possible in our language. In general for a normality test, we are testing the
null-hypothesis that the our 1D data is sampled from a population that has a
Normal Distribution. We assume a significance level of $0.05$ or $95\%$ for our tests unless otherwise stated.
For more information on the choice of 0.05 for a significance level, check out this page.
The Shapiro-Wilk normality test is reputadely more well suited to smaller datasets.
x = data['10 Min Sampled Avg'] shapiro_results = scipy.stats.shapiro(x) matrix_sw = [ ['', 'DF', 'Test Statistic', 'p-value'], ['Sample Data', len(x) - 1, shapiro_results, shapiro_results] ] shapiro_table = FF.create_table(matrix_sw, index=True) py.iplot(shapiro_table, filename='shapiro-table')
p-value is much less than our
Test Statistic, we have good evidence to not reject the null hypothesis at the 0.05 significance level.
The Kolmogorov-Smirnov test can be applied more broadly than Shapiro, since it is comparing any two distributions against each other, not necessarily one distriubtion to a normal one. These tests can be one-sided or both-sides, but the latter only applies if both distributions are continuous.
ks_results = scipy.stats.kstest(x, cdf='norm') matrix_ks = [ ['', 'DF', 'Test Statistic', 'p-value'], ['Sample Data', len(x) - 1, ks_results, ks_results] ] ks_table = FF.create_table(matrix_ks, index=True) py.iplot(ks_table, filename='ks-table')
Since our p-value is read as 0.0 (meaning it is "practically" 0 given the decimal accuracy of the test) then we have strong evidence to not reject the null-hypothesis
Anderson's test is derived from Kolmogorov and is used in a similar way to test the null-hypothesis that data is sampled from a population that follows a particular distribution.
anderson_results = scipy.stats.anderson(x) print(anderson_results)
AndersonResult(statistic=2.653698947239036, critical_values=array([ 0.566, 0.645, 0.773, 0.902, 1.073]), significance_level=array([ 15. , 10. , 5. , 2.5, 1. ]))
matrix_ad = [ ['', 'DF', 'Test Statistic', 'p-value'], ['Sample Data', len(x) - 1, anderson_results, anderson_results] ] anderson_table = FF.create_table(matrix_ad, index=True) py.iplot(anderson_table, filename='anderson-table')
As with our tests above, we have good evidence to not reject our null-hypothesis.
We can combine the D'Agostino and Pearson method to generate a new test which considers the
kurtosis, the sharpest point on the curve.
dagostino_results = scipy.stats.mstats.normaltest(x) matrix_dp = [ ['', 'DF', 'Test Statistic', 'p-value'], ['Sample Data', len(x) - 1, dagostino_results, dagostino_results] ] dagostino_table = FF.create_table(matrix_dp, index=True) py.iplot(dagostino_table, filename='dagostino-table')
Our p-value is very close to 0 and much less than our Test Statistic, so we have good evidence once again to not reject the null-hypothesis.