import plotly.plotly as py import plotly.graph_objs as go from plotly.tools import FigureFactory as FF import numpy as np import pandas as pd import scipy
Let us generate some random data from the
Normal Distriubtion. We will sample 50 points from a normal distribution with mean $\mu = 0$ and variance $\sigma^2 = 1$ and from another with mean $\mu = 2$ and variance $\sigma^2 = 1$.
data1 = np.random.normal(0, 1, size=50) data2 = np.random.normal(2, 1, size=50)
The two normal probability distribution functions (p.d.f) stacked on top of each other look like this:
x = np.linspace(-4, 4, 160) y1 = scipy.stats.norm.pdf(x) y2 = scipy.stats.norm.pdf(x, loc=2) trace1 = go.Scatter( x = x, y = y1, mode = 'lines+markers', name='Mean of 0' ) trace2 = go.Scatter( x = x, y = y2, mode = 'lines+markers', name='Mean of 2' ) data = [trace1, trace2] py.iplot(data, filename='normal-dists-plot')
One Sample T-Test is a statistical test used to evaluate the null hypothesis that the mean $m$ of a 1D sample dataset of independant observations is equal to the true mean $\mu$ of the population from which the data is sampled. In other words, our null hypothesis is that
For our T-test, we will be using a significance level of
0.05. On the matter of doing ethical science, it is good practice to always state the chosen significance level for a given test before actually conducting the test. This is meant to ensure that the analyst does not modify the significance level for the purpose of achieving a desired outcome.
For more information on the choice of 0.05 for a significance level, check out this page.
true_mu = 0 onesample_results = scipy.stats.ttest_1samp(data1, true_mu) matrix_onesample = [ ['', 'Test Statistic', 'p-value'], ['Sample Data', onesample_results, onesample_results] ] onesample_table = FF.create_table(matrix_onesample, index=True) py.iplot(onesample_table, filename='onesample-table')
Since our p-value is greater than our Test-Statistic, we have good evidence to not reject the null-hypothesis at the $0.05$ significance level. This is our expected result because the data was collected from a normal distribution.
If we have two independently sampled datasets (with equal variance) and are interested in exploring the question of whether the true means $\mu_1$ and $\mu_2$ are identical, that is, if the data were sampled from the same population, we would use a
Two Sample T-Test.
Typically when a researcher in a field is interested in the affect of a given test variable between two populations, they will take one sample from each population and will note them as the experimental group and the control group. The experimental group is the sample which will receive the variable being tested, while the control group will not.
This test variable is observed (eg. blood pressure) for all the subjects and a two sided t-test can be used to investigate if the two groups of subjects were sampled from populations with the same true mean, i.e. "Does the drug have an effect?"
twosample_results = scipy.stats.ttest_ind(data1, data2) matrix_twosample = [ ['', 'Test Statistic', 'p-value'], ['Sample Data', twosample_results, twosample_results] ] twosample_table = FF.create_table(matrix_twosample, index=True) py.iplot(twosample_table, filename='twosample-table')
Since our p-value is much less than our Test Statistic, then with great evidence we can reject our null hypothesis of identical means. This is in alignment with our setup, since we sampled from two different normal pdfs with different means.