Summary
This notebook presents statistical analysis in Python. It includes bootstrapping, confidence interval, test-statistics (one-sided, two-sided, Type I and II Errors) A/B testing with z-test, t-test, f-test, chi-square tests and effect size. All these approaches are implemented in Python. Two realistic data are applied in this work: 1- 2008 US swing state election results from kaggle, 2- Dimension of the finch beak data from datadryad.
Python functions and data files to run this notebook is in my Github page.
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import scipy.stats as st
import scipy.linalg
from scipy.stats import t
from scipy.stats import f
from scipy.stats import ttest_ind
import warnings
warnings.filterwarnings('ignore')
from functions import* # import require functions to run this notebook
Bootstrapping and Confidence Interval¶
# construct a synthetic population
np.random.seed(42)
size=100
synth = np.random.normal(loc=40,scale=10 , size=size)+np.random.randint(0,15 , size=size)
font = {'size' : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(14,4), dpi= 100, facecolor='w', edgecolor='k')
ax1=plt.subplot(1,2,1)
val=synth
EDA_plot.histplt (val,bins=15,title='Histogram of Artificial Gaussian Data',xlabl=None,
ylabl='Percentage',xlimt=(0,80),ylimt=(0,0.2),axt=ax1,
scale=1.25,loc=1,font=10,color='#8DC63F')
ax2=plt.subplot(1,2,2)
title='ECDF of Artificial Gaussian Data'
EDA_plot.CDF_plot(val,nvar=1,label=False,colors='#8DC63F',xlabel=None,title=title,ylabel='Cumulative Probability',bins=20,
xlim=(0,80),ylim=(0,1),linewidth=3.5,loc=4,axt=ax2,x_ftze=12,y_ftze=12,tit_ftze=14,leg_ftze=10)
plt.show()
# draw samples from population
sample = np.random.choice(synth, size=len(synth))
font = {'size' : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(14,4), dpi= 100, facecolor='w', edgecolor='k')
ax1=plt.subplot(1,2,1)
# Calculate distributin of the mean by bootstrapping (10000 samples)
n_sample=10000
bs_replicates=np.empty(n_sample)
for _ in range(n_sample):
bs_replicates[_]=stats.bootstrap_1D_replicate(synth, np.mean,seed=_+100)
EDA_plot.histplt (bs_replicates,bins=25,title=f'Histogram of Mean for {n_sample} Bootstrappings',xlabl=None,
ylabl='Percentage',xlimt=(40,50),ylimt=(0,0.15),axt=ax1,
scale=1.25,loc=1,font=10,color='#8DC63F')
The uncertainty in the sampled mean can be calculated by theory. For example, what is uncertainty in the mean for a 95% confidence interval. We already calculated the distribution of sampled mean. But we can calculate the distribution of the mean without applying bootstrapping. It can be calculated based on Central Limit Theorem that says for a large enough sample size n, the distribution of the sampling mean will approach a normal distribution.. Ok that is good. We now know that the distribution of the sampling mean is Gaussian (normal) based on Central Limit Theorem. The mean of the sampling distribution is the mean of the population. Therefore, if a population has a mean $\mu$, then the mean of the sampling distribution is also $\mu_{s}$: $\mu=\mu_{s}$. The variance of the sampled mean is calculated by:
$\large\sigma^{2}_{\mu_{s}}=\frac{\sigma^{2}}{n_{ind}}$ where $\sigma^{2}$ is the variance of data and $n_{ind}$ is number of independent data.
For example, 95% of the mean falls within:
then you can use a confidence interval to tell how confident the distribution of mean is (see Figure below).
- 68.0% of the predictions fall within 1 standard deviation of $\large m \pm \sqrt{\frac{\sigma^{2}}{n_{ind}}}$
- 95.0% of the predictions fall within 2 standard deviation of $\large m \pm 2\sqrt{\frac{\sigma^{2}}{n_{ind}}}$
- 99.7% of the predictions fall within 3 standard deviation of $\large m \pm 3\sqrt{\frac{\sigma^{2}}{n_{ind}}}$
see here for more information. Figure below compares calculated uncertainty in the mean from bootstrapping and theory (confidence interval).
font = {'size' : 8}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(20,5), dpi= 100, facecolor='w', edgecolor='k')
ax1=plt.subplot(1,2,1)
m=np.mean(sample) # mean of squared error
vr= np.var(sample) # Variance of squared error
sd_m=np.sqrt(vr/(len(sample)))
low_lim_t=m-2*sd_m
up_lim_t=m+2*sd_m
EDA_plot.histplt (bs_replicates,bins=25,title=f'95% Confident Interval of Mean from Resampling (Bootratping) \n Compared with Theory',xlabl=None,
ylabl='Percentage',xlimt=(40,50),ylimt=(0,0.20),axt=ax1,
scale=1.25,loc=1,font=10,color='#8DC63F')
plt.annotate(text='', xy=(low_lim_t-0.05,0.12), xytext=(up_lim_t+0.05,0.12),
arrowprops=dict(arrowstyle='<->, head_width=0.5',lw=3,color='r'))
plt.axvline(x=low_lim_t,linestyle='--',color='r')
plt.axvline(x=up_lim_t,linestyle='--',color='r')
txt= f'Lower Limit from Theory= {np.round(low_lim_t,1)} \nUpper Limit from Theory= {np.round(up_lim_t,1)} '
plt.text(low_lim_t+0.35, 0.1234, txt, color='r',fontsize=10.5)
low_lim_s,up_lim_s=np.percentile(bs_replicates,[2.5,97.5])
plt.annotate(text='', xy=(low_lim_s-0.05,0.15), xytext=(up_lim_s+0.05,0.15),
arrowprops=dict(arrowstyle='<->, head_width=0.5',lw=3,color='b'))
plt.axvline(x=low_lim_s,linestyle='--',color='b')
plt.axvline(x=up_lim_s,linestyle='--',color='b')
txt= f'Lower Limit from Resampling= {np.round(low_lim_s,1)} \nUpper Limit from Resampling= {np.round(up_lim_s,1)} '
plt.text(low_lim_s+0.35, 0.1534, txt, color='b',fontsize=10.5)
plt.show()
As you can see, the upper and lower limit from bootstrapping is very close to theoretical calculation. Therefore, instead of going through resampling, theoretical calculation can be applied especially resampling can be timely and computationally expensive when data size is large.
Pair Bootstrapping¶
If we want to quantify uncertainty in bivariate distribution (x and y), pair bootstrapping can be applied. The data is 2008 US swing state election results downloaded from https://www.kaggle.com/datasets/aman1py/swing-states?resource=download
df_swing=pd.read_csv('./Data/2008_swing_states.csv')
df_swing
state | county | total_votes | dem_votes | rep_votes | dem_share | |
---|---|---|---|---|---|---|
0 | PA | Erie County | 127691 | 75775 | 50351 | 60.08 |
1 | PA | Bradford County | 25787 | 10306 | 15057 | 40.64 |
2 | PA | Tioga County | 17984 | 6390 | 11326 | 36.07 |
3 | PA | McKean County | 15947 | 6465 | 9224 | 41.21 |
4 | PA | Potter County | 7507 | 2300 | 5109 | 31.04 |
... | ... | ... | ... | ... | ... | ... |
217 | OH | Hamilton County | 425086 | 225213 | 195530 | 53.53 |
218 | OH | Highland County | 19186 | 6856 | 11907 | 36.54 |
219 | OH | Hocking County | 12961 | 6259 | 6364 | 49.58 |
220 | OH | Licking County | 82356 | 33932 | 46918 | 41.97 |
221 | OH | Madison County | 17454 | 6532 | 10606 | 38.11 |
222 rows × 6 columns
#font = {'size' : 12}
#plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(7, 5), dpi= 100, facecolor='w', edgecolor='k')
# 1000 pair bootstarping by sampling from bivariate distribution (x,y)
n_sample=1000
x=df_swing['total_votes']
y=df_swing['dem_share']
EDA_plot.CrossPlot(x,y,title=f'Fitted Lines of Pair Bootstrapping for {n_sample} Resamples',
xlabl='total votes',ylabl='percent of dem_vote',loc=1,xlimt=(np.nanmin(x),np.nanmax(x)),
ylimt=(np.nanmin(y),np.nanmax(y)),axt=ax1,scale=1.2,alpha=0.9,markersize=6,marker='bo',font=10)
a_pb=[]
b_pb=[]
for _ in range(n_sample):
bs_val1,bs_val2=stats.bootstrap_2D_replicate(x,y,seed=_+100)
Lfunc1=np.polyfit(bs_val1,bs_val2,1)
vEst=Lfunc1[0]*x+Lfunc1[1]
a_pb.append(Lfunc1[0])
b_pb.append(Lfunc1[1])
ax1.plot(x, vEst,'g-',linewidth=1, alpha=0.1)
LU Bootstrapping¶
Missing data occur when no data value is stored for the variable in an observation. Missing data is one of the main challenges faced by machine learning. There are some approaches in practice to deal with missing values: 1- Drop column: one naive approach is to remove columns that have missing values. This approach is not reliable since the information of other columns without missing data is removed from dataset. 2- Drop row: another approach is to remove rows that have missing data. This approach is also not recommend because number of data (rows) is decreased that leads to loss of information. 3- Replace with Median or Mean: a common practice is to replace missing values with the median (mean) value for that column. However, this may also lead to unreliable prediction and artifacts since a large population of data would have similar values. The K nearest neighbours (K-NN) is an algorithm that applies feature similarity. Missing data are imputed by finding the K’s closest neighbours. K-NN is sensitive to outliers. It also does not consider the uncertainty in missing values. Bootstrapping can be applied to quantify uncertainty in distribution of each feature, and then replace each missing value with a random sample from the distribution. Although this approach measures the uncertainty, the correlation between features cannot be reproduced after imputation. The correlations and statistics (mean and standard deviation) should be honored after imputation. There are other techniques that respect the correlation after imputations such as multivariate imputation by chained equation (MICE) and deep learning. However, imputation can be quite slow and computationally expensive for large datasets. An imputation technique that can quantify associated uncertainty in missing data is developed in this Section. We propose imputing missing data using LU simulation: (lower–upper) triangular decomposition of the correlation matrix. Many theoretical developments of LU simulation have taken place in geo-statistics for geomodeling and spatial resampling. A modified LU conditional simulation (Davis 1987) is proposed for imputation in this paper by conditioning non-missing values to impute missing data for each well. This approach respects the correlation between features and quantifies uncertainty in missing values. It is also very fast requiring much less computational power for large data sets. The procedure is summarized by:
Normal Score Transformation. Quantile-quantile transformation is applied to convert distribution of each feature to Gaussian distribution.
Correlation Matrix of Features. Figure below shows a schematic illustration of correlation matrix $\boldsymbol{\rho}$ for $n$ features. The diagonal elements of this correlation matrix $\boldsymbol{\rho_{11}},\boldsymbol{\rho_{22}},....,\boldsymbol{\rho_{nn}}$ are 1: the correlation of each feature to itself.
Cholesky Decomposition. The correlation matrix can be discomposed by Cholesky decomposition as $\boldsymbol{\rho}=\textbf{LU}$. Figure below shows lower triangular matrix $\textbf{L}$ achieved from Cholesky decomposition.
Modified LU Conditional Simulation. A vector of uncorrelated standard normal deviate $\mathbf{w}$ with mean=0, standard deviation=1 is simulated for each feature. The length of $\mathbf{w}$ for each feature is number of data (here is total number of wells for training set). LU unconditional simulation can be simply calculated by $\textbf{y}=\textbf{Lw}$. Figure below-b shows how to generate LU unconditional simulation achieving correlated Gaussian realization $\textbf{y}$. For conditioning non-missing features, each array of $\mathbf{w}$ vector with known values should be converted to $\mathbf{w^{c}}$ that is a function of the non-missing features. The main challenge for this modified LU simulation is the ordering of missing and non--missing features for each row of data. If non--missing values are placed first followed by missing data, the conditioning will be applied. So, non--missing values are required to be placed first followed by missing data for each instance (row of data). This requires reconstructing the correlation matrix for each instance to be consistent with the order of features. This ordering is not important within missing and non--missing features.
Back-transform from Gaussian to Original Space. The simulated values in Gaussian space should be back-transformed to original space.
Last two steps are repeated for all data in training set to impute all missing data. Due to random sampling from distribution of each feature, this approach quantifies uncertainty in missing data while respecting the correlation between features. Moreover, implementation of the steps above is straightforward (especially with Python programming language). See the implementation below:
font = {'size' : 12}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(7, 5), dpi= 100, facecolor='w', edgecolor='k')
# 1000 LU bootstarping by sampling from bivariate distribution (x,y)
n_sample=1000
corr=np.zeros((2,2))
corr[0,0]=corr[1,1]=1.0
corr[0,1]=corr[1,0]=0.54
val=df_swing[['total_votes','dem_share']].values
Samp=stats.LUSim(value=val,corr=corr,nsample=val.shape[0],nsim=n_sample,seed=75)
x=df_swing['total_votes']
y=df_swing['dem_share']
EDA_plot.CrossPlot(x,y,title=f'Fitted Lines of LU Simulation for {n_sample} Resamples',
xlabl='total votes',ylabl='percent of dem_vote',loc=1,xlimt=(np.nanmin(x),np.nanmax(x)),
ylimt=(np.nanmin(y),np.nanmax(y)),axt=ax1,scale=1.2,alpha=0.9,markersize=6,marker='bo')
a_lu=[]
b_lu=[]
for i in range(n_sample):
Lfunc1=np.polyfit(Samp[i][0],Samp[i][1],1)
vEst=Lfunc1[0]*x+Lfunc1[1]
a_lu.append(Lfunc1[0])
b_lu.append(Lfunc1[1])
ax1.plot(x, vEst,'g-',linewidth=1, alpha=0.1)
The Step by Step procedure is:
- Calculate the linear correlation coecient between u and Cm calculated from all available data, and construct a correlation matrix .
- Compute the Cholesky decomposition of the correlation matrix as = LLT , and simulate a vector of uncorrelated standard normal deviate w.
- Generate vector of correlated Gaussian realizations Y = Lw, and rank each vector.
- Rank the realization of u and Cm, and order them based on Y.
font = {'size' : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(9, 8), dpi= 130, facecolor='w', edgecolor='k')
ax1=plt.subplot(2,1,1)
label=['Pair','LU']
data_var=len(label)*[None]
data_var[0]=[i*100000 for i in a_pb]
data_var[1]=[i*100000 for i in a_lu]
nvar=2
xs = np.linspace(0,7,1000)
colors = ['r','b']
title=''
title='Pair Bootstarping versus LU Bootstarping for a in y=a*x+b'
EDA_plot.KDE(xs,data_var,nvar,label,colors,xlabel=None,title=title,ylabel='Probability',xlim=(1.5,7),ylim=(0,0.01),
LAMBA=0.14,linewidth=2.5,loc=0,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=12,leg_ftze=9)
txt= f'LU Bootstarpping has higher Variance'
plt.text(4.8, 0.005, txt, color='g',fontsize=10)
ax1=plt.subplot(2,1,2)
label=['Pair','LU']
data_var=len(label)*[None]
data_var[0]=b_pb
data_var[1]=b_lu
nvar=2
xs = np.linspace(35,45,1000)
colors = ['r','b']
title=''
title='Pair Bootstarping versus LU Bootstarping for b in y=a*x+b'
EDA_plot.KDE(xs,data_var,nvar,label,colors,xlabel=None,title=title,ylabel='Probability',xlim=(35,50),ylim=(0,0.006),
LAMBA=0.2,linewidth=2.5,loc=1,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=12,leg_ftze=9)
txt= f'LU Bootstarpping has higher Variance'
plt.text(43, 0.003, txt, color='g',fontsize=10)
plt.subplots_adjust(wspace=0.9)
plt.subplots_adjust(hspace=0.25)
plt.show()
What is a test statistic?¶
The test statistic is used to calculate the p-value. It contains information about the data that is relevant for deciding whether to reject the null hypothesis. The sampling distribution of the test statistic under the null hypothesis is called the null distribution. When the data show strong evidence against the assumptions in the null hypothesis, the magnitude of the test statistic becomes too large or too small depending on the alternative hypothesis. This causes the test's p-value to become small enough to reject the null hypothesis.
For example, the test statistic for a Z-test is the Z-statistic, which has the standard normal distribution under the null hypothesis. Suppose you perform a two-tailed Z-test with an α of 0.05, and obtain a Z-statistic (also called a Z-value) based on your data of 2.5. This Z-value corresponds to a p-value of 0.0124. Because this p-value is less than α, you declare statistical significance and reject the null hypothesis.
Hypothesis test | Test statistic |
---|---|
Z-test | Z-statistic |
t-tests | t-statistic |
ANOVA | F-statistic |
chi-square tests | Chi-square statistic |
In hypothesis testing, there are two ways to determine whether there is enough evidence from the sample to reject H0 or to fail to reject H0. The most common way is to compare the p-value with a pre-specified value of α, where α is the probability of rejecting H0 when H0 is true. However, you can also compare the calculated value of the test statistic with the critical value. In hypothesis testing, a critical value is a point on the test distribution that is compared to the test statistic to determine whether to reject the null hypothesis.
Information above retrieved from Minitab
Hypothesis Testing¶
- A hypothesis test is rule that specifies whether to accept or reject a claim about a population depending on the evidence provided by a sample of data.
- A hypothesis test examines two opposing hypotheses about a population: the null hypothesis and the alternative hypothesis. The null hypothesis is the statement being tested. Usually the null hypothesis is a statement of "no effect" or "no difference". The alternative hypothesis is the statement you want to be able to conclude is true based on evidence provided by the sample data.
- Based on the sample data, the test determines whether to reject the null hypothesis. You use a p-value, to make the determination. If the p-value is less than the significance level (denoted as α or alpha), then you can reject the null hypothesis.
A common misconception is that statistical hypothesis tests are designed to select the more likely of two hypotheses. However, in designing a hypothesis test, we set the null hypothesis up as what we want to disapprove. Because we fix the significance level to be small before the analysis (usually, a value of 0.05 works well), when we reject the null hypothesis, we have statistical proof that the alternative is true. Conversely, if we fail to reject the null hypothesis we do not have statistical proof that the null hypothesis is true. This is because we have not fixed the probability that we falsely accepting the null hypothesis to be small.
Null hypothesis (H0)¶
The null hypothesis states that a population parameter (such as the mean, the standard deviation, and so on) is equal to a hypothesized value. The null hypothesis is often an initial claim that is based on previous analyses or specialized knowledge.
Alternative Hypothesis (H1)¶
The alternative hypothesis states that a population parameter is smaller, greater, or different than the hypothesized value in the null hypothesis. The alternative hypothesis is what you might believe to be true or hope to prove true.
Two-sided¶
Use a two-sided alternative hypothesis (also known as a nondirectional hypothesis) to determine whether the population parameter is either greater than or less than the hypothesized value. A two-sided test can detect when the population parameter differs in either direction, but has less power than a one-sided test.
For example, a researcher has results for a sample of students who took a national exam at a high school. The researcher wants to know if the scores at that school differ from the national average of 720. A two-sided alternative hypothesis (also known as a nondirectional hypothesis) is appropriate because the researcher is interested in determining whether the scores are either less than or greater than the national average. (H0: μ = 720 vs. H1: μ≠ 720)
One-sided¶
Use a one-sided alternative hypothesis (also known as a directional hypothesis) to determine whether the population parameter differs from the hypothesized value in a specific direction. You can specify the direction to be either greater than or less than the hypothesized value. A one-sided test has greater power than a two-sided test, but it cannot detect whether the population parameter differs in the opposite direction.
For example, a researcher has exam results for a sample of students who took a training course for a national exam. The researcher wants to know if trained students score above the national average of 720. A one-sided alternative hypothesis (also known as a directional hypothesis) can be used because the researcher is specifically hypothesizing that scores for trained students are greater than the national average. (H0: μ = 720 vs. H1: μ > 720)
For all two-tailed tests, α (significance level) should be divided by 2 before finding the right critical value.
Type I and Type II Errors¶
No hypothesis test is 100% certain. Because the test is based on probabilities, there is always a chance of making an incorrect conclusion. When you do a hypothesis test, two types of errors are possible: type I and type II. The risks of these two errors are inversely related and determined by the level of significance and the power for the test. Therefore, you should determine which error has more severe consequences for your situation before you define their risks.
Type I error¶
When the null hypothesis is true and you reject it, you make a type I error. The probability of making a type I error is α, which is the level of significance you set for your hypothesis test. An α of 0.05 indicates that you are willing to accept a 5% chance that you are wrong when you reject the null hypothesis. To lower this risk, you must use a lower value for α. However, using a lower value for alpha means that you will be less likely to detect a true difference if one really exists.
Type II error¶
When the null hypothesis is false and you fail to reject it, you make a type II error. The probability of making a type II error is β, which depends on the power of the test. You can decrease your risk of committing a type II error by ensuring your test has enough power. You can do this by ensuring your sample size is large enough to detect a practical difference when one truly exists.
Confidence Interval to Reject Null Hypothesis¶
Statistical significance itself doesn't imply that your results have practical consequence. If you use a test with very high power, you might conclude that a small difference from the hypothesized value is statistically significant. However, that small difference might be meaningless to your situation. You should use your specialized knowledge to determine whether the difference is practically significant.
For example, suppose you are testing whether the population mean (μ) for hours worked at a manufacturing plant is equal to 8. If μ is not equal to 8, the power of your test approaches 1 as the sample size increases, and the p-value approaches 0.
With enough observations, even trivial differences between hypothesized and actual parameter values are likely to become significant. Confidence intervals (if applicable) are often more useful than hypothesis tests because they provide a way to assess practical significance in addition to statistical significance. They help you determine what a parameter value is, instead of what it is not.
Suppose that you do a hypothesis test. Remember that the decision to reject the null hypothesis (H0) or fail to reject it can be based on the p-value and your chosen significance level (also called α). If the p-value is less than or equal to α, you reject H0; if it is greater than α, you fail to reject H0.
Your decision can also be based on the confidence interval (or bound) calculated using the same α. For example, the decision for a test at the 0.05 level of significance can be based on the 95% confidence interval:
If the reference value specified in H0 lies outside the interval (that is, is less than the lower bound or greater than the upper bound), you can reject H0.
If the reference value specified in H0 lies within the interval (that is, is not less than the lower bound or greater than the upper bound), you fail to reject H0.
Increase the Power of a Hypothesis Test¶
The power of a hypothesis test is the is the probability that the test correctly rejects the null hypothesis. The power of a hypothesis test is affected by the sample size, the difference, the variability of the data, and the significance level of the test.
Use a larger sample Using a larger sample provides more information about the population and, thus, more power. Using a larger sample is often the most practical way to increase power.
Use a higher significance level (also called alpha or α) Using a higher significance level increases the probability that you reject the null hypothesis. However, be cautious, because you do not want to reject a null hypothesis that is actually true. Rejecting a null hypothesis that is true is called type I error.
Choose a larger value for the difference It is easier to detect larger differences in population means or proportions.
Use a one-sided hypothesis A directional hypothesis has more power to detect the difference that you specify in the direction that you specify. (The direction is either less than or greater than.) However, be cautious, because a directional hypothesis cannot detect a difference that is in the opposite direction.
Improve your process For a hypothesis test of means (1-sample Z, 1-sample t, 2-sample t, and paired t), improving your process decreases the standard deviation. When the standard deviation is smaller, the power increases and smaller differences can be detected.
The p-value gives us the probability of observing what we observed, given a hypothesis is true. It does not tell us the probability that the null hypothesis is true.
A/B Testing¶
Dimension of the finch beak data is downloaded from https://datadryad.org/stash/dataset/doi:10.5061/dryad.g6g3h.
- Hypothesis Test: did the beaks get deeper from 1975 to 2012
df_beaks_1975=pd.read_csv('./Data/scandens_beaks_1975.csv')
df_beaks_2012=pd.read_csv('./Data/scandens_beaks_2012.csv')
df_beaks_2012
band | species | blength | bdepth | |
---|---|---|---|---|
0 | 19026 | scandens | 14.3 | 9.4 |
1 | 19028 | scandens | 12.5 | 8.9 |
2 | 19029 | scandens | 13.7 | 9.5 |
3 | 19094 | scandens | 13.8 | 11.0 |
4 | 19122 | scandens | 12.0 | 8.7 |
... | ... | ... | ... | ... |
122 | 21295 | scandens | 14.2 | 9.3 |
123 | 21297 | scandens | 13.0 | 9.8 |
124 | 21340 | scandens | 14.6 | 8.9 |
125 | 21342 | scandens | 13.1 | 9.8 |
126 | 21347 | scandens | 15.2 | 9.1 |
127 rows × 4 columns
font = {'size' : 12}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(20,5), dpi= 100, facecolor='w', edgecolor='k')
label=['1975','2012']
colors = ['r','b']
data_var=len(label)*[None]
KPI='dem_share'
data_var[0]=np.array(df_beaks_1975['Beak length, mm'])
data_var[1]=np.array(df_beaks_2012['blength'])
ax1=plt.subplot(1,2,1)
title='ECDF of scandens for beaks for 1975 and 2012'
EDA_plot.CDF_plot(data_var,nvar=2,label=label,colors=colors,xlabel=None,title=title,ylabel='Cumulative Probability',
bins=100,xlim=(10,20),ylim=(0,1),linewidth=3.5,loc=4,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=14,leg_ftze=10.5)
plt.show()
m1=data_var[0].mean()
m2=data_var[1].mean()
var1=np.var(data_var[0])
var2=np.var(data_var[1])
n1=len(data_var[0])
n2=len(data_var[1])
z=(m1-m2)/(np.sqrt(var1/n1+var2/n2))
print(f"m1={m1}, m2={m2}, var1={var1}, var2={var2}, n1={n1}, n2={n2}, z={z} ")
m1=14.120919540229885, m2=13.421023622047242, var1=0.5631577751354206, var2=0.5148800545601089, n1=87, n2=127, z=6.821438385963158
Two-sample z-test for Comparing Two Means¶
$\sigma$ of difference between two normally distributed but independent populations is known as:
$\large z=\frac{m_{1}-m_{2}-\Delta}{\sqrt{\frac{\sigma^{2}_{1}}{n_{1}}+\frac{\sigma^{2}_{2}}{n_{2}}}}$
where $m_{1}$ and $m_{2}$ are the means of the two samples, Δ is the hypothesized difference between the population means (0 if testing for equal means), σ1 and σ2 are the standard deviations of the two populations, and n1 and n2 are the sizes of the two samples.
Null Hypothesis: $H_{0}: m_{1}=m_{2}$ or $H_{0}: m_{1}-m_{2}=0$
Alternative Hypothesis: $H_{a}: m_{1}\neq m_{2}$
$\large z=\frac{14.12-13.42-0}{\sqrt{\frac{0.563}{87}+\frac{0.515}{127}}}=6.82$
p_value=(1-st.norm.cdf(6.82))*2
print(f"p-value={p_value}")
p-value=9.104050846531209e-12
# Calculate distributin of the mean by bootstrapping (10000 samples)
n_sample=10000
bs_replicates=np.empty(n_sample)
for _ in range(n_sample):
tmp1=stats.bootstrap_1D_replicate(data_var[0], np.mean,seed=_+50)
tmp2=stats.bootstrap_1D_replicate(data_var[1], np.mean,seed=_+100)
bs_replicates[_]=tmp1-tmp2
font = {'size' : 8}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(25,5), dpi= 100, facecolor='w', edgecolor='k')
ax1=plt.subplot(1,2,1)
m=m1-m2
sd_m= np.sqrt(var1/n1+var2/n2)
low_lim_t=m-2*sd_m
up_lim_t=m+2*sd_m
EDA_plot.histplt (bs_replicates,bins=25,title=f'95% Confident Interval (m±2σ) for Bootratping of m1-m2 \n Compared with Theory',xlabl=None,
ylabl='Percentage',xlimt=(-0.1,1.2),ylimt=(0,0.20),axt=ax1,
scale=1.25,loc=4,font=10,color='#8DC63F')
plt.annotate(text='', xy=(low_lim_t,0.12), xytext=(up_lim_t,0.12),
arrowprops=dict(arrowstyle='<->, head_width=0.5',lw=3,color='r'))
plt.axvline(x=low_lim_t,linestyle='--',color='r')
plt.axvline(x=up_lim_t,linestyle='--',color='r')
txt= f'Lower Limit from Theory= {np.round(low_lim_t,2)} \nUpper Limit from Theory= {np.round(up_lim_t,2)} '
plt.text(low_lim_t+0.03, 0.1234, txt, color='r',fontsize=10.5)
low_lim_s,up_lim_s=np.percentile(bs_replicates,[2.5,97.5])
plt.annotate(text='', xy=(low_lim_s,0.15), xytext=(up_lim_s,0.15),
arrowprops=dict(arrowstyle='<->, head_width=0.5',lw=3,color='b'))
plt.axvline(x=low_lim_s,linestyle='--',color='b')
plt.axvline(x=up_lim_s,linestyle='--',color='b')
txt= f'Lower Limit from Resampling= {np.round(low_lim_s,2)} \nUpper Limit from Resampling= {np.round(up_lim_s,2)} '
plt.text(low_lim_s+0.01, 0.158, txt, color='b',fontsize=10.5)
plt.text(0+0.01,0.158, 'Null Hypothesis \n m1-m2=0',fontsize=11,color='k',
fontweight='bold',style='oblique', ha='center',
va='top', wrap=True,bbox=dict(facecolor='w', alpha=1,pad=5))
plt.axvline(x=0,linestyle='--',color='y',lw=2)
plt.show()
Figure above shows 95% confidence interval for m1-m2. The value 0 is on vary far left. So, the the mean of m1 and m2 are statistically different. We can say m1>m2. The p-value is 9.10e-12. That is the probability that Null hypothesis is rejected and accept alternative hypothesis.
In practice, z‐test is not used often when we have small number of samples (<100), because the two population standard deviations σ1 and σ2 are usually unknown. Instead, sample standard deviations and the t‐distribution are used.
Two-sample t-test for Comparing Two Means¶
The t-distribution, also known as the Student’s t-distribution, is a type of probability distribution that is similar to the normal distribution with its bell shape but has heavier tails. T distributions have a greater chance for extreme values than normal distributions, hence the fatter tails. The z-test assumes that the population standard deviation is known. The t-distribution does not make this assumption. The t-distribution is defined by the degrees of freedom. By increasing degrees of freedom, t-distribution gets close to normal distribution. See Figure below:
The table below shows three sets of null and alternative hypotheses. Each makes a statement about the difference d between the mean of one population μ1 and the mean of another population μ2. (In the table, the symbol ≠ means " not equal to ".)
Null hypothesis | Alternative hypothesis | Number of tails |
---|---|---|
μ1 - μ2 = d | μ1 - μ2 ≠ d | 2 |
μ1 - μ2 > d | μ1 - μ2 < d | 1 |
μ1 - μ2 < d | μ1 - μ2 > d | 1 |
Two-Tailed Test¶
Null Hypothesis: $H_{0}: \mu_{1}=\mu_{2}$ or $H_{0}: \mu_{1}-\mu_{2}=0$
Alternative Hypothesis: $H_{a}: \mu_{1}\neq \mu_{2}$
Standard error. Compute the standard error (SE) of the sampling distribution.
$SE=\sqrt{\frac{\sigma^{2}_{1}}{n_{1}}+\frac{\sigma^{2}_{2}}{n_{2 }}}$
where 𝜎1 is the standard deviation of sample 1, 𝜎2 is the standard deviation of sample 2, n1 is the size of sample 1, and n2 is the size of sample 2.
Degrees of freedom. The degrees of freedom (DF) is:
$ DF=\large \frac{(\frac{\sigma^{2}_{1}}{n_{1}}+\frac{\sigma^{2}_{2}}{n_{2 }})^{2}}{ \frac{(\frac{\sigma^{2}_{1}}{n_{1}})^2}{n_{1}-1}+ \frac{(\frac{\sigma^{2}_{2}}{n_{2}})^2}{n_{2}-1}}$
If DF does not compute to an integer, round it off to the nearest whole number. Some texts suggest that the degrees of freedom can be approximated by the smaller of n1 - 1 and n2 - 1; but the above formula gives better results.
t-statistic (t) defined by the following equation.
$\large t=\frac{(\mu_{1}-\mu_{2})-d}{SE}$
where μ1 is the mean of sample 1, μ2 is the mean of sample 2, d is the hypothesized difference between population means, and SE is the standard error.
- P-value. The P-value is the probability of observing a sample statistic as extreme as the test statistic. Since the test statistic is a t statistic, use the t Distribution Calculator to assess the probability associated with the t statistic, having the degrees of freedom computed above.
$\large t=\frac{14.12-13.42-0}{\sqrt{\frac{0.563}{87}+\frac{0.515}{127}}}=5.39$
DF=((var1/n1+var2/n2)**2)/((var1/n1)**2/(n1-1)+(var2/n2)**2/(n2-1))
print(f"Calculate DF by the equation is {int(DF)}")
DF_s=((var1/n1+var2/n2)**2)/((var1/n1)**2/(n1-1)+(var2/n2)**2/(n2-1))
print(f"Calculate DF as the sum of the observations in both samples, minus two is {n1+n2-2}")
t_val=(m1-m2)/(np.sqrt(var1/n1+var2/n2))
print(f"Calculated t test statistic is t={t_val} ")
Calculate DF by the equation is 179 Calculate DF as the sum of the observations in both samples, minus two is 212 Calculated t test statistic is t=6.821438385963158
The two-tailed p-value that corresponds to a t test statistic of 6.82 with 179 degrees of freedom is:
#calculate p-value
(1-t.cdf(x=6.82, df=212))*2
9.34294863697005e-11
The one-tailed p-value that corresponds to a t test statistic of 6.82 with 179 degrees of freedom is ((μ1-μ2)>0):
#calculate p-value
(1-t.cdf(x=6.82, df=212))
4.671474318485025e-11
This can be automatically calculated by scipy
python library that assume two distribution have identical average (expected) values:
t_stats=ttest_ind(data_var[0], data_var[1])
t_stats
Ttest_indResult(statistic=6.846559942453622, pvalue=8.027357272898686e-11)
One-Tailed Test¶
Note that we consider a hypotheses constitute a one-tailed test. The null hypothesis will be rejected if the difference between sample means is bigger than would be expected by chance.
Null Hypothesis: $H_{0}: \mu_{1}-\mu_{2}<=0.5$
Alternative Hypothesis: $H_{a}: \mu_{1}-\mu_{2}>0.5$
t for t-test is calculated by the following equation.
$\large t=\frac{(\mu_{1}-\mu_{2})-d}{SE}$
$\large t=\frac{14.12-13.42-0.5}{\sqrt{\frac{0.563}{87}+\frac{0.515}{127}}}$
t_val=(m1-m2-0.5)/(np.sqrt(var1/n1+var2/n2))
print(f"Calculated t test statistic is t={t_val} ")
Calculated t test statistic is t=1.9482578110029694
#calculate p-value
p_value=(1-t.cdf(x=t_val, df=DF))
print (f'p-value for t=1.94 is p_value={p_value}, since p_value<0.05 significance level')
p-value for t=1.94 is p_value=0.02647205545681497, since p_value<0.05 significance level
f-test to Compare Two Variances¶
An f-test
is a catch-all term for any test that uses the f-distribution. In most cases, when people talk about the f-Test, what they are actually talking about is the f-Test to Compare Two Variances. However, the f-statistic is used in a variety of tests including regression analysis.
A Statistical f-test
uses an f-Statistic to compare two variances, s1 and s2, by dividing them. The result is always a positive number (because variances are always positive). The equation for comparing two variances with the f-test is:
$\large f=\frac{s^{2}_{1}}{s^{2}_{2}}$
You always test that the population variances are equal when running an f-test. In other words, you always assume that the variances are equal to 1. Therefore, your null hypothesis will always be that the variances are equal.
Assumptions¶
Several assumptions are made for the test. Your population must be approximately normally distributed (i.e. fit the shape of a bell curve) in order to use the test. Plus, the samples must be independent events. In addition, you’ll want to bear in mind a few important points:
The larger variance should always go in the numerator (the top number) to force the test into a right-tailed test. Right-tailed tests are easier to calculate.
For two-tailed tests, divide alpha by 2 before finding the right critical value.
If you are given standard deviations, they must be squared to get the variances.
If your degrees of freedom aren’t listed in the F Table, use the larger critical value. This helps to avoid the possibility of Type I errors.
lower_tail_prob = 0.025
dof_num = 86
dof_den = 126
f_critical = f.ppf(q=(1-lower_tail_prob), dfn=dof_num, dfd=dof_den)
f_critical
1.4661950905302727
Example problem: Conduct a two tailed F Test on the following samples:
- Beak Length 1975: Variance = 0.563, sample size = 87.
- Beak Length 2012: Variance = 0.515, sample size = 127.
1- hypothesis statements:
- Ho: No difference in variances.
- Ha: Difference in variances.
2- F critical value: Put the highest variance as the numerator and the lowest variance as the denominator:
- F Statistic = variance 1/ variance 2 = 0.563 / 0.515 = 1.093
3- Degrees of freedom:
- Sample 1 has 86 df (the denominator).
- Sample 2 has 126 df (the numerator).
4- Alpha level: No alpha was stated in the question, so use 0.05 (the standard “go to” in statistics). This needs to be halved for the two-tailed test, so use 0.025.
5- Find the critical F Value. There are several tables, so make sure you look in the alpha = .025 table. Critical F (86,126) at alpha (0.025) = 1.466.
6- Compare your calculated value (Step 2) to table value (Step 5). If your calculated value is higher than the table value, you can reject the null hypothesis:
F calculated value: 1.093
F value from table: 1.466.
1.093 < 1.466 . So, we cannot reject null hypothesis in this case
Chi-square tests (Pearson’s Chi-Square)¶
The Chi-Squared test is a statistical hypothesis test that assumes (the null hypothesis) that the observed frequencies for a categorical variable match the expected frequencies for the categorical variable. The test calculates a statistic that has a chi-squared distribution, named for the Greek capital letter Chi (X) pronounced “ki” as in kite. In fact, Chi-squared tests of independence determine whether a relationship exists between two categorical variables. Do the values of one categorical variable depend on the value of the other categorical variable? If the two variables are independent, knowing the value of one variable provides no information about the value of the other variable. This can also be considered as feature selection problem between categorical predictor and target.
[https://machinelearningmastery.com/chi-squared-test-for-machine-learning/]
Contingency table¶
In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering, and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them.
Here, we take a table that shows the number of men and women buying different types of pets.
Dog | Cat | Bird | Total | |
---|---|---|---|---|
Men | 207 | 282 | 241 | 730 |
Women | 234 | 242 | 232 | 708 |
Total | 441 | 524 | 473 | 1438 |
The aim of the test is to conclude whether the two variables( gender and choice of pet ) are related to each other.
Null hypothesis:
We start by defining the null hypothesis (H0) which states that there is no relation between the variables. An alternate hypothesis would state that there is a significant relation between the two.
The hypothesis can be verified using p-value:
We define a significance factor to determine whether the relation between the variables is of considerable significance. Generally a significance factor or alpha value of 0.05 is chosen. This alpha value denotes the probability of erroneously rejecting H0 when it is true. A lower alpha value is chosen in cases where we expect more precision. If the p-value for the test comes out to be strictly greater than the alpha value, then H0 holds true.
Expected Values Table :
Next, we prepare a similar table of calculated(or expected) values. To do this we need to calculate each item in the new table as :
Dog | Cat | Bird | Total | |
---|---|---|---|---|
Men | (730*441)/1438=223.87343533 | (730*473)/1438=266.00834492 | (730*473)/1438=240.11821975 | 730 |
Women | (708*441)/1438=217.12656467 | (524*708)/1438=257.99165508 | (708*473)/1438=232.88178025 | 708 |
Total | 441 | 524 | 473 | 1438 |
Chi-Square Table :
We prepare this table by calculating for each item the following:
$\large \chi^{2}_c=\sum\frac{(O_{i}-E_{i})^{2}}{E_{i}}$
where: c=Degrees of freedom, O=Observed value(s), E=Expected value(s)
observed (o) | calculated (c) | $\large \chi^{2}_c=\sum\frac{(O_{i}-E_{i})^{2}}{E_{i}}$ |
---|---|---|
207 | 223.87343533 | 1.2717579435607573 |
282 | 266.00834492 | 0.9613722161954465 |
241 | 240.11821975 | 0.003238139990850831 |
234 | 217.12656467 | 1.3112758457617977 |
242 | 257.99165508 | 0.991245364156322 |
232 | 232.88178025 | 0.0033387601600580606 |
Total | 4.542228269825232 |
From this table, we obtain the total of the last column, which gives us the calculated value of chi-square. Hence the calculated value of chi-square is 4.542228269825232.
Now, we need to find the critical value of chi-square. We can obtain this from a table. To use this table, we need to know the degrees of freedom for the dataset. The degrees of freedom is defined as : (no. of rows – 1) * (no. of columns – 1). Hence, the degrees of freedom is (2-1) * (3-1) = 2
Now, let us look at the table and find the value corresponding to 2 degrees of freedom and 0.05 significance factor : The tabular or critical value of chi-square here is 5.991.Therefore, H0 is accepted, since Critical value of Chi-square>= calculated of Chi-square (5.991>4.542228269825232). We can use either critical value or p-value to reject/accept H0.
Here is python implementation for this.
from scipy.stats import chi2_contingency
# defining the table
data = [[207, 282, 241], [234, 242, 232]]
stat, p, dof, expected = chi2_contingency(data)
# interpret p-value
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
print('Dependent (reject H0)')
else:
print('Independent (H0 holds true)')
p value is 0.10319714047309392 Independent (H0 holds true)
ANOVA¶
ANOVA stands for Analysis of Variance applying f-test to compare variances across the means (or average) of different groups. ANOVA is also called the Fisher analysis of variance, and it is the extension of the t- and z-tests. The z-test and t-test can only be applied to find difference between two groups but when we have three groups or more we need to use a different approach one is ANOVA.
Null Hypothesis: 𝐻0 : the mean of all groups are the same: 𝜇1=𝜇2=𝜇3…
Alternative Hypothesis: 𝐻𝑎: At least 1 difference among the means
For ANOVA, the total variation is made of two parts: 1- variation within the groups, 2- variation between the group. f-value is calculated as the ratio of these two variances.
$\Large f=\frac{Variation\,between\,Groups}{Variation\,within\,Groups}$
The higher the f-ratio is, the higher likelihood of rejecting the null hypothesis. If p-value (the probability of f-value from f-Distribution table is less than α (significance level), then the null hypothesis of equal means between groups is rejected.
Effect size¶
Effect size will also be calculated to quantitative the magnitude of difference: Trivial, Small, Medium and Large. Effect size is a quantitative measure of the magnitude of the experimental effect. The larger the effect size the stronger the relationship between two variables.
The P-value is not enough! A lower p-value is sometimes interpreted as meaning there is a stronger relationship between two variables. However, statistical significance means that it is unlikely that the null hypothesis is true (less than significance level 0.05). Therefore, a significant p-value tells us that an intervention works, whereas an effect size tells us how much it works.
Cohen's d¶
Cohen's d is an appropriate effect size for the comparison between two means. It can be used, for example, to accompany the reporting of t-test and ANOVA results. It is also widely used in meta-analysis.
To calculate the standardized mean difference between two groups, subtract the mean of one group from the other (μ1 – μ2) and divide the result by the standard deviation (𝜎) of the population from which the groups were sampled.
𝐸𝑓𝑓𝑒𝑐𝑡 𝑆𝑖𝑧𝑒=(𝜇1−𝜇2)/𝜎_𝑝𝑜𝑜𝑙𝑒𝑑
where 𝜇1 is the mean of first group and 𝜇2 is the mean of second group and 𝜎_𝑝𝑜𝑜𝑙𝑒𝑑 is the standard deviation of the population from which the groups were sampled. It is also called Pooled Standard calculated as below:
𝜎_𝑝𝑜𝑜𝑙𝑒𝑑=√((𝜎1^2+𝜎2^2)/2)
where 𝜎1 standard deviation of first group and 𝜎2 is standard deviation of second group. See Effect size and Cohen's D: Definition, Examples, Formulas for more information. Based on calculated effect size value, appropriate size (Trivial, Small, Medium, Large) can be assigned as below table:
Effect size | Relative Size |
---|---|
0.0 | Trivial |
0.2 | Small |
0.5 | Medium |
0.8 | Large |
1.4 |
- Home
-
- Prediction of Movie Genre by Fine-tunning GPT
- Fine-tunning BERT for Fake News Detection
- Covid Tweet Classification by Fine-tunning BART
- Semantic Search Using BERT
- Abstractive Semantic Search by OpenAI Embedding
- Fine-tunning GPT for Style Completion
- Extractive Question-Answering by BERT
- Fine-tunning T5 Model for Abstract Title Prediction
- Image Captioning by Fine-tunning ViT
- Build Serverless ChatGPT API
- Statistical Analysis in Python
- Clustering Algorithms
- Customer Segmentation
- Time Series Forecasting
- PySpark Fundamentals for Big Data
- Predict Customer Churn
- Classification with Imbalanced Classes
- Feature Importance
- Feature Selection
- Text Similarity Measurement
- Dimensionality Reduction
- Prediction of Methane Leakage
- Imputation by LU Simulation
- Histogram Uncertainty
- Delustering to Improve Preferential Sampling
- Uncertainty in Spatial Correlation
-
- Machine Learning Overview
- Python and Pandas
- Main Steps of Machine Learning
- Classification
- Model Training
- Support Vector Machines
- Decision Trees
- Ensemble Learning & Random Forests
- Artificial Neural Network
- Deep Neural Network (DNN)
- Unsupervised Learning
- Multicollinearity
- Introduction to Git
- Introduction to R
- SQL Basic to Advanced Level
- Develop Python Package
- Introduction to BERT LLM
- Exploratory Data Analysis
- Object Oriented Programming in Python
- Natural Language Processing
- Convolutional Neural Network
- Publications