T-TEST: Is used to compare the mean of two given samples
Hypothesis testing to check if alcohol contents affects wine quality using T-Test
Import the libraries
%matplotlib inline
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import researchpy as rc
import statsmodels.api as sm
from scipy.stats import levene
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
Load and read data
df = pd.read_csv("WineQT.csv")
Check data head
df.head()
Check data tail
df.tail()
Check data shape
df.shape
Check data information
df.isnull().sum()
df.info()
Check the unique values in each column
df.apply(lambda x : x.nunique())
Preparing data for T Test
Lets check the mean of wine quality with alcohol
df.groupby('quality')['alcohol'].describe().T
# The mean shows that the higher the wine content the better the wine quality
# picking random samples from wine quality
sample_01 = df[df['quality']== 5]
sample_02 = df[df['quality']== 6]
print(sample_01.shape,sample_02.shape)
Assumptions for T-Test The variances of the 2 samples are equal (we will use levene's test to check the assumption the distribution of the residuals between the two groups should follow the normal distribution
sample_01 = sample_01.sample(462)
print(sample_01.shape,sample_02.shape)
Levene's Test
alpha = 0.05
stats,Pvalue = stats.levene(sample_01['alcohol'], sample_01['alcohol'])
print(f' Test statistics : {stats} \n Alpha : {alpha} \n P-value : {Pvalue}')
if Pvalue > alpha:
print('Variances are same accept null hypothesis')
else:
print('Variances are not same reject null hypothesis')
Checking the normality of the residuals
# plotting and scaling of the difference between sample 1 and sample 2
# diagram looks like normal distribution
diff = scale((np.array(sample_01['alcohol']) - np.array(sample_02['alcohol'])))
plt.figure(figsize=(12,6))
plt.hist(diff)
plt.show()
# My Observation
# Wines with alcohol content between 9 - 12% are the most quality wines
plt.figure(figsize=(12,6))
sns.kdeplot(sample_01['alcohol'], shade=True)
sns.kdeplot(sample_02['alcohol'], shade=True)
plt.legend(['sample_01','sample_02'], fontsize=14)
plt.vlines(x=sample_01['alcohol'].mean(), ymin=0,ymax=0.25,color='blue',linestyle='--')
plt.vlines(x=sample_02['alcohol'].mean(), ymin=0,ymax=0.25,color='blue',linestyle='--')
plt.show()
Q-Q PLOT Generates the probability of sample data against the quantiles of theoretical distributions
# The residual is normally distributed because it closely follows the red lines
plt.figure(figsize=(12,6))
stats.probplot(diff,plot=plt,dist='norm')
plt.show()
**Normality test
H0: Normally distributed
H1: Not Normally Distributed**
# shapiro wilk's test shows residual is not normally distributed
alpha = 0.05
statistic, p_value = stats.shapiro(diff)
if p_value > alpha:
print(f'Accept Null Hypothesis p-value: {p_value}')
else:
print(f'Reject Null Hypothesis p-value: {p_value}')
**Independent sample t-test
H0: There is no difference in mean (wine quality doesn't depend on alcohol %)
H1: There is a difference in mean(wine quality depends on alcohol %)**
alpha = 0.05
statistic, p_value = stats.ttest_ind(sample_01['alcohol'], sample_02['alcohol'])
if p_value > alpha:
print(f'Fail To Reject Null Hypothesis p-value: {p_value}')
else:
print(f'Reject Null Hypothesis')
Conclusion
Alcohol content % has an effect on wine quality