How to use Statistical Hypothesis Testing for AQI case study
Prerequisite: Knowledge on Sampling, Hypothesis testing and Test statistics.
Statistical Hypothesis Approach
Questions to be asked:
1.The name of the test:
Test depends on the type of distribution the data is following. We have a wide range of statistical tests available for various distributions. If the data doesn’t follow any distribution we have Non Parametric tests available.
2. What the test is checking:
This is subjective to the problem we are solving, we may define a hypothesis between two samples, between a sample and a value, between two population groups. Based on our requirement we can choose up the Null and Alternative hypothesis for a considered test statistic.
3. The key assumptions of the test:
Every hypothesis testing is backed by assumptions like – is the data set normally distributed or we have known variance for calculation? The key and must assumptions and the path to decide a test statistic is defined in the below flow chart.
4. How the test result is interpreted:
It’s interesting to interpret the test results, But we have very marginal experience to accept or reject the Null hypothesis using traditional methods. So to avoid marginal mistakes we use a powerful P-Value to interpret the test results.
The p-value can be interpreted in the context of a chosen significance level called alpha. A common value for alpha is 5%, or 0.05. If the p-value is below the significance level, then the test says there is enough evidence to reject the null hypothesis and that the samples were likely drawn from populations with differing distributions.
p <= alpha: reject null hypothesis, different distribution. p > alpha: fail to reject null hypothesis, same distribution
5.Python Package for using the test:
Below are few packages that can make our life easier to work on:
from scipy.stats import ttest_1samp
from scipy.stats import ttest_ind
from scipy.stats import ttest_rel
from statsmodels.stats.weightstats import ztest
from statsmodels.stats.weightstats import ztest
from scipy.stats import f_oneway
As python is an open source, we can expect lot packages for statistical testing.
How to choose a Test Statistic in the first hand?
Example of Test Statistic:
T Test ( Student T test)
Z Test
ANOVA Test
Chi-Square Test
Case Study
Data Science approach to use hypothesis testing:
Here I am considering an example to showcase and in depth of hypothesis testing. I am keeping it simple and crisp for the blog reader. If someone needs to understand in depth by looking at the code, please feel free to check out my git page. Link is provided at the end.
Air Quality Index: Overview of this example is as follows.
The name of the test. – Two-sample Z test.
What the test is checking. – Two independent data groups and deciding whether the sample mean of two groups is equal or not. H0 : mean of two group is 0, H1 : mean of two group is not 0
The key assumptions of the test – Your sample size is greater than 30. Otherwise, use a t test. Data points should be independent from each other. In other words, one data point isn’t related or doesn’t affect another data point.Your data should be normally distributed. However, for large sample sizes (over 30) this doesn’t always matter. Your data should be randomly selected from a population, where each item has an equal chance of being selected. Sample sizes should be equal if at all possible.
How the test result is interpreted. – p-value and test statistics are given the output. IF p value<0.05: we are rejecting H0 (Null hypothesis), Else we are accepting H0 (Null hypothesis).
Hypothesis – We use Z test to please refer to the above assumption and result interpretation
H0 : Difference mean of two group is 0, H1 : Difference mean of two group is not 0
So lets start each step towards our hypothesis testing:
PM 2.5 is a measure used to define the AQI – Air Quality Index. Most of the organizations use instruments to define AQI. We would have seen this measure very common nowadays as pollution is increasing exponentially.
But, Do the measures like temperature, Humidity, Rainfall, WindSpeed and Visibility affect AQI? Is there any way to find a relationship between the instrumented AQI and the parameters defined in AQI? Here comes the hypothesis.
Our main idea is to take the mean of two groups, instrumented AQI and the predicted AQI using measures like temperature, Humidity, Rainfall, WindSpeed and Visibility. And check if the difference of the mean of these two groups is zero or not.
H0 : Difference mean of two group is 0, H1 : Difference mean of two group is not 0
Step1: Data Collection:
We have used Selenium webdrivers and BeautifulSoup python packages to make our life simple for data collection. As an example we have scrapped the data of Bengaluru for measures like temperature, Humidity, Rainfall, WindSpeed and Visibility between 2013 to 2016 https://en.tutiempo.net/climate/01-2013/ws-432950.html. And we managed to get csv files for which the AQI are recorded instrumentally between 2013 to 2016 from different sources. All the .csv files for your reference are noted in the git. Please check out my git page for more details, link is provided at the end of this blog.
Step 2: EDA:
Exploratory Data Analysis is done on various levels to understand the data distribution, Outliers, yearly changes ect. For detailed information please visit git. Sample of EDA year wise is shown below.
Step 3: Predicting PM 2.5 using measures scrapped from the website.
After performing various EDA and transformation on data, measures combined with PM 2.5 on a day to day basis is the data set for developing Machine learning models. Through which we predict a certain part of PM2.5 using the ML model we trained. With the both Predicted PM 2.5 and the Instrumentally recorded PM 2.5 will be used for our hypothesis testing.
Step 4: Hypothesis Testing:
Now we have the required data in hand – two sample groups whose mean can be calculated and used to develop test statistics. To proceed further we ask all the questions to ourselves which were defined at the beginning of this blog. Question like:
1.The name of the test? Two-sample Z test.
2.What the test is checking? Two independent data groups and deciding whether the sample mean of two groups is equal or not. H0 : mean of two group is 0, H1 : mean of two group is not 0
3.The key assumptions of the test? Your sample size is greater than 30. Otherwise, use a t test.Data points should be independent from each other. In other words, one data point isn’t related or doesn’t affect another data point. Your data should be normally distributed. However, for large sample sizes (over 30) this doesn’t always matter. Your data should be randomly selected from a population, where each item has an equal chance of being selected. Sample sizes should be equal if at all possible.
4.How the test result is interpreted. – p-value and test statistics are given the output. IF pval<0.05: we are rejecting H0 (Null hypothesis), Else we are accepting H0 (Null hypothesis).
5.Python Package for using the test. – from statsmodels.stats.weightstats import ztest
Conclusion:
It is clearly evident parameters like [‘avg_temp’, ‘max_temp’, ‘min_temp’, ‘rel_humidity’, ‘total_rainfall’,’avg_visibility’, ‘avg_windspeed’, ‘max_windspeed’,] are affecting PM2.5 (Particulate Matter 2.5 micro) and has the great influence over air pollution.
Reference Link:
Air Quality Index AQI Index BS4- Beautiful Soup Data Scrapping Machine Learning Selenium Web Driver Wrangling