The Chi-square Test is a non-parametric statistical test used to examine whether distributions of categorical variables differ from one another. It’s widely used to test the association between two categorical variables in a contingency table. The test compares the observed frequencies in each category of a contingency table with the expected frequencies, which are calculated based on the assumption that there is no association between the variables.
Chi-square Test of Independence: Used to determine if there is a significant association between two categorical variables.
Chi-square Goodness-of-Fit Test: Used to see if a sample distribution matches an expected distribution.
Chi-square Test is applicable in a wide range of disciplines, including sociology, marketing, and education, to test hypotheses about associations or differences in categorical data distributions.
19.1 Chi-square Test of Independence
Overview
The Chi-square Test of Independence is a non-parametric statistical test used to determine if there is a significant association between two categorical variables. This test assesses whether observed frequencies in a contingency table differ significantly from expected frequencies, which are calculated under the assumption of independence between the variables.
Null and Alternative Hypotheses
Null Hypothesis (H0): The null hypothesis states that there is no association between the two categorical variables; they are independent.
Alternative Hypothesis (H1): The alternative hypothesis suggests that there is a significant association between the two categorical variables.
Test Statistic
The test statistic for the Chi-square Test of Independence is calculated as follows:
\[
\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
\] where \(O_i\) represents the observed frequency, and \(E_i\) represents the expected frequency in each category.
The test statistic follows a chi-squared distribution with \((r-1)\) times \((c-1)\) degrees of freedom, where \(r\) is the number of rows and \(c\) is the number of columns in the contingency table.
Calculation of Expected Frequencies
Expected frequencies are calculated based on the marginal totals of the contingency table: \[
E_{ij} = \frac{(R_i \times C_j)}{N}
\] where \(R_i\) is the total for row \(i\), \(C_j\) is the total for column \(j\), and \(N\) is the overall total number of observations.
Interpretation of Results
If the calculated \(\chi^2\) value is greater than the critical value from the chi-squared distribution at the chosen significance level (commonly \(\alpha = 0.05\)), the null hypothesis is rejected, indicating a significant association between the variables.
Applications
Sociology: To analyze the relationship between education level and employment status.
Medicine: To study the association between a risk factor (like smoking) and the incidence of a disease.
Marketing: To evaluate the relationship between customer demographics and product preferences.
19.1.1 Example problem on Chi-square Test of Independence
A researcher wants to determine if there is an association between gender (male, female) and preference for a new product (like, dislike). The data collected is as follows:
Gender
Like
Dislike
Male
20
10
Female
30
40
Null and Alternative Hypotheses
Null Hypothesis (H0): There is no association between gender and preference for the product; they are independent.
Alternative Hypothesis (H1): There is an association between gender and preference for the product.
Step-by-Step Calculation
Observed Frequencies (O): The observed frequencies are given in the table:
Gender
Like
Dislike
Row Totals
Male
20
10
30
Female
30
40
70
Column Totals
50
50
100
Expected Frequencies (E): The expected frequencies are calculated based on the assumption of independence. The expected frequency for each cell is calculated using the formula: \[
E_{ij} = \frac{(R_i \times C_j)}{N}
\] where \(R_i\) is the row total, \(C_j\) is the column total, and \(N\) is the grand total.
For Male and Like: \[
E_{11} = \frac{(30 \times 50)}{100} = 15
\]
For Male and Dislike: \[
E_{12} = \frac{(30 \times 50)}{100} = 15
\]
For Female and Like: \[
E_{21} = \frac{(70 \times 50)}{100} = 35
\]
For Female and Dislike: \[
E_{22} = \frac{(70 \times 50)}{100} = 35
\]
The expected frequencies are:
Gender
Like (E)
Dislike (E)
Male
15
15
Female
35
35
Chi-square Test Statistic: The test statistic is calculated using the formula: \[
\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
\]
For Male and Like: \[
\chi^2_{11} = \frac{(20 - 15)^2}{15} = \frac{25}{15} = 1.67
\]
For Male and Dislike: \[
\chi^2_{12} = \frac{(10 - 15)^2}{15} = \frac{25}{15} = 1.67
\]
For Female and Like: \[
\chi^2_{21} = \frac{(30 - 35)^2}{35} = \frac{25}{35} = 0.71
\]
For Female and Dislike: \[
\chi^2_{22} = \frac{(40 - 35)^2}{35} = \frac{25}{35} = 0.71
\]
The total chi-square statistic is: \[
\chi^2 = 1.67 + 1.67 + 0.71 + 0.71 = 4.76
\]
Degrees of Freedom (df): The degrees of freedom for the test is calculated as: \[
df = (r - 1) \times (c - 1)
\] where \(r\) is the number of rows and \(c\) is the number of columns. In this case: \[
df = (2 - 1) \times (2 - 1) = 1
\]
Critical Value and P-value: The critical value for \(\chi^2\) at \(\alpha = 0.05\) and 1 degree of freedom can be found in chi-square distribution tables. The critical value is 3.841.
Compare the calculated \(\chi^2\) value with the critical value:
If \(\chi^2 > 3.841\), reject the null hypothesis.
Otherwise, do not reject the null hypothesis.
In this case, \(\chi^2 = 4.76\) which is greater than 3.841, so we reject the null hypothesis.
Alternatively, you can calculate the p-value using a chi-square distribution calculator or software. For \(\chi^2 = 4.76\) with 1 degree of freedom, the p-value is approximately 0.029.
Interpretation
Since the p-value (0.029) is less than the significance level (\(\alpha = 0.05\)), we reject the null hypothesis. There is sufficient evidence to conclude that there is a significant association between gender and preference for the new product.
R Code for Chi-square Test of Independence
# Data for the Chi-square Test of Independencedata<-matrix(c(20, 10, 30, 40), nrow =2, byrow =TRUE)# Perform the testresult<-chisq.test(data)# Output the resultprint(result)
Pearson's Chi-squared test with Yates' continuity correction
data: data
X-squared = 3.8571, df = 1, p-value = 0.04953
Python Code for Chi-square Test of Independence
import numpy as npfrom scipy.stats import chi2_contingency# Data for the Chi-square Test of Independencedata = np.array([[20, 10], [30, 40]])# Perform the testchi2, p, dof, expected = chi2_contingency(data)# Output the resultprint(f"Chi2 Statistic: {chi2}")
Chi2 Statistic: 3.8571428571428577
print(f"P-value: {p}")
P-value: 0.04953461343562668
print(f"Degrees of Freedom: {dof}")
Degrees of Freedom: 1
print("Expected Frequencies:")
Expected Frequencies:
print(expected)
[[15. 15.]
[35. 35.]]
19.2 Chi-square Goodness-of-Fit Test
Overview
The Chi-square Goodness-of-Fit Test is used to determine whether a sample distribution matches an expected distribution. This test compares the observed frequencies of categories to the frequencies expected under a specified theoretical distribution.
Null and Alternative Hypotheses
Null Hypothesis (H0): The null hypothesis states that the sample distribution matches the expected distribution.
Alternative Hypothesis (H1): The alternative hypothesis suggests that there is a significant difference between the observed and expected distributions.
Test Statistic
The test statistic for the Chi-square Goodness-of-Fit Test is calculated as follows: \[
\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
\] where \(O_i\) represents the observed frequency, and \(E_i\) represents the expected frequency for each category.
The test statistic follows a chi-squared distribution with \(k-1\) degrees of freedom, where \(k\) is the number of categories.
Calculation of Expected Frequencies
Expected frequencies are calculated based on the theoretical distribution. For example, if testing a uniform distribution, the expected frequency for each category would be equal.
Interpretation of Results
If the calculated \(\chi^2\) value is greater than the critical value from the chi-squared distribution at the chosen significance level (commonly \(\alpha = 0.05\)), the null hypothesis is rejected, indicating that the sample distribution significantly differs from the expected distribution.
Applications
Genetics: To determine if the observed frequencies of different genotypes match the expected frequencies under Mendelian inheritance.
Quality Control: To check if the observed defect rates in different categories match the expected rates.
Survey Analysis: To see if the distribution of survey responses matches the expected distribution based on population proportions.
19.2.1 Example problem on Chi-square Goodness-of-Fit Test
A company wants to know if the observed sales distribution across four regions (North, South, East, West) matches their expected distribution. The expected distribution is equal across all regions. The observed sales are as follows:
Region
Observed Sales
North
50
South
60
East
40
West
50
The company will use the Chi-square Goodness-of-Fit Test to analyze the data.
Null and Alternative Hypotheses
Null Hypothesis (H0): The observed sales distribution matches the expected distribution (equal across all regions).
Alternative Hypothesis (H1): The observed sales distribution does not match the expected distribution.
Step-by-Step Calculation
Observed Frequencies (O): The observed frequencies are given in the table:
Region
Observed Sales (O)
North
50
South
60
East
40
West
50
Expected Frequencies (E): The expected frequencies are calculated based on the assumption that the sales are equally distributed across all regions. Since the total number of observations is 200 (50 + 60 + 40 + 50 = 200) and there are four regions, the expected frequency for each region is:
\[
E = \frac{\text{Total Sales}}{\text{Number of Regions}} = \frac{200}{4} = 50
\]
So, the expected frequencies are:
Region
Expected Sales (E)
North
50
South
50
East
50
West
50
Chi-square Test Statistic: The test statistic is calculated using the formula:
Degrees of Freedom (df): The degrees of freedom for the test is calculated as:
\[
df = k - 1
\]
where \(k\) is the number of categories (regions). In this case:
\[
df = 4 - 1 = 3
\]
Critical Value and P-value: The critical value for \(\chi^2\) at \(\alpha = 0.05\) and 3 degrees of freedom can be found in chi-square distribution tables. The critical value is 7.815.
Compare the calculated \(\chi^2\) value with the critical value:
If \(\chi^2 > 7.815\), reject the null hypothesis.
Otherwise, do not reject the null hypothesis.
In this case, \(\chi^2 = 4\) which is less than 7.815, so we do not reject the null hypothesis.
Alternatively, you can calculate the p-value using a chi-square distribution calculator or software. For \(\chi^2 = 4\) with 3 degrees of freedom, the p-value is approximately 0.261.
Interpretation
Since the p-value (0.261) is greater than the significance level (\(\alpha = 0.05\)), we do not reject the null hypothesis. There is insufficient evidence to conclude that the observed sales distribution significantly differs from the expected distribution.
R Code for Chi-square Goodness-of-Fit Test
# Observed salesobserved<-c(50, 60, 40, 50)# Expected sales (equal distribution)expected<-rep(sum(observed)/length(observed), length(observed))# Perform the testresult<-chisq.test(observed, p =expected/sum(expected))# Output the resultprint(result)
Chi-squared test for given probabilities
data: observed
X-squared = 4, df = 3, p-value = 0.2615
Python Code for Chi-square Goodness-of-Fit Test
import numpy as npfrom scipy.stats import chisquare# Observed salesobserved = np.array([50, 60, 40, 50])# Expected sales (equal distribution)expected = np.full(len(observed), np.mean(observed))# Perform the testchi2, p = chisquare(f_obs=observed, f_exp=expected)# Output the resultprint(f"Chi2 Statistic: {chi2}")