19 Chi-square test

The Chi-square Test is a non-parametric statistical test used to examine whether distributions of categorical variables differ from one another. It’s widely used to test the association between two categorical variables in a contingency table. The test compares the observed frequencies in each category of a contingency table with the expected frequencies, which are calculated based on the assumption that there is no association between the variables.

Chi-square Test of Independence: Used to determine if there is a significant association between two categorical variables.
Chi-square Goodness-of-Fit Test: Used to see if a sample distribution matches an expected distribution.

Chi-square Test is applicable in a wide range of disciplines, including sociology, marketing, and education, to test hypotheses about associations or differences in categorical data distributions.

19.1 Chi-square Test of Independence

The Chi-square Test of Independence is a non-parametric statistical test used to determine if there is a significant association between two categorical variables. This test assesses whether observed frequencies in a contingency table differ significantly from expected frequencies, which are calculated under the assumption of independence between the variables.

Null and Alternative Hypotheses

Null Hypothesis (H0): The null hypothesis states that there is no association between the two categorical variables; they are independent.
Alternative Hypothesis (H1): The alternative hypothesis suggests that there is a significant association between the two categorical variables.

Test Statistic

The test statistic for the Chi-square Test of Independence is calculated as follows:

\[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \] where \(O_i\) represents the observed frequency, and \(E_i\) represents the expected frequency in each category.
The test statistic follows a chi-squared distribution with \((r-1)\) times \((c-1)\) degrees of freedom, where \(r\) is the number of rows and \(c\) is the number of columns in the contingency table.

Calculation of Expected Frequencies

Expected frequencies are calculated based on the marginal totals of the contingency table: \[ E_{ij} = \frac{(R_i \times C_j)}{N} \] where \(R_i\) is the total for row \(i\), \(C_j\) is the total for column \(j\), and \(N\) is the overall total number of observations.

Interpretation of Results

If the calculated \(\chi^2\) value is greater than the critical value from the chi-squared distribution at the chosen significance level (commonly \(\alpha = 0.05\)), the null hypothesis is rejected, indicating a significant association between the variables.

Applications

Sociology: To analyze the relationship between education level and employment status.

Medicine: To study the association between a risk factor (like smoking) and the incidence of a disease.

Marketing: To evaluate the relationship between customer demographics and product preferences.

19.1.1 Example problem on Chi-square Test of Independence

A researcher wants to determine if there is an association between gender (male, female) and preference for a new product (like, dislike). The data collected is as follows:

Gender	Like	Dislike
Male	20	10
Female	30	40

Null and Alternative Hypotheses

Null Hypothesis (H0): There is no association between gender and preference for the product; they are independent.
Alternative Hypothesis (H1): There is an association between gender and preference for the product.

Step-by-Step Calculation

Observed Frequencies (O): The observed frequencies are given in the table:

Gender Like Dislike Row Totals

Male 20 10 30

Female 30 40 70

Column Totals 50 50 100
Expected Frequencies (E): The expected frequencies are calculated based on the assumption of independence. The expected frequency for each cell is calculated using the formula: \[ E_{ij} = \frac{(R_i \times C_j)}{N} \] where \(R_i\) is the row total, \(C_j\) is the column total, and \(N\) is the grand total.
- For Male and Like: \[ E_{11} = \frac{(30 \times 50)}{100} = 15 \]
- For Male and Dislike: \[ E_{12} = \frac{(30 \times 50)}{100} = 15 \]
- For Female and Like: \[ E_{21} = \frac{(70 \times 50)}{100} = 35 \]
- For Female and Dislike: \[ E_{22} = \frac{(70 \times 50)}{100} = 35 \]
The expected frequencies are:

Gender Like (E) Dislike (E)

Male 15 15

Female 35 35
Chi-square Test Statistic: The test statistic is calculated using the formula: \[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \]
- For Male and Like: \[ \chi^2_{11} = \frac{(20 - 15)^2}{15} = \frac{25}{15} = 1.67 \]
- For Male and Dislike: \[ \chi^2_{12} = \frac{(10 - 15)^2}{15} = \frac{25}{15} = 1.67 \]
- For Female and Like: \[ \chi^2_{21} = \frac{(30 - 35)^2}{35} = \frac{25}{35} = 0.71 \]
- For Female and Dislike: \[ \chi^2_{22} = \frac{(40 - 35)^2}{35} = \frac{25}{35} = 0.71 \]
The total chi-square statistic is: \[ \chi^2 = 1.67 + 1.67 + 0.71 + 0.71 = 4.76 \]
Degrees of Freedom (df): The degrees of freedom for the test is calculated as: \[ df = (r - 1) \times (c - 1) \] where \(r\) is the number of rows and \(c\) is the number of columns. In this case: \[ df = (2 - 1) \times (2 - 1) = 1 \]
Critical Value and P-value: The critical value for \(\chi^2\) at \(\alpha = 0.05\) and 1 degree of freedom can be found in chi-square distribution tables. The critical value is 3.841.

Compare the calculated \(\chi^2\) value with the critical value:
- If \(\chi^2 > 3.841\), reject the null hypothesis.
- Otherwise, do not reject the null hypothesis.
In this case, \(\chi^2 = 4.76\) which is greater than 3.841, so we reject the null hypothesis.

Alternatively, you can calculate the p-value using a chi-square distribution calculator or software. For \(\chi^2 = 4.76\) with 1 degree of freedom, the p-value is approximately 0.029.

Gender	Like	Dislike	Row Totals
Male	20	10	30
Female	30	40	70
Column Totals	50	50	100

Gender	Like (E)	Dislike (E)
Male	15	15
Female	35	35

Interpretation

Since the p-value (0.029) is less than the significance level (\(\alpha = 0.05\)), we reject the null hypothesis. There is sufficient evidence to conclude that there is a significant association between gender and preference for the new product.

Chi-square Test of Independence in R and Python

Python

Chi-square Test of Independence

# Data for the Chi-square Test of Independence
data <- matrix(c(20, 10, 
                 30, 40), 
               nrow = 2, byrow = TRUE)

# Perform the test
result <- chisq.test(data)

# Output the result
print(result)


    Pearson's Chi-squared test with Yates' continuity correction

data:  data
X-squared = 3.8571, df = 1, p-value = 0.04953

# Interpret the result 
alpha <- 0.05
p <- result$p.value

  
if (p < alpha) {
  cat("(P-value)",  p, "<", alpha, 
      "(significance level / alpha).",  "\n",
      " Reject the null hypothesis: 
      There is a significant difference.\n")
} else {
  cat("(P-value)",  p, ">=", alpha, 
      "(significance level / alpha).", "\n",
      " Do not Reject the null hypothesis: 
      No significant difference.\n")
}

(P-value) 0.04953461 < 0.05 (significance level / alpha). 
  Reject the null hypothesis: 
      There is a significant difference.

Chi-square Test of Independence

import numpy as np
from scipy.stats import chi2_contingency

# Data for the Chi-square Test of Independence
data = np.array([[20, 10],
                 [30, 40]])

# Perform the test
chi2, p, dof, expected = chi2_contingency(data)
# Display results

print(f"""
Chi2 Statistic: {chi2:.4f}
P-value: {p:.5f}
Degrees of Freedom: {dof}
Expected Frequencies:
{expected}
""")


Chi2 Statistic: 3.8571
P-value: 0.04953
Degrees of Freedom: 1
Expected Frequencies:
[[15. 15.]
 [35. 35.]]

# Interpret the result
alpha = 0.05

if p < alpha:
    print(f"(P-value) {p:.4f} < {alpha} (sig level / alpha).")
    print("Reject the null hypothesis:", "\n", "There is a significant difference.\n")
else:
    print(f"(P-value) {p:.4f} >= {alpha} (significance level / alpha).")
    print("Do not reject the null hypothesis:", "\n", "No significant difference.\n")

(P-value) 0.0495 < 0.05 (sig level / alpha).
Reject the null hypothesis: 
 There is a significant difference.

By default, chi square tests in R and Python applies Yates’ continuity correction for 2×2 tables. This correction reduces the Chi-square statistic slightly.

19.2 Chi-square Goodness-of-Fit Test

Overview

The Chi-square Goodness-of-Fit Test is used to determine whether a sample distribution matches an expected distribution. This test compares the observed frequencies of categories to the frequencies expected under a specified theoretical distribution.

Null and Alternative Hypotheses

Null Hypothesis (H0): The null hypothesis states that the sample distribution matches the expected distribution.
Alternative Hypothesis (H1): The alternative hypothesis suggests that there is a significant difference between the observed and expected distributions.

Test Statistic

The test statistic for the Chi-square Goodness-of-Fit Test is calculated as follows: \[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \] where \(O_i\) represents the observed frequency, and \(E_i\) represents the expected frequency for each category.
The test statistic follows a chi-squared distribution with \(k-1\) degrees of freedom, where \(k\) is the number of categories.

Calculation of Expected Frequencies

Expected frequencies are calculated based on the theoretical distribution. For example, if testing a uniform distribution, the expected frequency for each category would be equal.

Interpretation of Results

If the calculated \(\chi^2\) value is greater than the critical value from the chi-squared distribution at the chosen significance level (commonly \(\alpha = 0.05\)), the null hypothesis is rejected, indicating that the sample distribution significantly differs from the expected distribution.

Applications

Genetics: To determine if the observed frequencies of different genotypes match the expected frequencies under Mendelian inheritance.

Quality Control: To check if the observed defect rates in different categories match the expected rates.

Survey Analysis: To see if the distribution of survey responses matches the expected distribution based on population proportions.

19.2.1 Example problem on Chi-square Goodness-of-Fit Test

A company wants to know if the observed sales distribution across four regions (North, South, East, West) matches their expected distribution. The expected distribution is equal across all regions. The observed sales are as follows:

Region	Observed Sales
North	50
South	60
East	40
West	50

The company will use the Chi-square Goodness-of-Fit Test to analyze the data.

Null and Alternative Hypotheses

Null Hypothesis (H0): The observed sales distribution matches the expected distribution (equal across all regions).
Alternative Hypothesis (H1): The observed sales distribution does not match the expected distribution.

Step-by-Step Calculation

Observed Frequencies (O): The observed frequencies are given in the table:

Region Observed Sales (O)

North 50

South 60

East 40

West 50
Expected Frequencies (E): The expected frequencies are calculated based on the assumption that the sales are equally distributed across all regions. Since the total number of observations is 200 (50 + 60 + 40 + 50 = 200) and there are four regions, the expected frequency for each region is:

\[ E = \frac{\text{Total Sales}}{\text{Number of Regions}} = \frac{200}{4} = 50 \]

So, the expected frequencies are:

Region Expected Sales (E)

North 50

South 50

East 50

West 50
Chi-square Test Statistic: The test statistic is calculated using the formula:

\[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \]
- For North: \[ \chi^2_{North} = \frac{(50 - 50)^2}{50} = \frac{0}{50} = 0 \]
- For South: \[ \chi^2_{South} = \frac{(60 - 50)^2}{50} = \frac{100}{50} = 2 \]
- For East: \[ \chi^2_{East} = \frac{(40 - 50)^2}{50} = \frac{100}{50} = 2 \]
- For West: \[ \chi^2_{West} = \frac{(50 - 50)^2}{50} = \frac{0}{50} = 0 \]
The total chi-square statistic is:

\[ \chi^2 = 0 + 2 + 2 + 0 = 4 \]
Degrees of Freedom (df): The degrees of freedom for the test is calculated as:

\[ df = k - 1 \]

where \(k\) is the number of categories (regions). In this case:

\[ df = 4 - 1 = 3 \]
Critical Value and P-value: The critical value for \(\chi^2\) at \(\alpha = 0.05\) and 3 degrees of freedom can be found in chi-square distribution tables. The critical value is 7.815.

Compare the calculated \(\chi^2\) value with the critical value:
- If \(\chi^2 > 7.815\), reject the null hypothesis.
- Otherwise, do not reject the null hypothesis.
In this case, \(\chi^2 = 4\) which is less than 7.815, so we do not reject the null hypothesis.

Alternatively, you can calculate the p-value using a chi-square distribution calculator or software. For \(\chi^2 = 4\) with 3 degrees of freedom, the p-value is approximately 0.261.

Interpretation

Since the p-value (0.261) is greater than the significance level (\(\alpha = 0.05\)), we do not reject the null hypothesis. There is insufficient evidence to conclude that the observed sales distribution significantly differs from the expected distribution.

Chi-square Goodness-of-Fit Test in R and Python

Python

Chi-square Goodness of Fit Test

# Observed sales
observed <- c(50, 60, 40, 50)

# Expected sales (equal distribution)
expected <- rep(sum(observed) / length(observed), length(observed))

# Perform the test
result <- chisq.test(observed, p = expected / sum(expected))

# Output the result
print(result)


    Chi-squared test for given probabilities

data:  observed
X-squared = 4, df = 3, p-value = 0.2615

# Interpret the result 
alpha <- 0.05
p <- result$p.value

  
if (p < alpha) {
  cat("(P-value)",  p, "<", alpha, 
      "(significance level / alpha).",  "\n",
      " Reject the null hypothesis: 
      There is a significant difference.\n")
} else {
  cat("(P-value)",  p, ">=", alpha, 
      "(significance level / alpha).", "\n",
      " Do not Reject the null hypothesis: 
      No significant difference.\n")
}

(P-value) 0.2614641 >= 0.05 (significance level / alpha). 
  Do not Reject the null hypothesis: 
      No significant difference.

Chi-square Goodness of Fit Test

import numpy as np
from scipy.stats import chisquare

# Observed sales
observed = np.array([50, 60, 40, 50])

# Expected sales (equal distribution)
expected = np.full(len(observed), np.mean(observed))

# Perform the test
chi2, p = chisquare(f_obs=observed, f_exp=expected)

# Single print
print(f"""
Chi2 Statistic: {chi2}
P-value: {p}

Observed Frequencies: {observed}
Expected Frequencies: {expected}
""")


Chi2 Statistic: 4.0
P-value: 0.26146412994911034

Observed Frequencies: [50 60 40 50]
Expected Frequencies: [50. 50. 50. 50.]

# Interpret the result
alpha = 0.05

if p < alpha:
    print(f"(P-value) {p:.4f} < {alpha} (sig level / alpha).")
    print("Reject the null hypothesis:", "\n", "There is a significant difference.\n")
else:
    print(f"(P-value) {p:.4f} >= {alpha} (significance level / alpha).")
    print("Do not reject the null hypothesis:", "\n", "No significant difference.\n")

(P-value) 0.2615 >= 0.05 (significance level / alpha).
Do not reject the null hypothesis: 
 No significant difference.