20  Phi-Coefficient of Correlation

The Phi-Coefficient of Correlation is a measure of the degree of association between two binary variables. This statistic is a specific case of the Pearson correlation coefficient and can be used when dealing with dichotomous variables.

The phi coefficient ranges from -1 to 1, where:

  • 1 or -1 indicates a perfect positive or negative association, respectively,
  • 0 indicates no association between variables.

The phi coefficient is often used in conjunction with the Chi-square test for 2x2 contingency tables to quantify the strength of the association between the variables. It provides a numeric measure of the relationship’s strength, whereas the chi-square test assesses the significance of that relationship.

Application Contexts:

  • Phi-Coefficient serves as a straightforward measure of association strength in studies with dichotomous variables, offering insights into the relationship’s intensity in medical, psychological, and social sciences research.

Each of these tests and measures has its specific conditions and assumptions that must be met to ensure valid and reliable results. They are powerful tools in the arsenal of statistical analysis for categorical data, providing insights into patterns, associations, and differences among groups or variables.

20.1 Example Problem:

Imagine a study looking at the relationship between having a gym membership (Yes or No) and being classified as physically active (Active or Not Active). Here’s the data collected from 200 individuals:

Active Not Active Total
Gym Member 80 20 100
No Gym Member 30 70 100
Total 110 90 200

Step-by-Step Calculation:

First, let’s label the counts in our contingency table: - $ a = 80 $ (Active and Gym Member) - $ b = 20 $ (Not Active and Gym Member) - $ c = 30 $ (Active and No Gym Member) - $ d = 70 $ (Not Active and No Gym Member)

Phi Coefficient Formula: \[ \phi = \frac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}} \]

Plugging in the values, we get: \[ \phi = \frac{(80 \times 70) - (20 \times 30)}{\sqrt{(80+20)(30+70)(80+30)(20+70)}} \] \[ \phi = \frac{5600 - 600}{\sqrt{100 \times 100 \times 110 \times 90}} \] \[ \phi = \frac{5000}{\sqrt{10000 \times 9900}} \] \[ \phi = \frac{5000}{\sqrt{99000000}} \] \[ \phi = \frac{5000}{9950} \] \[ \phi \approx 0.5025 \]

Interpretation:

The calculated Phi Coefficient of approximately 0.50 suggests a moderate positive association between having a gym membership and being physically active. This indicates that individuals with gym memberships are more likely to be classified as active compared to those without memberships. The value is positive, showing that the association is in the expected direction (more gym members are active), and a value of 0.50 indicates a noticeable correlation but not an extremely strong one.

20.1.1 Phi-Coefficient of Correlation calculation using R and Python:

Phi-coefficient correlation
# METHOD 1: Direct Phi Formula

# Assign the 2x2 table values
a <- 80  # Gym Member & Active
b <- 20  # Gym Member & Not Active
c <- 30  # No Gym Member & Active
d <- 70  # No Gym Member & Not Active

# Apply the Phi formula
phi_direct <- (a*d - b*c) / sqrt((a+b)*(c+d)*(a+c)*(b+d))

cat("Phi Coefficient (Direct Formula) =", phi_direct, "\n")
Phi Coefficient (Direct Formula) = 0.5025189 
# METHOD 2: Using Chi-square Test

# Create the contingency table
observed <- matrix(c(a, b, c, d),
                   nrow = 2,
                   byrow = TRUE,
                   dimnames = list(
                     c("Gym Member", "No Gym Member"),
                     c("Active", "Not Active")))

# Run Chi-square Test
chi_test <- chisq.test(observed)

# Phi = sqrt(Chi-square / total sample size)
phi_chi <- sqrt(chi_test$statistic / sum(observed))

cat(
  "Chi-square Statistic =", round(chi_test$statistic, 4), "\n",
  "Degrees of Freedom   =", chi_test$parameter, "\n",
  "P-value              =", round(chi_test$p.value, 5), "\n",
  "Phi Coefficient      =", round(phi_chi, 4), "\n"
)
Chi-square Statistic = 48.5051 
 Degrees of Freedom   = 1 
 P-value              = 0 
 Phi Coefficient      = 0.4925 
Phi-coefficient correlation
import numpy as np
from scipy.stats import chi2_contingency

observed = np.array([[80, 20],
                     [30, 70]])

# METHOD 1: Direct Phi Formula

a, b, c, d = 80, 20, 30, 70

phi_direct = (a*d - b*c) / np.sqrt((a+b)*(c+d)*(a+c)*(b+d))
print("Phi Coefficient (Direct Formula) =", phi_direct)
Phi Coefficient (Direct Formula) = 0.502518907629606
# METHOD 2: Using Chi-square (default: with Yates' correction)
chi2, p, dof, expected = chi2_contingency(observed)

# Phi Coefficient
phi_chi = np.sqrt(chi2 / observed.sum())

# Display all results together
print(f"""
Chi-square Statistic : {chi2:.4f}
Degrees of Freedom   : {dof}
P-value              : {p:.5f}
Phi Coefficient      : {phi_chi:.4f}
""")

Chi-square Statistic : 48.5051
Degrees of Freedom   : 1
P-value              : 0.00000
Phi Coefficient      : 0.4925

By default, chi square tests in R and Python applies Yates’ continuity correction for 2×2 tables. This correction reduces the Chi-square statistic slightly, which makes your Phi value a bit smaller (0.492 instead of 0.502).

In the Python code, the chi2_contingency() function from SciPy’s stats module is used to compute the chi-square statistic, and then the Phi Coefficient is calculated as the square root of the chi-square statistic divided by the total sample size.