9 Measures of Central Tendency

Measures of central tendency

Measures of central tendency are statistical metrics that summarize or describe the center point or typical value of a dataset. These measures are crucial in data analysis as they provide a simple summary about the sample and the measures. The three main measures of central tendency are the mean, median, and mode. Each measure provides different insights into the distribution and central point of a dataset.

Central tendency measures identify the central point around which data points cluster, offering insights into the dataset’s overall behavior.

Mean: The arithmetic average, calculated by summing all observations and dividing by the count of observations.
Median: The middle value in an ordered dataset, dividing it into two equal halves.
Mode: The most frequently occurring value(s) in a dataset, indicating the highest peak of the distribution.

9.1 Mean

The mean, often referred to as the average, is calculated by adding all the numbers in a dataset and then dividing by the count of those numbers. It is the most common measure of central tendency.

Formula: \(\text{Mean} = \frac{\sum_{i=1}^{n} x_i}{n}\), where \(x_i\) represents each value in the dataset and \(n\) is the number of values.
Sensitive to Outliers: The mean is influenced by outliers (extremely high or low values), which can skew the result.
Used for: Interval and ratio levels of measurement.

Example

Consider the set of exam scores: 85, 90, 78, 92, 85.

To calculate the mean:

Sum all the scores: \(85 + 90 + 78 + 92 + 85 = 430\).
Divide by the number of scores: \(430 / 5 = 86\).

So, the mean score is 86.

Python

Application

The mean is often used in educational settings to calculate the average score of a test, student grades, or even teacher evaluations. It provides a quick snapshot of the overall performance but can be misleading if a few students scored exceptionally high or low compared to the rest.

9.2 Median

The median is the middle value in a dataset when the values are arranged in ascending or descending order. If there is an even number of observations, the median is the average of the two middle numbers.

Represents: The 50th percentile of the dataset.
Not Sensitive to Outliers: Unlike the mean, the median is not affected by outliers, making it a better measure of central tendency for skewed distributions.
Used for: Ordinal, interval, and ratio levels of measurement.

Example

Using the same set of exam scores: 85, 90, 78, 92, 85. First, arrange them in ascending order: 78, 85, 85, 90, 92.

The median is the middle number, so in this case, it’s the third score: 85.

If the dataset had an even number of observations, say we add another score, 88, making the set: 78, 85, 85, 88, 90, 92. The median would be the average of the two middle scores, \(85 + 88 = 173\), then \(173 / 2 = 86.5\).

Python

Application

The median is valuable in real estate to determine the median house price in a region, providing a more accurate representation than the mean, which could be skewed by a few very high-priced or very low-priced sales.

9.3 Mode

The mode is the value that appears most frequently in a dataset. A dataset may have one mode (unimodal), more than one mode (bimodal or multimodal), or no mode at all if all values are unique.

Useful for: Identifying the most common or popular item in a dataset.
Can Be Applied to: Nominal, ordinal, interval, and ratio levels of measurement. It is the only measure of central tendency that can be used with nominal data.
Limitations: The mode might not be representative of the dataset as a whole, especially in datasets with a large number of unique values or multiple modes.

Example 1 (unimodal)

In the scores 85, 90, 78, 92, 85, the mode is 85, as it appears more frequently than any other score.

Example 2 (bimodal)

In a different set of scores: 70, 75, 80, 75, 80, 85. The dataset is bimodal because two numbers appear most frequently, 75 and 80.

Example 3 (no mode)

If all scores are unique, for example, 70, 75, 80, 85, 90, there is no mode, as no number appears more than once.

Python

1.  table(exam_scores): Creates a frequency table of the scores.
2.  which.max(...): Finds the value with the highest frequency.
3.  names(...): Extracts the most frequent score.
4.  as.numeric(...): Converts the result from character to numeric.

Application

The mode is used in marketing research to identify the most popular product size or color. It’s also used in demography to determine the most common age of a population.

9.3.1 When to Use Each Measure

Mean: Ideal for datasets without outliers and when every value is relevant. For example, calculating the average temperature of a city over a month to gauge climate change.
Median: Best for skewed distributions or when outliers are present, like in income surveys where a few extremely high or low incomes can skew the mean.
Mode: Useful for categorical data or to find the most common value in a dataset. For instance, finding the most common shoe size sold in a store to manage inventory efficiently.

9.3.2 Choosing the Right Measure

Symmetrical Distributions: The mean is typically preferred for symmetric distributions without outliers, as it considers every value in the dataset.
Skewed Distributions: The median is better for skewed distributions or when there are outliers, as it is not influenced by extreme values.
Categorical Data: The mode is best for categorical data or when identifying the most common value is of interest.

Importance

Each measure of central tendency offers unique insights into the data. The mean provides a mathematical average, the median gives the midpoint unaffected by outliers, and the mode indicates the most frequently occurring value. Selecting the appropriate measure depends on the nature of the data, its distribution, and the information you seek to extract from it. Understanding these measures enhances our ability to summarize, analyze, and make decisions based on data.