Module 5 - Part 3: Measures of Variation

There are additional characteristics for a set of numerical measurements that are of particular interest other than measures of location. Variation or spread in a set of numerical measurements is discsussed here.

Consider Situation 1) which is the dotplot for the Household Income for the 87 counties in Minnesota. The data provided in Situation 2) is ficitious data that has the same average as Situation 1).

Situation	Dotplot of Income	Description
1)		The average is $57,350. Situation 1) has Household Incomes levels that are more spread out than Situation 2). There is more income disparity in Situation 1).
2)		The average is $57,350. The Household Income levels across the counties are more similar in Situation 2). There is less income disparity in Situation 2).

The language used for the measures of center is consistent. e.g mean, average, and median. However, a variety of terms are used in place of variation, e.g. spread, consistency, disparity, volatility, risk, etc.

Disparity: If a state has high income disparity, then incomes are dissimilar.

Volatility: Stocks with high volatility are considered risky as the rate of return is uncertain. Strategy A has more volatility than Strategy B here.

Measures of Spread

Range
Interquartile Range
Mean Absolute Deviation
Standard Deviation / Variance

Example 5.3.1 Reconsider the data of the Household Income levels for the 87 counties in Minnesota. This data was obtained from the United States Census Bureau.

Download Data: Link

Measure of Spread: Range

The range is easy to compute and easy to understand; however, the range is very limiting in the information it provides about a set of measurements.

\[ Range = Maxiumum - Minimum \]

\[ \begin{array}{rcl} Range & = & $93,151 - $42,439 \\ & = & $50,712 \\ \end{array} \]

Household Income	Percentiles
$42,439	0%
$48,096	10%
$50,518	20%
$51,174	25%
$52,382	30%
$53,642	40%
$54,645	50%
$56,538	60%
$58,479	70%
$59,611	75%
$61,635	80%
$73,219	90%
$93,151	100%

Interpretation of Range: The distance between the poorest county’s Household Income level and richest county’s Household Income level is $50,712.

The range is not a commonly used measure of variation (or spread) by statisticians. There are two main reasons for this.

Range only uses two values in a dataset and ignores all other values
Note: One would never find it acceptable to use only two values to compute a mean – similarity statisticians do not feel it is appropriate to use only 2 values to compute a measure of spread.
By definition, the range is directly and adversely affected by outliers – outliers are always on the lower/upper end of a distribution

Questions

1. What is the smallest possible value for the range?

2. What does it mean if the range is at the smallest possible value? Briefly discuss.

Measure of Spread: Interquartile Range

John Tukey, the inventor of the box-and-whisker plot, used a quantity called the interquartile range. The interquartile range is not adversely effected by outliers as the lower and upper 25% of the distribution is truncated. The interquartile range captures the middle 50% of a distribution.

The interquartile range, denoted by IQR, is computed as follows.

\[ IQR = 75^{th} \space Percentile - 25^{th} \space Percentile \]

The Interquartile Range is computed as follows for the Household Income levels for the 87 counties in Minnesota.

IQR encompasses the middle 50% of the measurements

\[ \begin{array}{rcl} IQR & = & $59,611 - $51,174 \\ & = & $8,437 \\ \end{array} \]

Household Income	Percentiles
$42,439	0%
$48,096	10%
$50,518	20%
$51,174	25%
$52,382	30%
$53,642	40%
$54,645	50%
$56,538	60%
$58,479	70%
$59,611	75%
$61,635	80%
$73,219	90%
$93,151	100%

Interpretation of Interquaretile Range: The distance between the 75^th Percentile and 25^th Percentile for the Household Income levels for the 87 counties in Minnesota is $8,437.

Measure of Spread: Distance-to-Center

The standard approach to measuring spread or variation in data relies on the concept of distance-to-center In particular, a set of measurements that have a smaller total or average distance-to-middle is said to have less spread.

Consider again a map that shows the Household Income levels for the 87 counties across Minnesota. Counties whose household incomes are near the average are white. A more intense color indicates a further distance from the average – red indicating income below average and green indicating above.

Example 5.3.1
Consider the household income levels for the state of Rhode Island. Rhode Island only has five counties. Consider the following ficitious datasets for Rhode Island.

Data
Set A

Data
Set B

Data
Set C

A residual, also known as a deviance, is defined to be the distance between a data point and the average. A residual is the most commonly used method when measuring distance-to-center.

\[Residual = \big( \text{Data Point} - \text{Average} \big)\] The residual values for each data point are shown below for the datasets presented above. Unfortunately, when the residual values are added together to get an overall distance-to–center, the Total is 0. The Total is equal to 0 because the negative residuals cancel out the positive residuals.

Data
Set A

Residual

Total = 0

Data
Set B

Residual

-20,000

+20,000

Total = 0

Data
Set C

Residual

-20,000

-10,000

+10,000

+20,000

Total = 0

If the residuals are added up across all observations, the Total is equal to 0. The cancelling out of the negative and positive residuals is not unique to just this situation and happens with all data. Recall, the average is the balance point in the distribution and when the residual values are summed across all observations, the Total will be zero.

\[ \begin{array}{rcl} \text{Total} & = & \sum{Residuals} \\ & = & 0 \\ \end{array} \] In mathematics, there are two common approaches to getting rid of negative values – absolute value and squaring. One of the two following modifications can be used when the goal is to obtain a total distance-to-center.

Use Absolute Value of Residuals: One possible fix for the cancelling out effect when obtaining a sum of the residuals across all observations is to use

\[ \text{Total} = \sum{| Residual |} \]

Squaring the Residuals: Another possible fix for the cancelling out effet when obtaining a sum of the residuals across all observations is to use

\[ \text{Total} = \sum{\big(Residual\big)^{2}} \]

Measure of Spread: Mean Absolute Deviation

The mean absolute deviation (or mean absolute residual), denoted by MAD, is computed as follows. This is one method used to measure spread using the notion of distance-to-middle.

\[ \text{Mean Absolute Deviation} = \frac{\sum{| Residual |}}{n} \]

The Mean Absolute Deviation has been computed for Data Set A, B, and C from above.

Data Set A

Residual	\|Residual\|
$0	$0
$0	$0
$0	$0
$0	$0
$0	$0
Total	$0

\[ \begin{array}{rcl} MAD & = & \frac{$0}{5} \\ & = & $0 \\ \end{array} \]

Data Set B

Residual	\|Residual\|
-$20,000	+$20,000
$0	$0
$0	$0
$0	$0
+$20,000	+$20,000
Total	$40,000

\[ \begin{array}{rcl} MAD & = & \frac{$40,000}{5} \\ & = & $8,000 \\ \end{array} \]

Data Set C

Residual	\|Residual\|
-$20,000	+$20,000
-$10,000	+$10,000
$0	$0
+$10,000	+$10,000
+$20,000	+$20,000
Total	$60,000

\[ \begin{array}{rcl} MAD & = & \frac{$60,000}{5} \\ & = & $12,000 \\ \end{array} \]

Interpretation for Mean Absolute Deviation for Data Set B: The average distance to the middle is $8,000. That is, the typical distance from each point to the average is $8,000.

Comments

The mean absolute deviation for Data Set A is $0. This implies that there is no spread in the values for Data Set A.
The mean absolute deviation for Data Set C is $12,000. This value is somewhat larger than Data Set B because Data Set C has, on average, more distance to the center than Data Set B.

Measure of Spread: Standard Deviation

The residuals could have been squared to prevent the cancelling out effect when obtaining a sum of residuals across all observations.

Data Set A

Residual	(Residual)^2
$0	0
$0	0
$0	0
$0	0
$0	0
Total	0

\[ \begin{array}{rcl} Average & = & \frac{0}{5} \\ & = & 0 \\ \end{array} \]

Data Set B

Residual	(Residual)^2
-$20,000	400,000,000
$0	$0
$0	$0
$0	$0
+$20,000	400,000,000
Total	800,000,000

\[ \begin{array}{rcl} Average & = & \frac{800,000,000}{5} \\ & = & 160,000,000 \\ \end{array} \]

Data Set C

Residual	(Residual)^2
-$20,000	400,000,000
-$10,000	100,000,000
$0	$0
+$10,000	100,000,000
+$20,000	400,000,000
Total	1,000,000,000

\[ \begin{array}{rcl} Average & = & \frac{1,000,000,000}{5} \\ & = & 200,000,000 \\ \end{array} \]

The population variance is defined to be the average distance-to-center when the residuals are squared to alleviate the cancelling out effect described above.

\[ \text{Population Variance} = \frac{\sum{(Residual)^2}}{n} \]

The (sample) variance is used more often than the population variance. The denominator for the sample variance uses $n-1$ instead of $n$. This adjustment is necessary as only $n-1$ of the residuals are free-to-vary due to the fact that their sum is contrained to be zero. A more intuitively explanation as to why the variance uses $n-1$ is because the calculation of the residual required the estimation of $1$ additional quantity, i.e. an estimate of the average is needed before a residual can be calculated. \[ \text{Variance} = \frac{\sum{(Residual)^2}}{n-1} \] One issue with the variance is that the scale for the variance is not the same scale as the original data. The squaring of the residuals causes the variance to be on the squared scale of the original data. Taking the square root of the variance will overcome this problem. The result is called the standard deviation.

\[ \text{Standard Deviation} = \sqrt{\frac{\sum{(Residual)^2}}{n-1}} \] The standard deviation has been computed for Data Set A, B, and C from above.

Data Set A

\[ \begin{array}{rcl} \text{Std Dev} & = & \sqrt{\frac{\sum{(Residuals)^2}}{(n-1)}} \\ & = & \sqrt{\frac{0}{4}} \\ & = & \sqrt{0} \\ & = & $0 \\ \end{array} \]

Data Set B

\[ \begin{array}{rcl} \text{Std Dev} & = & \sqrt{\frac{\sum{(Residuals)^2}}{(n-1)}} \\ & = & \sqrt{\frac{800,000,000}{4}} \\ & = & \sqrt{200,000,000} \\ & = & $14,142 \\ \end{array} \]

Data Set C

\[ \begin{array}{rcl} \text{Std Dev} & = & \sqrt{\frac{\sum{(Residuals)^2}}{(n-1)}} \\ & = & \sqrt{\frac{1,000,000,000}{4}} \\ & = & \sqrt{250,000,000} \\ & = & $15,811 \\ \end{array} \]

Interpretation of Standard Deviation Data Set B: The standard deviation for Data Set B is $14,142. That is, the typical distance for a counties income level to the overall average is a little over $14,000.

Comments

The standard deviation for Data Set A is zero because there is no spread in the household income levels for Data Set A.
The information and all the formulas presented above for the standard deivaiton maybe overwhelming. Intuitively, the standard deviation measures the typical distance each observation is from the average. The standard deviation for the following data is 1.0 as this is the typical distance each observation is from 20.

Overall Comments Regarding Measures of Spread

The range should not be used to measure variation for a set of numerical measurements.
The mean absolute deviation and standard deviation are acceptable measures of variation for a set of numerical measurements. The standard deviation is most commonly used by statisticians because of its optimal theoretical properties.
The mean absolute deviation and standard deviation cannot be less than zero. A value near zero implies less variation and a large value implies more variation.
Measuring distance-to-middle is different than a measureing the gap between values. For example, the following data has a consistent uniform pattern and has a small gap between each value; however, this data has considerable variation as measured by distance-to-center.

Measures of Spread: Spreadsheet

Spreadsheets have formulas that will automatically compute the mean absolute deviation and the standard deviation for a set of numerical measurements. The mean absolution deviation formula is =AVEDEV() and the standard deviation formula is =STDEV(). These formalas are used to compute the MAD and standard deviation for Data Set C from above.

A PivotTable can be used to compute the standard deviation. Using a PivotTable to compute basic summaries is advantagous when several summaries are to be computed.

The PivotTable summaries provided below were obtained using the following specifications.

Rows: State

Values: 4 instances of Household Income; One for MEDIAN, AVERAGE, STDEV, and COUNT

Filter: State; deselect (Blanks) to remove blank row in PivotTable

The resulting PivotTable Summaries are shown below.

Questions

Consider the PivotTable Summaries provided above and the following dotchart when answering the questions below.

3. Does the fact that these states have a different number of counties adversely affect our ability to make fair comparisons between these states? Explain.

4. How does the average income compare across these three states? Discuss.

5. How does the income disparity, i.e. spread in income, vary across these three states? Which state has the largest amount of income disparity? Discuss.

6. Do the two measures of center, i.e. average and median, agree for each state? Briefly discuss.

7. Consider the CDF plots for each of these states. [See graphic below] Does this chart support or refute what you stated above regarding the measures of center and spread. Briefly discuss.