There are additional characteristics for a set of numerical measurements that are of particular interest other than measures of location. Variation or spread in a set of numerical measurements is discsussed here.
Consider Situation 1) which is the dotplot for the Household Income for the 87 counties in Minnesota. The data provided in Situation 2) is ficitious data that has the same average as Situation 1).
Situation | Dotplot of Income | Description |
1) | The average is $57,350. Situation 1) has Household Incomes levels that are more spread out than Situation 2). There is more income disparity in Situation 1). | |
2) | The average is $57,350. The Household Income levels across the counties are more similar in Situation 2). There is less income disparity in Situation 2). |
The language used for the measures of center is consistent. e.g mean, average, and median. However, a variety of terms are used in place of variation, e.g. spread, consistency, disparity, volatility, risk, etc.
Disparity: If a state has high income disparity, then incomes are dissimilar.
|
Volatility: Stocks with high volatility are considered risky as the rate of return is uncertain. Strategy A has more volatility than Strategy B here.
|
Measures of Spread
Range
Interquartile Range
Mean Absolute Deviation
Standard Deviation / Variance
Example 5.3.1 Reconsider the data of the Household Income levels for the 87 counties in Minnesota. This data was obtained from the United States Census Bureau.
The range is easy to compute and easy to understand; however, the range is very limiting in the information it provides about a set of measurements.
\[ Range = Maxiumum - Minimum \]
\[ \begin{array}{rcl} Range & = & $93,151 - $42,439 \\ & = & $50,712 \\ \end{array} \] |
|
Interpretation of Range: The distance between the poorest county’s Household Income level and richest county’s Household Income level is $50,712.
The range is not a commonly used measure of variation (or spread) by statisticians. There are two main reasons for this.
Range only uses two values in a dataset and ignores all other values
Note: One would never find it acceptable to use only two values to compute a mean – similarity statisticians do not feel it is appropriate to use only 2 values to compute a measure of spread.
By definition, the range is directly and adversely affected by outliers – outliers are always on the lower/upper end of a distribution
Questions
John Tukey, the inventor of the box-and-whisker plot, used a quantity called the interquartile range. The interquartile range is not adversely effected by outliers as the lower and upper 25% of the distribution is truncated. The interquartile range captures the middle 50% of a distribution. The interquartile range, denoted by IQR, is computed as follows. \[ IQR = 75^{th} \space Percentile - 25^{th} \space Percentile \] |
The Interquartile Range is computed as follows for the Household Income levels for the 87 counties in Minnesota.
IQR encompasses the middle 50% of the measurements \[ \begin{array}{rcl} IQR & = & $59,611 - $51,174 \\ & = & $8,437 \\ \end{array} \] |
|
Interpretation of Interquaretile Range: The distance between the 75th Percentile and 25th Percentile for the Household Income levels for the 87 counties in Minnesota is $8,437.
The standard approach to measuring spread or variation in data relies on the concept of distance-to-center In particular, a set of measurements that have a smaller total or average distance-to-middle is said to have less spread.
Consider again a map that shows the Household Income levels for the 87 counties across Minnesota. Counties whose household incomes are near the average are white. A more intense color indicates a further distance from the average – red indicating income below average and green indicating above.
Example 5.3.1
Consider the household income levels for the state of Rhode Island. Rhode Island only has five counties. Consider the following ficitious datasets for Rhode Island.
Data Set A |
|
||
Data Set B |
|
||
Data Set C |
|
A residual, also known as a deviance, is defined to be the distance between a data point and the average. A residual is the most commonly used method when measuring distance-to-center.
\[Residual = \big( \text{Data Point} - \text{Average} \big)\] The residual values for each data point are shown below for the datasets presented above. Unfortunately, when the residual values are added together to get an overall distance-to–center, the Total is 0. The Total is equal to 0 because the negative residuals cancel out the positive residuals.
Data Set A |
|
|||||||||
Data Set B |
|
|||||||||
Data Set C |
|
If the residuals are added up across all observations, the Total is equal to 0. The cancelling out of the negative and positive residuals is not unique to just this situation and happens with all data. Recall, the average is the balance point in the distribution and when the residual values are summed across all observations, the Total will be zero.
\[ \begin{array}{rcl} \text{Total} & = & \sum{Residuals} \\ & = & 0 \\ \end{array} \] In mathematics, there are two common approaches to getting rid of negative values – absolute value and squaring. One of the two following modifications can be used when the goal is to obtain a total distance-to-center.
\[ \text{Total} = \sum{| Residual |} \]
\[ \text{Total} = \sum{\big(Residual\big)^{2}} \]
The mean absolute deviation (or mean absolute residual), denoted by MAD, is computed as follows. This is one method used to measure spread using the notion of distance-to-middle.
\[ \text{Mean Absolute Deviation} = \frac{\sum{| Residual |}}{n} \]
The Mean Absolute Deviation has been computed for Data Set A, B, and C from above.
Data Set A
\[ \begin{array}{rcl} MAD & = & \frac{$0}{5} \\ & = & $0 \\ \end{array} \] |
Data Set B
\[ \begin{array}{rcl} MAD & = & \frac{$40,000}{5} \\ & = & $8,000 \\ \end{array} \] |
Data Set C
\[ \begin{array}{rcl} MAD & = & \frac{$60,000}{5} \\ & = & $12,000 \\ \end{array} \] |
Interpretation for Mean Absolute Deviation for Data Set B: The average distance to the middle is $8,000. That is, the typical distance from each point to the average is $8,000.
Comments
The mean absolute deviation for Data Set A is $0. This implies that there is no spread in the values for Data Set A.
The mean absolute deviation for Data Set C is $12,000. This value is somewhat larger than Data Set B because Data Set C has, on average, more distance to the center than Data Set B.
The residuals could have been squared to prevent the cancelling out effect when obtaining a sum of residuals across all observations.
Data Set A
\[ \begin{array}{rcl} Average & = & \frac{0}{5} \\ & = & 0 \\ \end{array} \] |
Data Set B
\[ \begin{array}{rcl} Average & = & \frac{800,000,000}{5} \\ & = & 160,000,000 \\ \end{array} \] |
Data Set C
\[ \begin{array}{rcl} Average & = & \frac{1,000,000,000}{5} \\ & = & 200,000,000 \\ \end{array} \] |
The population variance is defined to be the average distance-to-center when the residuals are squared to alleviate the cancelling out effect described above.
\[ \text{Population Variance} = \frac{\sum{(Residual)^2}}{n} \]
The (sample) variance is used more often than the population variance. The denominator for the sample variance uses \(n-1\) instead of \(n\). This adjustment is necessary as only \(n-1\) of the residuals are free-to-vary due to the fact that their sum is contrained to be zero. A more intuitively explanation as to why the variance uses \(n-1\) is because the calculation of the residual required the estimation of \(1\) additional quantity, i.e. an estimate of the average is needed before a residual can be calculated. \[ \text{Variance} = \frac{\sum{(Residual)^2}}{n-1} \] One issue with the variance is that the scale for the variance is not the same scale as the original data. The squaring of the residuals causes the variance to be on the squared scale of the original data. Taking the square root of the variance will overcome this problem. The result is called the standard deviation.
\[ \text{Standard Deviation} = \sqrt{\frac{\sum{(Residual)^2}}{n-1}} \] The standard deviation has been computed for Data Set A, B, and C from above.
Data Set A |
Data Set B \[ \begin{array}{rcl} \text{Std Dev} & = & \sqrt{\frac{\sum{(Residuals)^2}}{(n-1)}} \\ & = & \sqrt{\frac{800,000,000}{4}} \\ & = & \sqrt{200,000,000} \\ & = & $14,142 \\ \end{array} \] |
Data Set C \[ \begin{array}{rcl} \text{Std Dev} & = & \sqrt{\frac{\sum{(Residuals)^2}}{(n-1)}} \\ & = & \sqrt{\frac{1,000,000,000}{4}} \\ & = & \sqrt{250,000,000} \\ & = & $15,811 \\ \end{array} \] |
Interpretation of Standard Deviation Data Set B: The standard deviation for Data Set B is $14,142. That is, the typical distance for a counties income level to the overall average is a little over $14,000.
Comments
The standard deviation for Data Set A is zero because there is no spread in the household income levels for Data Set A.
The information and all the formulas presented above for the standard deivaiton maybe overwhelming. Intuitively, the standard deviation measures the typical distance each observation is from the average. The standard deviation for the following data is 1.0 as this is the typical distance each observation is from 20.
Overall Comments Regarding Measures of Spread
The range should not be used to measure variation for a set of numerical measurements.
The mean absolute deviation and standard deviation are acceptable measures of variation for a set of numerical measurements. The standard deviation is most commonly used by statisticians because of its optimal theoretical properties.
The mean absolute deviation and standard deviation cannot be less than zero. A value near zero implies less variation and a large value implies more variation.
Measuring distance-to-middle is different than a measureing the gap between values. For example, the following data has a consistent uniform pattern and has a small gap between each value; however, this data has considerable variation as measured by distance-to-center.
Spreadsheets have formulas that will automatically compute the mean absolute deviation and the standard deviation for a set of numerical measurements. The mean absolution deviation formula is =AVEDEV() and the standard deviation formula is =STDEV(). These formalas are used to compute the MAD and standard deviation for Data Set C from above.
A PivotTable can be used to compute the standard deviation. Using a PivotTable to compute basic summaries is advantagous when several summaries are to be computed.
Rows: State
|
Values: 4 instances of Household Income; One for MEDIAN, AVERAGE, STDEV, and COUNT
|
Filter: State; deselect (Blanks) to remove blank row in PivotTable
|
The resulting PivotTable Summaries are shown below.
Questions
Consider the PivotTable Summaries provided above and the following dotchart when answering the questions below.
3. Does the fact that these states have a different number of counties adversely affect our ability to make fair comparisons between these states? Explain.