A previous tutorial introduced some summary statistics appropriate for both categorical as well as metric variables. Now it’s time to turn to some measures that apply to metric variables exclusively. The most important ones are the mean (or average), variance and standard deviation.
Most of us are probably familiar with the mean (or average) but we’ll briefly review it for the sake of completeness. The mean is the sum of all values divided by the number of values that were added. We can represent this definition by the formula
Example Calculation Mean
Now let’s say we have some variable X1 containing the values 8, 9, 10, 11 and 12. If we fill these out in our formula, we’ll see that the mean of these values is 10:
The variance is the average squared deviation from the mean. We can represent this definition by the formula
The variance is a measure of dispersion; it indicates how far the data values lie apart.
Example Calculation Variance
Let’s reconsider variable X1 holding values 8, 9, 10, 11 and 12. If we apply the formula, we’ll find that the variance is 2.*
Now we have a second variable, X2, holding values 6, 8, 10, 12 and 14. How would you describe in words the difference between variables X1 and X2? They both have a mean value of 10. Well, the difference is that the values of X2 lie further apart; that is, X2 has a larger variance than X1.
Variance and Histogram
A variable’s variance is reflected by the shape of its histogram. Everything else equal, as the variance increases, the histogram becomes wider and lower. The figure below illustrates this for real data. Each variable has 1,000 observations and a mean of precisely 100. Note that the three histograms use the same scales for their horizontal and vertical axes.
Note how the histograms become lower and wider as variance increases.
The standard deviation is the square root of the variance. Its formula is therefore almost identical to that of the variance:
Just like the variance, the standard deviation is a measure of dispersion; it indicates how far a number of values lie apart.
The standard deviation and the variance thus basically express the same thing, albeit on different scales. So why don’t we just use one measure for expression the dispersion of a number of values? The reason is that for some scenarios the standard deviation is mathematically more convenient and reversely for the variance.