Pages

Sunday, July 28, 2013

Statistically Writing-Variance and Standard Deviation

I hope people weren't too annoyed by my previous statistics post about measures of central tendency. But it's important to really understand the concept of a mean, and its implications for research, before moving on to bigger and better things. This time around, we're going to look at measures of dispersion.


Say you have a set of data points, and you've figured out the mean for that set. You might, then, want to know how far from the mean each of your data points is. So if you subtract each data point from the mean, and take the absolute value, you would know that information, for each point.

Consider teenagers. You have a group of 5 teens, and each spends a certain number of hours per day on Facebook:

T1=3; T2= 5; T3=2; T4=6; T5=2

If you calculate the mean here, you get: 3.6. So, on average, each teen spends 3.6 hours per day on Facebook.

Now suppose you want to know how close or far from average each kid's time on Facebook is (Why? To see if your kid is a freak):

T1: |3-3.6|= 0.6
T2: |5-3.6|= 1.4
T3: |2-3.6|= 1.6
T3: |6-3.6|= 2.4
T5: |2-3.6|= 1.6

Well, that's nice, but notice, you have another data set here, for which you can also find the mean. This is called the Mean Absolute Deviation. In this case, it's equal to 1.52 hours.


But Mean Absolute Deviation is not Variance. Variance, denoted by sigma squared, is actually the sum of the squares of each of these numbers, averaged out:




So here, the Variance = (0.62 + 1.42 + 1.62 + 2.42 + 1.62)/5 = 5.72.


You may recall from my last stats post that I wrote about the distinction between the sample mean and the population mean. In the above example, the 5 teens constitute our entire population, and the formula above is for population variance, denoted by sigma squared. (Also note that population mean is denoted by mu.)

But let's say you want to use this group of 5 teens to estimate the average number of hours on Facebook for all teens in the US. Then the group of 5 teens is a sample. And weird as this may sound, a better way to estimate the variance of a population based on a sample is to calculate the "unbiased sample variance", denoted by s squared, where the result is computed by dividing by n-1 rather than by n.



In this case, the unbiased sample variance = 7.15.

Variance is a useful measure of how far from the mean the data points are. But notice, it's a squared value. This implies that the distance from the mean is exaggeratedly large. Just looking at the variance, without units, you can see that 7.15 is bigger than the greatest amount of time spent on Facebook, 6 hours.

And if you have outliers, say some weird kid was on Facebook 20 hours a day, the variance will be huge. For those readers who thought my last statistics post was overly simplistic, this is where it starts to be important to know which measures are good for data with outliers, and which aren't.

Also, if your data is measured in hours, it's unintuitive to think about distance from the mean in hours squared. This is where Standard Deviation comes in handy.

Standard Deviation is nothing but the square root of variance:

For an entire population,



And for a sample,



In this case, the Population Standard Deviation = 2.39,

and the Sample Standard Deviation = 2.67.


Visually, it looks something like this:



The mean is in blue, the data points in green, and the purple lines represent one standard deviation in each direction from the mean.