Welcome!

Welcome to my blog, a place to explore and learn about the experience of running a psychiatric practice. I post about things that I find useful to know or think about. So, enjoy, and let me know what you think.


Sunday, October 27, 2013

Statistically Writing: Normal Distribution

I know it's been a while, but it's time to pick up again with our statistics education.

Last time, we learned about standard deviation and variance. Let's review.
Note: we're going to talk about population and not sample in this post:

Variance is denoted by Sigma Squared:

Variance, Mu=mean


You'll hopefully recall that variance is a measure of distance from the mean, in a collection of data. But it's a little awkward because of the squared element, which leads us to:

Standard Deviation, denoted by sigma:  

Standard Deviation

Standard Deviation is the square root of Variance, which yields a measure of distance from the mean that's not squared, and therefore has the same units as the mean.

So if you're considering the number of cockroaches in NYC apartments, let's say the mean is 50 (probably more, but yuck). The Variance would be in units of cockroaches squared, while the Standard Deviation would be in units of cockroaches.

Normal Distribution

This graph depicts a Normal Distribution, with the center line equal to the mean:

Normal Distribution
And this is the function that describes the graph:



Notice that the function is described as "p of x". Normal Distribution starts with an "N". It's sometimes referred to as a bell curve, because of the shape, and bell starts with a "b". So why a "p"?

"P" is for probability.

A watermelon can weigh anywhere from a few pounds to 20 pounds, but say that, on average, watermelons weigh 10 pounds. If you look at a graph of the weights of watermelons, it will look like the graph above, and the center line will be 10 pounds, the mean. In other words, the vast majority of watermelons will weigh 10 pounds, plus or minus maybe 5 pounds. And a few crazy melons will weigh 20 pounds, or 1 pound.

So that graph is the likelihood, or probability, that a given watermelon will have a certain weight. If I go to Whole Foods and pick up an average sized watermelon, the chance, or probability that it weighs around 10 pounds is very high.

However, the chance that it weighs exactly 10 pounds is not high. In fact, it's zero. Why? Because it's impossible to be that precise. It could weigh 10.00000001 pounds, for example.

The way you need to think about it is, what is the probability that a given melon will weigh between 9.5 pounds and 11.5 pounds? A range. And the probability is given by the area under the curve in that range, which, if you recall your calculus, is the integral of p(x) from 9.5 to 11.5.

The fact that the probability is the area under the curve helps clarify why the chance that a watermelon weighs exactly 10 pounds is zero. Because the area under the curve at 10 pounds, or at any individual point, is the area of a line, which is zero. A line has no width.

A note on probability:

The probability of ANYTHING is between zero and one, which is the same as 0% and 100%. It's important to know this because the graph of a normal distribution extends to plus and minus infinity, so the total area under the graph, which is the integral of that scary looking function, p(x), from minus infinity to plus infinity, is one.

Many real-life statistics are normally distributed (that is, they can be described by a symmetric bell-shaped curve). For example, heights of 3rd graders, or weights of watermelons. But not all statistics are normally distributed. There can be sets of data with many outliers. Or there can be sets of data that follow different distributions. But most of the data we look at in medical studies are normally distributed, and the statistical analysis tools we're used to reading about, ANOVA, t-test, paired t-test, are all designed for use with a normal distribution.

These are some of the defining features of a normal distribution:

* Mean = Median = Mode
* Symmetry-the left side of the graph mirrors the right side of the graph, and because of this:
* 50% of the graph is to the right of the mean, and 50% of the graph is to the left of the mean.

Let's do an example.

I rolled 2 (virtual) dice 100 times, and these are the totals:

7 4 2 2 6 7 8 7 6 5
4 7 12 7 4 8 11 7 6 6
6 4 10 10 6 7 7 11 7 2
5 4 2 8 7 12 6 12 8 4
6 4 7 3 6 5 12 5 12 6
8 7 9 7 2 7 5 6 6 6
7 11 12 12 7 6 8 9 3 5
7 7 4 10 5 7 7 6 8 9
9 7 11 8 5 8 5 6 8 7 9
6 8 11 4 11 7 5 3 6 6

The average total = 6.8.
And the standard deviation = 2.5
(I let my spreadsheet compute these for me)

Now, we probably all know that the most likely total of two dice will be a 7, but let's just check:

#2's:    5
#3's:    3
#4's:    9
#5's:    9
#6's:   19
#7's:   23
#8's:   11
#9's:    5
#10's:  3
#11's:  6
#12's:  7

Yes, there are twenty three 7's, making 7 the most frequent result. What we're looking at here, then, are frequencies, how often an individual total was rolled. And remember, these frequencies represent probabilities. So the probability, or likelihood of rolling a 7 is 23/100 or 23%, and the probability or likelihood of rolling a 4 is 9/100 or 9%. In the future, if I choose to bet on a pair of dice, based on the data above, I'd have a 9% chance of rolling a 4. (There are better and simpler ways to compute the odds on dice, BTW).

If we graph these frequencies, we get:




Notice, this doesn't quite look like a normal distribution, even if you draw in the curve along the tops of the bars and smooth it out. There are several reasons for this.

1. There are only 100 data points. That's not bad, but the larger the sample size  the more the graph will look like a perfect bell curve, and we're not quite there with this one.

2. It's not really a normal distribution. The result of a dice roll is an integer between 2 and 12. These are discrete results (discrete like individual, not discrete like secret). And for a normal distribution, you really need continuous results, like average weight of a watermelon.  But since I didn't want to buy and measure 100 watermelons, this will have to do. And it does approximate what we're talking about. But how well?

Let's check it against our requirements for a normal distribution:

1. The mean = 6.8. The mode would be 7, because that's the most common result. And the median is also 7, the middle value. So the mean is off a little. If I rolled the dice 1000 times, the mean would move closer to 7, and if I rolled the dice infinitely many times, the mean would be exactly 7. But I don't have that kind of free time, so this will have to do.

2. Symmetry: Are the left and right sides mirror images of each other? Not really, but if we rolled infinitely many times, they would be.

3. Are 50% of the values to the right of the mean, and 50% to the left? No.

But overall it's not a terrible approximation of a normal distribution.

Let's consider another property of normal distributions. the standard deviation is 2.5, and the mean is 6.8. This implies that anything that falls between 4.3 and 9.3 (6.8 +/- 2.5) is within 1 standard deviation of the mean. If we count up the data, there are 9 fives, 19 sixes, 23 sevens, and 11 eights, and five, six, seven, and eight all fall between 4.3 and 9.3, So there are 62 data points within one standard deviation of the mean, or 62% of all data is within one s.d. of the mean.
Two standard deviations would be between 1.8 and 11.8, and all but 7 data points are in that range (everything but the twelves, of which there are 7). So 93% of the data lies within two standard deviations of the mean.

These figures approximate something called The Empirical Rule, which is another property of a normal distribution, and which states that:

68% of the data lies within 1 s.d. of the mean
95% of the data lies within 2 s.d.'s of the mean, and
99.7% of the data lies within 3 s.d.s of the mean

The empirical rule is useful, because the integral of that nasty looking p(x) that represents the normal distribution is quite difficult to solve, so it's generally determined numerically, using something called the cumulative distribution function, which we won't get into, but which tells us the area under the curve to the left of any given point. But in the absence of the ability to solve that integral, or access to the cumulative distribution function, we can learn a lot from the empirical rule, because we know the whole, 68-95-99.7 thing.

And the values we computed above, 62% for 1 s.d. from the mean, and 93% for 2 s.d.'s from the mean, are pretty close, especially considering the limitations of our example (not enough data points, not really a normal distribution).

Pictorially, it looks like this:

I'll wrap up now, but keep in mind, a real normal distribution goes on forever, so there will always be points way, way to the right and left of the mean. But even so, the vast majority of the data lies within 3 s.d.'s of the mean. And importantly, 95% of the data is within 2 s.d.'s, so that only 5% of the data lies outside 2 s.d.'s. Meaning that the probability, p, that a data point lies further than 2 s.d.'s from the mean, in either direction, is < 0.05.