Pages

Wednesday, July 10, 2013

Statistically Writing

I took a Statistics class my sophomore year in college. Got an A, didn't learn anything. I subsequently learned some probability and combinatorics, which I find immensely useful (not being facetious), but I still don't know any stats. And it seems to me that if I'm going to try to intelligently  read papers, I should know, really know, what ANOVA is, and how to compute number needed to treat, and all that other jazz.

Since this is the kind of information that is useful to most or all clinicians, I thought I'd break the topics up into individual posts, and share my understanding, or lack thereof. Now, most of you probably already know all of it. You haven't forgotten anything you learned about statistics in medical school, and you read through the minutiae of the statistical analyses in all studies you peruse. So you probably don't need it. But for the minority who don't remember so well, here's a refresher.

And please let me know if this is overly simplistic. Maybe I'm the only one, but honestly, I didn't really get the implications of this stuff until I wrote this post.

Let's start with the basic basics, measures of central tendency. These are the mean, median, and mode. The definitions are pretty simple. The mean is the average, the median is the middle value, and the mode is the most common value. But the important question, when it comes to understanding the results of a study, for instance, is why would you use one rather than the other?

For the record, what's referred to as the "mean" is generally the "sample mean", i.e. the average value of all the data points in a given sample. This stands in contradistinction to the "population mean", the average value of all the data points in an entire population.

Say you wanted to know what percentage of redheads in the US are left-handed. One way to determine this would be to find every last redhead in the country, and count the number of lefties. In this case, you'd be looking at the entire population, which, practically speaking, is impossible to do on a limited grant. So instead, you'd pick a sample of redheads, maybe all the redheads in your town who were willing to sign up for the study. This is a more do-able project, and you hope that the sample you're looking at is representative of the entire population. But if for some reason your town had an unusually high number of lefties, then the sample would not be representative of the general population in the US.

If you think about it this way, you can see how a perfectly conducted study can draw erroneous conclusions, because it can't look at the entire population, just a sample of it.

The function of any measure of central tendency is to give you a handle on a collection of data points, a sense of what the data is "telling" you. But it's important to note that there is no best measure of central tendency. The one we see the most is the mean, but it has it's limitations.

The mean is useful for including all data points, even in very large sets. And it's easy to incorporate new data.

Where it starts to falter is with outliers. Suppose you want to know the typical number of marbles owned by each of five children.  And suppose the numbers are as follows:

3; 7; 4; 5; 100

The mean value here is 23.8. But it would be misleading to say that on average, each child has 24 marbles. This is where the median is useful. If you put the numbers in order:

3; 4; 5; 7; 100,

you can see that the median is 5, which is much closer to the number of marbles owned by most children in this group.

This is something to keep in mind when reading a study. If a new antidepressant, Happyzac, caused massive improvement in 2 out of 30 subjects, but poor to moderate improvement in the other 28 subjects, the mean improvement might be misleading.

On the other hand, the median can be difficult to use if there is a very large data set, since it has to be put in order.

Also, if some data points are very close together, and others spread out, the middle number may not be the most useful way to think about the data set. Consider the following sequence:

1; 2; 3; 30; 70; 200; 554

The median here is 30, which really doesn't tell you much about the nature of this set. Unlike the example above, there are no outliers, just one small cluster, and a bunch of other numbers all over the place.

The mode is good for categories, particularly non-numerical ones. Say you wanted to find out the most common hair color of lefties. The mean isn't useful because how do you average hair color? And the median isn't useful because how do you put hair color in order, so you can determine the middle value?

The problem with the mode is that it can be very far from the middle value. Also, there can be more than one mode, e.g. if it turned out that there are the same number of blonde lefties as red-headed lefties.

To summarize:



Click HERE to read the next Statistics post on Variance and Standard Deviation.