# Testing for Group Differences

Very often, we want to know whether there is a noticeable or important difference between two different groups. These groups could be a treatment and a control group in a medical or psychological experiment, different demographic groups in social research, different countries or states in political or economic research, separate populations of animals in zoological research, or various experimental setups in chemical or physical research. There are various statistics methods for making such comparisons, the details of which can vary and be rather technical, however the basic approach is generally the same. Essentially this involves taking the average value for each group (for numerical data) or the proportion of the sample making a particular response (for categorical data), and then comparing these values between groups to see if the different between them is ‘big enough’ to regard as meaningful. If the difference is only small, it is judged to be likely only due to chance (what is called sampling error), and thus not of any significance. On the other hand, if the difference is sufficiently large, it is judged that such a different is unlikely to be due to chance, and therefore probably means there is a real difference between the groups.

Consider an example. Suppose we wish to know whether males or females (in some defined age group) are typically the same weight, or whether males are typically heavier than females, or visa-versa. We collect our data and find that in our sample, the mean weight of males is 85kg while the mean weight for a female is 75kg. We cannot now simply state that we have found that males are on average 10kg heavier than females, because we have not measured the entire population, only a small sample from it. If we could weigh everyone in the population, then it would be easy to say that the 10kg difference is really due to the fact that males weigh more than females on average. However, if since we only asked a smaller number of people, there is some probability that we happened to select heavier males and lighter females by chance, and that therefore our result of a 10kg difference is just due to luck rather than a real difference between the groups.

So how can we determine if this 10kg difference is sufficiently large to count as a meaningful difference? Generally, this is done by comparing the standard deviations of each of the groups. The key idea is to compare the size of the standard deviation to the size of the difference between the two sample groups. If the standard deviation is large compared to the inter-group difference, it means that the natural variability of heights amongst males and females (as measured by the standard deviation) is high compared to the average measured difference between males and females. If this difference is large then the 10kg difference is more likely to be due to chance than if the natural variability within each gender were smaller, since there is more scope within each group for us to choose individuals with very high or very weights. Imagine, for instance, one case in which all males were exactly the same height. In this case, the standard deviation of male heights would be zero (since there is no deviation at all), and so we could be sure that we didn’t happen to pick a group of heavier males just be chance, since there are no heavier males. Now imagine a case where there is a great variability in male weights, with some males very heavy and others very thin. In this case, it is fairly likely that we just by chance selected a sample of heavier-than-average males, and thus got our 10kg weight difference by chance rather than owing to any actual difference between males and females. This is the fundamental basis behind most statistical comparisons: if the difference between groups is large compared to the difference within each group, then the difference is more likely to be ‘real’, and not simply due to chance.

Another important connection to understand is that linking sample size to statistical reliability. Other things being equal, the larger the sample size of a particular survey or experiment (meaning the larger number of people or things measured), the more confident we can be that the results we derive from this data are ‘real’, and not simply the result of chance. This is due to the fact that over the long run, unlikely events in one direction typically cancel out with unlikely events in the other direction, resulting in an average which is more representative of the whole. This is why, for example, it is much more likely to flip three heads in a row (getting 100% heads, three heads from three tosses), than it is to flip 30 heads in a row (also 100% heads, but this time thirty from thirty tosses). If one really did flip a coin and it came up heads 30 times in a row, it would be reasonable to conclude that coin was not a fair coin, since such an outcome is so very unlikely otherwise (about one in a billion). Getting three heads in a row, however, is not particularly unlikely (one in eight), and so inferring that a coin is biased on the basis of only tossing three heads in a row would not be a valid inference. This is the essential reason why large sample sizes are preferred in surveys and experiments, as it allows more robust conclusions to be drawn from the results.

Understanding statistical tests for differences between groups is important in interpreting much statistical information that is presented in the media. A very common example is opinion polls, even the best of which typically have margins of error of 2-3%, owing to the sample sizes used and the degree of variance of the responses given. This means that very small differences in proportions expressing one opinion over another, a 49%-51% difference in a political campaign for example, are often not very meaningful, and should generally not be used as the basis for strong conclusions about the distribution of popular opinion on the subject in question, since the difference is well within typical sampling errors.