|Instructor:||Dr Nicola Hodges|
|Email:||YOUR EMAIL (if you want)|
|Important Course Pages|
Null Hypothesis Significance Testing (NHST)
Introduction to NHST
NHST is a statistical method that originated in the early part of the 20th century. Originally developed by statisticians Neyman, Pearson, and Fisher (Franks & Huck, 1986), it was intended to protect against the bias of researchers (who naturally have a tendency to want to make the most of their results as possible). While this attempt at preventing bias was developed with good intentions, and is still a valuable tool when used properly, it is important to consider how much has changed since the advent of NHST. The development of computer statistical software has essentially nullified the practice of computing statistics by hand. Also, whereas research done years ago was largely exploratory in nature, the majority of research conducted today is very specific and builds on existing research. These changes, among others, have led to some very prevalent misinterpretations of the scope of NHST (to be discussed below). Also important to consider is that while NHST was developed as a tool to guard against one of many threats to validity in research, it has become a 'gold standard' for publication, and unfortunately is given much more consideration than other potential threats to validity.
What NHST is:
NHST is a statistical method that tells us about the statistical 'rareness' of our findings (Carver, 1978). The use of NHST requires the development of a research hypothesis (a statement of difference, for example: the intervention will result in higher scores for our treatment group when compared to our control), and the development of a null hypothesis (statements of equality, for example: there will be no difference among the treatment and control group as a result of the intervention). It is important to keep in mind that the null hypothesis comes with some basic assumptions (Hunter, 1997):
- we assume that the sample we got comes from the population we hope to generalize to.
- we assume that any differences we do find are due to sampling error (because we actually believe that the treatment will have no effect)
Remember that NHST stands for null hypothesis significance testing, which means that when we run these tests we are assuming the null hypothesis is true (Hunter, 1997). Given this, what we are actually testing with NHST's is the likelihood that we would have found our result (treatment effect), given that the groups were actually equal on whatever it is we are testing for. NHST's tell us how rare it is to find what we found in a population that is uniform with respect to our variable of interest.
The result of a NHST is a p value. This p value is a number ranging from 0 to 1. The most commonly used p value is .05, which if found allows us to say that the likelihood that we would find our result in a population (given that the null hypothesis is true) is 5% (or 1 out of 20 times). When the researcher finds a p value they feel is small enough, they are able to reject the null hypothesis as the best explanation for their findings, and say that it is possible that it was due to the treatment. This is the extent of what NHST allows us to say. Unfortunately, poor understanding of the theory behind NHST has led to it being used to make claims far beyond this. These misinterpretations are being published in papers, written in textbooks, and taught in classrooms all too often.
What NHST isn't:
- NHST does not allow us to say anything about the likelihood that our treatment was effective. This is very common, and looks like this: "I found a p value of .01, so my treatment was effective and there is only a 1% possibility that what I found was due to chance."
- NHST does not allow us to say anything about the likelihood that our study could be repeated. This is also a common misinterpretation, and looks like this: "my p value was .05, so there is a 95% chance that my results could be replicated."
- NHST does not allow us to say anything about the validity of our results. Authors of papers will often publish a result as "very significant" because they found a p value of .001. This is inaccurate. Based on the p value found, a result is either statistically significant or it isn't. All a smaller p value means is that it is less likely that you would find what you found by chance.
- * For a more detailed explanation of these misconceptions, see Carver (1978).
What's the Problem With These Misinterpretations?
P values are an appealing statistical method, they allow us to relatively quickly and easily (thanks to statistical software packages like SAS and SPSS), pick through our results and say, "there's a pretty good chance that I'm onto something here". What researcher wouldn't find that appealing? Unfortunately though, the appeal of NHST has led to it being somewhat used-and-abused as a method of determining overall or practical significance of results (which is very different from statistical significance). P values are very sensitive to sample size; the larger the sample, the easier it is to get a statistically significant result (Carver, 1978). But who's to say results from studies with small sample sizes aren't worth consideration?
- Interpreting the p value as the likelihood that the treatment was effective leads to an overconfidence in study results (Dracup, 1995). Depending what that treatment is, false claims about it's efficacy could be potentially harmful to others, and could also mean more effective treatments don't get discovered.
- The focus of academia as a whole on statistical significance means it is more difficult to publish studies that don't find statistically significant results. If theories are being formed based only on studies that have found statistically significant results, aren't we missing the other side of the story? If we considered all the unpublished studies that found treatments to be ineffective, would it change the way we view existing theories? (Rosenthal, 1979)
- Some researchers use NHST to make black and white decisions regarding the truth of their research hypothesis, if they find a statistically significant result they reject the null hypothesis and automatically claim support for their research hypothesis. This way of interpreting results often lacks consideration for what the size of findings actually mean relative to the research question.
- Finally, NHST was developed to ensure that the interpretation of research results is an objective process, it was developed to guard against researcher bias. However, the reality is that much of science today is applied to fields where it is impossible to be completely objective. The social sciences require human interpretation of non-tangible things that inherently involve some level of subjectivity. Why are we so obsessed with this method of 'objectivity' when we now accept the fact that science involves some level of subjectivity (Carver, 1978). After all, researchers make decisions throughout the whole research process, why limit the value of their interpretation of results with NHST?
What Can We Do?
- Report exact p values - some researchers simply publish "this effect was non significant" or "p = ns". It would benefit readers to know if p=.06 or p=.99, there is a big difference between the two (Carver, 1978).
- Report effect sizes - understanding the size of effects allows us to interpret practical significance rather than just statistical significance (Carver, 1978).
- Make more specific hypothesis - this will encourage consideration of means and other descriptive statistics as opposed to just looking at p values (Dracup, 1995).
- Be critical of results as a reader - understand that statistical significance is not the be-all-end-all.
- Consider publishing confidence intervals as a part of your research - these allow for visual interpretation of results and are an extremely valuable (but underused) method of reporting results (Belia, Fidler, Williams, & Cumming, 2005). For a quick video on how to interpret confidence interval graphs, click here: CI's
Belia, S., Fidler, F., Williams, J., & Cumming, G. (2005). Researchers misunderstand confidence intervals and standard error bars. Psychological Methods, 10(4), 389-396.
Carver, R. P. (1978). The case against statistical significance testing. Harvard Educational Review, 48(3), 378-399.
Dracup, C. (1995). Hypothesis testing: What it really is. The Psychologist, 359-362.
Franks, B. D. & Huck, S. W. (1986). Why does everyone use the .05 significance level? Research Quarterly for Exercise and Sport, 57, 245-249.
Hunter, J. E., (1997). Needed: A ban on the significance test. Psychological Science, 8, 3-7.
Rosenthal, R. (1979). The "file drawer problem" and tolerance for null results. Psychological Bulletin, 84(3), 638-641.