Statistical significance
Poetry of reality Science |
We must know. We will know. |
A view from the shoulders of giants. |
Declaring statistical significance is the closest that statistics comes to proving a result. Informally, it's the term used to say that the null hypothesis is probably not true. More formally, it is the measure of how willing an experimenter is to erroneously reject the null hypothesis. Statistical significance is declared when the probability of observing the data under the assumption that the null hypothesis is true (the p-value) is above or below some arbitrary value, represented by a lower-case alpha (α). A p-value below alpha suggests that the data is not consistent with a true null hypothesis. Conventionally, the null hypothesis is then rejected, statistical significance declared, and parties commenced.
The word "significant", in this sense, does not mean "large" or "important" as it does in the everyday use of the word. It just means an effect is large enough to seem unlikely to have occurred solely by chance. Statistically significant effects can, in fact, be very small. Typically, larger sample sizes are required to demonstrate significance of smaller effects.
Contents
Basics of statistical significance[edit]
- Start with data collected especially for the question at hand
- Clearly formulate a null and alternative hypothesis
- Establish an alpha level appropriate for the study
- Compute a test statistic and it's corresponding probability under the null hypothesis (or use a computer do it).
- Compare the p-value to the alpha level and
- Reject the null if the p-value is lower
- Don't reject the null if the p-value is higher
In proper hypothesis testing, alpha is determined prior to collecting data. Selecting the proper alpha should be based on thoughtful contemplation of the risks of drawing the wrong conclusion, but, is commonly set at either 0.05 or 0.01. There is a trade-off between significance and statistical power (the probability that the null hypothesis is rejected given that it's false). A low alpha value means that a rejection of the null hypothesis is less likely to be in error, but it also decreases the chance having such a rejection. Increasing the sample size can increase the likelihood of significance without decreasing power.
In more common statistical approaches (i.e. "frequentist"), statistical significance emerges from the results of hypothesis testing. An alternative hypothesis (that there is an effect) is favoured — and a null hypothesis (that there is not an effect) is rejected — if the experimental evidence shows a significant difference from the null hypothesis. If a significant difference is not present, the null hypothesis is not rejected.
To be clear, statistical significance testing does not prove either hypothesis. Rejecting the null just suggests that the evidence disfavors the null enough for us to jump into the arms of the welcoming alternative hypothesis. Not rejecting the null hypothesis says that either the null hypothesis is probably true or there is not enough evidence to reject it; it does not prove the null hypothesis. Statistical significance is just a way to make a statement about the strength of the evidence.
Alpha value versus p-value[edit]
Hypothesis testing consists of formulating a null hypothesis and alternative hypothesis, choosing an alpha value, determining the rejection region, collecting data, calculating a statistic, and evaluating whether the statistic falls in the rejection region. There are four possible results of a hypothesis test: the null hypothesis is true and retained, the null hypothesis is false and rejected, the null hypothesis is true and rejected, and the null hypothesis is false and retained. If the null hypothesis is true, but is rejected, that is a Type I error. If the null hypothesis is false, but retained, that is a Type II error. The probability of a Type I error is, by definition, equal to the alpha value. The probability of a Type II error generally cannot be calculated, as the alternative hypothesis does not incorporate a known distribution. If the possible results of experiment can be ordered as "most likely" (given the null hypothesis) to "least likely", then the actual results can be assigned a value equal to the probability of those results, plus all "less likely" results. This probability is known as the "p-value". If the p-value is less than the alpha value, the null hypothesis is rejected. The significance of the test is determined by the alpha value, which is unaffected by the test results. The only effect the p-value has is that it is either less than the alpha value, and the null hypothesis is rejected, or it is more than the alpha value, and the null hypothesis is retained. A result does not become "more" statistically significant if the p-value is "a lot smaller" than the alpha value, as opposed to being simply "slightly smaller".
Abuse[edit]
An abuse of statistics is when journalists or certain agenda pushers ignore the concept of significance entirely — leading to false information being given out to people. In 2005, a report commissioned by the UK government concluded that there had been "no significant increase in drug use in UK schools". Not content with the conclusion that "things aren't that bad, actually", a few newspapers jumped on the report and decided to draw their own conclusions. In their, frankly amateurish, search for something to data mine (post-designation), they noticed that cocaine use in schools went from 1% to 2% — although these were rounded off for the summary, it was actually 1.4% and 1.9%, so a 35% increase, rather than a 100% increase. They had their smoking gun; despite what the government concluded, cocaine use had doubled, cocaine was flooding the playground and the government was covering it up. However, the government's conclusion was more accurate, because it took into account significance, clustering and the fact that the use of many different drugs had been polled. If you test many variables, the probability of one of them showing a clear trend by chance increases, and so tests for significance have to be altered appropriately. Upon doing the actual maths, the results were actually very insignificant, essentially produced by accident and the random chance that the sample would have fallen on a cluster of individuals using drugs that wasn't representative of the whole sample.^{[1]}
Problems with statistical significance[edit]
The alpha value is set usually at 0.05 or less. This means that there is a less than five percent chance of rejecting the null hypothesis by chance alone. There is nothing fundamentally magic about an alpha level of 0.05, yet after many generations of using it in analysis it seems to have taken on a certain magical value for many sciences. If a statistical test comes back with p=0.04, results are called significant and if p=0.06, they are called non-significant.
With this standard alpha level, about 1 in 20 results should come back significant when there really is no effect. This does happen frequently so it is wrong to assume a good value means you're completely certain; it's still all about probability. In individual experiments that run many statistical tests this is a problem, if you run 40 tests about 2 of them will show an effect that is not really there. This is often referred to as a family-wise error rate and is difficult to control for but some measures can be used. While it is easy to see this problem in a single set of experiments in a single paper, the same phenomenon emerges if a bunch of single experiments are published in multiple papers. With the thousands of experiments run every day all over the world, a very large number of them will show a statistical significance when there really is no effect at all. Publishing biases in journals exaggerate this problem because journals rarely publish experiments that show only a non-effect (i.e., "failed" experiments), and are much more likely to publish papers that show an effect. So one winds up with a massive uncontrolled bias in the published papers towards showing statistical significance where there really is none.
Abuse from pseudoscience[edit]
This is one reason why picking out a single test in a single paper to make a point is meaningless. It is a common tactic in pseudoscience to search through thousands of papers to find that one result that's significant and makes their point. Real science must be accompanied by the preponderance of evidence, and experimental results need to be replicated repeatedly and reliably before they should be incorporated in the body of accepted knowledge. This is why scientific consensus is important and quacks and cranks that go against this consensus do not gain points by finding a single example in a paper that might support their claims.
The problems above are due mostly to the use of frequentist approaches to statistical analysis. There is a growing movement of scientists who are encouraging the use of Bayesian-based statistics. Bayesian approaches are not subject to the same sort of systematic error propagation issues as frequentist approaches (however they are subject to their own unique sets of issues).
P-value fishing or "p-hacking"[edit]
"P-value fishing" (a.k.a. "fishing expedition"), more commonly known as "p-hacking," is a pejorative term for a statistical sleight of hand often abused by cranks and those with an agenda to push. There are two common ways to get a statistically significant result that doesn't mean much at all. The first is, in studies with a large number of variables, to run comparisons of all the variables and hope that something comes out significant. Proper methodology dictates that the experimenter choose which variables are being compared beforehand and to run post-hoc corrections on any further comparisons. In other words, just comparing as many variables as possible will eventually turn up a significant result, though it's likely to be statistical noise. The post-hoc correction either reduces the post-hoc test's alpha level or increases its p-value so that the family wise error rate (e.g. 1 out of 20) is maintained.^{[note 1]}
The second trick is to fish for p-values by cranking up the number of subjects until significance is achieved. Normally, it's good to have more subjects, however, the data should be interpreted in light of that. What often happens with a large subject pool is that even a slight difference in means will become significant even though the effect size is close to nothing. This is why it's important to look at the effect size in addition to the p-value.^{[4]}
A 2011 paper by Joseph P. Simmons et al. showed that despite despite outward commitment among psychological researchers for low rate of false positives by endorsing the use of α=0.05, researchers are often able to manipulate the results by choosing when to stop data collection, which variables to measure, and which statistical methods to use.^{[5]} Such practices by researchers, when unchecked by journal policies and article reviewers, is likely to have been contributory to the replication crisis in psychology.
Proposed solutions to the problems[edit]
Another approach has been to argue that statistics needs to lose its magical status in science as some sort of analogy to a proof, but rather needs to be seen as an argument or a measure of the strength of evidence.^{[note 2]} The p-value of statistics are just one piece in the broader perspective and should be weighed against other types of evidence. P-values can be reported directly, allowing people to integrate them with other evidence in making their conclusions. If other evidence is weak maybe a p-value of 0.05 is not convincing, or maybe if all the other evidence is strong a p-value of 0.1 is good enough. However, this is problematic, as dealing directly with p-values opens up the possibility of a large variety of statistical fallacies, such as multiplying the p-values of two studies to get the "combined" p-value.
Focusing on confidence intervals instead of p-values provides a more flexible and less arbitrary method to weight evidence. Certainly, a 95% confidence interval can be interpreted as rejecting the null hypothesis of some value outside the confidence interval at an alpha of 0.05. However, the confidence interval itself allows one to see all the values in the plausible range and decide if this estimate of a difference is even precise enough to be worth relying on. A wide confidence interval associated with a low p-value could seem less useful than a narrow, more precise confidence interval that fails to reject the null.
See also[edit]
External links[edit]
- What does it mean for a result to be "statistically significant"? Stats FAQ, George Mason University
- 9 circles of scientific hell, Neuroskeptic
- The Cult of Significance Testing, John D. Cook
- A cartoon example of p-value fishing over on XKCD
- Always test your hypothesis…
- Science isn't broken. Has a cool interactive on how p-hacking works.
Notes[edit]
References[edit]
- ↑ Cocaine Floods The Playground by Ben Goldacre (April 1, 2006) The Guardian via Bad Science.
- ↑ Bonferroni Correction by Eric W. Weisstein, Wolfram MathWorld.
- ↑ The Bonferonni and Šidák Corrections for Multiple Comparisons by Hervé Abdi, The University of Texas at Dallas.
- ↑ It's the Effect Size, Stupid: What effect size is and why it is important by Robert Coe, Paper presented at the Annual Conference of the British Educational Research Association, University of Exeter, England, 12-14 September 2002.
- ↑ False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant by Joseph P. Simmons, Leif D. Nelson, Uri Simonsohn (2011) Psychological Science 22(11):1359–1366. doi:10.1177/0956797611417632.