## Science - Not So Sloppy

Recent Wall Street Journal coverage1 has revived interest in Dr. John Ioannidis’ 2005 PLoS article, "Why Most Published Research Findings Are False."2 Somehow, I missed it the first time around, but this time the paper got me to thinking. In short, I don’t buy his argument; I think the whole foundation is wrong.

## Background

First, for those who haven’t read his paper, or don’t follow the math, let me outline what he’s saying.

He’s talking about a certain kind of experiment, one that’s much more prevalent in medicine than it is in, say, physics. In medicine, it commonly appears in the form of clinical trials.3

The experiments he’s talking about try to show some relationship – typically a causal relationship – between two entities. In a clinical trial, you’re trying to show a causal relationship between a particular treatment and patient outcomes. The hypothesis being tested is that performing the treatment improves the patient’s condition.

To do this, you identify two groups of patients. One, the "test" group, gets the relevant treatment. The other, the "control" group, doesn’t – though they’re not supposed to know that. Most commonly, the treatment is some regimen of drugs, so the control group is given a placebo. At the end of the experiment, you look at both groups to see whether the outcomes in the test group were better than the outcomes in the control group.

## Significance

However, even if you took two groups and treated them both as control groups, there would be differences in the outcomes – so we need some way to measure how likely the observed difference is to have occurred by chance alone, rather than as a result of the treatment.

The statistical trick researchers use for this is called a significance measure,4 and is almost always indicated by the variable p. If p < 0.05, then there’s less than a one in twenty probability the observed effect happened by chance. If p < 0.0001, then there’s less than a one in ten thousand chance.

For very many research areas, p < 0.05 is considered an acceptable risk, and such results are considered "significant." When p < 0.01, the results are considered "highly significant." Clinical trials in medicine almost always use p < 0.05. In particle physics, the standard is more commonly p < 0.0001. It’s a trade off between too many false positives and the expense of running an experiment. Keeping the costs of experimentation down and getting treatments to the patients are some of the forces behind the looser standard for clinical trials. Nobody dies if the physicists take a few more years to get their results.

## Possible Outcomes

Now, before the trial takes place, there are four possible outcomes:

1. the treatment fails
2. the treatment works
3. the treatment fails, but chance makes it appear to work (type 1 error, or false positive)
4. the treatment works, but chance makes it appear to fail (type 2 error, or false negative)

At the end of the trial, if the test group comes out better than the control group, you’re in one of the middle two cases. If both groups come out about the same, or the control group comes out better, you’re in the first or last case – but you don’t know which, for sure.

Ioannidis’ paper only discusses the cases where the treatment appears to work – the middle two. You could make an analogous argument about the other two (mathematically, they’re "dual" cases), but it doesn’t really add anything to the overall picture.

## Ioannidis’ Arguments

The first definition Ioannidis makes is R, which represents the fraction of all relationships one might test which are real. To use our clinical trial as an example, R represents the fraction of all treatments that actually work. So if there are two million possible ways you could treat the patients, and only twenty of those work, then R = 0.00001.

Ioannidis’ paper asserts when you’re setting up an experiment testing for the existence of some relationship – like our clinical trial – then there’s an a priori probability of R that the treatment will be effective. There are two million things you could have tested, but only twenty of them are "true". So there’s a 99.999% chance we’ve chosen a treatment that doesn’t work and only a 0.001% chance we’ve chosen one that works. But there’s some probability, even having chosen a treatment that doesn’t work, that our test group still does better than the control – that is, we’ve had a type 1 error. That chance is called Î±, and the chance a working treatment appears to fail (type 2 error) is ÃŸ.

If we’ve done our math right, then Î± is the same as the statistical significance of our test – the thing we called p earlier. For our two million treatments, twenty are "true" (that is, they improve patient outcomes) and 1,999,980 are "false" (no effect on patient outcomes, or they make things worse).

But, if we did separate clinical trials of all two million possible treatments, 99,999 of the false treatments will still appear to work just by chance – 5% (Î±) of the 1,999,980 false treatments. On the other hand, of the 20 true treatments, some fraction will fail in trials by chance and the rest will succeed. If we assume, for the sake of argument, that Î± and ÃŸ are about the same, then 19 work and one fails.

The central point of the first part of Ioannidis’ essay ("Modeling the Frameworks for False Positive Findings") is it’s far more likely we’ve chosen one of the 99,999 false treatments that gave a good result than one of the 19 good treatments that give a good result. Of course, repeating the experiment increases our confidence in it – but Ioannidis suggests such repeated experiments are often not performed.

The second part of the paper ("Bias") goes on to introduce a new term, u, which represents experimenter bias. The argument is experimenters often – whether intentionally or just due to sloppy practices – dig up apparent relationships in the data that aren’t really there. They perform an experiment to show one thing, it doesn’t show it, but they notice that some interesting subset of their data does show some relationship, so they jump on that as a conclusion. Rather than performing a new test to verify, they assume the old data is good enough. The issue here is in any given data set there are a huge number of ways you can partition it into subsets, and if each one has a 5% chance of showing a false positive, then it’s virtually certain a few of them do.

So u represents the tendency for experimenters to "cheat" along these or similar lines. Naturally, such tendencies to find data when there aren’t any will increase the chance of false positives, and exacerbate the effect the first part of the paper describes.

The third part ("Testing by Independent Teams") extends the first part to cover the case where multiple, independent research teams are investigating the same area.

## Issues

So, what’s wrong with all of this? Well, almost everything.

First, let’s talk about R (the fraction of all relationships one might test which are real). Can you really even assume such a statistic exists? Most of the time, the relationships are not discrete entities. When a relationship is known to exist, then some of the other relationships become more probable, while others become less probable. These are constantly shifting as scientists explore a field. Hypothetical experiments come in clusters, too – even in a drug trial there’s a presumably continuous range of dosages you could try of each possible drug. So the denominator for R simply isn’t countable. From a statistical perspective, this ruins the whole picture – there’s not even a valid Lebesgue measure5 on the dataset, so treating R as a probability measure6 is meaningless.

Even if it weren’t, it’s incorrect to assume experimenters choose experimental hypotheses at random from the full collection. It’s not like the team gets together and says, "Ok, we’ve got a new grant for research into curing AIDS. So what treatment shall we try this time? I know, let’s try poking them with sticks! Or maybe smear them with peanut-butter!" The team chooses experiments based on some model of how the underlying phenomenon works.

Ultimately experiments are designed to confirm or refute some proposed relationship in the context of an existing model. The degree to which the model accurately reflects reality (the "fidelity" of the model) increases the chance they’ll pick an experiment that shows a true relationship.

So treating R as the a priori probability of selecting an experiment that shows a true relationship is nonsense. In reality, the "fidelity" of the model for the discipline acts as a kind of "good" bias; experimenters with good models are far more likely to pick true treatments. This would be modeled as a v term – analogous to the u term from the latter half of the paper, but acting in the opposite direction. v is relevant to the reasoning given in the first part of the paper, before we even start talking about the possibility of sloppy science.

Additionally, in a general sense, I take issue with Ioannidis’ paper only talking about certain kinds of experiments (e.g., clinical trials) – not all scientific experiments. Yes, experiment form considered is very common in some disciplines, but it’s by no means comprehensive – nor are the standards in use the same in all disciplines. It’s a stretch to use his argument to conclude "most science is tainted."

## Not Completely Useless

Ioannidis’ paper does make some useful points. It’s hard to emphasize how important it is to treat data properly. When an experiment fails, you can’t look around to see if the results support some other hypothesis – or at least, if you do, you have to do a proper experiment to test it is real. Furthermore, independent verification of experimental results is clearly important. It acts to counter subconscious "cheating" on the part of research teams, which are highly competitive in some disciplines.

The study also suggests researchers could benefit from more math training. They shouldn’t just know how to figure the p for some dataset; they need to understand what it really means. Science without math is philosophy – and even philosophers are getting on board with the numbers, today. "Analytic" philosophy7 has become the de facto standard.

1 Hotz, Robert Lee. "Most Science Studies Appear to Be Tainted By Sloppy Analysis." Wall Street Journal: Science Journal. September 14, 2007; Page B1. Accessed October 2007 from http://online.wsj.com/article/SB118972683557627104.html.

2 Ioannidis, John P. A. "Why Most Published Research Findings are False". PLoS Med. 2005 August; 2(8): e124. Accessed October 2007 from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1182327.

3 "Clinical Trial." Wikipedia.org_. Accessed October 2007 from http://en.wikipedia.org/wiki/Clinical_trialtrial.

4 "Significance." Mathworld. Accessed October 2007 from http://mathworld.wolfram.com/Significance.html.

5 "Lebesgue measure." Mathworld. Accessed October 2007 from http://mathworld.wolfram.com/LebesgueMeasure.html.

6 "Probability measure." Mathworld. Accessed October 2007 from http://mathworld.wolfram.com/ProbabilityMeasure.html.

7 Leiter, Brian. "‘Analytic’ and ‘Continental’ Philosophy." The Philosophical Gourmet. Blackwell Publishing, 2006. Accessed October 2007 from http://www.philosophicalgourmet.com/analytic.asp. A discussion of the distinction between the two types of philosophy.

Similarly tagged OmniNerd content:

##### In fairness... by JSinger

Clinical trials are required to specify the details of their analysis in advance and correct for multiple testing, so while your explanation is correct, your choice of example is unfair. Ioannidis is talking about epidemiology, which is notorious for exactly the kind of fishing you describe, and as you say, that issue isn’t especially applicable to more qualitative experiments.

##### A common error regarding the p-value by Anonymous

Although you are to be commended for suggesting limitations regarding the applicability of Dr. Ioannidis’s controversial hypothesis, you inadvertently perpetuated a common misunderstanding regarding the meaning of p-values. Whereas you stated that "[i]f p < 0.05, then there’s less than a one in twenty probability the observed effect happened by chance." the p-value dose not comment upon the probability that the outcome was due to chance. Rather, the p-value is calculated after assuming that there is no real difference between the treatment arms and represents the probability of obtaining a distribution of data that is at least as extreme as that observed, again in the absence of any real difference. The p-value describes the data observed, not the probability of the hypothesis. Of course, a very small p-value suggests that the data are not consistent with the null hypothesis; one may conclude that the null hypothesis is improbable, presuming that the experimental hypothesis is plausible (if it weren’t, I wonder why the experiment was conducted). To sum up, a p < 0.05 is equivalent to the statement "Even if there is no difference between the treatment arms, there is less than 5% chance of observing this much difference in outcome between them." All of which is consistent with your commentary as a whole, and with Ioannidis’s article as well.