Fermat's Library | Why Most Published Research Findings Are False annotated/explained version.

This PLoS Medicine paper: “Why most Published Research Findings are...

Does the title apply to this paper?

Dr. Ioannidis is a Professor of Medicine at Stanford University and...

Here is a nice table explaining the different types of errors and r...

There have been many recent efforts to evaluate the reproducibility...

This short video explains P-values and significance tests nicely: ...

Researchers typically only report statistically significant results...

This is a very subtle point-> if independent teams are conducting e...

Significance testing and the arbitrary pvalue cutoff of .05 were in...

A great follow up to this point is a modelling exercise in which Gr...

A great follow up to how to improve comes from one of the sequels o...

PLoS Medicine | www.plosmedicine.org 0696

Essay

Open access, freely available online

August 2005 | Volume 2 | Issue 8 | e124

ublished research ﬁ ndings are

sometimes refuted by subsequent

evidence, with ensuing confusion

and disappointment. Refutation and

controversy is seen across the range of

research designs, from clinical trials

and traditional epidemiological studies

[1–3] to the most modern molecular

research [4,5]. There is increasing

concern that in modern research, false

ﬁ ndings may be the majority or even

the vast majority of published research

claims [6–8]. However, this should

not be surprising. It can be proven

that most claimed research ﬁ ndings

are false. Here I will examine the key

factors that inﬂ uence this problem and

some corollaries thereof.

Modeling the Framework for False

Positive Findings

Several methodologists have

pointed out [9–11] that the high

rate of nonreplication (lack of

conﬁ rmation) of research discoveries

is a consequence of the convenient,

yet ill-founded strategy of claiming

conclusive research ﬁ ndings solely on

the basis of a single study assessed by

formal statistical signiﬁ cance, typically

for a p-value less than 0.05. Research

is not most appropriately represented

and summarized by p-values, but,

unfortunately, there is a widespread

notion that medical research articles

should be interpreted based only on

p-values. Research ﬁ ndings are deﬁ ned

here as any relationship reaching

formal statistical signiﬁ cance, e.g.,

effective interventions, informative

predictors, risk factors, or associations.

“Negative” research is also very useful.

“Negative” is actually a misnomer, and

the misinterpretation is widespread.

However, here we will target

relationships that investigators claim

exist, rather than null ﬁ ndings.

As has been shown previously, the

probability that a research ﬁ nding

is indeed true depends on the prior

probability of it being true (before

doing the study), the statistical power

of the study, and the level of statistical

signiﬁ cance [10,11]. Consider a 2 × 2

table in which research ﬁ ndings are

compared against the gold standard

of true relationships in a scientiﬁ c

ﬁ eld. In a research ﬁ eld both true and

false hypotheses can be made about

the presence of relationships. Let R

be the ratio of the number of “true

relationships” to “no relationships”

among those tested in the ﬁ eld. R

is characteristic of the ﬁ eld and can

vary a lot depending on whether the

ﬁ eld targets highly likely relationships

or searches for only one or a few

true relationships among thousands

and millions of hypotheses that may

be postulated. Let us also consider,

for computational simplicity,

circumscribed ﬁ elds where either there

is only one true relationship (among

many that can be hypothesized) or

the power is similar to ﬁ nd any of the

several existing true relationships. The

pre-study probability of a relationship

being true is R⁄(R + 1). The probability

of a study ﬁ nding a true relationship

reﬂ ects the power 1 − β (one minus

the Type II error rate). The probability

of claiming a relationship when none

truly exists reﬂ ects the Type I error

rate, α. Assuming that c relationships

are being probed in the ﬁ eld, the

expected values of the 2 × 2 table are

given in Table 1. After a research

ﬁ nding has been claimed based on

achieving formal statistical signiﬁ cance,

the post-study probability that it is true

is the positive predictive value, PPV.

The PPV is also the complementary

probability of what Wacholder et al.

have called the false positive report

probability [10]. According to the 2

× 2 table, one gets PPV = (1 − β)R⁄(R

− βR + α). A research ﬁ nding is thus

The Essay section contains opinion pieces on topics

of broad interest to a general medical audience.

Why Most Published Research Findings

Are False

John P. A. Ioannidis

Citation: Ioannidis JPA (2005) Why most published

research ﬁ ndings are false. PLoS Med 2(8): e124.

open-access article distributed under the terms

of the Creative Commons Attribution License,

which permits unrestricted use, distribution, and

reproduction in any medium, provided the original

work is properly cited.

Abbreviation: PPV, positive predictive value

John P. A. Ioannidis is in the Department of Hygiene

and Epidemiology, University of Ioannina School of

Medicine, Ioannina, Greece, and Institute for Clinical

Research and Health Policy Studies, Department of

Medicine, Tufts-New England Medical Center, Tufts

University School of Medicine, Boston, Massachusetts,

United States of America. E-mail: jioannid@cc.uoi.gr

Competing Interests: The author has declared that

no competing interests exist.

DOI: 10.1371/journal.pmed.0020124

Summary

There is increasing concern that most

current published research ﬁ ndings are

false. The probability that a research claim

is true may depend on study power and

bias, the number of other studies on the

same question, and, importantly, the ratio

of true to no relationships among the

relationships probed in each scientiﬁ c

ﬁ eld. In this framework, a research ﬁ nding

is less likely to be true when the studies

conducted in a ﬁ eld are smaller; when

effect sizes are smaller; when there is a

greater number and lesser preselection

of tested relationships; where there is

greater ﬂ exibility in designs, deﬁ nitions,

outcomes, and analytical modes; when

there is greater ﬁ nancial and other

interest and prejudice; and when more

teams are involved in a scientiﬁ c ﬁ eld

in chase of statistical signiﬁ cance.

Simulations show that for most study

designs and settings, it is more likely for

a research claim to be false than true.

Moreover, for many current scientiﬁ c

ﬁ elds, claimed research ﬁ ndings may

often be simply accurate measures of the

prevailing bias. In this essay, I discuss the

implications of these problems for the

conduct and interpretation of research.

It can be proven that

most claimed research

ﬁ ndings are false.

PLoS Medicine | www.plosmedicine.org 0697

more likely true than false if (1 − β)R

> α. Since usually the vast majority of

investigators depend on α = 0.05, this

means that a research ﬁ nding is more

likely true than false if (1 − β)R > 0.05.

What is less well appreciated is

that bias and the extent of repeated

independent testing by different teams

of investigators around the globe may

further distort this picture and may

lead to even smaller probabilities of the

research ﬁ ndings being indeed true.

We will try to model these two factors in

the context of similar 2 × 2 tables.

Bias

First, let us deﬁ ne bias as the

combination of various design, data,

analysis, and presentation factors that

tend to produce research ﬁ ndings

when they should not be produced.

Let u be the proportion of probed

analyses that would not have been

“research ﬁ ndings,” but nevertheless

end up presented and reported as

such, because of bias. Bias should not

be confused with chance variability

that causes some ﬁ ndings to be false by

chance even though the study design,

data, analysis, and presentation are

perfect. Bias can entail manipulation

in the analysis or reporting of ﬁ ndings.

Selective or distorted reporting is a

typical form of such bias. We may

assume that u does not depend on

whether a true relationship exists

or not. This is not an unreasonable

assumption, since typically it is

impossible to know which relationships

are indeed true. In the presence of bias

(Table 2), one gets PPV = ([1 − β]R +

uβR)⁄(R + α − βR + u − uα + uβR), and

PPV decreases with increasing u, unless

1 − β ≤ α, i.e., 1 − β ≤ 0.05 for most

situations. Thus, with increasing bias,

the chances that a research ﬁ nding

is true diminish considerably. This is

shown for different levels of power and

for different pre-study odds in Figure 1.

Conversely, true research ﬁ ndings

may occasionally be annulled because

of reverse bias. For example, with large

measurement errors relationships

are lost in noise [12], or investigators

use data inefﬁ ciently or fail to notice

statistically signiﬁ cant relationships, or

there may be conﬂ icts of interest that

tend to “bury” signiﬁ cant ﬁ ndings [13].

There is no good large-scale empirical

evidence on how frequently such

reverse bias may occur across diverse

research ﬁ elds. However, it is probably

fair to say that reverse bias is not as

common. Moreover measurement

errors and inefﬁ cient use of data are

probably becoming less frequent

problems, since measurement error has

decreased with technological advances

in the molecular era and investigators

are becoming increasingly sophisticated

about their data. Regardless, reverse

bias may be modeled in the same way as

bias above. Also reverse bias should not

be confused with chance variability that

may lead to missing a true relationship

because of chance.

Testing by Several Independent

Teams

Several independent teams may be

addressing the same sets of research

questions. As research efforts are

globalized, it is practically the rule

that several research teams, often

dozens of them, may probe the same

or similar questions. Unfortunately, in

some areas, the prevailing mentality

until now has been to focus on

isolated discoveries by single teams

and interpret research experiments

in isolation. An increasing number

of questions have at least one study

claiming a research ﬁ nding, and

this receives unilateral attention.

The probability that at least one

study, among several done on the

same question, claims a statistically

signiﬁ cant research ﬁ nding is easy to

estimate. For n independent studies of

equal power, the 2 × 2 table is shown in

Table 3: PPV = R(1 − β

)⁄(R + 1 − [1 −

α]

− Rβ

) (not considering bias). With

increasing number of independent

studies, PPV tends to decrease, unless

1 − β < α, i.e., typically 1 − β < 0.05.

This is shown for different levels of

power and for different pre-study odds

in Figure 2. For n studies of different

power, the term β

is replaced by the

product of the terms β

for i = 1 to n,

but inferences are similar.

Corollaries

A practical example is shown in Box

1. Based on the above considerations,

one may deduce several interesting

corollaries about the probability that a

research ﬁ nding is indeed true.

Corollary 1: The smaller the studies

conducted in a scientiﬁ c ﬁ eld, the less

likely the research ﬁ ndings are to be

true. Small sample size means smaller

power and, for all functions above,

the PPV for a true research ﬁ nding

decreases as power decreases towards

1 − β = 0.05. Thus, other factors being

equal, research ﬁ ndings are more likely

true in scientiﬁ c ﬁ elds that undertake

large studies, such as randomized

controlled trials in cardiology (several

thousand subjects randomized) [14]

than in scientiﬁ c ﬁ elds with small

studies, such as most research of

molecular predictors (sample sizes 100-

fold smaller) [15].

Corollary 2: The smaller the effect

sizes in a scientiﬁ c ﬁ eld, the less likely

the research ﬁ ndings are to be true.

Power is also related to the effect

size. Thus research ﬁ ndings are more

likely true in scientiﬁ c ﬁ elds with large

effects, such as the impact of smoking

on cancer or cardiovascular disease

(relative risks 3–20), than in scientiﬁ c

ﬁ elds where postulated effects are

small, such as genetic risk factors for

multigenetic diseases (relative risks

1.1–1.5) [7]. Modern epidemiology is

increasingly obliged to target smaller

Table 1. Research Findings and True Relationships

Research

Finding

True Relationship

Yes No Total

Yes c(1 − β)R/(R + 1) cα/(R + 1) c(R + α − βR)/(R + 1)

No cβR/(R + 1) c(1 − α)/(R + 1) c(1 − α + βR)/(R + 1)

Total cR/(R + 1) c/(R + 1) c

DOI: 10.1371/journal.pmed.0020124.t001

Table 2. Research Findings and True Relationships in the Presence of Bias

Research

Finding

True Relationship

Yes No Total

Yes (c[1 − β]R + ucβR)/(R + 1) cα + uc(1 − α)/(R + 1) c(R + α − βR + u − uα + uβR)/(R + 1)

No (1 − u)cβR/(R + 1) (1 − u)c(1 − α)/(R + 1) c(1 − u)(1 − α + βR)/(R + 1)

Total cR/(R + 1) c/(R + 1) c

DOI: 10.1371/journal.pmed.0020124.t002

August 2005 | Volume 2 | Issue 8 | e124

PLoS Medicine | www.plosmedicine.org 0698

effect sizes [16]. Consequently, the

proportion of true research ﬁ ndings

is expected to decrease. In the same

line of thinking, if the true effect sizes

are very small in a scientiﬁ c ﬁ eld,

this ﬁ eld is likely to be plagued by

almost ubiquitous false positive claims.

For example, if the majority of true

genetic or nutritional determinants of

complex diseases confer relative risks

less than 1.05, genetic or nutritional

epidemiology would be largely utopian

endeavors.

Corollary 3: The greater the number

and the lesser the selection of tested

relationships in a scientiﬁ c ﬁ eld, the

less likely the research ﬁ ndings are to

be true. As shown above, the post-study

probability that a ﬁ nding is true (PPV)

depends a lot on the pre-study odds

(R). Thus, research ﬁ ndings are more

likely true in conﬁ rmatory designs,

such as large phase III randomized

controlled trials, or meta-analyses

thereof, than in hypothesis-generating

experiments. Fields considered highly

informative and creative given the

wealth of the assembled and tested

information, such as microarrays and

other high-throughput discovery-

oriented research [4,8,17], should have

extremely low PPV.

Corollary 4: The greater the

ﬂ exibility in designs, deﬁ nitions,

outcomes, and analytical modes in

a scientiﬁ c ﬁ eld, the less likely the

research ﬁ ndings are to be true.

Flexibility increases the potential for

transforming what would be “negative”

results into “positive” results, i.e., bias,

u. For several research designs, e.g.,

randomized controlled trials [18–20]

or meta-analyses [21,22], there have

been efforts to standardize their

conduct and reporting. Adherence to

common standards is likely to increase

the proportion of true ﬁ ndings. The

same applies to outcomes. True

ﬁ ndings may be more common

when outcomes are unequivocal and

universally agreed (e.g., death) rather

than when multifarious outcomes are

devised (e.g., scales for schizophrenia

outcomes) [23]. Similarly, ﬁ elds that

use commonly agreed, stereotyped

analytical methods (e.g., Kaplan-

Meier plots and the log-rank test)

[24] may yield a larger proportion

of true ﬁ ndings than ﬁ elds where

analytical methods are still under

experimentation (e.g., artiﬁ cial

intelligence methods) and only “best”

results are reported. Regardless, even

in the most stringent research designs,

bias seems to be a major problem.

For example, there is strong evidence

that selective outcome reporting,

with manipulation of the outcomes

and analyses reported, is a common

problem even for randomized trails

[25]. Simply abolishing selective

publication would not make this

problem go away.

Corollary 5: The greater the ﬁ nancial

and other interests and prejudices

in a scientiﬁ c ﬁ eld, the less likely

the research ﬁ ndings are to be true.

Conﬂ icts of interest and prejudice may

increase bias, u. Conﬂ icts of interest

are very common in biomedical

research [26], and typically they are

inadequately and sparsely reported

[26,27]. Prejudice may not necessarily

have ﬁ nancial roots. Scientists in a

given ﬁ eld may be prejudiced purely

because of their belief in a scientiﬁ c

theory or commitment to their own

ﬁ ndings. Many otherwise seemingly

independent, university-based studies

may be conducted for no other reason

than to give physicians and researchers

qualiﬁ cations for promotion or tenure.

Such nonﬁ nancial conﬂ icts may also

lead to distorted reported results and

interpretations. Prestigious investigators

may suppress via the peer review process

the appearance and dissemination of

ﬁ ndings that refute their ﬁ ndings, thus

condemning their ﬁ eld to perpetuate

false dogma. Empirical evidence

on expert opinion shows that it is

extremely unreliable [28].

Corollary 6: The hotter a

scientiﬁ c ﬁ eld (with more scientiﬁ c

teams involved), the less likely the

research ﬁ ndings are to be true.

This seemingly paradoxical corollary

follows because, as stated above, the

PPV of isolated ﬁ ndings decreases

when many teams of investigators

are involved in the same ﬁ eld. This

may explain why we occasionally see

major excitement followed rapidly

by severe disappointments in ﬁ elds

that draw wide attention. With many

teams working on the same ﬁ eld and

with massive experimental data being

produced, timing is of the essence

in beating competition. Thus, each

team may prioritize on pursuing and

disseminating its most impressive

“positive” results. “Negative” results may

become attractive for dissemination

only if some other team has found

a “positive” association on the same

question. In that case, it may be

attractive to refute a claim made in

some prestigious journal. The term

Proteus phenomenon has been coined

to describe this phenomenon of rapidly

Table 3. Research Findings and True Relationships in the Presence of Multiple Studies

Research

Finding

True Relationship

Yes No Total

Yes cR(1 − β

)/(R + 1) c(1 − [1 − α]

)/(R + 1) c(R + 1 − [1 − α]

− Rβ

)/(R + 1)

No cRβ

/(R + 1) c(1 − α)

/(R + 1) c([1 − α]

+ Rβ

)/(R + 1)

Total cR/(R + 1) c/(R + 1) c

DOI: 10.1371/journal.pmed.0020124.t003

DOI: 10.1371/journal.pmed.0020124.g001

Figure 1. PPV (Probability That a Research

Finding Is True) as a Function of the Pre-Study

Odds for Various Levels of Bias, u

Panels correspond to power of 0.20, 0.50,

and 0.80.

August 2005 | Volume 2 | Issue 8 | e124

PLoS Medicine | www.plosmedicine.org 0699

alternating extreme research claims

and extremely opposite refutations

[29]. Empirical evidence suggests that

this sequence of extreme opposites is

very common in molecular genetics

[29].

These corollaries consider each

factor separately, but these factors often

inﬂ uence each other. For example,

investigators working in ﬁ elds where

true effect sizes are perceived to be

small may be more likely to perform

large studies than investigators working

in ﬁ elds where true effect sizes are

perceived to be large. Or prejudice

may prevail in a hot scientiﬁ c ﬁ eld,

further undermining the predictive

value of its research ﬁ ndings. Highly

prejudiced stakeholders may even

create a barrier that aborts efforts at

obtaining and disseminating opposing

results. Conversely, the fact that a ﬁ eld

is hot or has strong invested interests

may sometimes promote larger studies

and improved standards of research,

enhancing the predictive value of its

research ﬁ ndings. Or massive discovery-

oriented testing may result in such a

large yield of signiﬁ cant relationships

that investigators have enough to

report and search further and thus

refrain from data dredging and

manipulation.

Most Research Findings Are False

for Most Research Designs and for

Most Fields

In the described framework, a PPV

exceeding 50% is quite difﬁ cult to

get. Table 4 provides the results

of simulations using the formulas

developed for the inﬂ uence of power,

ratio of true to non-true relationships,

and bias, for various types of situations

that may be characteristic of speciﬁ c

study designs and settings. A ﬁ nding

from a well-conducted, adequately

powered randomized controlled trial

starting with a 50% pre-study chance

that the intervention is effective is

eventually true about 85% of the time.

A fairly similar performance is expected

of a conﬁ rmatory meta-analysis of

good-quality randomized trials:

potential bias probably increases, but

power and pre-test chances are higher

compared to a single randomized trial.

Conversely, a meta-analytic ﬁ nding

from inconclusive studies where

pooling is used to “correct” the low

power of single studies, is probably

false if R ≤ 1:3. Research ﬁ ndings from

underpowered, early-phase clinical

trials would be true about one in four

times, or even less frequently if bias

is present. Epidemiological studies of

an exploratory nature perform even

worse, especially when underpowered,

but even well-powered epidemiological

studies may have only a one in

ﬁ ve chance being true, if R = 1:10.

Finally, in discovery-oriented research

with massive testing, where tested

relationships exceed true ones 1,000-

fold (e.g., 30,000 genes tested, of which

30 may be the true culprits) [30,31],

PPV for each claimed relationship is

extremely low, even with considerable

Box 1. An Example: Science

at Low Pre-Study Odds

Let us assume that a team of

investigators performs a whole genome

association study to test whether

any of 100,000 gene polymorphisms

are associated with susceptibility to

schizophrenia. Based on what we

know about the extent of heritability

of the disease, it is reasonable to

expect that probably around ten

gene polymorphisms among those

tested would be truly associated with

schizophrenia, with relatively similar

odds ratios around 1.3 for the ten or so

polymorphisms and with a fairly similar

power to identify any of them. Then

R = 10/100,000 = 10

−4

, and the pre-study

probability for any polymorphism to be

associated with schizophrenia is also

R/(R + 1) = 10

−4

. Let us also suppose that

the study has 60% power to ﬁ nd an

association with an odds ratio of 1.3 at

α = 0.05. Then it can be estimated that

if a statistically signiﬁ cant association is

found with the p-value barely crossing the

0.05 threshold, the post-study probability

that this is true increases about 12-fold

compared with the pre-study probability,

but it is still only 12 × 10

−4

Now let us suppose that the

investigators manipulate their design,

analyses, and reporting so as to make

more relationships cross the p = 0.05

threshold even though this would not

have been crossed with a perfectly

adhered to design and analysis and with

perfect comprehensive reporting of the

results, strictly according to the original

study plan. Such manipulation could be

done, for example, with serendipitous

inclusion or exclusion of certain patients

or controls, post hoc subgroup analyses,

investigation of genetic contrasts that

were not originally speciﬁ ed, changes

in the disease or control deﬁ nitions,

and various combinations of selective

or distorted reporting of the results.

Commercially available “data mining”

packages actually are proud of their

ability to yield statistically signiﬁ cant

results through data dredging. In the

presence of bias with u = 0.10, the post-

study probability that a research ﬁ nding

is true is only 4.4 × 10

−4

. Furthermore,

even in the absence of any bias, when

ten independent research teams perform

similar experiments around the world, if

one of them ﬁ nds a formally statistically

signiﬁ cant association, the probability

that the research ﬁ nding is true is only

1.5 × 10

−4

, hardly any higher than the

probability we had before any of this

extensive research was undertaken!

DOI: 10.1371/journal.pmed.0020124.g002

Figure 2. PPV (Probability That a Research

Finding Is True) as a Function of the Pre-Study

Odds for Various Numbers of Conducted

Studies, n

Panels correspond to power of 0.20, 0.50,

and 0.80.

August 2005 | Volume 2 | Issue 8 | e124

PLoS Medicine | www.plosmedicine.org 0700

standardization of laboratory and

statistical methods, outcomes, and

reporting thereof to minimize bias.

Claimed Research Findings

May Often Be Simply Accurate

Measures of the Prevailing Bias

As shown, the majority of modern

biomedical research is operating in

areas with very low pre- and post-

study probability for true ﬁ ndings.

Let us suppose that in a research ﬁ eld

there are no true ﬁ ndings at all to be

discovered. History of science teaches

us that scientiﬁ c endeavor has often

in the past wasted effort in ﬁ elds with

absolutely no yield of true scientiﬁ c

information, at least based on our

current understanding. In such a “null

ﬁ eld,” one would ideally expect all

observed effect sizes to vary by chance

around the null in the absence of bias.

The extent that observed ﬁ ndings

deviate from what is expected by

chance alone would be simply a pure

measure of the prevailing bias.

For example, let us suppose that

no nutrients or dietary patterns are

actually important determinants for

the risk of developing a speciﬁ c tumor.

Let us also suppose that the scientiﬁ c

literature has examined 60 nutrients

and claims all of them to be related to

the risk of developing this tumor with

relative risks in the range of 1.2 to 1.4

for the comparison of the upper to

lower intake tertiles. Then the claimed

effect sizes are simply measuring

nothing else but the net bias that has

been involved in the generation of

this scientiﬁ c literature. Claimed effect

sizes are in fact the most accurate

estimates of the net bias. It even follows

that between “null ﬁ elds,” the ﬁ elds

that claim stronger effects (often with

accompanying claims of medical or

public health importance) are simply

those that have sustained the worst

biases.

For ﬁ elds with very low PPV, the few

true relationships would not distort

this overall picture much. Even if a

few relationships are true, the shape

of the distribution of the observed

effects would still yield a clear measure

of the biases involved in the ﬁ eld. This

concept totally reverses the way we

view scientiﬁ c results. Traditionally,

investigators have viewed large

and highly signiﬁ cant effects with

excitement, as signs of important

discoveries. Too large and too highly

signiﬁ cant effects may actually be more

likely to be signs of large bias in most

ﬁ elds of modern research. They should

lead investigators to careful critical

thinking about what might have gone

wrong with their data, analyses, and

results.

Of course, investigators working in

any ﬁ eld are likely to resist accepting

that the whole ﬁ eld in which they have

spent their careers is a “null ﬁ eld.”

However, other lines of evidence,

or advances in technology and

experimentation, may lead eventually

to the dismantling of a scientiﬁ c ﬁ eld.

Obtaining measures of the net bias

in one ﬁ eld may also be useful for

obtaining insight into what might be

the range of bias operating in other

ﬁ elds where similar analytical methods,

technologies, and conﬂ icts may be

operating.

How Can We Improve

the Situation?

Is it unavoidable that most research

ﬁ ndings are false, or can we improve

the situation? A major problem is that

it is impossible to know with 100%

certainty what the truth is in any

research question. In this regard, the

pure “gold” standard is unattainable.

However, there are several approaches

to improve the post-study probability.

Better powered evidence, e.g., large

studies or low-bias meta-analyses,

may help, as it comes closer to the

unknown “gold” standard. However,

large studies may still have biases

and these should be acknowledged

and avoided. Moreover, large-scale

evidence is impossible to obtain for all

of the millions and trillions of research

questions posed in current research.

Large-scale evidence should be

targeted for research questions where

the pre-study probability is already

considerably high, so that a signiﬁ cant

research ﬁ nding will lead to a post-test

probability that would be considered

quite deﬁ nitive. Large-scale evidence is

also particularly indicated when it can

test major concepts rather than narrow,

speciﬁ c questions. A negative ﬁ nding

can then refute not only a speciﬁ c

proposed claim, but a whole ﬁ eld or

considerable portion thereof. Selecting

the performance of large-scale studies

based on narrow-minded criteria,

such as the marketing promotion of a

speciﬁ c drug, is largely wasted research.

Moreover, one should be cautious

that extremely large studies may be

more likely to ﬁ nd a formally statistical

signiﬁ cant difference for a trivial effect

that is not really meaningfully different

from the null [32–34].

Second, most research questions

are addressed by many teams, and

it is misleading to emphasize the

statistically signiﬁ cant ﬁ ndings of

any single team. What matters is the

Table 4. PPV of Research Findings for Various Combinations of Power (1 − β), Ratio

of True to Not-True Relationships (R), and Bias (u)

1 − β RuPractical Example PPV

0.80 1:1 0.10 Adequately powered RCT with little

bias and 1:1 pre-study odds

0.85

0.95 2:1 0.30 Conﬁ rmatory meta-analysis of good-

quality RCTs

0.85

0.80 1:3 0.40 Meta-analysis of small inconclusive

studies

0.41

0.20 1:5 0.20 Underpowered, but well-performed

phase I/II RCT

0.23

0.20 1:5 0.80 Underpowered, poorly performed

phase I/II RCT

0.17

0.80 1:10 0.30 Adequately powered exploratory

epidemiological study

0.20

0.20 1:10 0.30 Underpowered exploratory

epidemiological study

0.12

0.20 1:1,000 0.80 Discovery-oriented exploratory

research with massive testing

0.0010

0.20 1:1,000 0.20 As in previous example, but

with more limited bias (more

standardized)

0.0015

The estimated PPVs (positive predictive values) are derived assuming α = 0.05 for a single study.

RCT, randomized controlled trial.

DOI: 10.1371/journal.pmed.0020124.t004

August 2005 | Volume 2 | Issue 8 | e124

PLoS Medicine | www.plosmedicine.org 0701

totality of the evidence. Diminishing

bias through enhanced research

standards and curtailing of prejudices

may also help. However, this may

require a change in scientiﬁ c mentality

that might be difﬁ cult to achieve.

In some research designs, efforts

may also be more successful with

upfront registration of studies, e.g.,

randomized trials [35]. Registration

would pose a challenge for hypothesis-

generating research. Some kind of

registration or networking of data

collections or investigators within ﬁ elds

may be more feasible than registration

of each and every hypothesis-

generating experiment. Regardless,

even if we do not see a great deal of

progress with registration of studies

in other ﬁ elds, the principles of

developing and adhering to a protocol

could be more widely borrowed from

randomized controlled trials.

Finally, instead of chasing statistical

signiﬁ cance, we should improve our

understanding of the range of R

values—the pre-study odds—where

research efforts operate [10]. Before

running an experiment, investigators

should consider what they believe the

chances are that they are testing a true

rather than a non-true relationship.

Speculated high R values may

sometimes then be ascertained. As

described above, whenever ethically

acceptable, large studies with minimal

bias should be performed on research

ﬁ ndings that are considered relatively

established, to see how often they are

indeed conﬁ rmed. I suspect several

established “classics” will fail the test

[36].

Nevertheless, most new discoveries

will continue to stem from hypothesis-

generating research with low or very

low pre-study odds. We should then

acknowledge that statistical signiﬁ cance

testing in the report of a single study

gives only a partial picture, without

knowing how much testing has been

done outside the report and in the

relevant ﬁ eld at large. Despite a large

statistical literature for multiple testing

corrections [37], usually it is impossible

to decipher how much data dredging

by the reporting authors or other

research teams has preceded a reported

research ﬁ nding. Even if determining

this were feasible, this would not

inform us about the pre-study odds.

Thus, it is unavoidable that one should

make approximate assumptions on how

many relationships are expected to be

true among those probed across the

relevant research ﬁ elds and research

designs. The wider ﬁ eld may yield some

guidance for estimating this probability

for the isolated research project.

Experiences from biases detected in

other neighboring ﬁ elds would also be

useful to draw upon. Even though these

assumptions would be considerably

subjective, they would still be very

useful in interpreting research claims

and putting them in context. 

References

1. Ioannidis JP, Haidich AB, Lau J (2001) Any

casualties in the clash of randomised and

observational evidence? BMJ 322: 879–880.

2. Lawlor DA, Davey Smith G, Kundu D,

Bruckdorfer KR, Ebrahim S (2004) Those

confounded vitamins: What can we learn from

the differences between observational versus

randomised trial evidence? Lancet 363: 1724–

1727.

3. Vandenbroucke JP (2004) When are

observational studies as credible as randomised

trials? Lancet 363: 1728–1731.

4. Michiels S, Koscielny S, Hill C (2005)

Prediction of cancer outcome with microarrays:

A multiple random validation strategy. Lancet

365: 488–492.

5. Ioannidis JPA, Ntzani EE, Trikalinos TA,

Contopoulos-Ioannidis DG (2001) Replication

validity of genetic association studies. Nat

Genet 29: 306–309.

6. Colhoun HM, McKeigue PM, Davey Smith

G (2003) Problems of reporting genetic

associations with complex outcomes. Lancet

361: 865–872.

7. Ioannidis JP (2003) Genetic associations: False

or true? Trends Mol Med 9: 135–138.

8. Ioannidis JPA (2005) Microarrays and

molecular research: Noise discovery? Lancet

365: 454–455.

9. Sterne JA, Davey Smith G (2001) Sifting the

evidence—What’s wrong with signiﬁ cance tests.

BMJ 322: 226–231.

10. Wacholder S, Chanock S, Garcia-Closas M, El

ghormli L, Rothman N (2004) Assessing the

probability that a positive report is false: An

approach for molecular epidemiology studies. J

Natl Cancer Inst 96: 434–442.

11. Risch NJ (2000) Searching for genetic

determinants in the new millennium. Nature

405: 847–856.

12. Kelsey JL, Whittemore AS, Evans AS,

Thompson WD (1996) Methods in

observational epidemiology, 2nd ed. New York:

Oxford U Press. 432 p.

13. Topol EJ (2004) Failing the public health—

Rofecoxib, Merck, and the FDA. N Engl J Med

351: 1707–1709.

14. Yusuf S, Collins R, Peto R (1984) Why do we

need some large, simple randomized trials? Stat

Med 3: 409–422.

15. Altman DG, Royston P (2000) What do we

mean by validating a prognostic model? Stat

Med 19: 453–473.

16. Taubes G (1995) Epidemiology faces its limits.

Science 269: 164–169.

17. Golub TR, Slonim DK, Tamayo P, Huard

C, Gaasenbeek M, et al. (1999) Molecular

classiﬁ cation of cancer: Class discovery

and class prediction by gene expression

monitoring. Science 286: 531–537.

18. Moher D, Schulz KF, Altman DG (2001)

The CONSORT statement: Revised

recommendations for improving the quality

of reports of parallel-group randomised trials.

Lancet 357: 1191–1194.

19. Ioannidis JP, Evans SJ, Gotzsche PC, O’Neill

RT, Altman DG, et al. (2004) Better reporting

of harms in randomized trials: An extension

of the CONSORT statement. Ann Intern Med

141: 781–788.

20. International Conference on Harmonisation

E9 Expert Working Group (1999) ICH

Harmonised Tripartite Guideline. Statistical

principles for clinical trials. Stat Med 18: 1905–

1942.

21. Moher D, Cook DJ, Eastwood S, Olkin I,

Rennie D, et al. (1999) Improving the quality

of reports of meta-analyses of randomised

controlled trials: The QUOROM statement.

Quality of Reporting of Meta-analyses. Lancet

354: 1896–1900.

22. Stroup DF, Berlin JA, Morton SC, Olkin I,

Williamson GD, et al. (2000) Meta-analysis

of observational studies in epidemiology:

A proposal for reporting. Meta-analysis

of Observational Studies in Epidemiology

(MOOSE) group. JAMA 283: 2008–2012.

23. Marshall M, Lockwood A, Bradley C,

Adams C, Joy C, et al. (2000) Unpublished

rating scales: A major source of bias in

randomised controlled trials of treatments for

schizophrenia. Br J Psychiatry 176: 249–252.

24. Altman DG, Goodman SN (1994) Transfer

of technology from statistical journals to the

biomedical literature. Past trends and future

predictions. JAMA 272: 129–132.

25. Chan AW, Hrobjartsson A, Haahr MT,

Gotzsche PC, Altman DG (2004) Empirical

evidence for selective reporting of outcomes in

randomized trials: Comparison of protocols to

published articles. JAMA 291: 2457–2465.

26. Krimsky S, Rothenberg LS, Stott P, Kyle G

(1998) Scientiﬁ c journals and their authors’

ﬁ nancial interests: A pilot study. Psychother

Psychosom 67: 194–201.

27. Papanikolaou GN, Baltogianni MS,

Contopoulos-Ioannidis DG, Haidich AB,

Giannakakis IA, et al. (2001) Reporting of

conﬂ icts of interest in guidelines of preventive

and therapeutic interventions. BMC Med Res

Methodol 1: 3.

28. Antman EM, Lau J, Kupelnick B, Mosteller F,

Chalmers TC (1992) A comparison of results

of meta-analyses of randomized control trials

and recommendations of clinical experts.

Treatments for myocardial infarction. JAMA

268: 240–248.

29. Ioannidis JP, Trikalinos TA (2005) Early

extreme contradictory estimates may

appear in published research: The Proteus

phenomenon in molecular genetics research

and randomized trials. J Clin Epidemiol 58:

543–549.

30. Ntzani EE, Ioannidis JP (2003) Predictive

ability of DNA microarrays for cancer outcomes

and correlates: An empirical assessment.

Lancet 362: 1439–1444.

31. Ransohoff DF (2004) Rules of evidence

for cancer molecular-marker discovery and

validation. Nat Rev Cancer 4: 309–314.

32. Lindley DV (1957) A statistical paradox.

Biometrika 44: 187–192.

33. Bartlett MS (1957) A comment on D.V.

Lindley’s statistical paradox. Biometrika 44:

533–534.

34. Senn SJ (2001) Two cheers for P-values. J

Epidemiol Biostat 6: 193–204.

35. De Angelis C, Drazen JM, Frizelle FA, Haug C,

Hoey J, et al. (2004) Clinical trial registration:

A statement from the International Committee

of Medical Journal Editors. N Engl J Med 351:

1250–1251.

36. Ioannidis JPA (2005) Contradicted and

initially stronger effects in highly cited clinical

research. JAMA 294: 218–228.

37. Hsueh HM, Chen JJ, Kodell RL (2003)

Comparison of methods for estimating the

number of true null hypotheses in multiplicity

testing. J Biopharm Stat 13: 675–689.

August 2005 | Volume 2 | Issue 8 | e124

Discussion

This is a very subtle point-> if independent teams are conducting experiments (and not communicating with one another) on the same hypothesis, our interpretation of reported results of significance should be adjusted. This PLoS Medicine paper: “Why most Published Research Findings are False,” has been the most-accessed article in the history of the Public Library of Science with over 2.5 million reads. Most scientists care about unbiased/reproducible/good science and this paper has made many aware of how biased/non-reproducible/bad science and science publishing really are. Significance testing and the arbitrary pvalue cutoff of .05 were introduced by R.A. Fisher in the 1920s. Here is a quote from him about .05: "If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty or one in a hundred. Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fails to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance." For more: https://www.bmj.com/rapid-response/2011/11/03/origin-5-p-value-threshold Researchers typically only report statistically significant results, so you never hear about the “negative” results, or the ones that are not “significant” using this arbitrary p-value threshold. This system leads to misaligned incentives, non-reproducible research and a major distortion of scientific output/dissemination and understanding. This short video explains P-values and significance tests nicely: https://www.khanacademy.org/math/ap-statistics/tests-significance-ap/idea-significance-tests/v/p-values-and-significance-tests I also love the following link on Wikipedia titled "Misunderstandings of p-values": https://en.wikipedia.org/wiki/Misunderstandings_of_p-values A great follow up to this point is a modelling exercise in which Grimes and Ioannidis show how research becomes progressively untrustworthy under publish or perish pressure: http://rsos.royalsocietypublishing.org/content/5/1/171511.article-info There have been many recent efforts to evaluate the reproducibility of research across different fields, and in fact, prove that most research claims are false. One major effort called the Reproducibility Project: Psychology, is run by the Center for Open Science and aims to reproduce landmark social psychology results. This project was a collaboration of 270 contributing authors that repeated 100 published experimental and correlational psychological studies, and only were able to reproduce 36% of the publications. Another recent effort aimed to reproduce cancer biology research and could only reproduce 6/53 studies, this study was performed by Amgen researchers. For more studies and commentary: https://medium.com/@stelios.serghiou/being-earnest-about-science-a-plead-from-the-new-generation-6dc256d645e1 Does the title apply to this paper? Dr. Ioannidis is a Professor of Medicine at Stanford University and Director of the Stanford Prevention Research Center at Stanford University School of Medicine. His work tackles clinical research methodology, as well as evidence-based medicine and he is internationally recognized as a leader in empirical studies assessing biases, replication, and reliability of research findings in biomedicine and beyond. He also helps run the Meta-Research Innovation Center at Stanford (METRICS) which is reimagining science for the 21st century with the goal of strengthening the research enterprise to improve the quality of scientific studies in biomedicine and beyond. "How Not to Be Wrong" by Jordan Ellenberg, starting pg145, has an excellent discussion of this paper and others like it Here is a nice table explaining the different types of errors and rates: ![Imgur](https://i.imgur.com/LKVWFVY.png) A great follow up to how to improve comes from one of the sequels of this article by Ioannidis titled "How to make more published research true": http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1001747

Comments

Products

Project