NO EVIDENCE FOR PSI

7

# Problem 3: p-Values Overstate the Evidence Against the Null

Consider a data set for which p = .001, indicating a low probability of encountering a test statistic that is at least as extreme as the one that was actually observed, given that t h e n u l l h y p o t h e s i s H 0 i s t r u e . S h o u l d w e p r o c e e d t o r e j e c t H 0 ? W e l l , t h i s d e p e n d s a t l e a s t i n p a r t o n h o w l i k e l y t h e d a t a a r e u n d e r H 1 . S u p p o s e , f o r i n s t a n c e , t h a t H 1 r e p r e s e very small effect—then it may be that the observed value of the test statistic is almost as n t s a u n l i k e l y u n d e r H 0 a s u n d e r H 1 . W h a t i s g o i n g o n h e r e ?

The underlying problem is that evidence is a relative concept, and it is of limited interest to consider the probability of the data under just a single hypothesis. For instance, if you win the state lottery you might be accused of cheating; after all, the probability of winning the state lottery is rather small. This may be true, but this low probability in itself does not constitute evidence—the evidence is assessed only when this low probability is pitted against the much lower probability that you could somehow have obtained the winning number by acquiring advance knowledge on how to buy the winning ticket.

Therefore, in order to evaluate the strength of evidence that the data provide for or against precognition, we need to pit the null hypothesis against a specific alternative hypothesis, and not consider the null hypothesis in isolation. Several methods are available to achieve this goal. Classical statisticians can achieve this goal with the Neyman-Pearson procedure, statisticians who focus on likelihood can achieve this goal using likelihood ratios (Royall, 1997), and Bayesian statisticians can achieve this goal using a hypothesis test that computes a weighted likelihood ratio (e.g., Rouder et al., 2009; Wagenmakers, Lodewyckx, Kuriyal, & Grasman, 2010; Wetzels, Raaijmakers, Jakab, & Wagenmakers, 2009). As an illustration, we focus here on the Bayesian hypothesis test.

In a Bayesian hypothesis test, the goal is to quantify the change in prior to posterior o d d s t h a t i s b r o u g h t a b o u t b y t h e d a t a . F o r a c h o i c e b e t w e e n H 0 a n d H 1 , w e h a v e

p ( H 0 | D ) p ( H 1 | D )

=

p ( H 0 ) p ( H 1 )

×

p ( D | H 0 ) p ( D | H 1 )

,

(1)

which is often verbalized as

Posterior model odds = Prior model odds × Bayes factor.

(2)

T h u s , t h e c h a n g e f r o m p r i o r o d d s p ( H 0 ) / p ( H 1 ) t o p o s t e r i o r o d d s p ( H 0 | D ) / p ( H 1 | D ) b r o u g h t a b o u t b y t h e d a t a i s g i v e n b y t h e r a t i o o f p ( D | H 0 ) / p ( D | H 1 ) , a q u a n t i t y k n o w Bayes factor (Jeffreys, 1961). The Bayes factor (or its logarithm) is often interpreted as the weight of evidence provided by the data (Good, 1985; for details see Berger & Pericchi, 1996, Bernardo & Smith, 1994, Chapter 6, Gill, 2002, Chapter 7, Kass & Raftery, 1995, and O’Hagan, 1995). n a s t h e

W h e n t h e B a y e s f a c t o r f o r H 0

o v e r H 1

equals 2 (i.e.,

B F 0 1

= 2) this indicates that the

d a t a a r e t w i c e a s l i k e l y t o h a v e o c c u r r e d u n d e r H factor has an unambiguous and continuous scale, 0

t h e n u n d e r H 1 . E v e it is sometimes useful n

though the Bayes to summarize the

Bayes factor B) proposed

in terms of discrete categories of evidential strength. the classification scheme shown in Table 1.

Jeffreys

(1961,

Appendix

Several researchers have recommended Bayesian hypothesis tests (e.g., Berger & De- lampady, 1987; Berger & Sellke, 1987; Edwards, Lindman, & Savage, 1963; see also Wagen-