Bayesian Versus Orthodox Statistics: Which Side Are You On? - Dienes - 2011 - Article

Researchers are often confused about what can be inferred from significance tests. One problem occurs when people apply the Bayesian intuitions to significance testing - two approaches that must be firmly seperated.

Psychology and other disciplines have benefited enormously from having a rigorous procedure for extracting inferences from data. But can we do that better than we do now? Regarding the Bayesian approach, every practical probelm have been largely solved; so there is little to stop researchers from using the Bayesian approach in almost all circumstances.

Real research questions do not have pat answers, but see if, nonetheless, you have clear preferences. Almost all responses are consistent either with some statistical approach or with what a large section of researchers do in practise. There are three research scenarios by which you can see were you intuitions lie: (1) the stopping rule, (2) the planned versus post hoc, and the (3) multiple testing.

## What is the difference between Orthodox and Bayesian statistics?

The orthodox logic of statistics, as developed by Nerman and Pearson, starts from the assumption that probabilities are long-run relative frequencies. This requires an indefinitely large series of events that constitutes the collective; the probability of some property (q) occuring is then the proportion of events in the collective with property q. Long-run relative frequencies do not apply to the truth of individual theories because theories are not collective - the theories are just true or false. So, when using this approach to probability, the null hypothesis of no population difference between two particular conditions cannot be assigned a probability - it is either true or false.

The logic of Neyman Pearson (orthodox) statistics is to adopt decision procedures with known long-term error rates and then control those errors at acceptable levels. The error rate for false positives is called alpha, with a signifance level of .05, and the error rate for the false negatives is called beta, where beta is 1 - power.

The probability of a theory being true given data can be symbolized as P (theory| data), and that is what many of us would like to know. But this is the inverse of what orthodox statistics tells us, namely the P(data |theory).

When people directly infer a probability of the null hypothesis from a p value or significance level, they are violating the logic of Neyman Pearson statistics. Such people want to know the probability of theories and hypotheses. Neyman Pearson does not directly tell them that. Bayesian statistics starts from the premise that we can assign degrees of plausibility to theories, and what we want our data to do is tell us how to adjust these plausibilities.

## The likelihood

In the Bayesian approach, probability applies to the truth of theories. Thus, we can answer questions about p(H), the probability of a hypothesis being true (the prior probability) and also p (H|D), the probability of the hypothesis given data (the posterior probability) - neither we can do when using the orthodox approach. The probability of obtaining the exact data given the hypothesis is the likelihood.

From this theorem (Bayes) comes the likelihood principle; all information relevant tot inference obtained in data is provided by the likelihoof. The likelihood is the probability of obtaining the exact data obtained given a hypothesis. This is different from a p value, which is the probability of obtaining the same or more extreme data given both a hypothesis and a decision procedure.

In orthodox statistics, the p values are changed according to the decision procedure: under what conditions one would stop collecting data, whether or not the test is post hoc, or how many other tests one conducted. So, orthodox statistics violates the likelihood principle.

## The Bayes factor

The Bayes factor is introduced, which will allow us to consider the contrast between orthodox and Bayes in detail. The Bayes factor pits one theory against another. Once data are collected we can calculate the likelihood for each theory. These likelihoods are things we want researchers to agree on.

posterior odds = B x prior odds

The B automatically gives a notion of sensitivity; it directly distinguishes data supporting the null from data uninformative about whether the null or you theory was supported.

1. The stopping rule: When you 'top-up' subject numbers, the majority of researchers have reported the topped-up data set without taking into account that the initial planned number of subjects was lower than the topped-up number. On the Neyman-Pearson approach one must specify the stopping rule in advance. Typically, this means that one should use a power calculation to plan in advance how many subjects to run. Researchers might justify the 'topping-up' subjects because in their hearts they believe in the likelihood principle. Regarding to Bayes, you can run as many subjects as you like when using Bayes and stop when you like.
2. Planned versus post hoc comparisons: many people may have treated the results as predicted, because of the Bayesian intuitions and used the wrong tools for the right reasons. When using the Neyman-Pearson it matters whether you formulated your hypothesis before or after looking at the data (post hoc vs. planned comparisons). The likelihood principle contradicts not only Neyman Pearson on this points, but also advice of Popper. They valued the novelty of predictions. They criticized the practice of HARKing, hypothesising after the results are known. Post hoc fitting can involve preference for one auxiliary over many others of at least equal plausibility.
3. Multiple testing: Concering whether one would modify the conclusions for one test of subliminal perception based on the fact that other methods were tested, practice may vary depending on how the author feels. There is no strict standard about what counts as a 'family' for the sake of multiple testing. When using the Neyman Pearson one must correct for how many tests are conducted in a family of tests. The moral is that in assessing the evidence for of against a theory, one should take into account all the evidence relevant to the theory and not cherry pick the cases that seem to support it. Cherry picking is wrong on all statistical approaches.

## What is the rationality of the Bayesian approach?

One definition of rationality is having sufficient justification for one's beliefs, and another is that it is a matter of having subjected one's beliefs to critical scrutiny. Popper followed the latter definition and termed it critical rationalism. In this view there is never a sufficient justification for a given belief because knowledge has no absolute foundation. Critical rationalism bears some striking similarities to the orthodox approach to statistical inference - in this view the statistical inference cannot tell you how confident to be in different hypotheses; it only gives conventions for behavioral acceptance or rejection of different hypotheses, which, given a relevant statistical model, results in controlled preset long-term error rates.

A version about degrees of belief are subjective probabilities, personal convictions in an opinion. When probabilities of different propositions form part of the inferential procedure we use in deriving conclusions from data, then we need to make sure that the procedure is fair. There has been an attempt to specify the objective probabilities that follow the informational specifications of a problem.

In sum, one notion of rationality is having sufficient justification for one's beliefs. If one can assign numerical continuous degrees of justification to beliefs, then some simple minimal desiderate lead to the 'likelihood principle' of inference. Hypothesis testing violates the likelihood principle, indicating that some of the deepest held intuitions we train ourselves to have as orthodox users of statistics are irrational on a key intuitive notion of rationality.

## The effect size

The typical use of statistics is often not influenced by a factor that is logically relevant to inference: the effect size. A problem in many areas is that researchers have been relating theories to statistics by using the wrong questions: 'is there a difference?' with the only acceptable answers being 'yes' and 'withhold judgment'.

Neyman developed two specific measures of sensitivity: power and confidence intervals. A confidence interval is the set of population values that the data are consistent with. It may include zero but must include other values too. However, theories and practical questions generally specify, even if vaguely, relevant effect sizes. And they must, if predictions of a difference are ever to be tested.

Effect size is very important in the Neyman-Pearson approach: One must specify the sort of effect one predicts in order to calculate power. On the other hand, Fisherian significance testing leads people to ignore effect sizes. By contrast, one must specify what sort of effect sizes a theory predicts to calculate a Bayes factor.Despite some attempts to encourage researchers to use confidence intervals, their use has not taken off. Confidence intervals of some sort would deal with many problems.

## How to calculate the Bayes factor

To be able to calculate a Bayes factor in support of a theory, one has to specify what the probability of different effect sizes are, given the theory. In terms of data, the Bayes factor calculator asks for a mean together with its standard error. In terms of predictions of the theory, one has to decide what range of effects are relevant to the theory. The hard part is determining the best way to represent the predictions of a theory: which of these distributions and with what parameters?

## When to use the Bayes factor

Some researchers suggested a 'default' Bayes factor to be used on any data where the null hypothesis is compared with a default theory - namely, the theory that effects may occur in either direction, scaled to a large standardized effect size. But, as mentioned before, the Bayes factor is just one form of Bayesian inference- namely, a method for evaluating one theory against another.

## Multiple testing and cheating

With the Bayes factor, one does not have to worry about corrections for multiple testing, stopping rules, or planned versus post hoc comparisons. But, you might insist, all these rules in orthodox statistics were there to stop cheating. For example, when different assumptions concerning the predictions of a theory lead to different Bayes factors, what is to stop a researcher from picking the best one?

Strictly, every Bayes factor is a completely accurate indication of the support for the data of one theory over another, where the theories are defined by the precise predictions they ake, as we have represented them. The crucial question is which of these representations best matches the theory as the researcher has described it and related it to existing literature. Note that there is not anything wrong with finding out which ways of representing predictions produce especially high Bayes factos. This is not cheating but determining possible constraints on theory.

## The weaknesses of the Bayesian approach

The strenghts of Bayesian analyses are also their weaknesses:

1. Bayesian analyses force people to consider what a theory actually predicts, but specifying the predictions in detail my be contentious.
2. Bayesian analyses escape the paradoxes of violating the likelihood principle, but in so doing no longer control the Type I and Type II errors.

Ultimately, the issue is about what is more important to us: using a procedure with known long term error rates or knowing the degree of support for our theory (the amount by which we should chage our conviction in a theory). If we want to know the degree of evidence or support for our theory, then our reliance on orthodox statistics is irrational.

Check page access:
Public
Join WorldSupporter!

How to use this summary?
Work for WorldSupporter

JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

How to use more summaries?
Check other studie fields?
• Public
• WorldSupporters only
• JoHo members
• Private
Statistics
 563