What is the Bayes factor?

The Bayes factor (B) compares the probability of an experimental theory to the probability of the null hypothesis.
It gives the means of adjusting your odds in a continuous way.

  • If B is greater than 1, your data support the experimental hypothesis over the null
  • If B is less than 1, your data support the null over the experimental hypothesis
  • If B is about 1, then your experiment was not sensitive

For more information, look at the (free) summary of 'Bayes and the probability of hypotheses' or 'Bayesian versus orthodox statistics: which side are you one?'

Supporting content
Bayes and the probability of hypotheses - summary of Chapter 4 of Understanding Psychology as a science by Dienes

Bayes and the probability of hypotheses - summary of Chapter 4 of Understanding Psychology as a science by Dienes

Image

Critical thinking
Chapter 4 of Understanding Psychology as a science by Dienes
Bayes and the probability of hypotheses

Objective probability: a long-run relative frequency.
Classic (Neyman-Pearson) statistics can tell you the long-run relative frequency of different types of errors.

  • Classic statistics do not tell you the probability of any hypothesis being true.

An alternative approach to statistics is to start with what Bayesians say are people’s natural intuitions.
People want statistics to tell them the probability of their hypothesis being right.
Subjective probability: the subjective degree of conviction in a hypothesis.


Subjective probability

Subjective or personal probability: the degree of conviction we have in a hypothesis.
Probabilities are in the mind, not in the world.

The initial problem to address in making use of subjective probabilities is how to assign a precise number to how probable you think a proposition is.
The initial personal probability that you assign to any theory is up to you.
Sometimes it is useful to express your personal convictions in terms of odds rather than probabilities.

Odds(theory is true) = probability(theory is true)/probability(theory is false)
Probability = odds/(odds +1)

These numbers we get from deep inside us must obey the axioms of probability.
This is the stipulation that ensures the way we change our personal probability in a theory is coherent and rational.

  • People’s intuitions about how to change probabilities in the light of new information are notoriously bad.

This is where the statistician comes in and forces us to be disciplined.

There are only a few axioms, each more-or-less self-evidently reasonable.

  • Two aximons effectively set limits on what values probabilities can take.
    All probabilities will lie between 0 and 1
  • P(A or B) = P(A) + P(B), if A and B are mutually exclusive.
  • P(A and B) = P(A) x P(B|A)
    • P(B|A) is the probability of B given A.

Bayes’ theorem

H is the hypothesis
D is the data

P(H and D) = P(D) x P(H|D)
P(H and D) = P(H) x P(D|H)

so

P(D) x P(H|D) = P(H) x P(D|H)

Moving P(D) to the other side

P(H|D) = P(D|H) x P(H) / P(D)

This last one is Bayes theorem.
It tells you how to go from one conditional probability to its inverse.
We can simplify this equation if we are interested in comparing the probability of different hypotheses given the same data D.
Then P(D) is just a constant for all these comparisons.

P(H|D) is proportional to P(D|H) x P(H)

P(H) is called the prior.
It is how probable you thought the hypothesis was prior to collecting data.
It is your personal subjective probability and its value is completely up to you.

P(H|D) is called the posterior.
It is how probable your hypothesis is to you, after you have collected data.

P(D|H) is called the likelihood of the hypothesis
The probability of obtaining the data, given your hypothesis.

  • Your posterior is proportional to the likelihood times the prior.

This tells you how you can update our prior probability in a hypothesis given some data.
Your prior can be up to you, but having settled on it, the posterior is determined by the axioms of probability.
From the Bayesian perspective, scientific inference consists precisely in updating one’s personal conviction in a hypothesis in the light of data.

The likelihood

According to Bayes’ theorem, if you want to update your personal probability in a hypothesis, the likelihood tells you everything you need to know about the data.

  • Posterior is proportional to likelihood times prior

The likelihood principle: the notion that all the information relevant to inference contained in data is provided by the likelihood.

The data could be obtained given many different population proportions, but the data are more probable for some population proportions than others.

The highest likelihood is not the same as the highest probability.

  • The probability of the hypothesis in the light of the data is P(H|D), which is our posterior.
  • The likelihood of the hypothesis is the probability of the data given the hypothesis P(D|H)

We can use the likelihood to obtain our posterior, but they are not the same.
Just because a hypothesis has the highest likelihood, it does not mean you will assign the highest posterior probability.

  • The fact that a hypothesis has the highest likelihood means the data support that hypothesis most.
  • If the prior probabilities for each hypothesis were the same, then the hypothesis with the highest likelihood will have the highest posterior probability.
    • But the prior probabilities may mean that the hypothesis with the greatest support from the data, does not have the highest posterior probability.

Probability density distribution: the distribution of if the dependent variable can be assumed to vary continuously
A likelihood could be (or be proportional to) a probability density as well as a probability.

In significance testing, we calculate a form of P(D|H).
But, the P(D|H) used in significance testing is conceptually very different from the likelihood, the P(D|H) we are dealing with here.

  • The p-value in significance testing is the probability of rejecting the null, given the null is really true.
    • P(obtaining data as extreme or more extreme than D|H0)
    • In calculating a significance value, we hold fixed the hypothesis under consideration, H0, and we vary the data we might ave obtained.
  • The likelihood is P(obtaining exactly this D|H)
    • Here H is free to vary, but the D considered is always exactly the data obtained.
  • In calculating the likelihood, we are interested in the height of the curve for each hypothesis
    • It reflects just what the data were
  • In significance testing, we are interested in the ‘tail area’
    • This area is the probability of obtaining our data or data more extreme
  • In significance testing, we make a black and white decision
  • Likelihoods give a continuous graded measure of support for different hypotheses

In significance testing, tail areas are calculating in order to determine long-run error rates.
The aim of classic statistics is to come up with a procedure for making decisions that is reliable, which is to say that the procedures has known controlled long-run error rates.
To decide the long-run error rates, we need to define a collective.

Bayesian analysis

Bayes’ theorem says that posterior is proportional to likelihood times prior.
We can use this in two ways when dealing with real data

  • We can calculate a credibility interval
    • Credibility interval: the Bayesian equivalent of a confidence interval
  • We can calculate how to adjust our odds in favour of a theory we are testing over the null hypothesis in the light of our experimental data
    • The Bayes factor: the Bayesian equivalent of null hypothesis testing

Credibility intervals

Flat prior or uniform prior: you have no idea what the population value is likely to be

In choosing a prior decide:

  • Whether your prior can be approximated by a normal distribution
    • if so, what the mean of this distribution is
    • if so, what the standard deviation of this distribution is

Formulae for normal posterior:

  • Mean of prior = M0
  • Mean of sample = Md
  • Standard deviation of prior = S0
  • Standard error of sample = SE
  • Precision of prior = c0 = 1/S02
  • Precision of sample = cd = 1/SE2
  • Posterior precision = c1 = c0 + cd
  • Posterior mean M1 = (c0/c1)M0 + (cd/c1)Md
  • Posterior standard deviation, S1 = square root(1/c1)

For a reasonably diffuse prior (one presenting fairly vague prior options), the posterior is dominated by the likelihood.
If you started with a flat or uniform prior (you have no opinion concerning which values are most likely), the posterior would be identical to the likelihood.
Even if people started with very different priors, if you collect enough data, as long as the priors were smooth and allowed some non-negligble probability in the region of the true population value, the posteriors, being dominated by the likelihood, would come to be very similar.

If the prior and likelihood are normal, the posterior is also normal.
Having found the posterior distribution, you have really found out all you need to know.

The credibility interval is affected by any prior information you had.
But not with all the things that affect the confidence interval.

The Bayes factor

There is no such thing as significance testing in Bayesian statistics.
All one often has to do as a Bayesian statistician is determine posterior distributions.
With the Bayes factor you can compare the probability of an experimental theory to the probability of the null hypothesis.

H1 is your experimental hypothesis
H0 is the null hypothesis

P(H1|D) is proportional to P(D|H1) x P(H1)
P(H0|D) is proportional to P(D|H0) x P(H0)

P(H1|D)/ P(H0|D) = P(D|H1) /P(D|H0) x P(H1)/ P(H0)
Posterior odds = likelihood ratio x prior odds

The likelihood ratio is (in this case) called the Bayes factor B in favour of the experimental hypothesis.
Whatever your prior odds were in favour of the experimental hypothesis over the null, after data collection multiply those odds by B to get your posterior odds.

  • If B is greater than 1, your data support the experimental hypothesis over the null
  • If B is less than 1, your data support the null over the experimental hypothesis
  • If B is about 1, then your experiment was not sensitive

The Bayes factor gives the means of adjusting your odds in a continuous way.

Bayesian Versus orthodox statistics: which side are you on? - summary of an article by Dienes, 2011

Bayesian Versus orthodox statistics: which side are you on? - summary of an article by Dienes, 2011

Image

Critical thinking
Article: Dienes, Z, 2011
Bayesian Versus orthodox statistics: which side are you on?
doi: 10.1177/1745691611406920


The contrast: orthodox versus Bayesian statistics

The orthodox logic of statistics, starts from the assumption that probabilities are long-run relative frequencies.
A long-run relative frequency requires an indefinitely large series of events that constitutes the collective probability of some property (q) occurring is then the proportion of events in the collective with property q.

  • The probability applies to the whole collective, not to any one person.
    • One person may belong to two different collectives that have different probabilities
  • Long run relative frequencies do not apply to the truth of individual theories because theories are not collectives. They are just true or false.
    • Thus, when using this approach to probability, the null hypothesis of no population difference between two particular conditions cannot be assigned a probability.
  • Given both a theory and a decision procedure, one can determine a long-run relative frequency with which certain data might be obtained. We can symbolize this as P(data| theory and decision procedure).

The logic of Neyman Pearson (orthodox) statistics is to adopt decision procedures with known long-term error rates and then control those errors at acceptable levels.

  • Alpha: the error rate for false positives, the significance level
  • Beta: the error rate for false negatives

Thus, setting significance and power controls long-run error rates.

  • An error rate can be calculated from the tail area of test statistics.
  • An error rate can be adjusted for factors that affect long-run error rates
  • These error rates apply to decision procedures, not to individual experiments.
    • An individual experiment is a one-time event, so does not constitute a long-run set of events
    • A decision procedure can in principle be considered to apply over a indefinite long-run number of experiments.

The probabilities of data given theory and theory given data

The probability of a theory being true given data can be symbolized as P(theory|data).
This is what orthodox statistics tell us.
One cannot infer one conditional probability just by knowing its inverse. (So P(data|theory) is unknown).

Bayesian statistics starts from the premise that we can assign degrees of plausibility to theories, and what we want our data to do is to tell us how to adjust these plausibilities.

  • When we start from this assumption, there is no longer a need for the notion of significance, p value, or power.
  • Instead, we simply determine the factor by which we should change the probability of different theories given the data.

The likelihood

In the Bayesian approach, applies to the truth of theories.
We can answer the questions about:

  • p(H), the probability of a hypothesis being true (our prior probability)
  • p(H|D), the probability of a hypothesis given the data (our posterior probability).

Neither of these can be do using the orthodox approach.
Likelihood: the probability of obtaining the exact data given the hypothesis.

Posterior is given by likelihood times prior.

The likelihood principle: all information relevant to inference contained in data is provided by the likelihood.
When we are determining how given data changes the relative probability of our different theories, it is only the likelihood that connects the prior to the posterior.

The likelihood is the probability of obtaining the exact data obtained given a hypothesis (P(D|H).
This is different from a p value, which is the probability of obtaining the same or more extreme data given both a hypothesis and a decision procedure.

  • A p-value for a t test is a tail area of the t distribution
  • The corresponding likelihood is the height of the distribution at the point representing the data

In orthodox statistics, p values are changed according to the decision procedure; under what conditions one would stop collecting data, whether or not the test is post hoc, how many other test one conducted.
None of these factor influence the likelihood.

The Bayes factor

The Bayes factor pits one theory against another.

Prior probabilities and prior odds can be entirely personal and subjective.
There is no reason why people should agree about these before data are collected if they are not part of the publically presented inferential procedure.
If the priors form part of the inferential procedure, they must be fairly produced and subjected to the tribunal of peer judgement.

One data are collected we can calculate the likelihood for each theory.
These likelihoods are things we want researchers to agree on. Any probabilities that contribute to them should be plausibly or simply determined by determined by the specification of the theories.
The Bayes factor (B): the ratio of likelihoods.

Posterior odds = B x prior odds.

  • If B is greater than 1, the data supported you experimental hypothesis over the null.
  • If B is less than 1, the data supported the null hypothesis over the experimental one.
  • If B is about 1, the experiment was not sensitive.

The evidence is continuous and there are not thresholds in Bayesian theory.
B automatically gives a notion of sensitivity, it directly distinguishes data supporting the null from data uninformative about whether the null or you theory was supported.

For both p values associated with a t test and for B, if the null is false, as a number of subjects increases, then test scores are driven in one direction.

  • p values are expected to become smaller
  • Both t and B values are expected to become larger

When the null hypothesis is true, p values are not driven in any direction, only B us. B is then driven to zero.

Problems with the Neyman Pearson approach

Stopping rule

In the Neyman Pearson approach, one must specify the stopping rule in advance.
Once those conditions are met, there is to be no more data collection.
Typically, this means one should use a power calculation to plan in advance how many subjects to run.

The Bayes factor behaves differently from p values as more data are run (regardless of stopping rule).

  • For a p value, if the null is true, any value in the interval 0 to 1 is equally likely no matter how much data you collect
    • For this reason, sooner or later, you are guaranteed to get a significant result if you run subjects long enough and stop when you get the p value you want
    • When the null is true, as the number of subjects increases, the p value is not driven to any particular value.
  • As the number of subjects increases and the null is true, the Bayes factor is driven toward zero.

Planned versus post hoc comparisons

When using Neyman Pearson, it matters whether you formulated your hypothesis before or after looking at the data (post hoc vs. planned comparisons).
Predictions made before rather than after looking at the data are treated differently.

  • Post hoc fitting can involve preference for one auxiliary over may others of at least equal plausibility.

In Bayesian inference, the evidence for a theory is just as strong regardless of its timing relative to the data.
This is because the likelihood is unaffected by the time the data were collected.
The likelihood principle follows from the axioms of probability.
It is not the ability to predict in advance per se that is important, that ability is just an (imperfect) indicator of the prior probability of relevant hypotheses.
When performing Bayesian inference, there is no need to adjust for the timing of predictions per se.

Multiple testing

When using Neyman Pearson, one must correct for how many tests are conducted in a family of tests.

When using Bayes, it does not matter how many other statistical hypotheses are investigated. All that matters is the data relevant to each hypothesis under investigation.
Once one takes into account the full context, the axioms of probability lead to sensible answers.

In the Bayes approach, rather than the Neyman Pearson approach, that is most likely to demand that researchers draw appropriate conclusions from a body of relevant data involving multiple testing.

The rationality of the Bayesian approach

If we want to determine by how much we should revise continuous degrees of belief, we need to make sure our system of inference obeys the axioms of probability.
If researchers want to think in terms of degree of support data provide for a hypothesis, they should make sure their inferences obey the axioms of probability.

One version of degrees of belief are subjective probabilities.
Subjective probabilities: personal convictions in an opinion.
When probabilities of different propositions form part of the inferential procedure we use in deriving conclusions from data, then we need to make sure that the procedure is fair.
Thus, there has been an attempt to specify objective probabilities that follow from the informational specification of a problem.
In this way, the probabilities become an objective part of the problem, with values that can be argued about, given the explicit assumptions, and that do not depend any further on personal idiosyncrasies.

One notion of rationality is having sufficient justification for one’s beliefs.
If one can assign numerical continuous degrees of justification to beliefs, then some simple minimal desiderata lead to the likelihood principle of inference.
Hypothesis testing violates the likelihood principle.

Effect size

Bayes factors demand consideration of relevant effect sizes.

Neyman developed two specific measures of sensitivity:

  • Power
  • Confidence intervals: the set of population values that the data are consistent with.

For any continuous measure based on a finite number of subjects, an interval cannot be an infinitesimally small point.
A null result is always consistent with population values other than zero.
That is why a non-significant result cannot on its own lead to the conclusion that the null hypothesis is true.

Theories and practical questions generally specify, even if vaguely, relevant effect sizes.
The research context, usually provides a range of effects that are too small to be relevant and a range of effects that are consistent with theory or practical use.

Researchers have relevant intuitions, and that is why it has made sense to them to assert null hypotheses.
Bayes makes them explicit.
If we want to use null results in any way to count against theories that predict an effect, we must consider the range of effect sizes consistent with the theory.

Effect size is very important in the Neyman Pearson approach.

  • One must specify the sort of effect one predicts in order to calculate power.

On the other hand, Fisherian significance testing leads people to ignore effect sizes.

  • People have followed Fisher’s method, while paying lip service to effect sizes, but not heeding Fisher’s advice that nothing follows from a null result.

One must specify what sort of effect sizes a theory predicts to calculate a Bayes factor.
Because it takes into account effect size, the Bayes factor distinguishes evidence that there is not relevant effect from no evidence of a relevant effect.
One can only confirm a null hypothesis when one has specified the effect size expected on the theory being tested.

In specifying theoretically expected effect sizes, we should ask ourselves “What size effect does the literature suggest is interesting for this particular domain?” Rather than following the common practice of plucking a standardized effect size of 0.5 out of thin air, researchers should get to know the data of the field.

Confidence intervals themselves have all the problems for Neyman Pearson inference in general (unlike credibility or likelihood intervals).
Because confidence intervals consists of all values non-significantly different from the sample mean, they inherit the arbitrariness of significance testing.

How to calculate a Bayes factor

To calculate a Bayes factor in support of a theory, one has to specify what the probability of different effect sizes are, given the theory.
Bayes gives us the apparatus to flexibly deal with different degrees of uncertainty regarding the predicted effect size.
Logically, one needs to know what a theory predicts in order to know how much it is supported by evidence.

Three distributions

In terms of predictions of the theory (or requirements of a practical effect), one has to decide what range of effects are relevant to the theory.
Three ranges:

  • An uniform distribution
    All values between a lower bound and an upper bound.
    All values are represented as possible and equally likely given the theory and all those outside are inconsistent with the theory
  • A normal distribution
    One value is the most likely given the theory, and any values lower or higher are progressively less likely
  • Normal distribution centred on zero with only one tail
    The theory predicts an effect in one direction, but smaller values are generally more likely than larger values

Different ways of using Bayes factors

Bayes factor is suggested to be used on any data where the null hypothesis is compared with a default theory.
Or when inference is based on the posterior and thus takes into account the priors of hypothesis.
Also for specific hypotheses that interest the researcher and allows priors to remain personal and not part of public inference.

By following Bayes rule, each of these approaches means rational answers are provided for the given assumptions, and researchers may choose each according to their goals and which assumptions seem relevant to them.

Bayes factors are just one form of Bayesian inference, namely a method for evaluating theories against another.

Multiple testing and cheating

With Bayes factors, one does not have to worry about corrections for multiple testing, stopping rules, or planned versus post hoc comparisons.
Bayes factor just tells you how much support given data provides for one theory over another.
There is no right Bayes factor.

Strictly, each Bayes factor is a completely accurate indication of the support for the data of one theory over another.
The theories are defined by the precise predictions they make.
The crucial question is which of these representations best matches the theory as the researcher has described it and related it to the existing literature.

One constraint on the researcher will be the demand for consistency: arguing for one application of a theory ties one’s hands when it comes to another application.
The solution is to use a default Bayes factor for all occasions, though this amounts to evaluating a default theory for all occasions, regardless of one’s actual theory.
A default Bayes factor will only test your theory if it happens to correspond to the default.
Another solution is to define the predictions according to simple procedures to ensure the theory proposed is tested according to fair criteria.

When using Bayes in multiple testing, one can use the fact that one is testing multiple hypotheses to inform the results if one believes that testing these multiple hypotheses is relevant to the probability of any of them being true.

Weaknesses of the Bayesian approach

  • Bayesian analysis force people to consider what a theory actually predicts, but specifying the predictions in detail may be contentious.
  • Bayesian analysis escape the paradoxes of violating the likelihood principle, but in doing so they no longer control for Type I and Type II errors.

Calculating a Bayes factor depends on answering the following question about which there may be disagreement: What way of assigning probability distributions of effect sizes as predicted by theories would be accepted by protagonists on all sides of a debate?

Ultimately, the issue is about what is more important to us: using a procedure with known long-term error rates or knowing the degree of support for our theory.

WSRt, critical thinking - a summary of all articles needed in the fourth block of second year psychology at the uva

WSRt, critical thinking - a summary of all articles needed in the fourth block of second year psychology at the uva

Image

This is a summary of the articles and reading materials that are needed for the fourth block in the course WSR-t. This course is given to second year psychology students at the Uva. The course is about thinking critically about how scientific research is done and how this could be done differently.

More contributions of WorldSupporter author: SanneA
Comments & Kudos

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.