Critical thinking

Article: Dienes (2003)

Neyman, Pearson and hypothesis testing

In this article, we will consider the standard logic of statistical inference.

Statistical inference: the logic underlying all the statistics you see in the professional journals of psychology and most other disciplines that regularly use statistics.

The underlying logic of statistic (Neyman-Pearson) is both highly controversial, frequently attacked (and defended) by statisticians and philosophers, and more frequently misunderstood.

The meaning of probability we choose determines what we can do with statistics.

The proper way of interpreting probability remains controversial, so there is still debate over what can be achieved with statistics.

The Neyman-Pearson approach follows form one particular interpretation of probability. The Bayesian approach considered follows form another.

Interpretations often start with a set of axioms that probabilities must follow.

Two interpretations of probability:

- the subjective interpretation: a probability is a degree of conviction of a belief
- the objective interpretation: locate probability in the world.

The most influential objective interpretation of probability is the long-run relative frequency interpretation. Here, probability is a relative frequency.

Because the long-run relative frequency is a property of all the events in the collective, it follows that a probability applies to a collective, not to any single event.

A single event could be a member of different collectives. So a singular event does not have a probability, only collectives do.

Objective probabilities do not apply to single cases. They also do not apply to the truth of hypotheses.

A hypothesis is simply true or false, just as a single event either occurs or does not.

A hypothesis is not a collective, it therefore does not have an objective probability.

Data = D

Hypothesis = H

P(H|D) is the inverse of the conditional probability p(D|H). Inverting conditional probabilities makes a big difference.

P(A|B) can have a very different value from p((B|A).

If you know P(D|H) does not mean you know what p(H|D) is.

There are two reasons for this:

- inverse conditional probabilities can have very different values
- in any case, it is meaningless to assign an objective probability to a hypothesis.

Statistics cannot tell us how much to believe a certain hypothesis. What we can do, according to Neyman and Pearson, is set up decision rules for certain behaviours such that in following those rules in the long run we will not often be wrong. We can work out what the error rates are for certain decision procedures and we can choose procedures that control the long-run error rates at acceptable levels.

Decision rules work by setting up two contrasting hypotheses.

For a given experiment we can calculate p=(getting t as extreme or more extreme than obtained| H_{0}).

If p is less than alpha, the level of significance we have decided in advance, we reject H_{0}. By following this rule, we know in the long run that when H_{0} is actually true, we will conclude it false only 5% of the time.

In this procedure, p-value has not meaning in itself. It is just a convenient mechanical procedure for accepting or rejecting a hypothesis.

Alpha is an objective probability, a relative long-run frequency.

It is the proportion of errors of a certain type we will make in the long run, if we follow the above procedure and the null hypothesis is in fact true.

Neither alpha nor our calculated p tells us how probable the null hypothesis is.

Alpha: the long-term error rate for one type of error: saying the null is false when it is true.

There are two ways of making an error with the decision procedure.

- Type I error: when the null is true and we reject it

In the long run, when the null is true, we will make a Type I error in alpha proportion of our decisions - Type II error: accepting the null when it is false

In the long run, when the null is false, the proportion of times we nonetheless accept is is labelled beta.

Both alpha and beta should be controlled at acceptable levels.

Sometimes significance or alpha is defined simply as ‘the probability of a Type I error’. This is wrong.

Alpha is specifically the probability (long-run frequency) of a Type I error when the null hypothesis is true.

Strictly using a significance level of 5% does not guarantee that only 5% of all published significance results are in error.

Controlling for alpha does not mean you have controlled for beta.

Power is 1 – ß

Power is the probability of detecting an effect, given an effect really exists in the population.

In order to control ß, you need to:

- Estimate the size of effect you think is interesting, given your theory is true.
- Estimate the amount of noise your data will have

The more participants your run, the greater the power.

Studies should systematically use power calculations to determine the number of participants.

Significance of 5% means that, if the null hypothesis were true, one would expect 5% of studies to be significant.

Meta-analysis: the process of combining groups of studies together to obtain overall tests of significance.

A set of null results does not mean you should accept the null. They may indicate that you should reject the null.

If your study has low power, getting a null result tells you nothing in itself.

You would expect a null result whether or not the null hypothesis was true.

In the Neyman-Pearson approach, you set power at a high level in designing the experiment, before you run it. Then you are entitled to accept the null hypothesis when you obtain a null result. Doing this procedure you will make errors at a small controlled rate, a rate you have decided in advance is acceptable for you.

Statistics never allows absolute proof or disproof.

Sensitivity can be determined in three ways:

- power
- confidence intervals
- finding an effect significantly different from another reference one.

Whenever you find a null result and it is interesting to you that the result is null, you should always indicate the sensitivity of your analysis.

The conditions under which you will stop collecting data for a study define the stopping rule you use.

- The standard Neyman-Pearson stopping rule is to use power calculations in advance of running the study to determine how many participants should be run to control power at a predetermined level. Then run that number of subjects.
- Both alpha and beta can then be controlled at known acceptable levels.

- Another legitimate stopping rule involves the use of confidence intervals.

In the Neyman-Pearson approach it is essential to know the collective or reference class for which we are calculating our objective probabilities alpha and beta.

The relevant collective is defined by a testing procedure applied an indefinite number of times.

In the Neyman-Pearson approach, in order to control overall Type I error, if we perform a number of tests we need to test each one at a stricter level of significance in order to keep overall alpha at 0.05. There are numerous corrections.

A researcher might mainly want to look at one particular comparison, but threw in some other conditions out of curiosity while going to the effort of recruiting, running and paying participants. Then, it might feel unfair that the p level is to high just because you collected other conditions you didn’t need to have.

The solution is that if you planned one particular comparison in advance then you can test at the 0.05 level, because that one was picked out in advance of seeing the data.

But, the other tests must involve a correction.

Alpha is an objective probability and hence a property of a collective and not any individual event, not a particular sample.

In the Neyman-Pearson approach, the relevant probabilities alpha and beta are the long-run error rates you decide are acceptable and so must be set in advance.

If alpha is set at 0.05, the only meaningful claim to make about the p-value of a particular experiment is either it is less than 0.05 or not.

The statistics tell you nothing about how confident you should be in a hypothesis nor what strength of evidence there is for different hypotheses.

It is hard to construct an argument for why p-values should be taken as strength of evidence per se. Conceptually, the strength of evidence for or against a hypothesis is distinct from the probability of obtaining such evidence.

There is not need to force p-values into taking the role of measuring strength of evidence, a role for which they may often give a reasonable answer, but not always.

Significance is not a property of populations.

Hypotheses are about population properties. Significance is not a property of population means or differences.

Decision rules are laid down before data are collected; we simply make black and white decisions with known risks of error.

A more significant result does not mean a more important result, or a larger effect size.

The Neyman-Pearson approach is not just about null hypothesis testing.

Neyman also developed the concept of confidence interval, a set of possible population values the data are consistent with.

Instead of saying merely we reject one value, one reports the set of values rejected, and the set of possible values remaining.

To calculate the 95% confidence interval, find the set of all values of the dependent variable that are non-significantly different from your sample value at the 5% level.

Use of the confidence interval overcome some of the problems people have when using Neyman-Pearson statistics otherwise:

- it tells you the sensitivity of your experiment directly
- it turns out you can use the confidence interval to determine a useful stopping rule. Stop collecting data when the interval is of a certain predetermined width. Such a stopping rule would ensure that people do not get into situations where illegitimate stopping rules are tempting.

Confidence intervals are a very useful way of summarizing what a set of studies as a whole are telling us. You an calculate the confidence intervals on the parameter of interest by combining the information provided in all the studies.

The 95% confidence interval is interpreted in terms of an objective probability.

The procedure of calculating 95% confidence intervals will produce intervals that include the true population value 95% of the time.

There is no probability attached to any one calculated interval. That interval either includes the population in value or it does not.

There is not a 95% probability that the 95% confidence limits for a particular sample includes the true population mean. But if you acted as if the true population value is included in your interval each time you calculate a 95% confidence interval, you would be right 95% of the time.

Inference consists of simple acceptance or rejection

- Data seem to provide continuous support for or against different hypotheses.
- What a scientists wants to know is either how likely certain hypotheses are in the light of data or how strong the evidence supports one hypothesis rather than another.
- It is meaningless to use the tools and concepts developed in the Neyman-Pearson framework to draw inferences about the probability of hypotheses or the strength of evidence.

Null hypothesis testing encourages weak theorizing

- It encourages ‘not the null hypothesis’ not a certain value.
- The habitual use of confidence intervals instead of simple null hypothesis testing would overcome this objection.

In the Neyman-Pearson approach it is important to know the reference class, we must known that endless series of trials might have happened but never did.

- This is important when considering both multiple testing and stopping rules.
- The decision is basically arbitrary and tacit conventions determine practice.
- In the Neyman-Pearson approach, the same data can lead to different conclusions.
- The limits of the confidence intervals are sensitive to the stopping rule and multiple testing issues as well.

If the article uses significance or hypothesis tests, then two hypotheses need to be specified for each tests.

Most papers fall down at the first hurdle because the alternative is not well specified.

The stopping rule should be specified in advantage at a fixed number and significance testing took place once at the end of data collection.

Even if minimally interesting effect sizes were not stated in advance and if power were not stated in advance, a crucial point is how the authors dealt with interesting null results.

Given a null result was obtained, did the authors give some measure of sensitivity of the test?

## Original reference Per contributed on 18-02-2021 10:33

Great summary - can't find the original study - can you provide the full reference please. Thx.

## Add new contribution