The beast of bias - summary of chapter 6 of Statistics by A. Field (5th edition)

Statistics
Chapter 6
The beast of bias


What is bias?

Bias: the summary information is at odds with the objective truth.

An unbiased estimator: one estimator that yields and expected value that is the same thing it is trying to estimate.

We predict an outcome variable from a model described by one or ore predictor variables and parameters that tell us about the relationship between the predictor and the outcome variable.
The model will not predict the outcome perfectly, so for each observation there is some amount of error.

Statistical bias enters the statistical process in three ways:

  • things that bias the parameter estimates (including effect sizes)
  • things that bias standard errors and confidence intervals
  • things that bias test statistics and p-values

Outliers

An outlier: a score very different from the rest of the data.

Outliers have a dramatic effect on the sum of squared error.
If the sum of squared errors is biased, the associated standard error, confidence interval and test statistic will be too.

Overview of assumptions

The second bias is ‘violation of assumptions’.

An assumption: a condition that ensures that what you’re attempting to do works.
If any of the assumptions are not true then the test statistic and p-value will be inaccurate and could lead us to the wrong conclusion.

The main assumptions that we’ll look at are:

  • additivity and linearity
  • normality of something or other
  • homoscedasticity/ homogeneity of variance
  • independence

Additivity and linearity

The assumption of additivity and linearity: the relationship between the outcome variable and predictor is accurately described by equation.
The scores on the outcome variable are, in reality, linearly related to any predictors. If you have several predictors then their combined effect is best described by adding their effects together.

If the assumption is not true, even if all the other assumptions are met, your model is invalid because your description of the process you want to model is wrong.

Normally distributed something or other

The assumption of normality relates in different ways to things we want to do when fitting models and assessing them:

  • Parameter estimates.
    The mean is a parameter and extreme scores can bias it.
    Estimates of parameters are affected by non-normal distributions (such as those with outliers).
    Parameter estimates differ in how much they are biased in a non-normal distribution.
  • Confidence intervals
    We use values of the standard normal distribution to compute the confidence interval around a parameter estimate. Using values of he standard normal distribution makes sense only if the parameter estimates comes from one.
    For confidence intervals around a parameter estimate to be accurate, that estimate must have a normal sampling distribution.
  • Null hypothesis significance testing
    If we want to test a hypothesis about a model, we assume that the parameter estimates have a normal distribution. We assume this because the test statistics that we use have distributions related to the normal distribution. If our parameter estimate is normally distributed, then these test statistics and p-values will be accurate.
    For significance tests of models to be accurate the sampling distribution of what’s being measured must be normal.

The central limit theorem revisited

As the sample sizes get bigger the sampling distributions become more normal, until a point at which the sample is big enough that the sampling distribution is normal, even though the population of scores is very non-normal indeed.

The central limit theorem: regardless of the shape of the population, parameters estimate of that population will have a normal distribution provided the samples are ‘big enough’.

When does the assumption of normality matter?

The central limit theorem means that there are a variety of situations in which we can assume normality regardless of the shape of our sample data.

  • For confidence intervals around a parameter estimate to be accurate, that estimate must come from a normal sampling distribution. The central limit theorem tells us that in large samples, the estimate will have come from a normal distribution regardless of what sample or population data look like.

If we are interested in computing confidence intervals then we don’t need to worry about the assumption of normality if our sample is large enough.

  • For significance tests of models to be accurate the sampling distribution of what’s being tested must be normal.

The shape of our data shouldn’t affect significance tests provided our sample is large enough (central limit theorem). But, the extent to which test statistics perform as they should do in large samples varies across different test statistics.

  • For the estimates of model parameters to be optimal the residuals in the population must be normally distributed.

The method of least squares will always give you an estimate of the model parameters that minimizes error, so in that sense you don’t need to assume normality of anything to fit a linear model and estimate the parameters that define it.
But, there are other methods for estimating model parameters, and if you happen to have normally distributed errors then the estimates that you obtained using the method of least squares will have less error than the estimates you would have got using any of these other methods.

If all you want to do is estimate the parameters of your model then normality matters mainly in deciding how best to estimate them.
If you want to construct confidence intervals around those parameters, or compute significance tests relating to those parameters, then the assumption of normality matters in small samples. Because of the central limit theorem, we don’t really need to worry about this assumption in larger samples.

Provided your sample is large, outliers are a more pressing concern than normality.
You can have outliers that are less extreme but are not isolated cases. These outliers can dramatically reduce the power in significance tests.

Homoscedasticity/ homogeneity of variance

This impacts two things:

  • Parameters.
    Using the method of least squares to estimate the parameters in the model, we get optimal estimates if the variance of the outcome variable is equal across different values of the predictor variable.
  • Null hypothesis testing
    Test statistics of ten assume that the variance of the outcome variable is equal across different values of the predictor variable. If this is not the case then these test statistics will be inaccurate.

What is homoscedasticity/ homogeneity of variance?

In designs in which you test groups of cases this assumption means that these groups come from populations with the same variance.

In correlational designs, this assumption means that the variance of the outcome variable should be stable at all levels of the predictor variable.
As you go through levels of the predictor variable, the variance of the outcome variable should not chance.

When does homoscedasticity/ homogeneity of variance matter?

If we assume equality of variance that the parameter estimates for a linear model are optimal using the method of least squares.
The method of least squares will produce ‘unbiased’ estimates of parameters even when homogeneity of variance can’t be assumed, but they won’t be optimal.
Better estimates can be achieved using a method other than least squares.

If all you care about is estimating the parameters of the model in your sample then you don’t need to worry about homogeneity of variance in most cases, the method of least squares will produce unbiased estimates.

But

  • Unequal variances/hetroscedasticity creates a bias and inconsistency in the estimate of the standard error associated with the parameter estimates in your model.
  • Confidence intervals, significance tests (and therefore p-values) for the parameter estimates will be biased, because they are computed using the standard error.
  • Confidence intervals can be extremely inaccurate when homogeneity of variance/homoscedasticity cannot be assumed.

Independence

Independence: the errors in your model are not related to each other.

The equation that we use to estimate the standard error is valid only if observations are independent.

SPSS

  • To check that the distribution of scores is approximately normal, look at the values of skewness and kurtosis in the output.
  • positive values of skewness indicate too many low scores in the distribution, whereas negative values indicate a build-up of high scores.
  • positive values of kurtosis indicate a heavy-tailed distribution, whereas negative values indicate a light-tailed distribution.
  • the further the value is from zero, the more likely it is that the data are not normally distributed.
  • you can convert these scores to z-scores by dividing their standard error. If the resulting score (when you ignore the minus sign) is greater than 1.96, then it is significant (p<0.05)
  • Significance tests of skew and kurtosis should not be used in large samples (because they are likely to be significant even when skew and kurtosis are not too different from normal).
  •  The K-S test can be used (but shouldn’t be) to see if a distribution of scores significantly differs form a normal distribution
  • if the K-S test is significant (sig. In the SPSS table is less than 0.05) then the scores are significantly different from a normal distribution.
  • otherwise, the scores are approximately normally distributed.
  • the Shapiro-Wilk test does much the same thing, but it has more power to detect differences form normality (so this test might be significant when the K-S test is not).
  • Warning: in large samples these tests can be significant even when the scores are only slightly different from a normal distribution.

The reason for looking at the assumption of linearity and homoscedasticity together is that we can check both with a single graph.
Both assumptions relate to errors in the model and we can plot the values of these residuals against corresponding values of the outcome predicted by the model in a scatterplot.
The resulting plot shows whether there is a systematic relationship between what comes out of the model and the errors in the model.
Normally we convert the predicted values and errors to z-scores, so this plot is sometimes referred t as zpred vs. zresid.
If linearity and homoscedasticity hold true than ere should be no systematic relationship between the errors in the model and what the model predicts.

  • Homogeneity of variance/homoscedasticity is the assumption that the spread of outcome scores is roughly equal at different points at the predictor variable.
  • the assumption can be evaluated by looking at a plot of the standardized predicted values from your model against the standardized residuals.
  • when comparing groups, this assumption can be tested with Levene’s test and the variance ratio (Hartley’s Fmax)
    If Levene’s test is significant then the variances are significantly different in different groups
    otherwise, homogeneity of variance can be assumed
    the variance ratio is the largest group variance divided by the smallest. This value needs to be smaller than the critical values in the additional material.
  • Warning:
    there are good reasons not to use Levene’s test or the variance ratio! In large samples they can be significant when group variances are similar, and in small samples they can be non-significant when group variances are very different.

Reducing bias

Four approaches for correcting problems with data:

  • Trim the data: delete a certain quantity of scores from the extremes
  • Winsorizing: substitute outliers with the highest value that isn’t a outlier
  • Apply a robust estimation method: a common approach is to use bootstrapping
  • Transform the data: apply a mathematical function to scores to correct problems.

Probably the best of these choices is to use robust tests.
Robust tests: a family of procedures to estimate statistics that are unbiased even when the normal assumptions of the statistic are not met.

Trimming the data

Trimming the data: deleting some scores from the extremes.

In its simplest form it could be deleting the data from the person who contributed the outlier.
But, this could be done only if you have good reason to believe that this case is not from the population that you intended to sample.

More often, trimming involves removing extreme scores using one of two rules:

  • a percentage based rule
  • a standard deviation based rule

For example: a percentage based rule could be deleting the 10% highest and lowest scores.

If you take trimming to its extreme you get the median, which is the value left when you have trimmed all but the middle score.
Trimmed mean: the mean in a sample that has been trimmed this way.

M-estimator: a robust measure of location, differs from a trimmed mean in that the amount of trimming is determined empirically.
Rather than the researcher deciding before the analysis how much of the data to trim, an M-estimator determines the optimal amount of trimming necessary to give a robust estimate of, say, the mean.

The trimmed mean (and variance) will be relatively accurate even when the distribution is not symmetrical, because by trimming the ends of the distribution we remove outliers and skew that bias the mean.

Standard deviation based trimming involves calculating the mean and standard deviation of a set of scores, and then removing values that are a certain number of standard deviations greater than the mean.
Applying this wouldn’t affect the mean or standard deviation.

Winsorizing

Winsorizing the data: replacing outliers with the next highest score that is not an outlier.

There are some variations on winsorizing, such as replacing extreme scores with a score 3 standard deviations from the mean.

Robust methods

By far the best option if you have irksome data is to estimate parameters and their standard errors with methods that are robust to violations of assumptions and outliers.
Use methods that are relatively unaffected by irksome data.

  • The first set of tests are ones that do not rely on the assumption of normally distributed data. → robust methods.

Two simple concepts of robust data:

  • parameter estimates based on trimmed data
  • bootstrap.
    Estimates the properties of the sampling distribution from the sample data.

Bootstrap

The sample data are treated as a population from which smaller samples (bootstrap samples) are taken (putting each score back before a new one is drawn form the sample). The parameter of interest is calculated in each bootstrap sample. This process is repeated perhaps 2000 times. The result is 2000 parameter estimates, one form each bootstrap sample.

There are two things we can do with these estimates

  • order them and work out the limits within which 95% of them fall. We can use these values as estimate of the limits of the 95% confidence interval of the parameter → a percentile bootstrap confidence interval
  • calculate the standard deviation of the parameter estimates from the bootstrap samples and use it as the standard error of parameter estimates.

Because bootstrapping is based on taking random samples from the data you’ve collected, the estimates you get will be slightly different every time.

Transforming data

The idea behind transformations is that you do something to every score to correct for distributional problems, outliers, lack of linearity or unequal variances.
You do the same ting to all your scores.

If you are looking at the relationship between variables you can transform only the problematic variable.
If you are looking at differences between variables you must transform all the variables.

 

 

Page access
Public
Comments, Compliments & Kudos

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.
Promotions
Image
The JoHo Insurances Foundation is specialized in insurances for travel, work, study, volunteer, internships an long stay abroad
Check the options on joho.org (international insurances) or go direct to JoHo's https://www.expatinsurances.org

 

More contributions of WorldSupporter author: SanneA