Summary of Discovering statistics using IBM SPSS statistics by Field - 5th edition
- 11879 reads
Statistics
Chapter 6
The beast of bias
Bias: the summary information is at odds with the objective truth.
An unbiased estimator: one estimator that yields and expected value that is the same thing it is trying to estimate.
We predict an outcome variable from a model described by one or ore predictor variables and parameters that tell us about the relationship between the predictor and the outcome variable.
The model will not predict the outcome perfectly, so for each observation there is some amount of error.
Statistical bias enters the statistical process in three ways:
An outlier: a score very different from the rest of the data.
Outliers have a dramatic effect on the sum of squared error.
If the sum of squared errors is biased, the associated standard error, confidence interval and test statistic will be too.
The second bias is ‘violation of assumptions’.
An assumption: a condition that ensures that what you’re attempting to do works.
If any of the assumptions are not true then the test statistic and p-value will be inaccurate and could lead us to the wrong conclusion.
The main assumptions that we’ll look at are:
Additivity and linearity
The assumption of additivity and linearity: the relationship between the outcome variable and predictor is accurately described by equation.
The scores on the outcome variable are, in reality, linearly related to any predictors. If you have several predictors then their combined effect is best described by adding their effects together.
If the assumption is not true, even if all the other assumptions are met, your model is invalid because your description of the process you want to model is wrong.
Normally distributed something or other
The assumption of normality relates in different ways to things we want to do when fitting models and assessing them:
The central limit theorem revisited
As the sample sizes get bigger the sampling distributions become more normal, until a point at which the sample is big enough that the sampling distribution is normal, even though the population of scores is very non-normal indeed.
The central limit theorem: regardless of the shape of the population, parameters estimate of that population will have a normal distribution provided the samples are ‘big enough’.
When does the assumption of normality matter?
The central limit theorem means that there are a variety of situations in which we can assume normality regardless of the shape of our sample data.
If we are interested in computing confidence intervals then we don’t need to worry about the assumption of normality if our sample is large enough.
The shape of our data shouldn’t affect significance tests provided our sample is large enough (central limit theorem). But, the extent to which test statistics perform as they should do in large samples varies across different test statistics.
The method of least squares will always give you an estimate of the model parameters that minimizes error, so in that sense you don’t need to assume normality of anything to fit a linear model and estimate the parameters that define it.
But, there are other methods for estimating model parameters, and if you happen to have normally distributed errors then the estimates that you obtained using the method of least squares will have less error than the estimates you would have got using any of these other methods.
If all you want to do is estimate the parameters of your model then normality matters mainly in deciding how best to estimate them.
If you want to construct confidence intervals around those parameters, or compute significance tests relating to those parameters, then the assumption of normality matters in small samples. Because of the central limit theorem, we don’t really need to worry about this assumption in larger samples.
Provided your sample is large, outliers are a more pressing concern than normality.
You can have outliers that are less extreme but are not isolated cases. These outliers can dramatically reduce the power in significance tests.
Homoscedasticity/ homogeneity of variance
This impacts two things:
What is homoscedasticity/ homogeneity of variance?
In designs in which you test groups of cases this assumption means that these groups come from populations with the same variance.
In correlational designs, this assumption means that the variance of the outcome variable should be stable at all levels of the predictor variable.
As you go through levels of the predictor variable, the variance of the outcome variable should not chance.
When does homoscedasticity/ homogeneity of variance matter?
If we assume equality of variance that the parameter estimates for a linear model are optimal using the method of least squares.
The method of least squares will produce ‘unbiased’ estimates of parameters even when homogeneity of variance can’t be assumed, but they won’t be optimal.
Better estimates can be achieved using a method other than least squares.
If all you care about is estimating the parameters of the model in your sample then you don’t need to worry about homogeneity of variance in most cases, the method of least squares will produce unbiased estimates.
But
Independence
Independence: the errors in your model are not related to each other.
The equation that we use to estimate the standard error is valid only if observations are independent.
The reason for looking at the assumption of linearity and homoscedasticity together is that we can check both with a single graph.
Both assumptions relate to errors in the model and we can plot the values of these residuals against corresponding values of the outcome predicted by the model in a scatterplot.
The resulting plot shows whether there is a systematic relationship between what comes out of the model and the errors in the model.
Normally we convert the predicted values and errors to z-scores, so this plot is sometimes referred t as zpred vs. zresid.
If linearity and homoscedasticity hold true than ere should be no systematic relationship between the errors in the model and what the model predicts.
Four approaches for correcting problems with data:
Probably the best of these choices is to use robust tests.
Robust tests: a family of procedures to estimate statistics that are unbiased even when the normal assumptions of the statistic are not met.
Trimming the data
Trimming the data: deleting some scores from the extremes.
In its simplest form it could be deleting the data from the person who contributed the outlier.
But, this could be done only if you have good reason to believe that this case is not from the population that you intended to sample.
More often, trimming involves removing extreme scores using one of two rules:
For example: a percentage based rule could be deleting the 10% highest and lowest scores.
If you take trimming to its extreme you get the median, which is the value left when you have trimmed all but the middle score.
Trimmed mean: the mean in a sample that has been trimmed this way.
M-estimator: a robust measure of location, differs from a trimmed mean in that the amount of trimming is determined empirically.
Rather than the researcher deciding before the analysis how much of the data to trim, an M-estimator determines the optimal amount of trimming necessary to give a robust estimate of, say, the mean.
The trimmed mean (and variance) will be relatively accurate even when the distribution is not symmetrical, because by trimming the ends of the distribution we remove outliers and skew that bias the mean.
Standard deviation based trimming involves calculating the mean and standard deviation of a set of scores, and then removing values that are a certain number of standard deviations greater than the mean.
Applying this wouldn’t affect the mean or standard deviation.
Winsorizing
Winsorizing the data: replacing outliers with the next highest score that is not an outlier.
There are some variations on winsorizing, such as replacing extreme scores with a score 3 standard deviations from the mean.
Robust methods
By far the best option if you have irksome data is to estimate parameters and their standard errors with methods that are robust to violations of assumptions and outliers.
Use methods that are relatively unaffected by irksome data.
Two simple concepts of robust data:
Bootstrap
The sample data are treated as a population from which smaller samples (bootstrap samples) are taken (putting each score back before a new one is drawn form the sample). The parameter of interest is calculated in each bootstrap sample. This process is repeated perhaps 2000 times. The result is 2000 parameter estimates, one form each bootstrap sample.
There are two things we can do with these estimates
Because bootstrapping is based on taking random samples from the data you’ve collected, the estimates you get will be slightly different every time.
Transforming data
The idea behind transformations is that you do something to every score to correct for distributional problems, outliers, lack of linearity or unequal variances.
You do the same ting to all your scores.
If you are looking at the relationship between variables you can transform only the problematic variable.
If you are looking at differences between variables you must transform all the variables.
Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>
This is a summary of the book "Discovering statistics using IBM SPSS statistics" by A. Field. In this summary, everything students at the second year of psychology at the Uva will need is present. The content needed in the thirst three blocks are already online, and the rest
...There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
Main summaries home pages:
Main study fields:
Business organization and economics, Communication & Marketing, Education & Pedagogic Sciences, International Relations and Politics, IT and Technology, Law & Administration, Medicine & Health Care, Nature & Environmental Sciences, Psychology and behavioral sciences, Science and academic Research, Society & Culture, Tourisme & Sports
Main study fields NL:
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
5218 | 1 |
Add new contribution