The linear model - summary of Chapter 9 by A. Field 5th edition

Statistics
Chapter 9
The linear model (regression)

An introduction to the linear model (regression)
Bias in linear models?
Generalizing the model
Sample size and the linear model
The linear model with two or more predictors (multiple regression)

An introduction to the linear model (regression)

The linear model with one predictor

outcome = (b₀+b₁x_i) +error_i

This model uses an unstandardised measure of the relationship (b₁) and consequently we include a parameter b₀ that tells us the value of the outcome when the predictor is zero.

Any straight line can be defined by two things:

the slope of the line (usually denoted by b₁)
the point at which the the line crosses the vertical axis of the graph (the intercept of the line, b₀)

These parameters are regression coefficients.

The linear model with several predictors

The linear model expands to include as many predictor variables as you like.
An additional predictor can be placed in the model given a b to estimate its relationship to the outcome:

Y_i = (b₀ +b₁X_1i +b₂X_2i+ … b_nX_ni) + Ɛ_i

b_n is the coefficient is the nth predictor (X_ni)

Regression analysis is a term for fitting a linear model to data and using it to predict values of an outcome variable form one or more predictor variables.
Simple regression: with one predictor variable
Multiple regression: with several predictors

Estimating the model

No matter how many predictors there are, the model can be described entirely by a constant (b₀) and by parameters associated with each predictor (bs).

To estimate these parameters we use the method of least squares.
We could assess the fit of a model by looking at the deviations between the model and the data collected.

Residuals: the differences between what the model predicts and the observed values.

To calculate the total error in a model we square the differences between the observed values of the outcome, and the predicted values that come from the model:

total error: Σⁿ_i=1(observed_i-model_i)²

Because we call these errors residuals, this is called the residual sum of squares (SS_R).
It is a gauge of how well a linear model fits the data.

if the SS_R is large, the model is not representative
if the SS_R is small, the model is representative for the data

The least SS_R gives us the best model.

Assessing the goodness of fit, sums of squares R and R²

Goodness of fit: how well the model fits the observed data

Total sum of squares (SS_T): how good the mean is as a model of the observed outcome scores.

We can use the values of SS_T and SS_R to calculate how much better the linear model is than the baseline model of ‘no relationship’.
The improvement in prediction resulting from using the linear model rather than the mean is calculated as the difference between SS_T and SS_R.
This improvement is the model sum of squares SS_M

if SS_M is large, the linear model is very different from using the mean to predict the outcome variable. It is a big improvement.

R² = SS_M/ SS_T

R² is the improvement due to the model

To express this value as a percentage, multiply it by 100.
R² represents the amount of variance in the outcome explained by the model relative to how much variation there was to explain in the first place.
we can take the square root of this value to obtain Pearson’s correlation coefficient for the relationship between values of the outcome predicted by the model and the observed values of the outcome.

Another use of the sums of squares is in assessing the F-test.

F is based upon the ratio of the improvement due to the model and the error in the model.

Mean squares (MS): the sum of squares divided by the associated degrees of freedom.

MS_M = SS_M/k

MS_R = SS_R/ (N – k – 1)

F = MS_M/MS_R

F has an associated probability distribution from which a p-value can be derived to tell us the probability of getting an F at least as big as one we have if the null hypothesis were true.
The F statistic can also used to the significance R²

F = ((N – k – 1)R²) / (k(1-R²)

Assessing individual predictors

Any predictor in a linear model has a coefficient (b_i). The value of b represents the change in the outcome resulting from a unit change in a predictor.
The t-statistic is based on the ratio of explained variance against unexplained variance or error

t = (b_{observed –} b_expected)/ SE_b

The statistic t has a probability distribution that differs accordingly to the degrees of freedom for the text.

Bias in linear models?

Outliers

An outlier: a case that differs substantially from the main trend in the data.
Outliers can affect the estimates of the regression coefficients.

Standardized residuals: the residuals converted to z-scores and so are expressed in standard deviation units.
Regardless of the variables of the model, standardized residuals are distributed around a mean of 0 with a standard deviation of 1.

Standardized residuals with an absolute value greater than 3,29 are cause for concern because in an average sample a value this high is unlikely to occur
if more than 1% of our sample cases have standardized residuals with an absolute value greater than 2,58 there is evidence that the level of error within our model may be unacceptable
if more than 5% of cases have standardized residuals with an absolute value greater than 1,96 then the model may be a poor representation of the data

Influential cases

There are several statistics used to assess the influence of a case.

adjusted predicted value
the predicted value of the outcome for that case from a model in which the case is excluded.
If the model was stable, then the predicted value of a case should be the same regardless of whether that case was used to estimate the model
Deleted residual
the difference between the adjusted predicted value and the original observed value.
studentized deleted residual
the deleted residual divided by the standard error
Cook’s distance
a measure of the overall influence of a case on the model
the leverage
gauges the influence of the observed value of the outcome variable over the predicted values
Mahalanobis distances
measure the distance of cases from the mean(s) of the predictor variable(s)
to look at how the estimates b in a model change as a result of excluding a case

DFBeta: the difference between a parameter estimated using all cases and estimated when one case is excluded.
DFFit: the difference between the predicted values for a case when the model is estimated including or excluding that case.
Covariance ratio (CVR): quantifies the degree to which a case influences the variance of the regression parameters.

Generalizing the model

Assumptions of the linear model

Additivity and linearity
the outcome variable should be linearly related to any predictors and, with several predictors, their combined effect is the best described by adding their effect together.
Independent errors
for any two observations the residual terms should be uncorrelated.
This can be tested with the Durbin-Watson test.
homoscedasticity
at each level of the predictor variable(s) the variance of the residual terms should be constant.
Residuals at each level of the predictor(s) should have the same variance (homoscedasticity).
Normally distributed errors
the differences between the predicted and observed data are most frequently zero or close to zero and differences much greater than zero happen only occasionally.
Predictors are uncorrelated with ‘external variables’
External variables: variables that haven’t been included in the model and that influence the outcome variable
Variable types
all predictor variables must be quantitative or categorical.
The outcome variable must be quantitative, continuous and unbounded.
No perfect multicollinearity
if your model has more than one predictor, then there should be no perfect linear relationship between two or more of the predictors.
Non-zero variance
the predictors should have same variation in value

Cross-validation of the model

Even if we can’t be confident that the model derived from our sample accurately represents the population, we can assess how well our model might predict the outcome in a different sample.
Cross-validation: assessing the accuracy of a model across different samples.
If a model can be generalized, then it must be capable of accurately predicting the same outcome variable form the same set of predictors in a different group of people.

Once we have estimated the model there are two main methods of cross-validation:

Adjusted R²
Adjusted R²tells us how much variance in Y would be accounted for if the model had been derived from the population from which the sample was taken.
The adjusted value indicates the loos of predictive power.
Data splitting
involves randomly splitting your sample data, estimating the model in both halves of the data and comparing the resulting models.

Sample size and the linear model

The sample size required depends on the size of effect that we’re trying to detect and how much power we want to detect in these effects.
The bigger the sample size the better.

Summary

A linear model (regression) is a way of predicting values of one variable form another based on a model that describes a straight line.
this line is the line that best summarizes the pattern of the data
to asses how well the model fits the data use:
- R², which tells us how much variance is explained by the model compared to how much variance there is to explain in the first place. It is the proportion of variance in the outcome variable that is shared by the predictor variable
- F, which tells us how much variability the model can explain relative to how much it can’t explain.
- the b-value, which tells us the gradient of the regression line and the strength of the relationship between a predictor and the outcome variable. If it is significant then the predictor variable significantly predicts the outcome variable.

The linear model with two or more predictors (multiple regression)

a great deal of care should be taken in selecting predictors for a model because the estimates of the regression coefficients depend upon the variables in the model.

Methods of entering predictors into the model

Having chosen predictors, you must decide the order to enter them into the model.

when predictors are completely uncorrelated, the order of variance entry has very little effect on the parameters estimated, but we rarely have uncorrelated predictors.
Other things being equal, use hierarchical regression.
You select predictors based on past work and decide in which order to enter them in the model.
You should enter known predictors into the model first in order of their importance in predicting the outcome.
An alternative method is entry.
Here you force all predictors into the model simultaneously.
Stepwise regression
avoid this

Comparing models

Hierarchical methods involve adding predictors to the model stages, and it is useful to assess the improvement to the model at each stage.
A simple way to quantify the improvement is to compare R² for the new model to that for the old model.

F_change = ((N – k_new -1)R²_change)/(k_change(1-R²_change))

We can compare models using this F-statistic.

Multicollinearity

Multicollinearity exists when there is a strong correlation between two or more predictors.
Perfect collinearity: when at least one predictor is a perfect linear combination of the others.

As collinearity increases there are three problems that arise:

- Untrustworthy bs
As collinearity increases, so to the standard errors of the b coefficients.
Big standard errors for b coefficients mean more variability in these bs across samples, and greater change of
- predictor equations that are unstable across samples
- b coefficients in the sample that are unrepresentative of those in the population
It limits the size of R
Importance of predictors
it makes it difficult to assess the individual importance of a predictors

Variance inflation factor (VIF): indicates whether a predictor has a strong linear relationship with the other predictor(s). The tolerance statistic is its reciprocal.

if the largest VIF is greater than 10, this this indicates a serious problem
If the average VIF is substantially greater than 1 then the regression may be biased
Tolerance below 0,2 indicates a potential problem.

Access:

Public

Join WorldSupporter!

Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

This content is related to:

Summary of Discovering statistics using IBM SPSS statistics by Field - 5th edition

Check more of topic:

Samenvattingen voor psychologie en gedrag

Universiteit Amsterdam: UVA

This content is used in:

Summary of Discovering statistics using IBM SPSS statistics by Field - 5th edition

Going abroad?

Insure your way around the world

International expat insurances

Travel & Worldsupporter insurances (NL)

Study with summaries

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Spotlight: topics

Check the related and most recent topics and summaries:

Study fields and working areas:

Samenvattingen voor psychologie en gedrag

Research, science and statistics

Countries and regions:

The Netherlands

WorldSupporter and development goals:

Development Goal 04: Quality Education

Institutions, jobs and organizations:

Universiteit Amsterdam: UVA

This content is also used in .....

Summary of Discovering statistics using IBM SPSS statistics by Field - 5th edition

This is a summary of the book "Discovering statistics using IBM SPSS statistics" by A. Field. In this summary, everything students at the second year of psychology at the Uva will need is present. The content needed in the thirst three blocks are already online, and the rest

...

analysis-2958826_960_720.jpg

Why is my evil lecturer forcing me to learn statisics? - summary of chapter 1 of statistics by A. Field (5th edition)

The spine of statistics - summary of chapter 2 of Statistics by A. Field (5th edition)

The beast of bias - summary of chapter 6 of Statistics by A. Field (5th edition)

Non-parametric models - summary of chapter 7 of Statistics by A. Field (5h edition)

Correlation - summary of chapter 8 of Statistics by A. Field (5th edition)

The linear model - summary of Chapter 9 by A. Field 5th edition

Comparing two means - summary of chapter 10 of Statistics by A. Field (5th edition)

Moderation, mediation, and multi-category predictors - summary of chapter 11 of Statistics by A. Field (5th edition),

Comparing several independent means - summary of chapter 12 of Statistics by A. Field (5th edition)

Analysis of covariance - summary of chapter 13 of Statistics by A. Field (5th edition)

Factorial designs - summary of chapter 14 of statistics by A. Field (5th edition)

Repeated measures designs - summary of chapter 15 of Statistics by A. Field (5th edition)

Mixed designs - summary of chapter 16 of Statistics by A. Field (5th edition)

Multivariate analysis of variance (MANOVA) - summary of chapter 17 of Statistics by A. Field (5th edition)

Exploratory factor analysis - summary of chapter 18 of Statistics by A. Field (5th edition)

Categorical outcomes: chi-square and loglinear analysis - summary of chapter 19 of Statistics by A. Field

WSRt using SPSS, manual for tests in the third block of the second year of psychology at the uva

Everything you need for the course WSRt of the second year of Psychology at the Uva

Categorical outcomes: logistic regression - summary of (part of) chapter 20 of Statistics by A. Field

Check how to use summaries on WorldSupporter.org

Submenu: Summaries & Activities

Follow the author: SanneA

Work for WorldSupporter

JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics

Search a summary, study help or student organization

Select any filter and click on Search to see results

The linear model - summary of Chapter 9 by A. Field 5th edition

An introduction to the linear model (regression)

Bias in linear models?

Generalizing the model

Sample size and the linear model

The linear model with two or more predictors (multiple regression)

Summary of Discovering statistics using IBM SPSS statistics by Field - 5th edition

Samenvattingen voor psychologie en gedrag

Universiteit Amsterdam: UVA

Summary of Discovering statistics using IBM SPSS statistics by Field - 5th edition

Contributions: posts

Add new contribution

Spotlight: topics

Samenvattingen voor psychologie en gedrag

Research, science and statistics

The Netherlands

Development Goal 04: Quality Education

Universiteit Amsterdam: UVA

Summary of Discovering statistics using IBM SPSS statistics by Field - 5th edition

analysis-2958826_960_720.jpg

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

Quicklinks to fields of study for summaries and study assistance