Applying multiple regression

Predicting and explaining (causal) relations can be important when there are more than two variables, because a phenomenon can be predicted by multiple factors.

Using a multiple regression has three advantages compared to using Pearson correlations.

First, it provides information about optimal predictions of Y by a combination of X variables. Second, it allows you to determine how well the prediction is, by examining what the total contribution is of the set of predictors on the prediction. Finally, it allows you to determine the contribution of each predictor separately (it is important to note that the most optimal prediction is not per se a correct prediction). The last advantage can be used to determine more clearly a causal relation, or to determine the added value of a predictor.

The formula for multiple regression is:

\[y = b_0 + b_1 x_1 + b_2 x_2 + ... + b_p x_p \]

  • Y: predicted or expected value of the dependent variable
  • x1 through xp : distinct independent or predictor variables
  • b0: the value of Y when all of the independent variables (X1 through Xp) are equal to zero
  • b1 - bp : the estimated regression coefficients

Multiple correlations

The multiple correlation (R) has a value between 0 and 1, and hence can not be negative which is different from the Pearson correlation.

R2 refers to the proportion of explained variance of Y, in which a higher R2 indicates a better prediction. To correct for an overestimation of shared variance, one can use the adjusted R2 which is calculated as:

\[adjusted\:R2=1-\frac{(1-R2)(N-1)}{N-p-1}\]

  • R2: proportion of explained variance of Y
  • N: number of points in your data sample
  • p: number of independent regressors, i.e. the number of variables in your model, excluding the constant

Partial and semi-partial correlation

The (semi-)partial correlation coefficients control for the effect of one or more other variables. 

Partial correlation

The partial correlation r01.2 is the correlation between two variables with one or more variables removed from both X and Y.

Imagine that we want to examine the relation between income and school achievement. We find a significant correlation between these two variables. However, this does not mean per se that success on school results in a higher income. It might also be explained by IQ, for example: this causes both higher school achievements and a higher income. The way to examine this, is by calculating the partial correlation between school achievement and income, after removing IQ from both variables.

For the partial correlation, we conduct a separate regression analysis for both variables with the to be controlled variables (in the example: income with IQ and school achievement with IQ). We take the residual of both analyses. This is the part of variance that is not explained by IQ. The correlation between these is the partial correlation.

The notation of the partial correlation coefficient is r01.23..p in which the correlated variables stand left from the dot, and the variables which are controlled stand right from the dot.

The squared partial correlation is the proportion of explained variance.

Semi-partial correlation

The semi-partial correlation is the correlation between criterion Y and a controlled (partialled) predictor variable. While the partial correlation removes a variable from both the criterion and the predictor, here we only remove a variable from the predictor. The semi-partial correlation is the correlation of Y with that part of X1 that is independent of X2 (the residual). The notation of the semi-partial correlation is: r0(1.2) in which we remove variable 2 from predictor 1. For the correlation, it applies that: r20(1.2) = r20.12 – r202.

Constants and regression weights

In general, the constant does not have an intrinsic value for researchers and is therefore difficult to interpret. In addition, the interpretation of the regression weights can be difficult, because the measurement units are often arbitrary. This also makes it difficult to determine which predictor is most important. The latter problem can be resolved by using standardized regression weights. Standardized regression weights are noted with the sign β (beta).

This way, you are independent of measurement units and you can compare different predictors well. However, this has the negative consequence that you are dependent on the standard deviation within samples, which is especially problematic if you want to compare different studies. Regression weights are always partial, which implies that they are only valid when all variables are included in the equation. Thus, when a correction is applied for the effects of all other variables you can not examine the regression weights as something separately, but only within the context.

Testing: from samples to population

So far, we only looked at descriptive statistics. However, we can also use inferential statistics to say something about the population from which the sample is drawn. To determine if the total contribution of all variables differs from zero, the F-test can be used. To determine the unique contribution of each predictor, a t-test can be conducted for each predictor. However, the more predictors (the more t-tests), the larger the chance on a type-I error. Therefore, the F-test is used as a kind of ‘gatekeeper’ to determine how many t-tests should be considered. If the F-test is significant, t-tests are conducted.

The F-test is calculated as:

\[F=\frac{(-p-1)R2}{p(1-R2)}\]

  • R2: proportion of explained variance of Y
  • N: number of points in your data sample
  • p: number of independent regressors, i.e. the number of variables in your model, excluding the constant

The t test is used to check the significance of individual regression coefficients in the multiple linear regression model. Adding a significant variable to a regression model makes the model more effective, while adding an unimportant variable may make the model worse.

The hypothesis statements to test the significance of a particular regression coefficient, βj

H0 : βj = 0

H1 : β≠ 0

The test statistic for this test is based on the t distribution (and is similar to the one used in the case of simple linear regression models):

\[T_0=\frac{\hat{\beta}_j}{se(\hat{\beta}_j)}\]

Assumptions

Different assumptions have to be met:

  1. The dependent variable should be interval scaled; predictors can be binary or interval scaled. Fortunately, multiple regression is fairly robust for small deviations of the interval level.

  2. There is a linear relation between the predictors and the dependent variable. With a standard multiple regression, only linear relation can be identified (and for example no curvilinear relations). Deviations can also be determined with a residual plot.

  3. The residuals have (a) a normal distribution (b) the same variance for all values of the linear combinations of predictors and (c) are independent of each other.

The assumption of normally distributed residuals is not very important to consider, because regression tests are robust against violations when the sample is large enough (N > 100). Often, the assumption is checked with a histogram. The assumption of heteroscedasticity (3 (b)) should be checked properly, because regression is not robust against violations of this. A residual plot is used for this. The latter assumption (independence of mistakes, 3 (c)) is very important, but difficult to check. Fortunately, most research designs meet this assumption. Checking assumptions is thus always dependent on the assessment of researchers and can thus be interpreted differently by people.

Multicollinearity and outliers

Outliers are scores of three of more standard errors above or below the mean. It is important to consider why the score of an individual is an outlier in the analysis. In addition, outliers can have a disproportional influence on the regression weights. If you decide to exclude outliers from the analysis, it is good practice to be very explicit about this in you report, and note why you chose to do so.

Different problems may arise when correlations between dependent variables are strong. Sometimes, the regression does not provide any results. In other cases, the estimates are unreliable or it is difficult to interpret the results. To check for multicollinearity, you can check the tolerance of each predictor (it should exceed 0.10).

Tolerance is calculated as:

\[Tolerance = 1 - R2_j\]

Rj : the multiple correlation between variable j and all other predictor variables

Furthermore, you can check the VIF which can be calculated as 1/tolerance. This should be as low as possible, at least below 0.10.

Mediating and moderating relations

Mediators and moderators are important in your research: variables that play a role in the relation between two other variables.

Mediation

A mediator mediates the relation between two other variables. For example: the degree of self-confidence is mediated by the amount of care received from parents and the way someone thinks about raising children (Caring parents result in a high confidence, which results in confidence to raise children).

Baron and Kenny wrote a lot about mediation. They mention three steps that have to be taken, in order to have a mediating effect.

  1. You have to show that the independent variable has a significant relation with the mediator.
  2. You have to show that there is a significant relation between the mediator and the dependent variable and between the independent and dependent variable.
  3. You have to demonstrate that, when the mediator and independent variable are used together to predict the dependent variable, the path between the independent and dependent variable (c) becomes less strong (preferably non-significant).

But, when path ‘c’ does not disappear fully and remains significant, what then? One way is the Sobel test, with which we question whether the full mediating path of the independent variable to the mediator to the dependent variable is significant. For this, we need the regression-coefficients and standard errors of the two paths. The standard error of Beta (se β) is not given and should be calculated as: t = β/sβ , so s β = β/t.

\[t = \frac{\beta}{se\beta}\]

so

\[se\beta=\frac{\beta}{t}\]

Moderation

With moderating relations, the relation between independent and dependent variables changes by a third (moderator) variable. For example: we examine the influence of faily stress-events on the number of symptoms of stress as indicated by the student. In addition, we find that when the student receives much social support, he shows less symptoms than someone who receives little social support.

What can you do on a WorldSupporter Statistics Topic?

What can you do on a WorldSupporter Statistics Topic?

  • Understand statistics with knowledge and explanation about a topic of statistics
  • Practice with questions and answers to test your statistical knowledge and skills
  • Watch statistics practiced in real life with selected videos for extra clarification
  • Study relevant terminology with glossaries of statistical topics
  • Share your knowledge and experience and see other WorldSupporters' contributions about a topic of statistics
Video for understanding multiple regression

WorldSupporter Statistics Topics

Basics and alternatives for multiple regression

Statistics: suggestions, summaries and tips for encountering Statistics

Statistics: suggestions, summaries and tips for encountering Statistics

Knowledge and assistance for discovering, identifying, recognizing, observing and defining statistics.

Startmagazine: Introduction to Statistics
Stats for students: Simple steps for passing your statistics courses

Stats for students: Simple steps for passing your statistics courses

Image

How to triumph over the theory of statistics (without understanding everything)?

Stats of students

  • The first years that you follow statistics, it is often a case of taking knowledge for granted and simply trying to pass the courses. Don't worry if you don't understand everything right away: in later years it will fall into place, and you will see the importance of the theory you had to know before.
  • The book you need to study may be difficult to understand at first. Be patient: later in your studies, the effort you put in now will pay off.
  • Be a Gestalt Scientist! In other words, recognize that the whole of statistics is greater than the sum of its parts. It is very easy to get hung up on nit-picking details and fail to see the forest because of the trees
  • Tip: Precise use of language is important in research. Try to reproduce the theory verbatim (i.e. learn by heart) where possible. With that, you don't have to understand it yet, you show that you've been working on it, you can't go wrong by using the wrong word and you practice for later reporting of research.
  • Tip: Keep study material, handouts, sheets, and other publications from your teacher for future reference.

How to score points with formulas of statistics (without learning them all)?

  • The direct relationship between data and results consists of mathematical formulas. These follow their own logic, are written in their own language, and can therefore be complex to comprehend.
  • If you don't understand the math behind statistics, you don't understand statistics. This does not have to be a problem, because statistics is an applied science from which you can also get excellent results without understanding. None of your teachers will understand all the statistical formulas.
  • Please note: you will probably have to know and understand a number of formulas, so that you can demonstrate that you know the principle of how statistics work. Which formulas you need to know differs from subject to subject and lecturer to lecturer, but in general these are relatively simple formulas that occur frequently, and your lecturer will likely tell you (often several times) that you should know this formula.
  • Tip: if you want to recognize statistical symbols, you can use: Recognizing commonly used statistical symbols
  • Tip: have fun with LaTeX! LaTeX code gives us a simple way to write out mathematical formulas and make them look professional. Play with LaTeX. With that, you can include used formulas in your own papers and you learn to understand how a formula is built up – which greatly benefits your understanding and remembering that formula. See also (in Dutch): How to create formulas like a pro on JoHo WorldSupporter?
  • Tip: Are you interested in a career in sciences or programming? Then take your formulas seriously and go through them again after your course.

How to practice your statistics (with minimal effort)?

How to select your data?

  • Your teacher will regularly use a dataset for lessons during the first years of your studying. It is instructive (and can be a lot of fun) to set up your own research for once with real data that is also used by other researchers.
  • Tip: scientific articles often indicate which datasets have been used for the research. There is a good chance that those datasets are valid. Sometimes there are also studies that determine which datasets are more valid for the topic you want to study than others. Make use of datasets other researchers point out.
  • Tip: Do you want an interesting research result? You can use the same method and question, but use an alternative dataset, and/or alternative variables, and/or alternative location, and/or alternative time span. This allows you to validate or falsify the results of earlier research.
  • Tip: for datasets you can look at Discovering datasets for statistical research

How to operationalize clearly and smartly?

  • For the operationalization, it is usually sufficient to indicate the following three things:
    • What is the concept you want to study?
    • Which variable does that concept represent?
    • Which indicators do you select for those variables?
  • It is smart to argue that a variable is valid, or why you choose that indicator.
  • For example, if you want to know whether someone is currently a father or mother (concept), you can search the variables for how many children the respondent has (variable) and then select on the indicators greater than 0, or is not 0 (indicators). Where possible, use the terms 'concept', 'variable', 'indicator' and 'valid' in your communication. For example, as follows: “The variable [variable name] is a valid measure of the concept [concept name] (if applicable: source). The value [description of the value] is an indicator of [what you want to measure].” (ie.: The variable "Number of children" is a valid measure of the concept of parenthood. A value greater than 0 is an indicator of whether someone is currently a father or mother.)

How to run analyses and draw your conclusions?

  • The choice of your analyses depends, among other things, on what your research goal is, which methods are often used in the existing literature, and practical issues and limitations.
  • The more you learn, the more independently you can choose research methods that suit your research goal. In the beginning, follow the lecturer – at the end of your studies you will have a toolbox with which you can vary in your research yourself.
  • Try to link up as much as possible with research methods that are used in the existing literature, because otherwise you could be comparing apples with oranges. Deviating can sometimes lead to interesting results, but discuss this with your teacher first.
  • For as long as you need, keep a step-by-step plan at hand on how you can best run your analysis and achieve results. For every analysis you run, there is a step-by-step explanation of how to perform it; if you do not find it in your study literature, it can often be found quickly on the internet.
  • Tip: Practice a lot with statistics, so that you can show results quickly. You cannot learn statistics by just reading about it.
  • Tip: The measurement level of the variables you use (ratio, interval, ordinal, nominal) largely determines the research method you can use. Show your audience that you recognize this.
  • Tip: conclusions from statistical analyses will never be certain, but at the most likely. There is usually a standard formulation for each research method with which you can express the conclusions from that analysis and at the same time indicate that it is not certain. Use that standard wording when communicating about results from your analysis.
  • Tip: see explanation for various analyses: Introduction to statistics
Statistics: suggestions, summaries and tips for understanding statistics

Statistics: suggestions, summaries and tips for understanding statistics

Knowledge and assistance for classifying, illustrating, interpreting, demonstrating and discussing statistics.

Startmagazine: Introduction to Statistics
Understanding data: distributions, connections and gatherings
Understanding reliability and validity
Statistics Magazine: Understanding statistical samples
Understanding distributions in statistics
Understanding variability, variance and standard deviation
Understanding inferential statistics
Understanding type-I and type-II errors
Understanding effect size, proportion of explained variance and power of tests to your significant results
Statistiek en onderzoek - Thema
Statistics: suggestions, summaries and tips for applying statistics

Statistics: suggestions, summaries and tips for applying statistics

Knowledge and assistance for choosing, modeling, organizing, planning and utilizing statistics.

Applying z-tests and t-tests
Applying correlation, regression and linear regression
Applying spearman's correlation - Theme
Applying multiple regression
Crossroads: activities, countries, competences, study fields and goals
Activity abroad, study field of working area:
Statistics
3320