Multiple Regression (12)

## 12. Multiple Regression

Simple regression (see chapter 11) can predict a dependent variable as a function of a single independent variable. But often there are multiple variables at play. In order to determine the simultaneous effect of multiple independent variables on a dependent variable, multiple regression is used. The least squares principle fit the model.

## 12.1. The model

As with simple regression, the first step in the model development is model specification, the selection of the model variables and functional form of the model. This is influenced by the model objectives, namely: (1) predicting the dependent variable, and/or (2) estimating the marginal effect of each independent variable. The second objective is hard to achieve, however, in a model with multiple independent variables, because these variables are not only related to the dependent variable but also to each other. This leaves a web of effects that is not easily untangled.

To make multiple regression models more accurate an error termε” is added, as a way to recognize that none of the described relationships in the model will hold exactly and there are likely to be variables that affect the dependent variable, but are not included in the model.

## 12.2. Estimating Coefficients

Multiple regression coefficients are calculated with the least squares procedure. However, again this is more complicated than with simple regression, as the independent variables not only affect the dependent variable but also each other. It is not possible to identify the unique effect of each independent variable on the dependent variable. This means that the higher the correlations between two or more of the independent variables in a model are, the less reliable the estimated regression coefficients are.

There are 5 assumptions to standard multiple regression. The first 4 are the same as are made for simple regression (see chapter 11). The 5th states that it is not possible to find a set of nonzero numbers such that the sum of the coefficients equals 0. This assumption excludes the cases in which there is a linear relationship between a pair of independent variables. In most cases this assumption will not be violated if the model is properly specified.

Whereas in simple regression the least squares procedure finds a line that best represents the set of points in space, multiple regression finds a plane that best represents these points (as each variable is represented with its own dimension).

It is important to be aware of the fact that in a multiple regression it is not possible to know which independent variable predicts which change in the dependent variable. After all, the slope coefficient estimated is affected by the correlations between all independent and dependent variables. This also means that any multiple regression coefficient is dependent on all independent variables in the model. These coefficients are thus referred to as conditional coefficients. This is the case is all multiple regression models unless there are two independent variables with a sample correlation of zero (but this is very unlikely). Because of this effect highly correlated independent variables should be avoided if possible, so as to minimize the influence on the estimated coefficients. This is also the reason that proper model specification, based on an adequate understanding of the problem context and theory, is crucial in multiple regression models.

## 12.3. Inferences with Multiple Regression Equations

Multiple regression isn’t exact, and the variability of the dependent variable is only in part explained by the linear function of the independent variables. Therefore often a measure is used to show the proportion of said variability that can be explained by the multiple regression model:   Mean square regression (MSR). This measure needs to be adjusted for the number of independent variable. It is calculated as follows:

In multiple regression the sum-of squares decomposition is performed as follows:

Sum of squares total       = sum of squares regression    + sum of squares error

Which can be interpreted as:

Total sample variability   =   explained variability                + unexplained variability

As with simple regression the SSE can be used to calculate the estimated variance of population model errors, which is used for statistical inference.

Another useful measure is R2, the coefficient of determination. R2 can be used to describe the strength of the linear relationship between the independent variables and the dependent variable.

The equation is as follows:

Be aware that if you want to used R¸2 to compare regression models, you can only do so if they have the same set of sample observations of the dependent variable.

Using R2 as an overall measure of the quality of a fitted equation, has one potential problem. Namely, the SSR (explained sum of squares) will increase as more independent variables are added to the model, even if these added variables are not important predictors. This increase in SSR leads to a misleading increase in R2.

This problem can be avoided by calculating the adjusted coefficient of determination:

In which K stands for the number of independent variables.

Lastly, R, the coefficient of multiple correlation, is the correlation between the observed  and predicted value of the dependent variable. This is equal to the root of the multiple coefficient:

## 12.4. Confidence Intervals & Hypothesis Tests for Coefficients

Both confidence intervals and hypothesis tests for estimated regression coefficients in multiple regression models depend on the variance of the coefficients and the probability distribution of the coefficient.

In a multiple regression model the dependent variable has the same normal distribution and variance as the error term, ε. This means that the regression coefficients have a normal distribution, and their variance can be derived from the linear relationship between the dependent variable and the regression coefficients. Again this involves the same calculation as with simple regression, but more complicated.

The error term is made up of a large number of components with random effects, which is why it can generally be assumed that it is normally distributed. Interestingly, because of the central limit theorem, the coefficient estimates are generally normally distributed even if ε is not, meaning that the use of ε does not affect the developed hypothesis tests and confidence intervals.

The problematic factor that remains with multiple regression models is that the multitude of relations often leads to interpretation errors. Mainly the correlations between the independent variables influence the confidence intervals as well as the hypothesis tests, and increase the variance of the coefficient estimators. This variance is thus conditional on the entire set of independent variables in the model.

In order to get a good coefficient estimate there should be, if possible: (1) a wide range for the independent variables, (2) independent variables that have low correlations, and (3) a model that is close to all data points. It is not always possible to make such choices, but by being aware of these effects good judgements can be made about the applicability of available models.

Other effects of the multitude of independent variables are:

• An increase in the correlation between the independent variables causes the variance of the coefficient estimators to increase. This has to do with the fact that is becomes more difficult to separate the individual effects of the independent variables on the dependent variable.
• An increase in the number of independent variables makes the algebraic structure of the model more complex. The importance of the influences on the coefficient variance remains the same.

A coefficient variance estimator is shown as s2b, and calculated as follows:

The square root of a variance estimator is known as the coefficient standard error.

Hypothesis tests for regression coefficients are developed using the coefficient variate estimates. In a multiple regression model the hypothesis test H0: βj=0 is used most, as it can determine whether a specific independent variable is conditionally important in the model. Given the other variables in the model a conclusion can be drawn immediately in this case, using the printed Student’s t-statistic or the p-value.

It is important to note that hypothesis tests are only valid when only the particular set of variables included in the regression model are used, as these are the variables the tests are based on. Including additional predictor variables makes the tests non-valid.

## 12.5. Testing Regression Coefficients

It can also occur that the focus of interest lies on the effect of the combination of several variables. This is then calculated as follows:

1. Hypothesis tests are presented to determine if sets of coefficients are simultaneously equal to zero. If this hypothesis is accepted this means that none of the independent variables in the model are statistically significant (aka none provide useful information). This hypothesis is almost always rejected in an applied regression situation.
This hypothesis is tested using “the partitioning of variability” (SST=SSR+SSE; see 12.3). Using this a critical value F is calculated, then compared with the F in table 9 (Appendix A) at significance level α. If the calculated value is larger than the value in the table, the null hypothesis can be rejected. This leads to the conclusion that at least one coefficient is not equal to zero.
2. Next a hypothesis test is developed for the subset of regression parameters that bode looking into. This test can be sued to determine whether the combined effect of several independent variable is significant within the regression model.
The test is conducted by comparing the SSE from the complete regression model to the SSE(R) from a restricted model that includes only the selected  independent variables. If the calculated F is larger than the critical value of F, then the null hypothesis can be rejected and it can be concluded that the variables should be included in the model.

It is also possible to test the hypothesis that a single independent variable, given the other independent variables in the model, does not improve on the prediction of the dependent variable, using the same method of calculation as in step 2 above. This can also be done with a Student’s t test, which will yield the same conclusion as a F-test.

## 12.6. Predicting the Dependent Variable

An important feature of a regression model is to predict the value for the dependent variable, given the values for the independent variables. These forecasts can be calculated using the coefficient estimates. Besides the predicted value itself it is also desirable to have a confidence interval (expected value with probability 1-α) or a prediction interval (expected values + ε). To calculate these intervals the estimates of the standard deviations for the expected values, and the individual points need to be calculated. These calculations are, again, the same as used in simple regression, but more complicated. For this reason the calculations are only done using statistical software.

## 12.7. Non-linear Models

Regression models always assume a linear relationship, but sometimes a nonlinear relationship needs to be analysed. Luckily there are ways to transform regression models so they can be used for broader applications. This can be done because the assumptions about the independent variables within multiple regression are very loose. Non-linear models that can be used are as follows:

• Quadratic models: This is the more simple non-linear equation. To estimate the coefficients in a quadratic model, however, the variables need to be transformed back into a linear model. This can be done simply as follows:
Quadratic function:  Y = β0 + β1X1 + β2X12 + ε
Transformation:   x1=z1   &   x12=z2

Linear function:  yi = β0 + β1z1i + β2z2i + ε
Transforming the variables means the model can be estimated as a linear multiple regression model, but the results can be used as from a nonlinear model. For adequate interpretation, however,  the linear and quadratic coefficients need to be combined.
• Logarithmic models: Exponential functions have constant elasticity, and are most widely used in the analyses of market behaviour. The transformations and calculations associated with this model are more complicated, but are included in any quality statistical software.

## 12.8. Dummy Variables

Thus far the independent variables have been assumed to exist over a range. But it is also possible for a variable to be categorical. In this case the variable only takes two values: x=0 and x=1 (it is of course possible to take more values, according to the number of categories in the categorical variable). This structure is called a dummy variable, or an indicator variable.

Introducing a dummy variable into a multiple regression model shifts the linear relationship between the dependent variable and the other independent variables by the coefficient β of that dummy variable.

When there is a shift of the linear function by identifiable categorical factors, a dummy variable with values of 1 and 0 can estimate this shift effect. A dummy variable can also be used to model and test for differences in the slope coefficient, by adding an interaction variable.

Access:
Public
Search other summaries?

This content is also used in .....
More contributions of WorldSupporter author: Dara Yapp:
Work for WorldSupporter

JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world