Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)
- 2971 reads
Regression analysis is the process of researching associations between quantitative response variables and explanatory variables. It has three aspects: 1) investigating whether an association exists, 2) determining the strength of the association and 3) making a regression equation to predict the value of the response variable using the explanatory variable.
The response variable is denoted as y and the explanatory variable as x. A linear function means that there is a straight line throughout the points of data in a graph. A linear function is: y = α + β (x). In this alpha (α) is the y-intercept and beta (β) is the slope.
The x-axis is the horizontal axis and the y-axis is the vertical axis. The origin is the point where x and y are both 0.
The y-intercept is the value of y when x = 0. In that case β(x) equals 0, only y = α remains. The y-intercept is the location where the line starts on the y-axis.
The slope (β) indicates the change in y for a change of 1 in x. So the slope is an indication of how steep the line is. Generally when β increases, the line becomes steeper.
When β is positive, then y increases when x increases (a positive relationship). When β is negative, then y decreases when x increases (a negative relationship). When β = 0, the value of y is constant and doesn't change when x changes. This results in a horizontal line and means that the variables are independent.
A linear function is an example of a model; a simplified approximation of the association between variables in the population. A model can be good or bad. A regression model usually means a model more complex than a linear function.
In regression analysis α and β are regarded as unknown parameters that can be estimated using the available data. Each value of y is a point in a graph and can be written with its coordinates (x, y). A graph is used as a visual check whether it makes sense to make a linear function. If the data is U-shaped, a straight line doesn't make sense.
The variable y is estimated by ŷ. The equation is estimated by the prediction equation: ŷ = a + b(x). This line is the best line; the line closest to all data points. In the prediction equation is a = ȳ – bx̄ and:
A regression outlier is a data point far outside the trend of the other data points. It's called influential when removing it would cause a big change for the prediction equation. The effect is smaller for large datasets. Sometimes it's better for the prediction equation to leave the outlier out and explain this when reporting the results.
The prediction equation estimates the values of y, but they won't completely match the actual observed values. Studying the differences indicates the quality of the prediction equation. The difference between an observed value (y) and the predicted value (ŷ) is called a residual, this is y – ŷ. When the observed value is bigger, the residual is positive. When the observed value is smaller, the residual is negative. The smaller the absolute value of the residual, the better the prediction is.
The best prediction equation has the smallest residuals. To find it, the SSE (sum of squared errors) is used. SSE tells how good or bad ŷ is in predicting y. The formula of the SSE is: Σ (y – ŷ)2.
The least quares estimates a and b in the least squares line ŷ = a + b(x) have the values for which SSE is as small as possible. It results in the best possible line that can be drawn. In most software SSE is called the residual sum of squares.
The SSE of the best regression line has both negative and positive residuals (that all become positive by squaring them), of which the sum and the mean are 0. The best line intersects the mean of x and the mean of y, so it intersects (x̄, ȳ), the center of the data.
In y = a + b(x) there is the same sort of y-value for every x-value. This is a deterministic model. Usually this isn't how reality works. For instance when age (x) predicts the number of relationships someone has been in (y), then not everybody has had the same number at age 22. In that case a probabilistic model is better; a model that allows variability in the y-value. The data can then be visualized in a conditional distribution, a distribution that has the extra condition that x has a certain value.
A probabilistic model shows the mean of the y-values, not the actual values. The formula of a conditional distribution is E(y) = α + β (x). The symbol E means the expected value. When for instance people aged 22 have had different numbers of relationships, the probabilistic model can predict the mean number of relationships.
A regression function is a mathematical equation that describes how the mean of the response variable changes when the value of the explanatory variable changes.
Another parameter of the linear regression model is the standard deviation of a conditional distribution, σ. This parameter measures the variability of the y-values for all person with a certain x-value. This is called the conditional standard deviation.
Because the real standard deviation is unknown, the sample standard deviation is used:
The assumption is made that the standard deviation is the same for every x-value. If the variability would differ per distribution of a value of x, then s would indicate the mean variability. The Mean Square Error (MSE) is s squared. In software the conditional standard deviation has several names: Standard error of the estimate (SPSS), Residual standard error (R ), Root MSE (Stata and SAS).
The degrees of freedom for a regression function are df = n – p, in which p is the number of unknown parameters. In E(y) = α + β (x) there are two unknown parameters (α and β) so df = n – 2.
The conditional standard deviation depends both on y and on x and is written as σy|x (for the population) and sy|x (for the sample), shortened σ and s. In a marginal distribution the standard deviation only depends on y, so this is written as σy (for the population) and sy (for the sample). The formula of a point estimate of the standard deviation is:
The upper part in the root, Σ (y – ȳ)2, is the total sum of squares. The marginal standard deviation (independent of x) and the conditional standard deviation (dependent on a certain x) can be different.
The slope tells how steep a line is and whether the association is negative or positive, but it doesn't tell how strong the association between two variables is.
The association is measured by the correlation (r). This is standardized version of the slope. It is also called the standardized regression coefficient or Pearson correlation. The correlation is the value that the slope would have if the variables would have an equal variability. The formula is:
In regard to the slope (b), the r is: r = (sx / sy) b, in which sx is the standard deviation of x and sy is the standard deviation of y.
The correlation has the following characteristics:
It can only be used if a straight line makes sense.
It lies between 1 and -1.
It is positive/negative, the same as b.
If b is 0, then r is 0, because then there is no slope and no association.
If r increases, then the linear association is stronger. If r is exactly -1 or 1, then the linear association is perfectly negative or perfectly positive, without errors.
The r does not depend on units of measurement.
The correlation implies regression towards the mean. This means that when r increases, the association is stronger between the standard deviation of x and the proportion of the standard deviation of y.
The coefficient of determination r2 is r-squared and it indicates how good x can predict y. It measures how good the least squares line ŷ = a + b(x) predicts y compared to the prediction of ȳ.
The r2 has four elements;
Rule 1: y is predicted, no matter what x is. The best prediction then is the sample mean ȳ.
Rule 2: y is predicted by x. The prediction equation ŷ = a + b(x) predicts y.
E1 are the errors of rule 1 and E2 are the errors of rule 2.
The proportional limit of the number of errors is the coefficient of determination: r2 = (E1 - E2) / E1 in which E1 = Σ (y – ȳ)2, this is the total sum of squares (TSS). In this E2 = Σ (y – ŷ)2, this is the SSE.
R-squared has a number of characteristics similar to r:
Because r is between 1 and -1, the r2 needs to be between 0 and 1.
When SSE = 0, then r2 = 1. All points are on the line.
When b = 0, then r2 = 0.
The closer r2 is to 1, the stronger the linear association is.
The units of measurement and which variable is the explanatory one (x or y), don't matter for r2.
The TSS describes the variability in the observations of y. The SSE describes the variability of the prediction equation. The coefficient of determination indicates how many % the variance of a conditional distribution is bigger or smaller than that of a marginal distribution. Because the coefficient of determination doesn't use the original scale but a squared version, some researcher prefer the standard deviation and the correlation because the information they give is easier to interpret.
For categorical variables, chi-squared test is used to test for independence. For quantitative variables, the confidence test of the slope or of the correlation provides a test for independence.
The assumptions for inference applied to regression are:
Randomization
The mean of y is approximated by E(y) = α + β (x)
The conditional standard deviation σ is equal for every x-value
The conditional distribution of y for every x-value is a normal distribution
The null hypothesis is H0 : β = 0 (in that case there is no slope and the variables are independent), the alternative hypothesis is Ha : β ≠ 0.
The t-score is found by dividing the sample slope (b) by the standard error of b. The formula is t = b / se. This formula is similar to the formula for every t-score; the estimate minus the null hypothesis (0 in this case), divided by the standard error of the estimate. You can find the P-value for df = n – 2. The standard error of b is:
in which
The smaller the standard deviation s, the more precise b estimates β.
The correlation is denoted by the Greek letter ρ. The ρ is 0 in the same situations in which β = 0. A test whether H0 : ρ = 0 is performed in the same way as a test for the slope. For the correlation the formula is:
When many variables possibly influence a response variable, these can be portrayed in a correlation matrix. For each variable the correlation can be calculated.
A confidence interval gives more information about a slope than an independence test. The confidence interval of the slope β is: b ± t(se).
Calculating a confidence interval for a correlation is more difficult, because the sampling distribution isn't symmetrical unless ρ = 0.
R2 indicates how good x predicts y and it depends on TSS (the variability of the observations of y) and SSE (the variability of the prediction equation). The difference, TSS – SSE, is called the regression sum of squares or the model sum of squares. This difference is the total variability in y that is explained by x using the least squares line.
Often the assumption is made that a linear association exists. It's important to check the data in a scatterplot first to see whether a linear model makes sense. If the data is U-shaped, then a straight line doesn't make sense. Making this error could cause the result of an independence test of the slope to be wrong.
Other assumptions are that the distribution is normal and that σ is identical for every x-value. Even when the distribution isn't normal, then the least squares line, the correlation and the coefficient of determination are still useful. But if the standard deviation isn't equal, then other methods are more efficient than the least squares line.
Some outliers have big effects on the regression lines and the correlations. Sometimes outliers need to be taken out. Even one point can have a big influence, particularly for a small sample.
The assumption of randomization, both for x and y, is important for the correlation. If there is no randomization and the variability is small, then the sample correlation will be small and it will underestimate the population correlation. For other aspects of regression, like the slope, the assumption of randomization is less important.
The prediction equation shouldn't be extrapolated and used for (non-existent) data points outside of the range of the observed data. This could have absurd results, like things that are physically impossible.
The theoretical risk exists that the mean of y for a certain value of x doesn't estimate the actual individual observation properly. The Greek letter epsilon (ε) denotes the error term; how much y differs from the mean. The population model is y = α + β x + ε and the sample prediction equation is y = a + bx + e. The ε is also called the population residual.
A model is only an approximation of reality. It shouldn't be too simple. If a model is too simple, it should be adjusted.
Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>
Summary of Statistical methods for the social sciences by Agresti, 5th edition, 2018. Summary in English.
There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
Main summaries home pages:
Main study fields:
Business organization and economics, Communication & Marketing, Education & Pedagogic Sciences, International Relations and Politics, IT and Technology, Law & Administration, Medicine & Health Care, Nature & Environmental Sciences, Psychology and behavioral sciences, Science and academic Research, Society & Culture, Tourisme & Sports
Main study fields NL:
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
1972 |
Add new contribution