Practice Exam 2015/2016: Statistics II for IB – UG

Part 1 - Multiple choice questions
Part 2 - Problem on Multivariate Regression Analysis
Part 3 - Problem on Factor Analysis
Answers Part 1 - Multiple choice questions
Answers Part 2 - Problem on Multivariate Regression Analysis
Answers Part 3 - Problem on Factor Analysis

Part 1 - Multiple choice questions

Question 1

Which of the following statements on type i and type II errors is correct?

A type II error is the inability to reject a wrong null hypothesis.
A type II error is the rejection of a true null hypothesis
A type I error is the inability to reject a wrong null hypothesis.
A type I error is the rejection of a true null hypothesis.

Question 2

Which kind of relation do we have between Type I and II errors?

The probability of a Type II error increases as the probability of a Type i error increases.
Type I and II errors are independent.
The probability of a Type II error decreases as the probability ofa Type I error increases.
Type I and II errors are directly proportional.

Question 3

Consider to choose among tests, in order to achieve a given power level. In other words, you have a target power for your test, which statement is correct?

We cannot increase power by choosing a larger alpha level.
Other things being equal, for a greater effect size we need a larger sample size to achieve the target power.
We cannot increase power by choosing a larger sample size.
A smaller alpha level typically requires a larger sample to achieve the target power.

Question 4

A scatter plot of number of teachers (T) and number of people with University degrees in Dutch cities (P) shows a positive relation. Which is the most likely explanation for this positive association?

Teachers at any school always advise students to get a job requiring a University degree, so an increase in T is causing an increase in P.
Larger cities tend to have both more teachers and more people with University degrees. We can then expect T and P to be increasing in the variable ”size of the city".
Teaching is a common profession for Dutch people with a high income, so an increase in the number of people with University degree and a high income causes an increase in T.
Dutch cities with higher incomes tend to have more teachers and more people going to the University. We then expect T and P to be increasing in the variable ”income". Therefore the causation between T and P is difficult to prove.

Question 5

Which ofthe following statements regarding scatterplots are correct?

In a scatterplot the value of a variable is displayed as a function of the value of another variable.
In a scatterplot the value of a variable is displayed as a function of space.
In Multivariate Regression Analysis (MRA), scatterplots are the unique tools to check the relation between a dependent variable and an independent variable.
A scatterplot provides insights on just the linear relation between two variables.

Question 6

Consider missing data.

We can replace Missing Completely at Random data solely by their sample mean.
If the missing data are less than 10%, we can replace them solely by sample means.
We consider only the cases with observed values for all the variables.
We can exclude cases listwise or pairwise.

Question 7

A researcher observed that in her survey study about travel expenditures individuals who did not provide their household income tended to be almost exclusively those in the higher income bracket. Which sentence is correct?

Statistical results based on a sample reduced to 40% of its original size are surely biased.
Any statistical results based on data with non-random missing data could be biased.
Individuals from a higher income bracket should be excluded from such a survey.
There is no problem; any kind of statistical analysis can be executed based on this data.

Question 8

What is one of the distinctions between a population parameter and a sample statistic?

The true value of a population parameter can never be known, and the true value of a sample statistics can be computed if there is no any missing data.
A sample statistic changes across samples, while a population parameter remains fixed.
A population parameter changes across samples, while a sample statistic remains fixed.
The true value of a sample statistic can never be known but the true value of a population parameter can be known.

Question 9

Which ofthe following property would indicate that a dataset is not symmetric?

The range is equal to 5 standard deviations.
The range is larger than the interquartile range.
The mean is much smaller than the median.
There are no outliers.

Question 10

Which one ofthese statistics can be unaffected by outliers?

Mean.
Interquartile range.
Standard deviation.
Range.

Question 11

What is the effect of an outlier on the value ofa correlation coefficient between a dependent and an independent variable?

An outlier always decreases this coefficient.
An outlier might decrease or increase this coefficient, depending on its relation with the other data.
An outlier always increases this coefficient.
An outlier does not have any effect on this coefficient.

Question 12

A regression model with variable Y regressed on variable X is used to

To determine if any values for X are outliers.
To determine if any values for Y are outliers.
To determine if a change in variable X causes a change in variable Y‘
To estimate the change in variable Y for a given change in variable X.

Question 13

Consider the following population model: Y_j = ß₀ + ß₁X_1,j + ß₂X_2,j + ... + ßkX_k,j+ ε_j

For any j = 1, ..., N. If all the values of the dependent variable are multiplied by the same constant, what does it happen to the norm of the residuals and R² of the regression?

The norm of the residuals changes, and R² stays the same.
Both the norm of the residuals and R² stay the same.
Both the norm of the residuals and R² change.
The norm of the residuals stays the same, and R² changes.

Question 14

You collect data on the score S_f on the final exam of a course and on the score S₁, S₂, S₃ on the first, second and thirdassignment of the course, respectively. All the scores are expressed on the integer scale points from 1 to 10. Some data show that there is a relation between these variables. The estimated linear regression model is: Ŝ_f = 6.8 + 0.0 * S₁ + 0.25 * S₂ + 0.0 * S₃
One interpretation of the coefficients is

At A student who gets 0 on the second assignment (S₂ = 0) is predicted to get 6 on the final exam.
A student who gets 0 on the third assignment (S₃ = 0) is predicted to get 7 on the final exam.
A student who gets 4 points more than another student on the second assignment is predicted to get 1 point more than the other student on the final exam.
Students only receive a fourth (.25) of the points for a correct answer on the final exam compared to a correct answer on the second assignment.

Question 15

Pick the choice that best completes the following sentence. If a relationship between two variables is called statistically significant, it means the investigators think the variables are

Related in the population from which the sample is draWn.
Not related in the population represented by the sample.
Related in the sample clue to chance alone.
Very important.

Question 16

Consider a dependent variable with variance which does not change for different values of an independent variable. With respect this independent variable, the dependent variable is characterized by

Linearity.
Muiticollinearity,
Homoscedasticity,
Heteroscedasticity.

Question 17

Consider the following component matrix of a Principal Component Analysis.

Component:	1	2	3	4
X6 Product Quality X7 E Commerce Activities X8 Technical Support X9 Complaint Resolution X10 Advertising X11 Product Line X12 Salesfroce Image X13 Competitive Pricing X14 Warranty & Claims X16 Order & Billing X18 Delivery Speed	.248 .307 .292 .871 .340 .716 .377 -.281 .394 .809 .879	-.501 .713 -.369 .031 .581 -.455 .754 .660 -.305 .042 .117	-.081 .306 .794 -.274 .115 -.151 .341 -.069 .778 -.220 -.302	.670 .284 -.202 -.215 .331 .212 .232 -.348 -.193 -.247 -.206
Sum of Squares (value)	3.427	2.551	1.691	1.087
Percentage of trace	31.15	23.19	15.37	9.88

What is the total percentage of variance explained by the four factors?

(100/31.15)x3.427 ≈ 11
31.15 + 23.19 + 15.37 + 9.88 = 79.59
(100/31.15)x3.427+(100/23.19)x2.551+(100/15.37)x1.691+(100/9.88)x1.087 ≈ 44
3.427 + 2.551 + 1.691 + 1.087 ≈ 8.756

Question 18

Which of the following is/are critical assumption(s) for factor analysis?

There is a balanced mixture of dependent and independent variables.
Some underlying structure does exist in the set of analysed variables.
There is no multicoilinearity, because this property would cause several estimation problems.
Normality, homoscedasticity and linearity.

Part 2 - Problem on Multivariate Regression Analysis

A researcher applies Multivariate Regression Analysis to important characteristics that can influence the amount of customers of a company. For the study, the researcher has at disposal data from 92 customers in 4 metric variables:

X8 Technical Support
X11 Product Line
X15 New Products
X19 Satisfaction

Each variable is measured on an integer scale with points from 1 to 10, with 1 being ”Poor” and 10 being "Excellent". The researcher considers variable X19 as representative of the customer satisfaction with respect to the overall company's activity, while she considers variables X8, X11, and X15 as representative of the customer satisfaction with respect to just a specific part of the company activities, as explained by the variable names. Therefore, the researcher tries to explain the variation in X19 by means of the variation in X8, X11, and X15. The appencix on PART 2 — Problem on Multivariate Regression Analysis" on pages 13-17 contains the SPSS output necessary to answer the questions.

Question 1

Explain if Multivariate Regression Analysis is allowed for the given dataset.

Question 2

Are there any problems with missing data and outliers?

Question 3

Discuss the assumption of normality in this data set. Use a significance level of α = 0.05

Question 4

Explain how to test for the presence of heteroscedasticity for the four variables in the dataset. Interpret the test statistics given in the tables. What do you conclude?

Questions 5-10 refer to the model considered in "Tables and graphs for MODEL 1 in PART 2" in the Appendix

Question 5

Provide the linear regression model, and explain what coefficients and variables represent.

Question 6

Provide the regression equation for the linear regression model using the entermethod.

Question 7

Determine the percentage of variation in the dependent variable that is explained by the regression model. Is this percentage significant? Specify and explain the used test.

Question 8

Explain which independent variables have a significant contribution in the prediction of the dependent variable in the regression model. Use a significance level of α = 0.05

Question 9

Indicate and explain which independent variable has the highest influence on the dependent variable of the regression equation.

Question 10

Does multicollinearity cause a problem in the regression? Explain your answer.

Questions 11-13 refer to the model considered in ”Fables and graphs for MODEL 2 in PART 2" in the Appendix

Question 11

Provide the regression equations for the linear regression models using the sequentialforward method.

Question 12

Explain which independent variables have a unique, significant contribution to the prediction ofthe dependent variable. Indicate exactly which table you use in your explanation

Question 13

Part 3 - Problem on Factor Analysis

A researcher is studying the market segmentation of a company’s customers and applies factor analysis to important characteristics that can influence this market segmentation. The researcher has at disposal data from 92 customers in 12 metric variables measured on a 0—10 scale with 10 being "Excellent” and 0 being ”Poor". The variables are

X6 — Product Quality
X7 — E—Commerce Activities
X8 — Technical Support
X9 — Complaint Resolution
X10 — Advertising
X11 — Product Line
X12 — Salesforce Image
X13 — Competitive Pricing
X15 — New Products
X16 — Ordering & Billing
X17 — Price Fiexibility
X18 — Delivery Speed

Appendix Bcontains the SPSS output necessary to answer the questions.

Question 1

What is factor analysis and what is the goal offactor analysis?
What is the difference between principal component analysis and common factor analysis and why do you apply these methods?
Consider the first set of tables and graphs for PART 3 (pages 19-21). Which extraction method has been used here?

Question 2

Is factor analysis allowed on this dataset?
Describe three different ways to check if the considered variables are correlated, and evaluate whether the dataset meets the correlation assumption necessary for factor analysis.
In case the assumption is not met, which variable should be removed to apply factor analysis?

Question 3

What Is a factor loading and what is a cross-loading?
How are factor loadings and factor eigenvalues related?
Consider the second set of tables and graphs for PART 3 (pages 22-23). How much of the variance of variable X12 —Salesforce Image is explained by the ﬁrst factor?
Consider the factor solution provided in the second set of tables and graphs for PART 3 (pages 22—23). Is it a good factor solution? Why?

Question 4

What is a factor rotation?
In what situation would an oblique factor rotation be more appropriate?

Question 5

What is the difference between the Varimax and Oblimin factor rotation? (note that the names for the factor rotations are the ones used in SPSS)
Consider the third set of tables and graphs for PART 3 (page 24). Explain which factor model leads to the easiest interpretation of the underlying structure of the data, and why this happens.
Describe a strategy for validating the factor analysis results.

Answers Part 1 - Multiple choice questions

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Question 11

Question 12

Question 13

Question 14

Question 15

Question 16

Question 17

Answers Part 2 - Problem on Multivariate Regression Analysis

Question 1

X19 is the dependent variable, and X8, X11, and X15 are the independent variables. From the table "Descriptive Statistics", there are 92 cases with values for the four considered variables. 50 the ratio "sample size to independent variables” is 92:3 = 30.7:1
It is in agreement with the adopted rule of thumb of having at least 10 times as many cases as independent variables. It is also in accordance with the minimum ratio 5:1 considered in the textbook.
All variables are metric. Therefore Multivariate Regression Analysis (MBA) is allowed.

Question 1

From the table "Case Processing Summary" there is no missing data, so there is no problem.
There is one possible outlier in X15 —- New Products. This realization can be seen in the X15- boxplot and histogram. There are no apparent outliers in the other plots.
There are no problems with outliers.

Question 3

The assumption of normality can be checked in different ways:

Based on the histograms and boxplots, the variables may be normal. However, the sample probability distribution of X8 and X11 are the closest to a normal. The sample probability distribution of X15 and X19 are characterized by several local maxima.
Based on the P-P plots, all the variables are close to normal variables.
We can consider the Kolmogorov-Smirnov test. This test is a test for the equality of probability distributions. The following hypothesis H0 is tested against the alternative hypothesis H1:
1. H0 = The variable is normally distributed.
2. H1 = The variable is not normally distributed.

As asked in the question, we consider a significance level 0.05

We say that a variable behaves as a normal variable at the 5% confidence level if the value on the column Sig. of the table Tests of Normality is greater than 0.05. Then all the variables behave as normal at the 5% confidence level.

Question 4

First, we can consider a Levene test. In this test the following hypothesis is tested:

H0) The variable is homoscedastic
H1) The variable is heteroscedastic

We can consider the Levene test based on different statistics, such as mean and median. This test considers the variance of a metric variable compared across levels of another variable. In particular, the test focuses on a particular statistic, such as the mean and the median. In the four tables "Test of Homogeneity of Variances" given in this exam, the additional information on the chosen statistic is not provided. Considering the Levene test with output given in the tables and a significance level of 0.05 we observe that

With X19 as factor) - Technical Suggort (0.094) - Product Line (0.000) - New Products (0.034)
With X15 as factor) - Technical Support (0.000) - Product Line (0.027) - Satisfaction (0.001)
With X11 as factor) - Technical Support (0.000) - New Products (0.001) - Satisfaction (0.000)
With X8 as factor) - Product Line (0.005) - New Products (0.074) - Satisfaction (0.000)

In the two underlined cases, the significance level is higher than 01:0.05 and we fail to reject the null hypothesis. We reject H0 in all the other cases. In all these cases and we say that the variables are statistically heteroscedastic at the 5% confidence
level.

Second, we can perform a graphical analysis, considering the scatterplots of pairs of variables. We consider where the pattern has an overall shape that differs from a rectangle (e.g., an overall triangular shape). For example, X11 seems to be heteroscedastic, showing this kind of pattern.

Question 5

The considered model is "Model 1". The (theoretical) linear regression model in vector form (that is, with no label for the observation) is

X19 = a + b₁*X8 + b₂*X11 + b₃*X15 + e

In this notation X19 is the regressand (dependent, explained variable) vector. Similarly, X8, X11 and X15 are the regressors (independent, explanatory variables) vectors.

The parameter a is the constant coefﬁcient.
The parameters b₁, b₂, b₃ are the coefficients of the various regressors, and they represent the impact of an increase/decrease of a regressor on the explained variable.

e is the vector of errors.

Question 6

The regression equation is the estimated version of the (theoretical) regression model. From the table "Coefﬁcients" we get

^X19 = 3.455 + 0.017*X8 + 0.506*X11 + 0.073*X15

Where the symbol "AX19" indicates the vector of the estimated (or predicted) value of variable vector X19, for the values X8, X11 and X15 ofthe vector regressor.

You can also answer this question by providing the regression equation for each single observation (that is, with the indexfor the observation)

Question 7

From the "Model Summary" table we get that R2 = 0.325. This means that 32.5% of the variation in the dependent variable X19 is explained by the independent variables. This means that the model explains about a third of the total variation of X19.

The test to determine whether this percentage is significant is the F-test reported in the ANOVA table. The hypothesis is

H0) R² = 0 vs. H1) R² > 0

The test is equivalently formulated as

H0) b_i=0 vs. H1) b_i ≠ 0

(we test if all the regression coefficients are equal to zero versus the hypothesis that there is at least one regression coefﬁcient that is different from zero).

F-statistic= 14.108, p-value= .000 (close to 0). Interpretation of the values:

The p-value is < α = 0.05, then we reject H0. We conclude that the R² is significantly different from zero.

Question 8

The i-th independent variable has a significant contribution in the prediction of the dependent variable if its coefficient is statistically different from 0. We then test the hypotheses H0) b_i=0 vs. H1) b_i ≠ 0. The t-test is reported in the coefficients table.

X8: t-value=0.233, p-value=0.816      X8: p>alpha, fail to reject H0
X11: t-value=6.162, p-value=0.000    X11: p<alpha, reject H0
X15: t-value=1.001, p-value=0.320    X15: p>alpha, fail to reject H0
Conclusion: X11 is significantly different from 0 (ie, this independent variable has a significant contribution in the prediction of the dependent variable).

Question 9

From the ”Coefficients" table we get that the independent variable that has the highest influence on the dependent variable is X11 since it has the highest standardized regression coefficient (it has also the highest unstandardized coefficient).

Question 10

No. All tolerances are above 0.10. All VlFs are below 10

Question 11

First of all, we must always check if the estimation and test results we consider are appropriate! Sometimes we may have accidentally at disposal useless results! On page 17-18 of this exam, there are two tables that do not refer to the models considered in this exam! The tables are "ANOVA” and "Coefficients" on page 17. They refer indeed to a different regression model, where the dependent variable is indeed X22 - Purchase Level! We can understand this fact on the basis of 2 facts:

Direct observation. Under the tables dependent and independent variables are specified.
Intuition. In previous questions we have built the same model (Model 1) by the enter method and found thisregression equation:
^X19 = 3.455 + 0.017*X8 + 0.506*X11 + 0.073*X15
We have also found that the impact of X8 and X15 0n X19 is not statistically significant. Intuitively we can then expect to find a regression equation that is ”quite similar" to
AX19 = 3.455 + 0.506*X11
On the basis of these remarks, we should then check for a table with similar findings, since the two methods (enter method and forward method) are just way to analyze the same model on the basis of available data.

The regression equation for the regression model obtained by using the forward method is obtained from the last table of the Tables and graphs for MODEL 2 in PART 2"

Model 2 (forward method)

^X19 = 3.878 + 0.513 * X11

Question 12

Again, the table to use is the last table of the "Tables and graphs for MODEL 2 in PART 2”. From this table we understand that the only one statistically significant variable to be taken as explanatory variable is X11. This is consistent with the findings in the answer to question 8 of this part of the exam. The regression equation is also quite similar.

Question 13

Overall, the model with only X11 is the model to adopt. In previous questions we have seen that X3 and X15 do not addmuch information to explain the variation of X19.

By comparing the Adjusted-R² of Modell (0.302) with Model 2 (0.309) this value is slightly higher in case of Model2, which support the same conclusion to select the model with forward method.

Answers Part 3 - Problem on Factor Analysis

Question 1

Factor analysis (FA) is a technique to analyze dependences among explanatory variables. Its primary purpose is to define an underlying structure among the variables in the analysis. The goal is to reduce or summarize the data.
Principal Component Analysis (PCA) is used to reduce the dimensionality of data. The total variance is redistributed in p observed variables over p principal components, where p is a positive integer number. The first principal component has largest contribution to total variance. The second has the second largest contribution, etc. Common factor analysis is used to summarize/explain the data. The method reproduces observed correlations as good as possible, using small number of common factors. Common factor analysis considers only the common variance among variables. The observed relations between the variables are describing underlying constructs (i.e. the common factors), which may serve further as theoretical deepening.
The extraction method used is Principal Component Analysis.
In Principal Component Analysis the goal is to explain as much of the common or shared variance as possible using the factors. Therefore, the initial communalities rather than the total variance are inserted on the diagonals of the correlation matrix and these are smaller than 1.

Question 2

The variables are all metric. The ratio of observations to variables is 92 : 12 (which is about 7.7 : 1, therefore above the 5 : 1 threshold, that is our "rule of thumb" threshold) -> FA is allowed
We can consider
1. The anti-image correlation matrix, which is the matrix with (negative) partial correlation. This matrix gives us the correlation that is unexplained when the effects of other variables are taken into account. In this exam the anti-image correlation matrix has "small" values outside the main diagonal: these values are < 0.7
2. Barlett's test of sphericity, which tests for the present of correlations among the variables. It provides the statistical signiﬁcance that the correlation matrix has significant correlations among at least some of the variables. From the table "KMO and Bartlett‘s test" on page 20, the test is significant, with p-value=.000. The assumption is then met.
3. Kaiser-Meyer-Olkin Measure of Sampling Adequacy (MSA), which is 1 if the variable is perfectly predicted without error by the other variables
  - overall MSA = .622 > .5
  - MSA’s for X11, X15 and X17 is not >0.50 -> the correlation assumption that is necessary for FA is not met
4. The variable with the lowest MSA is X17

Question 3

A factor loading is the correlation between a variable and a factor.
A cross—loading is a ”high correlation" (on the basis of a chosen threshold for the correlation) between a variable and different factors.
The sum of the squared factor loadings is the eigenvalue of the factor
The variance of the variable explained by the factor is equal to the squared-factor loading. From the component matrix, for x12 this is 0.595² = 0.352 or 35.2%
No, while all communalities are >0.5, there are cross-loadings (absolute value of a loading on different factors that is > 0.4) for the variables X6, X7, X12, X17. Again, let us recall that 0.4 is a threshold chosen as a rule of thumb!

Question 4

A factor rotation is a method of redistributing variance from factors previously obtained (the unrotated solution). We want to have different variables with factor loadings on different factors. In other words, we want to have a situation without cross—Ioadings: cross-loadings make the interpretation of the factors harder to achieve.
Oblique rotation methods produce factors which are correlated. Sometimes oblique rotation gives us a situation with no cross-Ioadings. When we have a good reason to believe that the factors may be correlated, and we eliminate cross-loadings so that we enhance the interpretability of the factors, oblique factor rotation is more appropriate.

Question 5

With an orthogonal rotation (Varimax) factors remain uncorrelated. With an oblique rotation factors can be correlated (Oblimin).
We get cross-loadings with all the procedures. Therefore no factor rotation leads us to a relatively easy interpretation of the underlying structure of the data.
These are possible strategies:
- Split the sample, redo the factor analysis for both samples and compare the results
- Take a new sample, redo the factor analysis with the new sample and compare the results
- Detect influential observations, remove them from the dataset, redo the factor analysis excluding the influential observations and then compare the results.