Seminar Assumptions and Bootstraps

Summary and study notes 

Welke onderwerpen worden behandeld in het hoorcollege?

Assumptions. In the lecture there are five different assumptions discussed: outliers, multicollinearity, homoscedasticity, linearity and normality distributed residuals. Look at the notes at the end in this document. 

Bootstrapping. You use bootstrap when distributions are not in agreement with the assumptions causing. 

Mediation = the relationship between an independent variable and a dependent variable via the inclusion on a third hypothetical variable, the mediator variable. When you do mediation you always have to use bootstrap, because there is an indirect effect.

Welke onderwerpen worden besproken die niet worden behandeld in de literatuur?

In dit college worden geen andere onderwerpen besproken dit niet worden behandeld in de literatuur.  

Welke recente ontwikkelingen in het vakgebied worden besproken? 

Er worden geen recente ontwikkelingen besproken. 

Welke opmerkingen worden er tijdens het college gedaan door de docent met betrekking tot het tentamen?

Er worden geen vragen gesteld over effects size op het tentamen. 

Welke vragen worden behandeld die gesteld kunnen worden op het tentamen? 

Er worden geen tentamenvragen behandeld. 

Hoorcollege aantekeningen

Assumptions and violations

  1. Outlier may influence your results. If there is an outlier, you have to remove it, especially if it is a theoretical illogical value. Do analysis with and without and inspect whether conclusion is the same (but what if not). Leave it, but correct; e.g. use robust estimator (look at the median instead of the mean). 
  2. Multicollinearity. Toleance = 1/VIF. Tolereance < .2 is a possible problem, tolerance < .1 is a problem. VIF > 5 is a possible problem, VIF > 10 is a problem. When predictors correlate strongly (> 0.8), it is impossible to compute unique estimations for the regression coefficients. The estimations of the b-coefficients are unreliable. The importance of individual predictors is difficult to determine. When there is multicollinearity you have to decide whether you leave all predictors out or some. 
  • In case of an interaction, you have to center the variables. 
  • Remove one of the predictors (leave at least one out, but not all). 
  • Use ‘latent’ effect instead of original variables; use factor analysis or sum score. 
  1. Homoscedasticity = for each x-value there is the same spread. The consequence is that hypotheses tests are no longer valid. If there is heteroscedastic, you can use bootstrap. You can also test for linearity with the plots of homoscedasticity, when there is no symmetry around the middle line, there is no linear relationship. 
  2. LinearityYou can look for linearity in a plot when you add line for quadratic effects (R2). There is a linear relationship when the spread around the line is symmetric. When there is no linearity, you have to add quadratic effect (so, transform into new variable which is squared. 
  3. Normality distributed residualsThe residuals have to be normally distributed. You can check this assumption in a plot. You can also test it when you increase the sample size, with a small sample size there is no significant value, but with a large sample size you will get a significant value if there is normality. Dealing with non-normality:
  • Ignore problem, claim ML estimation is robust. Defensible if distribution not extreme and large N.
  • Use normalizing transformation (dependent variable). Square root, logarithm, inverse, normalized scores. If often does not make sense to look at the transformed variables. 
  • Use robust estimators (MLR). Works well in many occasions. N > 200, larger with large models. 
  • Bootstrapping.
  • Bayes estimation. 


It is called bootstrapping because you have to help yourself out with the means that you have. You use bootstrap when distributions are not in agreement with the assumptions causing: for example, non-normal errors, heteroscedasticity, small samples, moderation, mediation, count variables. When this assumption is not valid, you do bootstrap. When you cannot assume it is a t-distribution, you approximate the sampling distribution by re-sampling (with replacement). In the end you get other p-values and other scores that are valid. If you only sample 10 persons, you resample your sample for multiple times (1000 times) and then you got a normal distribution. This normal distribution will be used with your calculations. You sample with replacement; this means that you can pick one individual twice. Bootstrapping is randomly done. 

Advantages of bootstrapping

It is simple, you don’t need a distribution. There is no assumption to check. We don’t need a normal data or a large sample size. We can obtain the SE and CI for complex parameters, such as correlation coefficients. You can check the stability of the results. It may give you more accurate scores. 

Why do we use bootstraps?

We don’t know the real distribution (population), but only the data. The data we can use as a proxy for the population. We draw multiple samples from this proxy (resampling), as if we sample from the population. Compute the statistics of interest on each of the sampled datasets. Calculate the mean and confidence interval from the distribution of statistics. 

The only assumption is that your sample needs to be representative for the population! When we use bootstrap, we don’t need the other assumptions. 


Be aware of the difference between mediation and moderation. Moderator = the effect between X1 and Y depends on the value of X2Mediation = the relationship between an independent variable and a dependent variable via the inclusion on a third hypothetical variable, the mediator variable. When you do mediation you always have to use bootstrap, because there is an indirect effect. Complete mediation = when there is no direct effect of X on Y. Partial mediation = combination of direct effect of X on Y and indirect effect of X on Y through M. You need bootstrap because when you use at the distribution of ab, this is not normally distributed because it is a product of two variables (indirect effect). 

Steps bootstrap indirect effect

  1. Repeatedly sample from the dataset with replacement. Draw 5000 (default) new samples of N cases from the original sample, with replacement. Note: Cases can be drawn more than once or not once. 
  2. Estimate indirect effect 𝑎𝑏 in every sub-sample: à à Y.
  3. Make a histogram of all values of 𝑎𝑏 à Bootstrap sampling distribution of 𝑎𝑏 (positively skewed). Or: Order all these estimations of the indirect effect.
  4. Look at the middle 95% (reading of the 2.5th and 97.5th percentile in the distribution) à 95% CI. 5. Optional: Determine the bias corrected and accelerated (BCa) interval. 


When zero is included in the confidence interval, you don’t reject the H0 hypothesis, so there is no effect. If zero is outside the confidence interval, there is an effect. In SPSS you look at BootLLCI and BootULCI. When 0 is included, you reject H0, there is an indirect effect. When you decide if the mediation is complete or partial, you look if there is a direct effect and if this is significant. 

Contributions, Comments & Kudos

Add new contribution

This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Enter the characters shown in the image.
Summaries & Study Note of Britt van Dongen
Join World Supporter
Join World Supporter
Log in or create your free account

Why create an account?

  • Your WorldSupporter account gives you access to all functionalities of the platform
  • Once you are logged in, you can:
    • Save pages to your favorites
    • Give feedback or share contributions
    • participate in discussions
    • share your own contributions through the 11 WorldSupporter tools
Access level of this page
  • Public
  • WorldSupporters only
  • JoHo members
  • Private
Switch Font