Bulletpoints per chapter with the 2022 edition of Analysing Data using Linear Models by van den Berg - Chapter


What are variables, variation and co-variation? - Chapter 1

  • Data on units and variables can be stored in a data matrix.
  • There are different kinds of variables. How you describe variables and how you visualise them depends on their measurement level.

How can we make inferences about a mean? - Chapter 2

  • In statistics, inference refers to drawing conclusions about population (complete) data on the basis of sample data (a selection of data).
  • When you randomly draw equally-sized samples from a population, and for each sample you compute the mean, you can make a histogram of all sample means. This histogram represents the sampling distribution of the sample mean.
  • When you randomly draw equally-sized samples from a population, and for each sample you compute the variance, you can make a histogram of all sample variances. This histogram represents the sampling distribution of the sample variance.
  • The sample mean is an unbiased estimator of the population mean.
  • The sample variance is a biased estimator of the population variance.
  • The standard deviation of the sampling distribution is called the standard error.
  • The larger the sample size, the smaller the standard error.
  • A 95% confidence interval contains 95% of the sample means had the population mean been equal to the sample mean. Its construction is based on the estimated sampling distribution of the sample mean.
  • If you know the standard error (because you know the population variance), the standardised sample means will show a normal distribution. If you don’t know the standard error, you have to estimate it based on the sample variance. If sample size is really large, you can estimate the population variance pretty well, and the sample variances will be very similar to each other. In that case, the sampling distribution will look very much like a normal distribution. But if sample size is relatively small, each sample will show a different sample variance, resulting in different standard error estimates. If you standardise each sample mean with a different standard error, the sampling distribution will not look normal. This distribution is called a t-distribution.
  • The shape of the sampling distribution is a t-distribution. The shape of this t-distribution depends on sample size (expressed as degrees of freedom).
  • The objective of null-hypothesis testing is that we either reject the null-hypothesis, or not. This is done using the data from a random sample. In the null-hypothesis procedure, we assume that the null-hypothesis is true, and compare the sample data with data that would result if the null-hypothesis were true.
  • The p-value represents the probability of finding a t-value equal or more extreme than the one found, assuming that the null-hypothesis is true. Often a p-value of 5% or smaller is used to support the conclusion that the null-hypothesis is not tenable.
  • In two-tailed testing, we have rejection regions on both sides of the t-distribution. In one-tailed testing, we have only one rejection region.
  • A type I error is the mistake of rejecting the null-hypothesis while it is in fact true. A type II error is the mistake of not rejecting the null-hypothesis while it is in fact not true.

How can we make inferences about proportions? - Chapter 3

  • The sampling distribution of a sample proportion is closely related to the binomial distribution.
  • With increasing sample size, the binomial distribution becomes normal, hence the sampling distribution of a sample proportion also becomes normal (Central Limit Theorem).

What does linear modelling entail? - Chapter 4

  • A simple linear equation represents a straight line.
  • A straight line has an intercept and a slope.
  • The intercept is the value of Y given that X=0.
  • Data usually do not show a straight line, but a straight line might be a good approximation (prediction model) for the data.
  • Finding a reasonable straight line is called regression.
  • A residual is the difference between an observed Y-value and the predicted Y-value.
  • Finding the best regression line is usually based on the least squares principle.
  • Correlation stands for the co-relation between two variables. It tells you how well one variable can be predicted from the other.
  • Correlation is standardised to be between -1 and 1. It is the slope of the regression line for standardised X- and Y-values.
  • Covariance is an unstandardised measure for the co-relation.
  • Unexplained variance is the variance of the residuals.
  • Explained variance is total variance minus the unexplained variance.
  • R-squared is the proportion of explained variance.
  • It is possible to include more than one independent variable in a linear model. We then talk about multiple regression.
  • It is not wise to include two independent variables that are highly correlated with each other. High correlations among independent variables is called collinearity.
  • The relationship between one independent variable and the dependent variable can change, depending on what other independent variables are included in the model. When this change is dramatic, one usually refers to Simpson’s paradox.

How can we make inferences about linear models? - Chapter 5

  • You can use sample data to perform inference on population data.
  • The linear model coefficients that are based on sample data have uncertainty due to the random sampling.
  • Uncertainty of model parameters is quantified using standard errors.
  • Standard errors become smaller with increasing sample size, and coefficients become more precise.
  • Model parameters have t distributions, which can be used to construct confidence intervals.
  • In a model with K independent variables, the residual degrees of freedom is n−K−1.
  • With null-hypothesis testing, usually the hypothesis is tested that a particular slope in the linear model is equal to 0 in the population.
  • Statistical power refers to the probability of finding a significant result, given that the population coefficient is a certain non-zero value.
  • Statistical power depends on the population value (if it is large, then you have a higher probability of finding a significant result than when it is close to 0).
  • Statistical power increases with increasing sample size.
  • Power analysis can give you insight about how many datapoints you need in order to have reasonable statistical power.
  • Null-hypothesis is beginning to get outdated. Always consider if an alternate approach like reporting confidence intervals is more appropriate to answer your research question.
  • A null-hypothesis can be done by checking whether a certain confidence interval includes 0. If 0 is not in the interval, the null-hypothesis that the population value is 0 can be rejected.
  • With an intercept-only model, you can test the null-hypothesis that the population mean of the dependent variable equals 0, and compute a confidence interval.

What are categorical predictor variables? - Chapter 6

  • A categorical independent variable can be added to a model by creating a quantitative variable.
  • A categorical variable with only two classes (groups/categories) can be recoded into a dummy variable.
  • A categorical variable with more than two classes (groups/categories) can be recoded into a set of dummy variables.
  • With dummy coding there is always a reference group (the one that is coded as 0).
  • When testing a null-hypothesis about the equality of more than two group means, an analysis of variance (ANOVA) should be carried out, reporting the F-statistic.
  • When the null-hypothesis is true, then you expect to see an F-value of around 1.
  • An F-value much greater than 1 indicates evidence for rejecting the null-hypothesis. How much evidence depends on the degrees of freedom (model and residual degrees of freedom).
  • The F-distribution is closely related to the t-distribution.

What are the assumptions of linear models? - Chapter 7

  • The general assumptions of linear models are linearity (additivity), independencenormality and homogeneity of variance.
  • Linearity refers to the characteristic that the model equation is the summation of parameters, e.g. b0+b1X1+b2X2+….
  • Normality refers to the characteristic that the residuals are drawn from a normal distribution, i.e. e∼N(0,σ2).
  • Independence refers to the characteristic that the residuals are completely randomly drawn from the normal distribution. There is no systematic pattern in the residuals.
  • Homogeneity of variance refers to the characteristic that there is only one normal distribution that the residuals are drawn from, that is, with one specific variance. Variance of residuals should be the same for every subset of the data.
  • Assumptions are best checked visually.
  • Problems can often be resolved by some transformation of the data, for example taking the logarithm of a variable, or computing squares.
  • Inference is generally robust against violations of these assumptions, except for the independence assumption.

What should we do when the assumptions are not met? - Chapter 8

  • When a distribution of residuals looks very far removed from a normal distribution, consider using a nonparametric method of analysis.
  • The options for this are Spearman's rho, Kendall's rank-order correlation and Kruskal-Wallis test for group comparisons.

What does moderation entail? - Chapter 9

  • When having two independent variables, it is possible to also quantify the extent to which one variable moderates (modifies) the effect of the other variable on the dependent variable.
  • This quantity, the extent to which one variable moderates (modifies) the effect of the other variable on the dependent variable, is termed interaction effect.

How do researchers use contrast in statistical analysis? - Chapter 10

  • When analysing categorical variables in a linear model, the categorical variable is represented by a new set of numeric variables.
  • These new variables can be dummy variables (default) but can also be other types of numeric variables.
  • How these numeric variables relate to the original categorical variable is summarised in a coding matrix S.
  • The coding matrix S determines what values are printed in the regression table. These values are actually contrasts.
  • Contrasts are weighted sums of group means. The contrasts are represented in a contrast matrix L.
  • Contrasts are meant to address specific research questions.
  • Matrix L is the inverse of S, and S is the inverse of L.
  • If your model involves moderation, you can calculate “simple effects” (or “simple slopes”): the effects of one independent variable given particular values of a second independent variable.

How do we perform post hoc comparisons? - Chapter 11

  • Your main research questions are generally very limited in number. If they can be translated into contrasts, we call these a priori contrasts.
  • Your a priori contrasts can be answered using a pre-set level of significance, in the social and behavioural sciences this is often 5% for p-values and using 95% for confidence intervals. No adjustment necessary.
  • This pre-set level of significance, α, should be set before looking at the data (if possible before the collection of the data).
  • If you are looking at the data and want to answer specific research questions that arise because of what you see in the data (post hoc), you should use adjusted p-values and confidence intervals.
  • There are several ways of adjusting the test-wise αs to obtain a reasonable family-wise α: Bonferroni is the simplest method but rather conservative (low statistical power). Many alternative methods exist, among them are Scheffé’s procedure, and Tukey HSD method.

How do we perform linear mixed modelling? - Chapter 12

  • In models, certain numbers, namely the intercept and the slope, are the same for every case. We therefore call these effects of intercept and slope fixed effects, as they are all the same for all units of analysis. In contrast, we call the e term, the random error term or the residual in the regression, a random effect. This is because the error term is different for every unit
  • When studying the data of pre-post intervention designs, we are mainly focused on the fixed effect of the intervention: Is X related to a change in Y?
  • Whenever the main interest of a study is on the variances of the random effects, REML is the way to go. On the other hand, if the main research question is about the fixed effects, the way to go is to use ML.
  • When you want to write down the results from a linear mixed model, it is important to explain what the model looked like, in terms of fixed and random effects. Usually it is not necessary to mention whether you used REML or ML, or what method you used to determine degrees of freedom. 

How do we conduct linear mixed models for more than two measurements? - Chapter 13

  • An intraclass correlation indicates how much clustering there is within the groups. When this correlation is 0 or very close to 0, it does not matter to include these random effects. In that case, we might as well use an ordinary linear model, using the lm() function. Because the assumption of linear models is that the residuals are completely random, that is, that there are not systematic effects in the residuals, we know that it is very important to include random effects if we observe an ICC of 0.85. Ignoring a factor that causes an ICC of 0.85 can lead to wrong inference, so wrong standard errors and confidence intervals.
  • We can report the analysis in a fashion like: "A linear mixed model was run on the Y levels, using a fixed effect for the numeric predictor variable time and random effects for the variable participant. We saw a significant linear effect of time on Y level, t(200)=−24.42,p<.001.  
  • Our null-hypothesis is that the effect of X in affecting Y is the same in group 1 and group 2. We can investigate whether the effect of X is the same for both groups.  
  • Results can be reported in a fashion such as: "The null-hypothesis that the effect of X is the same in the two groups cannot be rejected, t(100)=0.743,p=.46. We therefore conclude that there is no evidence that X has a different effect for group 1 than for group 2."
  • In statistically analysing the interaction effect in a pre-mid-post intervention design, we make use of dummy coding.  
  • A mixed design includes two kinds of variables: one is a between-individuals variable, and one variable is a within-individual variable.  
  • When there is at least one within variable in your design, you have to use a linear mixed model.
  • A possible answer that we can find in the F-statistic of an ANOVA can be formulated as follows: We see a significant measure by group interaction effect, F(2,2850)=1859.16,p<.001. The null-hypothesis of the same change in Y in three different populations can be rejected, and we conclude that the different groups show a different change in Y over time.

What are non-parametric alternatives for linear mixed models? - Chapter 14

  • When a distribution of residuals looks very far removed from a normal distribution and/or shows heterogeneity of variance, consider either a data transformation or using a non-parametric method of analysis.
  • Friedman’s and Wilcoxon’s tests are non-parametric alternatives for linear mixed modelling with one categorical predictor variable, in a within-subjects design.
  • Wilcoxon’s can be used for an independent variable with only two categories.
  • Friedman’s test can also be used for an independent variable with more than two categories.

How is logistic regression conducted with generalised linear models? - Chapter 15

  • Logistic regression is in place when the dependent variable is dichotomous (yes/no, 1/0, TRUE/FALSE).
  • Logistic regression is a form of a generalised linear model.

How can generalised linear models be used for count data? - Chapter 16

  • A Poisson regression model is a form of a generalised linear model. 
  • Poisson regression is appropriate in situations where the dependent variable is a count variable (0,1,2,3,…).
  • Whereas the normal distribution has two parameters (mean μ and variance σ2), the Poisson distribution has only one parameter: λ.
  • A Poisson distribution with parameter λ has mean equal to λ and variance equal to λ.
  • λ can take any real value between 0 and ∞.
  • The ANOVA analog in Poisson regression is a model comparison of the model deviances, making use of the χ2 distribution. The degrees of freedom is determined by the difference in the number of parameters between two models.
  • Poisson regression with only categorical independent variables is the same as analysing counts in cross-tables. For two categorical independent variables, one often reports a Pearson chi-square test.
  • Poisson regression is more versatile than limiting yourself to Pearson chi-square test statistics.

What does big data analytics entail? - Chapter 17

  • Data science is about making data available for analysis. This field of research aims to extract knowledge and insight from structured and unstructured data. To do that, it draws from statistics, mathematics, computer science and information science.
  • There are a couple of reasons why big data analytics is different from the data analysis framework discussed in previous chapters. These relate to different types of questions, the p>n problem and the problem of over-fitting.
  • In big data analytics one uses these models too, but in addition there is a wealth of other models and methods, too. To name but the most well-known: decision trees, support vector machines, smoothing splines, generalized additive models, naive Bayes, and neural networks. Each model or method in itself has many subversions. 
  • Cross-validation is a form of a re-sampling method. In re-sampling methods, different subsets of the training data are used to fit the same model or different models, or different versions of a model.  
  • In big data problems, we often see the following steps: The first step is problem identification.  Then, selection of data sources is the next step.  The next step is feature selection. Following feature selection, there is construction of a data matrix. Training and test (validation) data set is the next step in the process. Then, there is model selection. Hereafter, we build the model. Then, we focus on validating the model. Lastly, we interpret the result and evaluate.

Join World Supporter
Join World Supporter
Log in or create your free account

Waarom een account aanmaken?

  • Je WorldSupporter account geeft je toegang tot alle functionaliteiten van het platform
  • Zodra je bent ingelogd kun je onder andere:
    • pagina's aan je lijst met favorieten toevoegen
    • feedback achterlaten
    • deelnemen aan discussies
    • zelf bijdragen delen via de 7 WorldSupporter tools
Follow the author: Vintage Supporter
Content categories
Comments, Compliments & Kudos

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.