Research Methods & Statistics – Interim exam 4 (UNIVERSITY OF AMSTERDAM)

This bundle contains a summary for the fourth interim exam of the course "Research Methods & Statistics" given at the University of Amsterdam. It contains the books: "Statistics, the art and science of learning from data by A. Agresti (third edition)" with the chapters:

- 3, 12, 14, 15.

It also contains additional information on Bayesian statistics.

Access to the summaries:

All summaries for the first and second interim are free to use with a WorldSupporter account. Join WorldSupporter first if you don't have an account yet
Summaries for the third and fourth interim are only available for JoHo Worldsupporter members. You can become a JoHo WorldSupporter member for only 5 euro a year. Go to Join JoHo for more information and for the application form.

Q&A and Feedback:

If you have any questions on the course, feel free to ask in the commentfields of the summaries and I will try to help you as soon as possible.
If you have any feedback on the summaries, you can use the commentsfields as well, but be nice ;-) I have put a lot of work in it.

Bundle items:

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 3 summary

THE ASSOCIATION BETWEEN TWO CATEGORICAL VARIABLESWhen analysing data the first step is to distinguish between the response variable and the explanatory variable. The response variable is the outcome variable on which comparisons are made. If the explanatory variable is categorical, it defines the groups to be compared with respect to values for the response variable. If the explanatory variable is quantitative, it defines the change in different numerical values to be compared with respect to values for the response variable. The explanatory variable should explain the response variable (e.g: survival status is a response variable and smoking status is the explanatory variable).An association exists between two variables if a particular value for one variable is more likely to occur with certain values of the other variable. A contingency table is a display for two categorical variables. Conditional proportions are proportions which formation is conditional on ‘x’. A conditional proportion should be conditional to something. A conditional proportion is also a percentage. The proportion of the totals (e.g: percentage of total amount of ‘no’) is called a marginal proportion.There is probably an association between two variables if there is a clear explanatory/response relationship, that dictates which way we compute the conditional proportions. Conditional proportions are useful in determining if there’s an association. A variable can be independent from another variable. THE ASSOCIATION BETWEEN TWO QUANTITATIVE VARIABLESWe examine a scatterplot to study association. There is a difference between a positive association and a negative association. If there is a positive association, x goes up as y goes up. If there is a negative association, x goes up as y goes down. Correlation describes the strength of the linear association. Correlation (r) summarizes th direction of the association between two quantitative variables and the strength of...

Lees verder over Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 3 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 12 summary

MODEL HOW TWO VARIABLES ARE RELATEDA regression line is a straight line that predicts the value of a response variable ‘y’ from the value of an explanatory variable ‘x’. The correlation is a summary measure of association. The regression line uses the following formula:The data is plotted before a regression line is made, because it can be strongly influenced by outliers. The regression equation is often called a prediction equation. The difference between y - ŷ, between an observed outcome y and its predicted value ŷ is the prediction error, called the residual. The average of the residuals is zero. The regression line has a smaller sum of squared residuals than any other line. It is called the least squares line. The population regression equation has the following formula:This formula is a model. A model is a simple approximation for how variables relate in a population. The probability distributions of y values at a fixed value of x is a conditional distribution (e.g: the means of annual income for people with 12 years of education).DESCRIBE STRENGTH OF ASSOCIATIONCorrelation does not differentiate between response and explanatory variables. The formula for the slope uses the correlation and can be calculated as following:Using this formula, the y-intercept can be calculated:The slope can’t be used to determine the strength of the association, because it determines on the units of measurement. The correlation is the standardized version of the slope. The formula for the correlation is the following:A property of the correlation is that at any particular x value, the predicated value of y is relatively closer to its mean than x is to its mean. If a particular ‘x’ value falls 2.0 standard deviations from the mean with a correlation of 0.80, then the predicted ‘y’ is ‘r’ times that many standard deviations from its mean, so the predicted ‘y’ would be 0.80 times 2.0 standard deviations from the mean. The predicted ‘y’ is...

Lees verder over Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 12 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 14 summary

ONE-WAY ANOVE: COMPARING SEVERAL MEANSThe inferential method for comparing means of several groups is called analysis of variance, also called ANOVA. Categorical explanatory variables in multiple regression and in ANOVA are referred to as factors, also known as independent variables. An ANOVA with only one independent variable is called a one-way ANOVA. Evidence against the null hypothesis in an ANOVA test is stronger when the variability within each sample is smaller or when the variability between groups is larger. The formula for the F (ANOVA) test is:When the null hypothesis is true, the mean of the F-distribution is approximately 1. If the null hypothesis is wrong, then F>1. This also increases if the sample size increases. The larger the F-statistic, the smaller the P-value. The F-distribution has two degrees of freedom values: and The ANOVA test has five steps:AssumptionsA quantitative response variable for more than two groups. Independent random samples. Normal population distribution with equal standard deviation.HypothesesTest statisticyP-valueThis is the right-tail probability of the observed F-value.ConclusionThe null hypothesis is normally rejected if the P-value is smaller than 0.05.If the sample sizes are equal, the within-groups estimate of the variance is the mean of the g sample variances for the g groups. It uses the following formula: If the sample sizes are equal, the between-groups estimate of the variance uses the following formula: The ANOVA F-test is robust to violations if the sample size is large enough. If the population sample sizes are not equal, the F test works quite well as long as the largest group standard deviation is no more than about twice the smallest group standard deviation. Disadvantages of the F-test are that it tells us whether groups are different, but it does not tell us which groups are different. ESTIMATING DIFFERENCES IN GROUPS FOR A SINGLE FACTORThe F-test only...

Lees verder over Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 14 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 15 summary

COMPARE TWO GROUPS BY RANKINGNonparametric statistical methods are inferential methods that do not assume a particular form of distribution (e.g: the assumption of a normal distribution) for the population distribution. The Wilcoxon test is the best known nonparametric method. Nonparametric methods are useful when the data are ranked and when the assumption of normality is inappropriate. The Wilcoxon test sets up a distribution using the probability of each difference of the mean rank. This test has five steps:AssumptionsIndependent random samples from two groups.HypothesesTest statisticThis is the difference between the sample mean ranks for the two groups.P-valueThis is a one-tail or two-tail probability, depending on the alternative hypothesis. ConclusionThe null hypothesis is either rejected in favour of the alternative hypothesis or not.The sum of the ranks can also be used, instead of the mean of the ranks. When conducting the Wilcoxon test, a z-test can also be conducted if the sample is large enough. This z-test has the following formula:A Wilcoxon test can also be conducted by converting quantitative observations to ranks. The Wilcoxon test is not affected by outliers (e.g: an extreme outlier gets the lowest/highest rank, no matter if it’s a bit higher or lower than the number before that). The difference between the population medians can also be used if the distribution is highly skewed, but this requires the extra assumption that the population distribution of the two groups have the same shape. The point estimate of the difference between two medians equals the median of the differences between the two groups. A sample proportion can also be used, by checking what the proportion is of observations in group one that’s better than group two. If there is a proportion of 0.50, then there is no effect. The closer the proportion gets to 0 or 1, the greater the difference between the two groups. NONPARAMETRIC METHODS FOR...

Lees verder over Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 15 summary

Research Methods & Statistics – Bayesian statistics summary (UNIVERSITY OF AMSTERDAM)

Probability refers to the proportion of occurrence when a particular experiment is repeated infinitely often under different circumstances. It is a long-term relative frequency, does not apply to unique events and is dependent on the reference category. Subjective probability refers to the subjective degree of conviction in a hypothesis. Objective probability refers to the long-term relative frequency and is the same probability used in classical statistics. The p-value is the probability of finding a test statistic at least as extreme as the one observed, given that the null hypothesis is true. An X% confidence interval for a parameter is an interval that in repeated use has an X% chance to capture the true value of the parameter. The p-values are only concerned about the null hypothesis, although it is not possible to make statements about the probability of a hypothesis in classical statistics.If the null hypothesis is true, then the p-values drift randomly. Therefore, it is possible that the p-value is significant by chance. This is why stopping rules are imperative in classical statistics. In Bayesian statistics, the Bayes factor does not drift randomly but drifts towards the correct decision.In classical statistics, the stopping rules (1), the timing of explanations (posthoc test or not) (2) and multiple tests influence the conclusion. This is not the case in Bayesian statistics.Classical statistics does not allow for probabilities to be assigned to hypotheses or parameters, whereas Bayesian statistics does allow this.Bayesian statistics is a method of learning from prediction errors. It assumes that probability does not exist but only uncertainty, which has to be quantified in a principled manner. Therefore, in Bayesian statistics, probability can be assigned to a single hypothesis.The data drive an update from prior knowledge to posterior knowledge. This method investigates whereas classical statistics investigates . The Bayes...