## Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

- 2504 keer gelezen

All characteristics of a subject that can be measured are variables. These characteristics can vary between different subjects within a sample or within a population (like income, sex, opinion). The use of variables is to indicate the variability of a value. As as example, the number of beers consumed per week by students. The values of a variable constitute the measurement scale. Several measurement scales, or ways to differ variables, are possible.The most important divide is that between quantitative and categorical variables. Quantitative variables are measured in numerical values, such as age, numbers of brothers and sisters, income. Categorical variables (also called qualitative variables) are measured in categories, such as sex, marital status, religion. The measurement scales are tied to statistical analyses: for quantitative variables it is possible to calculate the mean (i.e. the average age), but for categorical variables this isn't possible (i.e. there is no average sex.Also there are four measurement scales: nominal, ordinal, interval and ratio. Categorical variables have nominal or ordinal scales.The nominal scale is purely descriptive. For instance with sex as a variable, the possible values are man and woman. There is no order or hierarchy, one value isn't higher than the other.The ordinal scale on the other hand assumes a certain order. For instance happiness. If the possible values are unhappy, considerably unhappy, neutral, considerably happy and ecstatic, then there is a certain...

1 - **Go to www JoHo.org, **and join JoHo WorldSupporter by **choosing a membership + online access**

2 - Return to WorldSupporter.org and **create an account with the same email address**

3 - **State your JoHo WorldSupporter Membership **during the** creation of your account**, and you can start using the services

- You have online access to all free + all exclusive summaries and study notes on WorldSupporter.org and JoHo.org
- You can use all services on JoHo WorldSupporter.org (EN/NL)
- You can make use of the tools for work abroad, long journeys, voluntary work, internships and study abroad on JoHo.org (Dutch service)

- If you already have a WorldSupporter account than you can change your account status from 'I am not a JoHo WorldSupporter Member' into 'I am a JoHo WorldSupporter Member with full online access
- Please note: here too you must have used the same email address.

1 - Ga naar www JoHo.org, en sluit je aan bij JoHo WorldSupporter door een membership met online toegang te kiezen

2 - Ga terug naar WorldSupporter.org, en maak een account aan met hetzelfde e-mailadres

3 - Geef bij het account aanmaken je JoHo WorldSupporter membership aan, en je kunt je services direct gebruiken

2 - Ga terug naar WorldSupporter.org, en maak een account aan met hetzelfde e-mailadres

3 - Geef bij het account aanmaken je JoHo WorldSupporter membership aan, en je kunt je services direct gebruiken

- Je hebt nu online toegang tot alle gratis en alle exclusieve samenvattingen en studiehulp op WorldSupporter.org en JoHo.org
- Je kunt gebruik maken van alle diensten op JoHo WorldSupporter.org (EN/NL)
- Op JoHo.org kun je gebruik maken van de tools voor werken in het buitenland, verre reizen, vrijwilligerswerk, stages en studeren in het buitenland

- Wanneer je al eerder een WorldSupporter account hebt aangemaakt dan kan je, nadat je bent aangesloten bij JoHo via je 'membership + online access ook je status op WorldSupporter.org aanpassen
- Je kunt je status aanpassen van 'I am not a JoHo WorldSupporter Member' naar 'I am a JoHo WorldSupporter Member with 'full online access'.
- Let op: ook hier moet je dan wel hetzelfde email adres gebruikt hebben

- To support the JoHo WorldSupporter and Smokey projects and to contribute to all activities in the field of international cooperation and talent development
- To use the basic features of JoHo WorldSupporter.org

- To support the JoHo WorldSupporter and Smokey projects and to contribute to all activities in the field of international cooperation and talent development
- To use full services on JoHo WorldSupporter.org (EN/NL)
- For access to the online book summaries and study notes on JoHo.org and Worldsupporter.org
- To make use of the tools for work abroad, long journeys, voluntary work, internships and study abroad on JoHo.org (NL service)

- Voor steun aan de JoHo WorldSupporter en Smokey projecten en een bijdrage aan alle activiteiten op het gebied van internationale samenwerking en talentontwikkeling
- Voor gebruik van de basisfuncties van JoHo WorldSupporter.org
- Voor het gebruik van de kortingen en voordelen bij partners
- Voor gebruik van de voordelen bij verzekeringen en reisverzekeringen zonder assurantiebelasting

- Voor volledige online toegang en gebruik van alle online boeksamenvattingen en studietools op WorldSupporter.org en JoHo.org
- voor online toegang tot de tools en services voor werk in het buitenland, lange reizen, vrijwilligerswerk, stages en studie in het buitenland
- voor online toegang tot de tools en services voor emigratie of lang verblijf in het buitenland
- voor online toegang tot de tools en services voor competentieverbetering en kwaliteitenonderzoek
- Voor extra steun aan JoHo, WorldSupporter en Smokey projecten

Access:

JoHo members

Join WorldSupporter!

This content is related to:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Check more of this topic?

Work for WorldSupporter

JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Search other summaries?

This content is also used in .....

What are statistical methods? – Chapter 1

Statistics is used more and more often to study the behavior of people, not only by the social sciences but also by companies. Everyone can learn how to use statistics, even without much knowledge of mathematics and even with fear of statistics. Most important are logic thinking and perseverance.

To first step to using statistical methods is collecting data. Data are collected observations of characteristics of interest. For instance the opinion of 1000 people on whether marihuana should be allowed. Data can be obtained through questionnaires, experiments, observations or existing databases.

But statistics aren't only numbers obtained from data. A broader definition of statistics entails all methods to obtain and analyze data.

Before being able to analyze data, a design is made on how to obtain the data. Next there are two sorts of statistical analyses; descriptive statistics and inferential statistics. Descriptive statistics summarizes the information obtained from a collection of data, so the data is easier to interpret. Inferential statistics makes predictions with the help of data. Which kind of statistics is used, depends on the goal of the research (summarize or predict).

To understand the differences better, a number of basic terms are important. The subjects are the entities that are observed in a research study, most often people but sometimes families, schools, cities etc. The population is the whole of subjects that you want to study (for instance foreign students). The sample is a limited number of selected subjects on which you will collect data (for instance 100 foreign students from several universities). The ultimate goal is to learn about the population, but because it's impossible to research the entire population, a sample is made.

Descriptive statistics can be used both in case data is available for the entire population and only for the sample. Inferential statistics is only applicable to samples, because predictions for a yet unknown future are made. Hence the definition of inferential statistics is making predictions about a population, based on data gathered from a sample.

The goal of statistics is to learn more about the parameter. The parameter is the numerical summary of the population, or the unknown value that can tell something about the ultimate conditions of the whole. So it's not about the sample but about the population. This is why an important part of

.....read moreAccess:

JoHo members

Which kinds of samples and variables are possible? – Chapter 2

All characteristics of a subject that can be measured are variables. These characteristics can vary between different subjects within a sample or within a population (like income, sex, opinion). The use of variables is to indicate the variability of a value. As as example, the number of beers consumed per week by students. The values of a variable constitute the measurement scale. Several measurement scales, or ways to differ variables, are possible.

The most important divide is that between quantitative and categorical variables. Quantitative variables are measured in numerical values, such as age, numbers of brothers and sisters, income. Categorical variables (also called qualitative variables) are measured in categories, such as sex, marital status, religion. The measurement scales are tied to statistical analyses: for quantitative variables it is possible to calculate the mean (i.e. the average age), but for categorical variables this isn't possible (i.e. there is no average sex.

Also there are four measurement scales: nominal, ordinal, interval and ratio. Categorical variables have nominal or ordinal scales.

The nominal scale is purely descriptive. For instance with sex as a variable, the possible values are man and woman. There is no order or hierarchy, one value isn't higher than the other.

The ordinal scale on the other hand assumes a certain order. For instance happiness. If the possible values are unhappy, considerably unhappy, neutral, considerably happy and ecstatic, then there is a certain order. If a respondent indicates to be neutral, this is happier than considerably unhappy, which in turn is happier than unhappy. Important is that the distances between the values cannot be measured, this is the difference between ordinal and interval.

Quantitative variables have an interval or ratio scale. Interval means that there are measurable differences between the values. For instance temperate in Celcius. There is an order (30 degrees is more than 20) and the difference is clearly measurable and consistent.

The difference between interval and ratio is that for an interval scale the value can't be zero, but for a ratio scale it can be. So the ratio scale has numerical values, with a certain order, with measurable differences and with zero as a possible value. Examples are percentage or income.

Furthermore there are discrete and continuous variables. A variable is discrete when the possible values can only be limited, separate numbers. A variable is continuous when the values can be anything possible. For instance the number of brothers and sisters is discrete, because it's not possible to have 2.43 brother/sister. And for instance

.....read moreAccess:

JoHo members

What are the main measures and graphs of descriptive statistics? - Chapter 3

- 3.1 Which tables and graphs display data?
- 3.2 How do you describe the center of data using mean, median and mode?
- 3.3 How can you measure the variability of data?
- 3.4 How can you measure quartiles and other positions on a distribution?
- 3.5 How do you call statistics for multiple variables?
- 3.6 Which letters are used in formulas to mark the difference between the sample and the population?

Descriptive statistics serves to create an overview or summary of data. There are two kinds of data, quantitative and categorical, each has different descriptive statistics.

To create an overview of categorical data, it's easiest if the categories are in a list including the frequence for each category. To compare the categories, the relative frequencies are listed too. The relative frequency of a category shows how often a subject falls within this category compared to the sample. This can be calculated as a percentage or a proportion. The percentage is the total number of observations within a certain category, divided by the total number of observations * 100. Calculating a proportion works largely similar, but then the number isn't multiplied by 100. The sum of all proportions should be 1.00, the sum of all percentages should be 100.

Frequencies can be shown using a frequency distribution, a list of all possible values of a variable and the number of observations for each value. A relative frequency distributions also shows the comparisons with the sample.

Example (relative) frequency distribution:

Gender | Frequence | Proportion | Percentage |

Male | 150 | 0.43 | 43% |

Female | 200 | 0.57 | 57% |

Total | 350 (=n) | 1.00 | 100% |

Aside from tables also other visual displays are used, such as bar graphs, pie charts, histograms and stem-and-leaf plots.

A bar graph is used for categorical variables and uses a bar for each category. The bars are separated to indicate that the graph doesn't display quantitative variables but categorical variables.

A pie chart is also used for categorical variables. Each slice represents a category. When the values are close together, bar graphs show the differences more clearly than pie charts.

Frequency distributions and other visual displays are also used for quantitative variables. In that case, the categories are replaced by intervals. Each interval has a frequence, a proportion and a percentage.

A histogram is a graph of the frequency distribution for a quantitative variable. Each value is represented by a bar, except when there are many values, then

.....read moreAccess:

JoHo members

What role do probability distributions play in statistical inference? – Chapter 4

- 4.1 What are the basic rules of probability?
- 4.2 What is the difference in probability distributions for discrete and continuous variables?
- 4.3 How does the normal distribution work exactly?
- 4.4 What is the difference between sample distributions and sampling distributions?
- 4.5 How do you create the sampling distribution for a sample mean?
- 4.6 What is the connection between the population, the sample data and the sampling distribution?

Randomization is important for collecting data, the idea that the possible observations are known but it's yet unknown which possibility will prevail. What will happen, depends on probability. The probability is the proportion of the number of times that a certain observation is prevalent in a long sequence of similar observations. The fact that the sequence is long, is important, because the longer the sequence, the more accurate the probability. Then the sample proportion becomes more like the population proportion. Probabilities can also be measured in percentages (such as 70%) instead of proportions (such as 0.7). A specific branch within statistics deals with subjective probabilities, called Bayesian statistics. However, most of statistics is about regular probabilities.

A probability is written like P(A), where P is the probability and A is an outcome. If two outcomes are possible and they exclude each other, then the chance that B happens is 1- P(A).

Imagine research about people's favorite colors, whether this is mostly red and blue. Again the assumption is made that the possibilities exclude each other without overlapping. The chance that someone's favorite color is red (A) or blue (B), is P(A of B) = P (A) + P (B).

Next, imagine research that encompasses multiple questions. The research seeks to investigate how many married people have kids. Then you can multiply the chance that someone is married (A) with the chance that someone has kids (B).The formula for this is: P(A and B) = P(A) * P(B if also A). Because there is a connection between A and B, this is called a conditional probability.

Now, imagine researching multiple possibilities that are not connected. The chance that a random person likes to wear sweaters (A) and the chance that another random person likes to wear sweaters (B), is P (A and B) = P (A) x P (B). These are independent probabilities.

A random variable means that the outcome differs for each observation, but mostly this is just referred to as a variable. While a discrete variable has set possible values,

.....read moreAccess:

JoHo members

How can you make estimates for statistical inference? – Chapter 5

Sample data is used for estimating parameters that give information about the population, such as proportions and means. For quantitative variables the population mean is estimated (like how much money on average is spent on medicine in a certain year). For categorical variables the population proportions are estimated for the categories (like how many people do and don't have medical insurance in a certain year).

Two kinds of parameter estimates exist;

A point estimate is a number that is the best prediction.

An interval estimate is an interval surrounding a point estimate, which you think contains the population parameter.

There is a difference between the estimator (the way that estimates are made) and the estimate point (the estimated number itself). For instance, a sample is an estimator for the population parameter and 0,73 is an estimate point of the population proportion that believes in love at first sight.

A good estimator has a sampling distribution that is centered around the parameter and that has a standard error as small as possible.

An estimate isn't biased when the sampling distribution is centered around the parameter. This is especially the case when the sample mean is the population parameter. In that case ӯ (sample mean) equals µ (population mean). ӯ is then regarded a good estimator for µ.

When an estimate is biased, the sample mean doesn't estimate the population mean well. Usually the sample mean is below, because the extremes from a sample can't be more than those of the population, only less. The sample variety is smaller, allowing the sample variety to underestimate the population variety.

An estimator should also have a small standard error. An estimator is called efficient when the standard error is smaller than that of other estimator. Imagine a normal distribution. The standard error of the median is 25% bigger than the standard error of the mean. The sample mean is closer to the population mean than the sample median is. The sample mean is a more efficient estimator then.

A good estimator is unbiased (meaning the sampling distribution is centered around the parameter) and efficient (meaning it has the smallest standard error).

Usually the sample mean serves as an estimator for the population mean, the sample standard deviation

.....read moreAccess:

JoHo members

How do you perform significance tests? – Chapter 6

- 6.1 What are the five components of a significance test?
- 6.2 How do you perform a significance test for a mean?
- 6.3 How do you perform a significance test for a proportion?
- 6.4 Which errors can be made in significance tests?
- 6.5 Which limitations do significance tests have?
- 6.6 How can you calculate the probability of type II error?
- 6.7 How is the binomial distribution used in significance rests for small samples?

A hypothesis is a prediction that a parameter within the population has a certain value or falls within a certain interval. A distinction can be made between two kinds of hypotheses. A null hypothesis (H_{0}) is the assumption that a parameter will assume a certain value. Opposite is the alternative hypothesis (H_{a}), the assumption that the parameter falls in a range outside of that value. Usually the null hypothesis means no effect. A significance test (also called hypothesis test or test) finds if enough material exists to support the alternative hypothesis. A significance test compares point estimates of parameters with the expected values of the null hypothesis.

Significance tests consist of five parts:

Assumption. Each test makes assumptions about the type of data (quantitative/categorical), the required level of randomization, the population distribution (for instance the normal distribution) and the sample size.

Hypotheses. Each test has a null hypothesis and an alternative hypothesis.

Test statistic. This indicates how far the estimate lies from the parameter value of H

_{0}. Often, this is shown by the number of standard errors between the estimate and the value of H_{0}.P-value. This gives the weight of evidence against H

_{0}. The smaller the P-value is, the more evidence that H_{0}is incorrect and that H_{a}is correct.Conclusion. This is an interpretation of the P-value and a decision on whether H

_{0}should be accepted or rejected.

Significance tests for quantitative variables usually research the population mean µ. The five parts of a significance test come to play here.

Assumed is that the data is retrieved from a random sample and it has the normal distribution.

The hypothesis is two-sided, meaning that both a null hypothesis and an alternative hypothesis exist. Usually the null hypothesis is H_{0}: µ = µ_{0 }, in which µ_{0} is the value of the population mean. This hypothesis says that there is no effect (0). The alternative hypothesis then contains all other values and looks

Access:

JoHo members

How do you compare two groups in statistics? - Chapter 7

- 7.1 What are the basic rules for comparing two groups?
- 7.2 How do you compare two proportions of categorical data?
- 7.3 How do you compare two means of quantitative data?
- 7.4 How do you compare the means of dependent samples?
- 7.5 Which complex methods can be used for comparing means?
- 7.6 Which complex methods can be used for comparing proportions?
- 7.7 Which nonparametric methods exist for comparing groups?

In social science often two groups are compared. For quantitative variables means are compared, for categorical variables proportions. When comparing two groups, a binary variable is created: a variable with two categories (also called dichotomous). For instance for sex as a variable the results are men and women. This is an example of bivariate statistics.

Two groups can be dependent or independent. They are dependent when the respondents naturally match with each other. An example is longitudinal research, where the same group is measured in two moments in time. For an independent sample the groups don't match, for instance in cross-sectional research, where people are randomly selected from the population.

Imagine comparing two independent groups: men and women and the time they spend sleeping. Men and women are two different groups, with two population means, two estimates and two standard errors. The standard error indicates how much the mean differs for each sample. Because we want to investigate the difference, also this difference has a standard error. The population difference is estimated by the sample difference. What you want to know, is µ₂ – µ₁, this is estimated by ȳ_{2} – ȳ_{1}. This can be shown in a sampling distribution. The standard error of ȳ_{2} – ȳ_{1} indicates how much the mean varies between samples. The formula is:

Estimated standard error =

In this case se_{1} is the standard error of group 1 (men) and se_{2} the standard error of group 2 (women).

Instead of the difference also the ratio can be given. This is especially useful in case of very small proportions.

The difference between the proportions of two populations (π_{2} – π_{1}) is estimated by the difference between the sampling proportions. When the samples are very large, the difference is small.

The confidence interval is the point estimate of the difference ± the t-score multiplied with the standard error. The formula for the group difference is:

confidence interval in which

When

.....read moreAccess:

JoHo members

How do you analyze the association between categorical variables? – Chapter 8

A contingency table contains the outcomes of all possible combinations of categorical data. A 4x5 contingency table has 4 rows and 5 columns. It often indicates percentages, this is called relative data.

A conditional distribution means that the data is dependent on a certain condition and shown as percentages of a subtotal, like women that have a cold. A marginal distribution contains the separate numbers. A simultaneous distribution shows the percentages with respect to the entire sample.

Two categorical variables are statistically independent when the probability that one occurs is unrelated to the probability that the other occurs. So this is when the probability distribution of one variable is not influenced by the outcome of the other variable. If this does happen, they are statistically dependent.

When two variables are independent, this gives information about variables in the population. Probably the sample will be similarly distributed, but not necessarily. The variability can be high. A significance test tells whether it's plausible that the variables really are independent in the population. The hypotheses for this test are:

H_{0}: the variables are statistically independent

H_{a}: the variables are statistically dependent

A cell in a contingency table shows the observed frequency (f_{o}), the number of times that an observation is made. The expected frequency (f_{e}) is the number that is expected if the null hypothesis is true, so when the variables are independent. The expected frequency is calculated by adding the total of a row to a total of a column and then dividing this number by the sample size.

A significance test for independence uses a special test statistic. X^{2} says how close the expected frequencies are to the observed frequencies. The test that is performed, is called the chi-squared test (of indepence). The formula for this test is:

This method was developed by Karl Pearson. When X^{2} is small, the expected and observed frequencies are close together. The bigger X^{2}, the further they are apart. So this test statistic gives information on the level of coincidence.

A binomial distribution shows the probabilities of outcomes of a small sample with categorical discrete variables, like tossing a coin. This is not a distribution of observations or a

.....read moreAccess:

JoHo members

How do linear regression and correlation work? – Chapter 9

- 9.1 What are linear associations?
- 9.2 What is the least squares prediction equation?
- 9.3 What is a linear regression model?
- 9.4 How does the correlation measure the association of a linear function?
- 9.5 How do you predict the slope and the correlation?
- 9.6 What happens when the assumptions of a linear model are violated?

Regression analysis is the process of researching associations between quantitative response variables and explanatory variables. It has three aspects: 1) investigating whether an association exists, 2) determining the strength of the association and 3) making a regression equation to predict the value of the response variable using the explanatory variable.

The response variable is denoted as y and the explanatory variable as x. A linear function means that there is a straight line throughout the points of data in a graph. A linear function is: y = α + β (x). In this alpha (α) is the y-intercept and beta (β) is the slope.

The x-axis is the horizontal axis and the y-axis is the vertical axis. The origin is the point where x and y are both 0.

The y-intercept is the value of y when x = 0. In that case β(x) equals 0, only y = α remains. The y-intercept is the location where the line starts on the y-axis.

The slope (β) indicates the change in y for a change of 1 in x. So the slope is an indication of how steep the line is. Generally when β increases, the line becomes steeper.

When β is positive, then y increases when x increases (a positive relationship). When β is negative, then y decreases when x increases (a negative relationship). When β = 0, the value of y is constant and doesn't change when x changes. This results in a horizontal line and means that the variables are independent.

A linear function is an example of a model; a simplified approximation of the association between variables in the population. A model can be good or bad. A regression model usually means a model more complex than a linear function.

In regression analysis α and β are regarded as unknown parameters that can be estimated using the available data. Each value of y is a point in a graph and can be written with its coordinates (x, y). A graph is used as a visual check whether it makes sense to make a linear function. If the data is U-shaped, a straight line doesn't make sense.

The variable y is estimated by ŷ. The equation is estimated

.....read moreAccess:

JoHo members

Which types of multivariate relationships exist? – Chapter 10

Many scientifical studies research more than two variables, requiring multivariate methods. A lot of research is focussed on the causal relationship between variables, but finding proof of causality is difficult. A relationship that appears causal may be caused by another variable. Statistical control is the method of checking whether an association between variables changes or disappears when the influence of other variables is removed. In a causal relationship, x → y, the explanatory variable x causes the response variable y. This is asymmetrical, because y does not need to cause x.

There are three criteria for a causal relationship:

Association between the variables

Appropriate time order

Elimination of alternative explanations

An association is required for a causal relationship but it doesn't necessitate it. Usually it immediately becomes clear what is a logical time order, such as an explanatory variable preceding a response variable. Apart from x and y, extra variables may provide an alternative explanation. In observational studies it can almost never be proved that a variable causes another variables, this isn't certain. Sometimes there can be outliers or anecdotes that contradict causality, but usually a single anecdote isn't enough proof to contradict causality. It's easier to find causality with randomized experiments than with observational studies. This is because randomization appoints two groups randomly and sets the time frame before starting the experiment.

Eliminating alternative explanations is often tricky. A method of testing the influence of other variables is controlling them; eliminating them or keeping them on a constant value. Controlling means taking care that the control variables (the other variables) don't have an influence anymore on the association between x and y. A random experiment in a way also uses control variables; the subjects are selected randomly and the other variables manifest themselves randomly in the subjects.

Statistical control is different from experimental control. In statistical control, subjects with certain characteristics are grouped together. Observational studies in social science often form groups based on socio-economic status, education or income.

The association between two quantitative variables is shown in a scatter plot. Controlling this association for a categorical variable is done by comparing the means.

The association between two categorical variables is shown in a contingency table. Controlling this association

.....read moreAccess:

JoHo members

What is multiple regression? – Chapter 11

- 11.1 What does a multiple regression model look like?
- 11.2 How do you interpret the coefficient of determination for multiple regression?
- 11.3 How do you predict the values of multiple regression coefficients?
- 11.4 How does a statistical model represent interaction effects?
- 11.5 How do you compare possible regression models?
- 11.6 How do you calculate the partial correlation?
- 11.7 How do you compare the coefficients of variables with different units of measurement by using standardized regression coefficients?

A multiple regression model has more than one explanatory variable and sometimes also (a) controle variable(s): E(y) = α + β_{1}x_{1} + β_{2}x_{2}. The explanatory variables are numbered: x_{1}, x_{2}, etc. When an explanatory variable is added, then the equation is extended with β_{2}x_{2}. The parameters are α, β_{1} and β_{2}. The y-axis is vertical, x_{1} is horizontal and x_{2} is perpendicular to x_{1}. In this three-dimensional graph the multiple regression equation describes a flat surface, called a plane.

A partial regression equation describes only part of the possible observations, only those with a certain value.

In multiple regression a coefficient indicates the effect of an explanatory variable on a response variable, while controlling for other variables. Bivariate regression completely ignores the other variables, multiple regression only brushes them aside for a bit. This is the basic difference between bivariate and multiple regression. The coefficient (like β_{1}) of a predictor (like x_{1}) tells what is the change in the mean of y when the predictor is raised by one point, controlling for the other variables (like x_{2}). In that case, β_{1} is a partial regression coefficient. The parameter α is the mean of y when all explanatory variables are 0.

The multiple regression model has its limitations. An association doesn't automatically mean that there is a causal relationship, there may be other factors. Some researchers are more careful and call statistical control 'adjustment'. The regular multiple regression model assumes that there is no statistical interaction and that the slope β doesn't depend on which combination of explanatory variables is formed.

Multiple regression that exists in the population is estimated by the prediction equation : ŷ = a + b_{1} x_{1} + b_{2} x_{2 }+ … + b _{p} x _{p} in which p is the number of explanatory variables.

Just like the bivariate model, the multiple regression model uses residuals to measure prediction errors. For a predicted response ŷ and a measured response y, the residual is the difference between them: y – ŷ. The SSE (Sum of Squared Errors/Residual Sum of Squares) is similar as for bivariate models: SSE = Σ (y – ŷ)^{2}, the only difference is the fact that the estimate ŷ is shaped

Access:

JoHo members

What is ANOVA? – Chapter 12

For analyzing categorical variables without assigning a ranking, dummy variables are an option. This means that fake variables are created from observations:

z_{1} = 1 and z_{2} = 0 : observations of category 1 (men)

z_{1} = 0 and z_{2} = 1 : observations of category 2 (women)

z_{1} = 0 and z_{2} = 0 : observations of category 3 (transgender and other identities)

The model is: E(y) = α + β_{1}z_{1} + β_{2}z_{2}. The means are deducted from the model: μ_{1} = α + β_{1} and μ_{2} = α + β_{2} and μ_{3} = α. Three categories only require two dummy variables, because what remains falls in category 3.

A significance test using the F-distribution tests whether the means are the same. The null hypothesis H_{0} : μ_{1} = μ_{2} = μ_{3} = 0 is the same as H_{0} : β_{1} = β_{2} = 0. A small F means a big P and much evidence against the null hypothesis.

The F-test is robust against small violations of normality and differences in the standard deviations. However, it can't handle very skewed data. This is why randomization is important.

A small P doesn't say which means differ or how much. Confidence intervals give more information. For every mean a confidence interval can be constructed, or for the difference between two means. An estimate of the difference in population means is:

The degrees of freedom of the t-score are df = N – g, in which g is the number of categories and N is the combined sample size (n_{1} + n_{2} + … + n_{g}). When the confidence interval doesn't contain 0, this is proof of difference between the means.

In case of lots of groups with equal population means, it might happen that a confidence interval finds a difference anyway, due to the increase in errors that comes with the increase in the number of comparisons. Multiple comparison methods check the probability that all intervals of a lot of comparisons contain the real differences. For a 95% confidence interval the probability that one comparison contains an error is 5%, this is the multiple comparison error rate. One such method is the Bonferroni method, which divides the

.....read moreAccess:

JoHo members

How does multiple regression with both quantitative and categorical predictors work? – Chapter 13

- 13.1 What do models with both quantitative and categorical predictors look like?
- 13.2 Which inferential methods are available for regression with quantitative and categorical predictors?
- 13.3 In what kind of case studies is multiple regression analysis required?
- 13.4 How do you use adjusted means?
- 13.5 What does a linear mixed model look like?

Multiple regression is also feasible for a combination of quantitative and categorical predictors. In a lot of research it makes sense to control for a quantitative variable. A quantitative control variable is called a covariate and it is studied using analysis of covariance (ANCOVA).

A graph helps to research the effect of quantitative predictor x on the response y, while controlling for the categorical predictor z. For two categories, z can be the dummy variable, else more dummy variables are required (like z_{1} and z_{2}). The values of z can be 1 ('agree') or 0 ('don't agree'). If there is no interaction, the lines that fit the data best are parallel and the slopes are the same. It's even possible that the regression lines are exactly the same. But if they aren't parallel, there is interaction.

The predictor can be quantitative and the control variable can be categorial, but this can also be the other way around. Software compares the means. A regression model with three categories is:: E(y) = α + βx + β_{1}z_{1} + β_{2}z_{2}, in which β is the effect of x on y for all groups z. For every additional quantitative variable a βx is added. For every additional categorical variable a dummy variable is added (or several, depending on the number of categories). Cross-product terms are added in case of interaction.

The first step to making predictions is testing whether a model needs to include interaction. A F-test compares a model with cross-product terms to a model without. For this the F-test uses the partial sum of squares; the variability in y that is explained by a certain variable when the other aspects are already accounted for. The null hypothesis says that the slopes of the cross-product terms are 0, the alternative hypothesis says that there is interaction. In a graph, interaction looks like this:

Another F-test checks whether a complete or a reduced model is better. To compare a complete model (E(y) = α + βx + β_{1}z_{1} + β_{2}z_{2}) with a reduced model (E(y) =

Access:

JoHo members

How do you make a multiple regression model for extreme or strongly correlating data? – Chapter 14

- 14.1 What strategies are available for selecting a model?
- 14.2 How can you tell when a statistical model doesn't fit?
- 14.3 How do you detect multicollinearity and what are its consequences?
- 14.4 What are the characteristics of generalized linear models?
- 14.5 What is polynomial regression?
- 14.6 What do exponential regression and log transforms look like?
- 14.7 What are robust variance and nonparametric regression?

Three basic rules for selecting variables to add to a model are:

Select variables that can answer the theoretical purpose (accepting/rejecting the null hypothesis), with sensible control variables and mediating variables

Add enough variables for a good predictive power

Keep the model simple

The explanatory variables should be highly correlated to the response variable but not to each other. Software can test and select explanatory variables. Possible strategies are backward elimination, forward selection and stepwise regression. In backward elimination all possible variables are added, tested for their P-value and then only the significant variables are selected. Forward selection starts from scratch, adding variables with the lowest P-value. Another version of this is stepwise regression, this method removes redundant variables when new variables are added.

Software helps but it's up to the researcher to think and make choices. It also matters whether research is explanatory, starting with a theoretical model with known variables, or whether research is exploratory, openly looking for explanations of a phenomenon.

Several criteria are indications of a good model. To find a model with big power but without an overabundance of variables, the adjusted R^{2} is used:

The adjusted R^{2} decreases when an unnecessary variable is added.

Cross-validation continuously checks whether the predicted values are as close as possible to the observed values. The result is the predicted residual sum of squares (PRESS):

If PRESS decreases, the predictions get better. However, this test assumes a normal distribution. A method that can handle other distributions, is the Akaike information criterion (AIC), which selects the model in which ŷ_{i} is as close as possible to E(y_{i}). If AIC decreases, the predictions get better.

Inference of parameters in a regression model has the following assumptions:

The model fits the shape of the data

The conditional distribution of y is normal

The standard deviation is constant in the range of values of the explanatory variables (this is called homoscedasticity)

Access:

JoHo members

What is logistic regression? – Chapter 15

- 15.1 What are the basics of logistic regression?
- 15.2 What does multiple logistic regression look like?
- 15.3 How does inference with logistic regression models work?
- 15.4 How is logistic regression performed for ordinal variables?
- 15.5 What do logistic models with nominal responses look like?
- 15.6 How do loglinear models describe the associations between categorical variables?
- 15.7 How do goodness-of-fit tests work for contingency tables?

A logistic regression model is a model with a binary response variable (like 'agree' or 'don't agree'). It's also possible for logistic regression models to have ordinal or nominal response variables. The mean is the proportion of responses that are 1. The *linear probability model* is P(y=1) = α + βx. This model often is too simple, a more extended version is:

The logarithm can be calculated using software. The odds are:: P(y=1)/[1-P(y=1)]. The log of the odds, or logistic transformation (abbreviated as logit) is the logistic regression model: logit[P(y=1)] = α + βx.

To find the outcome for a certain value of a predictor, the following formula is used:

The e to a certain power is the antilog of that number.

A straight line is drawn next to the curve of a logistic graph to analyze it. β is maximal where P(y=1) = ½. For logistic regression the maximal likelihood method is used instead of the least squares method. The model expressed in odds is:

The estimate is:

With this the odds ratio can be calculated.

There are two possibilities to present the data. For ungrouped data a normal contingency table suffices. For grouped data a row contains data for every count in a cel, like just one row with the number of subjects that agreed, followed by the total number of subjects.

An alternative of the logit is the probit. This link assumes a hidden, underlying continuous variable y* that is 1 above a certain value T (threshold) and that is 0 below T. Because y* is hidden, it's called a latent variable. However, it can be used to make a probit model: probit[P(y=1)] = α + βx.

Logistic regression with repeated measures and random effects is analyzed with a linear mixed model: logit[P(y_{ij} = 1)] = α + βx_{ij} + s_{i}.

The multiple logistic regression model is: logit[P(y = 1)] = α + β_{1}x_{1} + … + β_{p}x_{p}. The further β_{i} is from 0, the stronger

Access:

JoHo members

Which kinds of samples and variables are possible? – Chapter 2

All characteristics of a subject that can be measured are variables. These characteristics can vary between different subjects within a sample or within a population (like income, sex, opinion). The use of variables is to indicate the variability of a value. As as example, the number of beers consumed per week by students. The values of a variable constitute the measurement scale. Several measurement scales, or ways to differ variables, are possible.

The most important divide is that between quantitative and categorical variables. Quantitative variables are measured in numerical values, such as age, numbers of brothers and sisters, income. Categorical variables (also called qualitative variables) are measured in categories, such as sex, marital status, religion. The measurement scales are tied to statistical analyses: for quantitative variables it is possible to calculate the mean (i.e. the average age), but for categorical variables this isn't possible (i.e. there is no average sex.

Also there are four measurement scales: nominal, ordinal, interval and ratio. Categorical variables have nominal or ordinal scales.

The nominal scale is purely descriptive. For instance with sex as a variable, the possible values are man and woman. There is no order or hierarchy, one value isn't higher than the other.

The ordinal scale on the other hand assumes a certain order. For instance happiness. If the possible values are unhappy, considerably unhappy, neutral, considerably happy and ecstatic, then there is a certain order. If a respondent indicates to be neutral, this is happier than considerably unhappy, which in turn is happier than unhappy. Important is that the distances between the values cannot be measured, this is the difference between ordinal and interval.

Quantitative variables have an interval or ratio scale. Interval means that there are measurable differences between the values. For instance temperate in Celcius. There is an order (30 degrees is more than 20) and the difference is clearly measurable and consistent.

The difference between interval and ratio is that for an interval scale the value can't be zero, but for a ratio scale it can be. So the ratio scale has numerical values, with a certain order, with measurable differences and with zero as a possible value. Examples are percentage or income.

Furthermore there are discrete and continuous variables. A variable is discrete when the possible values can only be limited, separate numbers. A variable is continuous when the values can be anything possible. For instance the number of brothers and sisters is discrete, because it's not possible to have 2.43 brother/sister. And for instance

.....read moreAccess:

JoHo members

What are the main measures and graphs of descriptive statistics? - Chapter 3

- 3.1 Which tables and graphs display data?
- 3.2 How do you describe the center of data using mean, median and mode?
- 3.3 How can you measure the variability of data?
- 3.4 How can you measure quartiles and other positions on a distribution?
- 3.5 How do you call statistics for multiple variables?
- 3.6 Which letters are used in formulas to mark the difference between the sample and the population?

Descriptive statistics serves to create an overview or summary of data. There are two kinds of data, quantitative and categorical, each has different descriptive statistics.

To create an overview of categorical data, it's easiest if the categories are in a list including the frequence for each category. To compare the categories, the relative frequencies are listed too. The relative frequency of a category shows how often a subject falls within this category compared to the sample. This can be calculated as a percentage or a proportion. The percentage is the total number of observations within a certain category, divided by the total number of observations * 100. Calculating a proportion works largely similar, but then the number isn't multiplied by 100. The sum of all proportions should be 1.00, the sum of all percentages should be 100.

Frequencies can be shown using a frequency distribution, a list of all possible values of a variable and the number of observations for each value. A relative frequency distributions also shows the comparisons with the sample.

Example (relative) frequency distribution:

Gender | Frequence | Proportion | Percentage |

Male | 150 | 0.43 | 43% |

Female | 200 | 0.57 | 57% |

Total | 350 (=n) | 1.00 | 100% |

Aside from tables also other visual displays are used, such as bar graphs, pie charts, histograms and stem-and-leaf plots.

A bar graph is used for categorical variables and uses a bar for each category. The bars are separated to indicate that the graph doesn't display quantitative variables but categorical variables.

A pie chart is also used for categorical variables. Each slice represents a category. When the values are close together, bar graphs show the differences more clearly than pie charts.

Frequency distributions and other visual displays are also used for quantitative variables. In that case, the categories are replaced by intervals. Each interval has a frequence, a proportion and a percentage.

A histogram is a graph of the frequency distribution for a quantitative variable. Each value is represented by a bar, except when there are many values, then

.....read moreAccess:

JoHo members

What role do probability distributions play in statistical inference? – Chapter 4

- 4.1 What are the basic rules of probability?
- 4.2 What is the difference in probability distributions for discrete and continuous variables?
- 4.3 How does the normal distribution work exactly?
- 4.4 What is the difference between sample distributions and sampling distributions?
- 4.5 How do you create the sampling distribution for a sample mean?
- 4.6 What is the connection between the population, the sample data and the sampling distribution?

Randomization is important for collecting data, the idea that the possible observations are known but it's yet unknown which possibility will prevail. What will happen, depends on probability. The probability is the proportion of the number of times that a certain observation is prevalent in a long sequence of similar observations. The fact that the sequence is long, is important, because the longer the sequence, the more accurate the probability. Then the sample proportion becomes more like the population proportion. Probabilities can also be measured in percentages (such as 70%) instead of proportions (such as 0.7). A specific branch within statistics deals with subjective probabilities, called Bayesian statistics. However, most of statistics is about regular probabilities.

A probability is written like P(A), where P is the probability and A is an outcome. If two outcomes are possible and they exclude each other, then the chance that B happens is 1- P(A).

Imagine research about people's favorite colors, whether this is mostly red and blue. Again the assumption is made that the possibilities exclude each other without overlapping. The chance that someone's favorite color is red (A) or blue (B), is P(A of B) = P (A) + P (B).

Next, imagine research that encompasses multiple questions. The research seeks to investigate how many married people have kids. Then you can multiply the chance that someone is married (A) with the chance that someone has kids (B).The formula for this is: P(A and B) = P(A) * P(B if also A). Because there is a connection between A and B, this is called a conditional probability.

Now, imagine researching multiple possibilities that are not connected. The chance that a random person likes to wear sweaters (A) and the chance that another random person likes to wear sweaters (B), is P (A and B) = P (A) x P (B). These are independent probabilities.

A random variable means that the outcome differs for each observation, but mostly this is just referred to as a variable. While a discrete variable has set possible values,

.....read moreAccess:

JoHo members

Call to action: Do you have statistical knowledge and skills and do you enjoy helping others while expanding your international network?

People who share their statistical knowledge and skills can contact WorldSupporter Statistics for more exposure to a larger audience. Relevant contributions to specific WorldSupporter Statistics Topics are highlighted per topic so that users who are interested in certain statistical topics can broaden their theoretical perspective and international network.

Do you have statistical knowledge and skills and do you enjoy helping others while expanding your international network? Would you like to cooperate with WorldSupporter Statistics? Please send us an e-mail with some basics (Where do you live? What's your (statistical) background? How are you helping others at the moment? And how do you see that in relation to WorldSupporter Statistics?) to info@joho.org - and we will most definitely be in touch.

Understanding data: distributions, connections and gatherings

- Data is any collection of facts, statistics, or information that can be used for analysis or decision-making. It can be raw or processed, and it can be in the form of numbers, text, images, or sounds.

Follow the author: Annemarie JoHo

More contributions of WorldSupporter author: Annemarie JoHo:

- 1 of 20
- volgende ›

Promotions

JoHo kan jouw hulp goed gebruiken! Check hier de diverse studentenbanen die aansluiten bij je studie, je competenties verbeteren, je cv versterken en een bijdrage leveren aan een tolerantere wereld

Check how to use summaries on WorldSupporter.org

- Check out: Register with JoHo WorldSupporter: starting page (EN)
- Check out: Aanmelden bij JoHo WorldSupporter - startpagina (NL)

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

- Starting Pages: for some fields of study and some university curricula editors have created (start) magazines where customised selections of summaries are put together to smoothen navigation. When you have found a magazine of your likings, add that page to your favorites so you can easily go to that starting point directly from your profile during future visits. Below you will find some start magazines per field of study
- Use the menu above every page to go to one of the main starting pages
- Tags & Taxonomy: gives you insight in the amount of summaries that are tagged by authors on specific subjects. This type of navigation can help find summaries that you could have missed when just using the search tools. Tags are organised per field of study and per study institution. Note: not all content is tagged thoroughly, so when this approach doesn't give the results you were looking for, please check the search tool as back up
- Follow authors or (study) organizations: by following individual users, authors and your study organizations you are likely to discover more relevant study materials.
- Search tool : 'quick & dirty'- not very elegant but the fastest way to find a specific summary of a book or study assistance with a specific course or subject. The search tool is also available at the bottom of most pages

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

- Check out: Why and how to add a WorldSupporter contributions
- JoHo members: JoHo WorldSupporter members can share content directly and have access to all content: Join JoHo and become a JoHo member
- Non-members: When you are not a member you do not have full access, but if you want to share your own content with others you can fill out the contact form

**Field of study**

- Communication & Media sciences
- Corporate & Organizational Sciences
- Cultural Studies & Humanities
- Economy & Economical sciences
- Education & Pedagogic Sciences
- Health & Medical Sciences
- IT & Exact sciences
- Law & Justice
- Nature & Environmental Sciences
- Psychology & Behavioral Sciences
- Public Administration & Social Sciences
- Science & Research
- Technical Sciences

Check related topics:

Activities abroad, studies and working fields

Institutions and organizations

## Add new contribution