Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)
- 2969 reads
All characteristics of a subject that can be measured are variables. These characteristics can vary between different subjects within a sample or within a population (like income, sex, opinion). The use of variables is to indicate the variability of a value. As as example, the number of beers consumed per week by students. The values of a variable constitute the measurement scale. Several measurement scales, or ways to differ variables, are possible.
The most important divide is that between quantitative and categorical variables. Quantitative variables are measured in numerical values, such as age, numbers of brothers and sisters, income. Categorical variables (also called qualitative variables) are measured in categories, such as sex, marital status, religion. The measurement scales are tied to statistical analyses: for quantitative variables it is possible to calculate the mean (i.e. the average age), but for categorical variables this isn't possible (i.e. there is no average sex.
Also there are four measurement scales: nominal, ordinal, interval and ratio. Categorical variables have nominal or ordinal scales.
The nominal scale is purely descriptive. For instance with sex as a variable, the possible values are man and woman. There is no order or hierarchy, one value isn't higher than the other.
The ordinal scale on the other hand assumes a certain order. For instance happiness. If the possible values are unhappy, considerably unhappy, neutral, considerably happy and ecstatic, then there is a certain order. If a respondent indicates to be neutral, this is happier than considerably unhappy, which in turn is happier than unhappy. Important is that the distances between the values cannot be measured, this is the difference between ordinal and interval.
Quantitative variables have an interval or ratio scale. Interval means that there are measurable differences between the values. For instance temperate in Celcius. There is an order (30 degrees is more than 20) and the difference is clearly measurable and consistent.
The difference between interval and ratio is that for an interval scale the value can't be zero, but for a ratio scale it can be. So the ratio scale has numerical values, with a certain order, with measurable differences and with zero as a possible value. Examples are percentage or income.
Furthermore there are discrete and continuous variables. A variable is discrete when the possible values can only be limited, separate numbers. A variable is continuous when the values can be anything possible. For instance the number of brothers and sisters is discrete, because it's not possible to have 2.43 brother/sister. And for instance weight is continuous, because it's possible to weigh 70 kilo but also 70.52 kilo.
Categorical variables (nominal or ordinal) are always discrete because they have a limited number of categories. Quantitative variables can be both discrete or continuous. When quantitative variables happen to be able to have lots of possible values, they are considered continuous.
Randomization is the mechanism of obtaining a representative sample. In a simple random sample every subject of the population has an equal chance of becoming part of the sample. The randomness is important, because it needs to be guaranteed that the data isn't biased. Biased information would make inferential statistics useless, because then it's impossible to say anything about the population.
For a random sample a sampling frame is necessary; a list of all subjects within the population. Next all subjects are numbered and then at random numbers are drawn. Drawing random numbers can be done using software, for instance R. In R the following formula is used:
> sample(1:60, 4) #
[1] 22 47 38 44 #
The symbol > indicates that the program needs to execute a task. In this sample the goal is to select four random subjects from a list of 60 subjects in total. The program indicates which subjects are chosen: numbers 22, 47. 38 and 44.
Data can be collected using surveys, experiments and observational studies. All these methods can have a degree of randomization.
Different types of surveys are possible; online, offline etc. Every way to gather data has challenges in terms of representing the population accurately.
Experiments are used to measure and compare the reactions from subjects under different conditions. These conditions, so called treatments, are values of a variable that can influence the reaction. It is up to the researcher to decide which subjects will follow which treatments. This is where randomization plays a part; the researcher needs to divide the subjects into groups randomly. In this case an experimental design is used to constitute which subjects will follow which treatments.
In observational studies the researcher measures the values of variables without influencing or manipulating the situation. Who will be observed, is determined at random. The biggest risk of this method is that a variable that influences the results remains unseen.
In theory, a measure must be valid, which means that it is clear what it's supposed to measure and that it accurately reflects this concept. A measure must also be reliable, meaning that it's consistent and a respondent would give the same answer again when asked twice. In reality however all kinds of factors can influence a research.
Even in the case of multiple completely random samples, the samples will differ in the way that they are different from the population. This difference is called the sampling error; how much the statistic that is drawn from the sample differs from the parameter that indicates the value in the population. In other words, the sampling error indicates the percentage that the sample can differ from the actual population. If in the population 66% agrees with government policy, but in the sample 68%, then the sampling error is 2%. In most cases for samples of over a 1000 subjects the sampling error remains limited to 3%. This is called the margin of error. This concept is often used in statistics because it can say something about the quality of a sample.
Apart from sampling error there are other factors that influence the results from random samples, such as sampling bias, response bias and non-response bias.
In probability sampling the chance of every possible sample is known. In nonprobability sampling however this is not known, the reliability is unknown and sampling bias can happen. So the sampling bias occurs in case it's not possible to guarantee that all members of the population have an equal chance to become part of the sample. This happens for instance when only volunteers take part in a research. Volunteers can be different from people that choose not to participate. The difference for certain variables that the volunteers cause, is called selection bias.
When questions in a survey or interview are asked in a certain fashion or sequence, response bias can occur. The interviewers may want to get socially desirable answers, with questions such as “Do you agree that...?” The respondents prefer not to disagree with the interviewer and are more inclined to agree, even if they might not want to. Also the general inclination to give answers that people think the interviewer favors, is part of response bias.
Non-response bias happens when people quit during research or other factors result in missing data. Some people choose not to answer certain questions, for various reasons. When people decide to quit, they may have different values on important variables compared to the respondents that remain. This can influence the data, even in a random sample.
Apart from simple random samples, also other methods are possible. There are cases when a completely unselective sample isn't possible. Sometimes it can be desirable or easier not to use a completely unselective sample. There are other methods that still use probability sampling (so that the chance is known for every possible sample) and randomization (to have a representative sample as a goal).
In a systematic random sample the subjects are chosen in a systematic manner, by consistently skipping a certain number of subjects. An example is selecting every tenth house in a street. The formula for this method is: k = N/n. The k is the skip number, the selected subject after other subjects are skipped. N is the population and n is the sample size.
A stratified sample divides the population in groups, also called strata. From each stratum a number of subjects is chosen at random that will form the sample. This can be proportional or disproportional. In a porportional stratified sample the proportions in the strata are equal to the proportions in the population. If for instance 60% in the population is male and 40% is female, then this needs to be the same in the sample. Sometimes is may be better to use a disproportional stratified sample. If in a sample of only 100 subjects only 10% is female, it doesn't make sense to have those 10 women all in the sample. A number like that is too small to be representative, then no conclusions can be drawn about the actual population. In that case it's better to choose a disproportional stratified sample.
Most samples require access to the entire population, but in reality this may not be given. In that case cluster sampling may be an option. This requires dividing the population in clusters (for instance city districts) and randomly choosing one cluster. The difference with stratified samples is that not every cluster is represented.
Another option is multistage sampling; several layered samples. For instance first provinces are selected, then cities within those provinces and then streets within those cities.
Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>
Summary of Statistical methods for the social sciences by Agresti, 5th edition, 2018. Summary in English.
Selected contributions of other WorldSupporters on the topic of Data: distributions, connections and gatherings
There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
Main summaries home pages:
Main study fields:
Business organization and economics, Communication & Marketing, Education & Pedagogic Sciences, International Relations and Politics, IT and Technology, Law & Administration, Medicine & Health Care, Nature & Environmental Sciences, Psychology and behavioral sciences, Science and academic Research, Society & Culture, Tourisme & Sports
Main study fields NL:
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
2070 | 2 |
Add new contribution