Bulletsummary per chapter with the 3rd edition of Psychometrics: An Introduction by Furr & Bacharach - Chapter

What is psychometrics? - BulletPoints 1
What is important when assigning numbers to psychological constructs? - BulletPoints 2
What are variability and covariability? - BulletPoints 3
What is dimensionality and what is factor analysis? - BulletPoints 4
What is reliability? - BulletPoints 5
How to empirically estimate the reliability? - BulletPoints 6
What is the importance of reliability? - BulletPoints 7
What is validity? - BulletPoints 8
How to evaluate evidence for convergent and divergent validity? - BulletPoints 9
What types of response bias are there? - BulletPoints 10
What types of test bias are there? - BulletPoints 11
What is a confirmatory factor analysis? - BulletPoints 12
What is the generalizability theory? - BulletPoints 13
What is the Item Response Theory (IRT) and which models are there? - BulletPoints 14

What is psychometrics? - BulletPoints 1

According to Cronbach , a psychological test is a systematic procedure for comparing the behavior of two or more people. This test must meet three conditions: (1) the test must have behavioral samples ; (2) the behavioral samples must be collected in a systematic manner and; (3) the purpose of the test must be to measure the behavior of two or more people (inter-individual differences). It is also possible that we measure the behavior of an individual at different times, in which case we speak of intra-individual differences.
Criterion referenced tests (also called domain referenced tests) are most common in situations where a statement must be made about a certain skill of a person. One predetermined cutoff score is used to divide people into two groups: (1) people whose score is higher than the cutoff score and; (2) people whose score is lower than the cutoff score.
Norm referenced tests are mainly used to compare the scores of a person with scores from the norm group. Nowadays, it is difficult to make a distinction between the benchmark tests and the benchmark tests.
Another well-known distinction between tests is the distinction between the so-called speed tests (speed) and power tests. Speed tests are time-bound tests. It often happens that not all questions can be answered in a questionnaire. We look at how many questions you can answer correctly in the given time. Power tests are not time-bound tests. Here it is highly likely that one can answer all questions in a questionnaire. These questions often become more difficult and it is checked here how many questions people have answered correctly.
Psychometrics is the collection of procedures that are used to measure variability in human behavior and subsequently combine these measurements into psychological phenomena. Psychometrics is a relatively young, but rapidly developing, scientific discipline.

What is important when assigning numbers to psychological constructs? - BulletPoints 2

There are two potential meanings of zero. Zero can mean that the object or person does not exist (absolute zero). This is for example at the reaction time. Zero can also be an arbitrary amount of a property (arbitrary zero). In this case one can think of a clock or thermometer. It is important to see whether the zero in a psychological test is relative or absolute. It is possible that the test indicates zero while the person has that characteristic. Then you can take it as a relative zero while it was initially intended as an absolute zero. Identity, ranking, quantity and the meaning of zero are important issues in understanding scores on psychological tests.
Measuring is the addition of figures to observations of behavior in order to clearly see the differences between psychological traits. There are four measurement levels, or four scales: nominal, ordinal, interval, and ratio.

What are variability and covariability? - BulletPoints 3

You can quantitatively display scores of a group of people or scores of one person at different times in a so-called distribution of scores. A distribution of scores is quantitative, because the differences between scores are expressed in figures. The difference between scores within a distribution is called the variability.
With a variance, the difference is calculated within one set of scores. With covariability, also called covariance, the difference of one set of scores is compared with the difference of another set of scores. In other words: with a covariance, the relationship between two variables is searched for, for example IQ and GPA. With a variance, one variable is used.
The direction of the relationship between the two variables can have a positive or negative relationship. There is a positive (or direct) relationship when high scores on the first variable and high scores on the second variable occur at a time. A negative relationship exists when high scores for the first variable and low scores for the second variable occur. This can also be reversed, so low scores on the first variable and high scores on the second variable.

What is dimensionality and what is factor analysis? - BulletPoints 4

When a psychological test contains items that reflect a single trait of a person, and the reactions are not influenced by other traits of that person, this means that the test is one-dimensional. The concept of conceptual homogeneity means that all responses to the items / questions are influenced by one and the same psychological trait.
If a psychological test contains items that reflect more than one trait of a person, the test can be subdivided into dimensions (multidimensional). These dimensions are multidimensional with correlating dimensions or multidimensional without correlating dimensions.
Factor analysis is the most commonly used statistical procedure to measure and test dimensionality. There are two types of factor analysis: explorative factor analysis (EFA) and confirmatory factor analysis (CFA). EFA is the type that is used most often.

What is reliability? - BulletPoints 5

According to the Classic Test Theory (KTT) the reliability can be determined on the basis of observed scores (Xo), true scores (Xt), and random scores (Xe). Random scores are also called measurement errors.
Rxx (reliability coefficient) = St² / So²
Rxx = 0 means that everyone has the same true score. (St² = 0)
Rxx = 1 means that the variance of the true scores is equal to the variance of the observed scores. In other words: there are no measurement errors!
The greater the correlation between the observed scores and the error scores, the smaller Rxx. So reliability will be relatively high if the observed scores have a low correlation with the error scores.
A reliability of 1.0 indicates that the differences between the observed test scores perfectly match the differences between the true scores. A reliability of 0.0 indicates that the differences between the observed scores and the true scores are totally contradictory.
Although we cannot determine with certainty what the reliability or standard measurement error of a test is, advanced methods have been developed to estimate it. Examples of such techniques are giving two versions of the test, doing the same test twice, and so on. In this section, four methods are discussed to estimate the reliability and standard measurement error of a test: (1) parallel testing; (2) the tau equivalent test model; (3) essentially tau equivalent test model; (4) congeneric test model. Each model offers a perspective on how two or more tests are the same.

How to empirically estimate the reliability? - BulletPoints 6

There are three methods to estimate reliability: (1) parallel test; (2) test retest; (3) internal consistency.
The first method to estimate reliability is the parallel test. In the parallel test there are two tests: the test that people want to perform where scores come out and a second test where scores also come out. The correlation between the test scores and the parallel test scores can be calculated with these two scores. The correlation can then be interpreted as an estimate of reliability. The two tests are parallel if both tests measure the same set of true scores and if they both have the same error variance. The correlation between the two parallel tests is equal to the reliability of the test scores. A practical problem with the use of a parallel test is that one is never sure whether the parallel test can meet the assumptions of classical test theory. We can never be sure that the true scores of the first form will be the same as the true scores of the parallel form. Different test forms have a different content, which can cause problems with the parallel test. If the parallel test does not correspond well with the first test, the correlation is not a good estimate of the reliability.
The second method to estimate reliability is the test-retest method. This method is useful for measuring stable psychological concepts such as intelligence and extraversion. The same people can have the same test performed several times. If the assumptions are correct, the correlation can be calculated between the first scores and the repeated scores. This correlation is then the estimator of the test-retest reliability. The applicability of the test retest test depends on a number of assumptions to ensure good reliability. Just as with the parallel test, the true scores for both tests must be the same. The error variance of the first test must also be the same as the error variance of the second test. If these assumptions are met, we can say that the correlation between the scores of the two test samples is an estimate of the reliability of the score.
The third method to estimate reliability is the split-half reliability. The split-half reliability is obtained when the test is split in two and the correlation between the two parts is calculated. In this case, two small parallel tests were actually made. The process to use the split-half method is in three steps. The first step is to divide the scores into two. The second step is to calculate the correlation between the two parts. This split-half correlation (rhh) indicates the degree to which the two parts are equal to each other. The third step is to put the correlation in a formula to calculate the reliability (Rxx) estimate. This is done with the Spearman-Brown formula: Rxx = 2 * rhh / 1 + rhh.
The accuracy of alpha and omega depends on the validity of certain assumptions. In summary, the alpha method only has accurate reliability estimates when the items are essentially tau-equivalent or parallel (see Chapter 5 for a discussion of these models). The omega is more broadly applicable; the omega also provides accurate reliability estimates for congeneric tests.
Many psychological tests have binary items (you can choose from two answers). For these tests a special formula can be used to estimate the reliability, namely the Kuder-Richardson 20 formula. This is based on two steps. First all statistics are collected. These are the proportion of correctly answered questions (p) and the proportion of incorrectly answered questions (q). Then the variance of each item is calculated with si² = pq and the variance of all test scores (sx²). The second step is to process these statistics in the Kuder and Richardson formula (KR20): Rxx = (k / k-1) * (1- (∑pq / sx²))

What is the importance of reliability? - BulletPoints 7

There are two important sources of information that can help us evaluate an individual test score. The first is a point estimator, this is a value that is interpreted as the best estimate of someone's score on a psychological trait. The second is a confidence interval, which gives an area with values in which the true score of a person lies. If the true score has a large confidence interval then we know that the observed score is a poor point estimator of the true score.
According to the classical test theory, the correlation of the observed scores of two measurements (r_xoyo) depends on two factors: the correlation between the true scores of the two psychological constructs (r_xtr_yt) and the reliability of the two measurements (R_xx and R_yy).
Because the measurement error reduces the observed correlation, this has disadvantages for interpreting and leading the research. Results must always be interpreted with the help of reliability. An important result of a study is the effect size. Some effect sizes show the extent to which the variables are interrelated and others show the magnitude of the differences between groups.
Statistical significance is a second important result of a study. Statistical significance provides certainty of a result. If a result is statistically significant then it is seen as a real find and not just a fluke. With statistical significance, a clear difference is demonstrated. The observed effect has a major influence on statistical significance. If the effect size becomes larger, the test is rather statistically significant. The third implication of including reliability is that researchers should report reliability estimates of their measurements. This is necessary because the readers must be able to interpret the results.

What is validity? - BulletPoints 8

For more than 60 years, the following basic definition of validity was assumed : Validity is the extent to which a test measures what it is intended to measure. Although this definition is very much used and still is, the concept of validity is presented a little too simply through this definition. A better definition would be that validity is the extent to which the interpretation of test scores for a specific purpose is supported by evidence and theory.
Validity has three important implications. (1) Validity refers to the interpretation of test scores regarding a specific psychological construct, it is not about the test itself. This means that a measurement is not valid or invalid, but that validity relates to the interpretation and use of measurements. (2) Validity is a matter of degree, it is not an "all or nothing" construct. (3) Validity is entirely based on empirical evidence and theory.
For years there was a traditional perspective on validity, in which there are three types of validity were identified: (1) content validity; (2) criterion validity; (3) construct validity. Nowadays construct validity is seen as the essential concept in validity. Construct validity is the extent to which a test score can be interpreted as a reflection of a certain psychological construct.
Three major organizations (AERA, APA, and NCME) published a revision of Standards for Educational and Psychological Testing in 2014, emphasizing five facets of construct validity. Construct validity is determined by five types of information: (1) content; (2) internal structure; (3) response process; (4) associations; (5) consequences.

How to evaluate evidence for convergent and divergent validity? - BulletPoints 9

There are four methods for looking at convergent and discriminatory associations. The following four methods are common methods for evaluating convergent validity and discriminant validity: (1) focus on certain associations; (2) sets of correlations; (3) multitrait-multimethod matrices; (4) quantify construct validity (QCV).
Validity coefficients are static results that represent the degree of associate between a test and one or more criterion variables. It is important to be aware of the factors that can influence the validity coefficients.
Sensitivity and specificity are used to summarize the proportions of good identifications. The sensitivity ( sensitivity ) let the chance to see someone with a disorder correctly identified by the test. Specifically Heid ( specificity ) shows the probability see someone who the disorder is not correctly identified by the test. In reality one can never know if someone has a disorder, but it is a guideline that is trusted. We will illustrate both concepts with an example
Sensitivity = true positives / (true positives + false negatives).
Specificity = true negatives / (true negatives + false positives)

What types of response bias are there? - BulletPoints 10

Consciously or unconsciously, cooperative or not, self-improving or even self-effacing, response bias plays a constant role in psychological measurement. Response bias means that respondents' reactions (negatively) influence the quality of the psychological measurement. Bias means the bias or bias of responses / outcomes, which are often incorrect.
There are different types of the reaction bias, each type is influenced by other factors (the content / design of a test, the test context, conscious possibilities to react in an invalid manner, unconscious factors, etc.). These factors led to six types of response bias : (1) acquiescence bias (saying yes and saying no); (2) extreme (vs. average) responses; (3) social desirability (" faking good" ), (4) simulation (" faking bad"), (5) random or carefree reaction, (6) gambling.
There are roughly three strategies for dealing with reaction bias: (1) Managing the test context; (2) Managing the test content and / or scores; (3) Use specially designed 'bias' tests.
In addition, we can distinguish three goals when dealing with reaction bias : (1) Minimizing the occurrence of reaction bias; (2) Minimizing the effects of reaction bias; (3) Discovering reaction bias, possibly intervene.
These strategies and goals can be combined to summarize different methods for reaction bias.
Strategy 1 + goal 1 = anonymization, minimize frustration, warnings
Strategy 2 + goal 1 = simple items, neutral items, forced choices, minimum choice
Strategy 2 + goal 2 = balanced scales, probability corrections
Strategy 2 + goal 3 = embedded validity scales
Strategy 3 + goal 3 = social desirability tests, extremity tests, acquiescence tests

What types of test bias are there? - BulletPoints 11

There are generally two types of test to distinguish bias: construct bias and predictive bias. Construct bias: bias regarding the meaning of a test. Predictive bias: bias regarding the usability of a test. These two types of test bias are independent of each other. In other words, one bias can exist in a certain test without the other bias.
Although there is a difference in test scores between two groups, this does not necessarily mean that there is a test bias. Perhaps the difference is based on reality. For example: if a test shows that the weight of men is on average higher than the weight of women, then this is based on reality. But you can have your doubts when it comes to math skills. For example, it is not logical that the math skills of men are higher than the math skills of women.
We use internal structures to find out whether there is a construct bias. These contain a pattern of correlations between items and / or correlations between each item and the total score. Evaluation is as follows: we compare the internal structures for a test separately for two groups. If the two groups exhibit the same internal structures in terms of their test responses, we can conclude that the test does not suffer from construct bias. Conversely, if the two groups do differ in internal structures with regard to the test reactions, then there is construct bias. There are five methods to discover construct bias: (1) Reliability; (2) Ranking (rank order); (3) Item discrimination index; (4) Factor analysis; (5) Differential item function analysis.
An external evaluation of the test is required to discover the predictive bias. Two considerations are: (1) Does the test really help you predict the outcome? (2) Does the test predict the outcome evenly for several groups? We can investigate this on the basis of regression analysis.
Test bias is not the same as test fairness. Test fairness has to do with an appropriate use of test scores, in the field of social and / or legal rules and the like. Test fairness is not a psychometric aspect of a test. Test bias, on the other hand, is a psychometric concept, embedded in theory about test score validity. Test bias is defined by specific statistical and research methods, which enable the researcher to make decisions about the test bias. Both are important for psychological testing.

What is a confirmatory factor analysis? - BulletPoints 12

There are two types of factor analysis: exploratory factor analysis (EFA) and confirmatory factor analysis (CFA). These two types of factor analysis are most suitable for different phases of test development and evaluation. EFA is most suitable for the first phases of test use (clarifying the construct and the test). CFA is most suitable in later phases of test use, after the initial evelations of item properties and dimensionality and after major revisions of the test content (ie when the test content is virtually fixed). Confirmative factor analysis (CFA) is used to investigate the dimensionality of a test when there are already hypotheses about the number of underlying factors (dimensions), the connections between items and factors, and the coherence of the factors.
Performing a CFA consists of four steps: (1) specification of the measurement model; (2) calculations; (3) interpret and report the results; (4) model changes and new analysis (if necessary). These four steps are discussed below.
Measurement invariance can be mapped with CFA by comparing groups in terms of specific parameters (such as the lambda , the theta , etc.) of measurement models. If groups have different values for a parameter, then this is proof of a lack of invariance for the parameter (and therefore proof of a certain degree of construct bias, because the parameters differ between groups). The extent to which there are differences can be summarized in four different levels of measurement invariance : (1) configural ; (2) weak / metric; (3) strong / scalar ; (4) strict . In short, the greater the difference, the less robust the test is for measurement invariance (the first level is therefore the weakest, least robust level).

What is the generalizability theory? - BulletPoints 13

The Generalizability Theory (G theory) helps us to distinguish the effects of multiple facets and then to use different measurement strategies. It is an ideal framework for complex measurement strategies in which several facets influence the measurement quality. This is a fundamental difference compared to the classical test theory (CTT), where different facets are not assumed.
The G theory can be used for multiple types of analysis, but a basic psychometric analysis consists of a two-phase process: the G study and the D study. The variance components are estimated in the first phase. In such a study, factors are identified that influence the observed variance (and therefore the generalizability). This phase is called a G study, because it is used to identify to what extent the different facets could influence generalizability. In the second phase, the results of phase one are used to estimate the generalizability of the different combinations of facets. This phase is known as a D study, because the phase is used to make decisions about future measurement strategies.

What is the Item Response Theory (IRT) and which models are there? - BulletPoints 14

The Item Response Theory (IRT) is an alternative to the classical test theory (CTT). The IRT identifies and analyzes the measurements in behavioral sciences. The reaction of the individual to a certain test item is influenced by characteristics of the individual (trait level) and properties of the item (difficulty level).
Item discrimination refers to distinguishing individuals in low and high trait levels. The discrimination value of the item indicates the relevance of the item in relation to the trait level being measured
- Positive discrimination ≥ 0: relationship between item and trait (property) that is being measured
- Negative discrimination ≤ 0: inconsistency between item and trait.
- Discrimination value = 0: no relationship between item and trait (property) that is measured by the test.
The Rasch model (one-parameter logistic model) (= 1PL) only has the properties of the individual and the properties of the item as components that influence the scores.
The two-parameter model (2PL) has three components that influence the scores, namely the characteristics of the individual, the characteristics of the item and the item discrimination.
The chance of gambling is also included in the three-parameter model . The 3PL model can be seen as a variation on the 2PL model, where one component has been added (the chance of gambling ): c i refers to the lower chance of answering item i correctly . According to the 3PL model, the chance of a correct answer is therefore influenced by: (1) the characteristics of the individual, i.e., the " trait level" Ө; (2) the item difficulty β; (3) the item discrimination α; (4) the "gamble parameter".
The Graded Response Model (GRM) is made for testing, etc. with more than two answer options. As with previous models, this model assumes that a person's response to an item is affected by that person 's trait level, item difficulty, and item discrimination. But the GRM has different difficulty parameters for one item.