Summary of chapter 3 of Psychological testing; History, principles and applications - Gregory
A: Standards and Test Standardization
Standards for testing are established by reference to scores of norm groups. This is called standardization and ensures that we can interpret individual test scores in a meaningful way. In addition, the usefulness of a test score is determined by the consistency (reliability) of the test. A norm group consists of a sample of participants who are representative of the population for which the test is intended.
Raw scores
The raw score is the most basic level of information provided by a psychological test (for example, the number of correct answers). Raw scores in themselves are meaningless; only in reference with standards do the scores acquire meaning. Almost all psychological tests are interpreted by means of standards, although other types of tests also exist (such as criterion-referenced tests).
Essential Statistical Concepts
Frequency distributions
The enormous amount of data that comes from administering tests should be summarized first. A simple way to do this is to draw up a frequency distribution. Class intervals (for example 1-3) are drawn up, after which the frequency of the scores that fall within that interval is indicated for each interval. A histogram is a graph in which the frequency distribution information can be displayed by means of columns. A similar graph is a frequency polygon (line graph), except that the frequencies are indicated with a single line instead of columns.
Measure of central tendency
To obtain a single, representative score we need a measure of the central tendency. Possible measures are the mean (adding the scores divided by N ), the median (the middle score if all scores are ranked) and the mode (the score that occurs most often).
Measures of variability
The standard deviation ( s ) is usually used to describe the degree and manner of distribution of the scores. When the value of the standard deviation is low, the scores are densely packed around a central value. As the scores spread out, the value of the standard deviation increases. The standard deviation is the square root of the variance ( s 2 ).
The normal distribution
When testing with a larger sample, the scores often form a normal distribution, with the graph having a bell-shaped, symmetrical appearance. In psychology, normal distributions are preferred over other types of distributions for various reasons. First, normal distributions have useful mathematical features that form the basis for various types of statistical research. Second, normal distributions are precisely defined, making it possible to know accurately the percentage of scores that fall within a certain range. Finally, in many cases the normal distribution forms naturally, for example in the case of many human physical and mental characteristics.
Skewness
Skew denotes the degree of symmetry or asymmetry of a frequency distribution. If many scores fall at the low end of the scale, the distribution is right skewed (positively skewed) and if many scores fall at the high end of the scale, the distribution is left skewed (negatively skewed). A skewed distribution often means that there are too few easy or too few difficult items in the test.
Transformation of raw scores
Percentiles
A percentile expresses the percentage of individuals in the standardization sample who scored below a particular raw score. This is noted as P 94 (for example, with a raw score of 25 corresponding to the percentile of 94). Percentiles can also be thought of as rankings in a group of 100 representative participants, with PR 1 at the bottom of the sample and PR 99 at the top of the sample. A percentile of 50 ( P 50 ) corresponds to the median, P 25 to the first quartile ( Q 1) and P 75 to the third quartile ( Q 3).
Standard scores
The standard score (also called a z- score) expresses the distance from the mean in units of standard deviation. A raw score that is exactly one standard deviation from the mean has the standard score +1.00. Standard scores, unlike percentiles, have the desirable psychometric property of preserving the relative magnitudes of distances between consecutive values of the raw scores. Another advantage of the standard score is that it is possible to compare results on different tests by means of a common scale. However, the distributions of the tests to be compared must have the same form.
T scores
Standardized scores are conceptually identical to standard scores, with the difference that standardized scores are always expressed in very positive numbers. A popular type of standardized score is the T-score. It has an average of 50 and a standard deviation of 10. The T-score is in fact a transformation of the z- score.
Normalize standard scores
As mentioned earlier, test developers prefer normal distributions. In the case of an asymmetric distribution, it can be normalized. The percentile for each raw score is used to determine the corresponding default score. If this is done for each case, the final distribution will be normally distributed. There is a major drawback to normalizing non-normal distributions, namely that mathematical relationships in the raw scores may not be valid for the normalized standard scores. In practice, normalized standard scores are rarely used.
Stanines, Stens, and C scale
The Stanine scale was developed during WWII. All raw scores are converted to a single-digit system of scores with a range of 1-9. The mean of Stanine scores is always 5 and the standard deviation is about 2. Variations on the Stanine scale are the Sten scale (10 units) and the C – scale (11 units).
Selecting a norm group
When selecting a reference group, an attempt is made to obtain a representative cross-section from the population for which the test is intended. The easiest way to do this is simple random sampling, where every member of the population has an equal chance of being chosen. However, this often does not work in practice because not every member of the population is reachable or available to participate in the test. Another way is stratified random sampling. The population is classified on the basis of important background variables (e.g. age or gender), after which a certain percentage is randomly selected from each class.
Age and grade standards
An age norm represents the level of test performance for each individual age category in the normative sample. Participants are then compared with their own peers. A grade norm represents the level of test performance for each individual school year (for example, grade 5 primary school) in the normative sample.
Local and group standards
Local standards are derived from representative local participants, as opposed to a national sample. Subgroup norms consist of the scores obtained from a particular subgroup (e.g. women or Turkish immigrants).
Forecast table
An expectation table shows the established relationship between test scores and expected outcome on a particular task. For example, an expectation table could show the relationship between final exam scores (predictor) and later college grade (criterion). When using an expectation table, it must always be carefully observed whether conditions or rules regarding the predictor or the criterion have remained the same.
Criterion-referenced testing
Where norm-referenced tests are intended to classify participants on a continuum of skill or performance, criterion-referenced tests are intended to compare participants' results against a predetermined performance standard. These types of tests are often used in education. The content of criterion-referenced tests is determined on the basis of the relevance to the curriculum. This is in contrast to standard-referenced tests, in which the content is determined in such a way that the best possible distinction can be made between the participants.
B: Concepts of reliability
Reliability refers to the degree of consistency in measurement on a continuum from minimal consistency (e.g., response time) to near perfect repeatability of results (e.g., a scale).
Classical test theory
The classical test theory formed the basis for test development during the twentieth century. The alternative, item response theory, is discussed at the end of this chapter. Classical test theory assumes that test scores arise from two factors: factors that contribute to consistency (the individual's stable traits) and factors that contribute to inconsistency (characteristics or conditions that have nothing to do with the trait being measured). The measurement error is what should be minimized as much as possible during testing.
Sources of measurement error
Measurement error can arise from many different sources and only the most important are discussed here.
Item selection can cause measurement error because the selection is always just a sample of all possible items.
Test administration can be a source of measurement error because it is never entirely possible to create identical test situations with different participants; Consider, for example, background noise, temperature, light, fluctuations in the mood of the participant, etc.
Test scoring can sometimes be a source of measurement error when subjective scoring systems are used, such as with projective tests or essay questions.
The above sources are collectively described as non-systematic measurement error, which means that its effects are unpredictable and inconsistent. Systematic measurement error, on the other hand, occurs when the test accidentally measures something other than the characteristic for which the test was intended.
Measurement error and reliability
A higher degree of measurement error decreases the reliability of psychological test results. Reliability and measurement error are actually different ways of expressing how consistent a test is. A crucial assumption of classical test theory is that non-systematic measurement errors occur as random influences (unintentional background noise, accidentally seeing an answer, etc.). Because these are random events, unsystematic measurement errors will be positive and negative to approximately the same degree and will therefore mean approximately to zero across a large group of participants. The fact that unsystematic measurement errors are random also means that they have no correlation with both the true score and measurement errors on other tests. From classical test theory it can therefore be deduced that the variance of the obtained scores is simply the variance of the true scores plus the variance of measurement errors.
The reliability coefficient
The reliability coefficient ( r xx ) is the ratio of the variance of the true score to the total variance of the test scores. The reliability coefficient can take a value between 0 (completely unreliable) and 1 (completely reliable).
The correlation coefficient
The correlation coefficient ( r ) in its most common application expresses the degree of linear relationship between two sets of scores obtained by the same person. The coefficient can take a value of -1.00 (perfect negative correlation), via 0.00 (no correlation) to +1.00 (perfect positive correlation). Negative or positive correlation with the same value expresses the same degree of correlation; whether this is negative or positive depends on how one of the two variables is scored.
The correlation coefficient as a reliability coefficient
If test results are highly consistent, the scores of individuals taking the same test on two occasions would be highly correlated. In this sense, the correlation coefficient is also a reliability coefficient. This retesting of the same (groups of) persons as a method for determining reliability is one of the many methods available to test reliability, some of which will be explained below.
Reliability as temporal stability
Test-retest reliability
As just mentioned, the simplest method for estimating reliability is retesting people. The higher the correlation is between the same person's first and second score on the same test, the higher the reliability. Acceptable reliability coefficients usually fall between 0.80 and 0.90.
Alternative versions reliability
Sometimes test developers produce two different versions of a test, both of which are administered to the same group. The reliability is then higher as the correlation between scores on the same test is higher. This is similar to test-retest reliability, with the important difference that item sampling differences now also exist as a source of error variance. Moreover, it is very expensive to develop alternative versions.
Reliability as internal consistency
Split semi-reliability
Scores of the same person on equivalent halves of a test are correlated with each other. This works on the same principle as test-retest reliability, although it often results in higher estimates of reliability. However, it is cheaper than test-retest reliability and there are no practice effects. On the other hand, it is often difficult to divide the test into equivalent halves. In order to obtain split-half reliability, not only the Pearson r must be calculated; it must also be adapted by means of the Spearman-Brown formula.
The Spearman-Brown Formula
The above method provides an estimate of reliability for half as short a test as the original test. Since shorter tests are generally less reliable than longer tests, the coefficient must be adjusted. Despite the widespread use of the split-half method, it is often criticized for its lack of precision.
Coefficient of alpa
The coefficient alpha (also called Crohnbach's alpha) can be seen as the mean of all possible split-half coefficients, corrected by the Spearman-Brown formula. The coefficient alpha is an index for the internal consistency of the items. While this is a valuable approach to reliability, it is not a substitute for the test-retest approach.
The Kuder-Richardson Estimate of Reliability
Crohnbach's alpha is a more general application of the previously developed Kuder-Richardson formula 20 (KR-20). This applies in cases where each test item is scored as 0 or 1 (i.e. correct or incorrect).
Inter-rater reliability
In tests where the judgment of the assessor is a major factor in the reliability of the test, it is important to calculate the inter-rater reliability. In this method, scores of different assessors on the same test (taken from the same person) are correlated with each other.
What type of reliability is applicable?
To determine which type of reliability estimate is most appropriate, it is important to determine the nature and purpose of the test. For example, in tests that should show temporal reliability, the test-retest reliability is the most obvious, and in tests that strive for factorial reliability, the coefficient alpha. Split half methods work well with tests that accurately rank items based on difficulty. Many test manuals report multiple sources of reliability information.
Item Response Theory
From the 1960s, an alternative model was increasingly used in addition to the classical test theory: the item response theory (IRT; also known as latent pull theory).
Item response functions
An item response function (IRF) is a mathematical equation that describes the relationship between the amount of latent trait an individual possesses and the likelihood that they will provide a particular answer to a test item that measures that construct. Each individual is considered to have a certain amount of latent appetite, which directly affects the responses they give to a test. The IRFs for all items together can be used, among other things, to calculate the reliability of the test. In addition, the difficulty of an item can be calculated with it; if only individuals with a large amount of the pull get the item right, the item has a great difficulty. Also, the degree of discrimination against the item can be indicated; if people with different amounts of the pull give the same answer to the item, there is a low degree of discrimination.
Information functions
In the context of psychological measurement, information represents the ability of a test item to differentiate between people. Some items are designed to differentiate between people with a low level of the pull, others to differentiate between people with a high level of the pull. Thus, test items provide different levels of information for each level of the measured draft. An item information function graphically shows the relationship between the participants' pull level and the information provided by each test item.
Invariance at IRT
Invariance has two related but separate meanings within IRT.
First, it indicates the assumption that a participant's position on a latent draft continuum can be estimated from the responses to each set of test items, as the IRF of these test items is known.
Second, it indicates the assumption that IRFs do not depend on the characteristics of a particular population.
Thus, the IRF for each item is considered to exist in an abstract, independent and timeless manner. Although IRT analyzes usually require huge samples, the necessary software is relatively simple and widely available.
The new rules of measurement
A number of conclusions from classical test theory do not hold up within the framework of the IRT. For example, within classical test theory, the rate of standard error is the same for individuals of different levels, while within the IRT, the rate of standard error is greater at both extremes of a level. The axiom within classical test theory that shorter tests are always more unreliable than longer tests does not apply to the IRT either. In addition, tests within the IRT model are better adapted to computerized-adaptive testing, in which the items that an individual receives depend on the answers that they have entered for earlier items.
Special circumstances in estimating reliability
Traditional approaches to estimating reliability are misleading or inapplicable for some applications.
Unstable characteristics
Some characteristics, such as the galvanic skin response, fluctuate so quickly that the test and retest of it would have to take place almost at the same time to say anything useful about reliability.
Speed and power tests
In speed tests, most items can be properly completed by all participants; the score then depends on the amount of items they receive. In strength tests, the participants have enough time, but they cannot answer all items equally well. A traditional split-half approach would thus yield extremely high reliability coefficients.
Range limitation
Test-retest reliability will be extremely low if it is based on a sample of homogeneous participants with a range limitation for the measured attribute (for example, an intelligence test in university students).
Reliability of criterion-referenced tests
The structure of criterion-referenced tests (as explained earlier) ensures that the variability in participants' scores is minimal. Traditional approaches to reliability are therefore not applicable here.
The interpretation of reliability coefficients
There is no standard answer to the question of what an acceptable level of reliability is. There is some consensus that a very accurate measure of individual differences must have a confidence level above 0.90. However, tests with a reliability of 0.70 sometimes also prove to be useful.
Reliability and the standard measurement error
Suppose a person would take the same IQ test infinitely often. The distribution of all these scores would then be a normal distribution, with the mean as the true score for this person. The standard deviation of this distribution would then be the standard measurement error.
Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>
Contributions: posts
Spotlight: topics
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
Add new contribution