Bulletsummary per chapter with the 3rd edition of Psychological testing: A Practical Introduction by Hogan - Chapter


What is psychological testing? - Bulletpoints 1

  • Broadly speaking, we can distinguish five major categories of tests: (1) mental ability tests; (2) achievement tests; (3) personality tests; (4) interests and attitudes; (5) neuropsychological tests. 

  • Many different people use psychological tests. Although there is considerable diversity within each group, we broadly classify four major contexts for the use of psychological tests: (1) clinical (for example psychologists); (2) educational (for example teachers); (3) personnel (for example the military and businesses); and (4) research. In research, tests can be used with different purposes. First, the test may serve as the operational definition of the dependent variable. For example, the WAIS may serve as a definition for what constitutes intelligence. Second, researchers may use tests to describe samples. Third, researchers may conduct research on the tests themselves, for example by assessing the psychometric properties of a test, such as reliability and validity.

  • In psychological testing, there are four basic assumptions: (1) people differ in important traits; (2) we can quantify these traits; (3) the traits are reasonably stable; (4) measures of the traits relate to actual behaviour. With quantification (the second assumption), it is meant that objects (in psychology: individuals) can be arranged along a continuum. This quantification assumption is crucial to the concept of measuring. This is illustrated by the following two questions: "How did you measure the child's school performance?" and "How did you test the child's school performance?". Both questions essentially mean the same. 

  • There are six major forces that influenced the development of psychological testing: (1) the scientific impulse; (2) concern for the individual; (3) practical applications; (4) statistical methodology; (5) the rise of clinical psychology; (6) computers. These six major forces are briefly discussed below. 

  • There are five fundamental issues in psychological testing. First, reliability refers to the stability of test scores. For example, if I take a test today and again tomorrow, will I get (roughly) the same score? Second, a fundamental question concerns the validity of a test; what is the test actually measuring? Third, how can the scores from a test be interpreted? Interpretation of test results generally depends on the norms that are used. Fourth, how was the test developed? And fifth, what are the practical issues that need to be considered? For example, how long does the test take? It is available in all languages? And so on. 

  • To define a test, it is important to summarize the characteristics that comprise it. That is, a test is (1) a procedure or device (2) that yields information (3) about behaviour. A test does generally not focus on all behaviour, but is used to examine (4) a sample of behaviour. This is done (5) in a systematic, standardized manner. Lastly, a test concerns some form of quantification or measurement.

What are the nine major sources of information about tests? - Bulletpoints 2

  • There are two common problems in practice that require information about tests. The first is: How do I get information about a particular test? Sometimes, one is interested in a test, but knows little about it and wants to know more about this test before using it. The second is: What tests are available for a particular purpose? This is a different question. One is, for example, interested in the relationship between intelligence and SES and therefore wants to measure intelligence. The question then arises how to measure intelligence? Which tests are developed that measure intelligence; and how to choose between these tests?

  • In the remainder of this chapter, nine major sources of information about tests are discussed. These nine sources are: (1) comprehensive lists of tests; (2) systematic reviews of published tests; (3) electronic listings; (4) special-purpose collections; (5) books about single tests; (6) textbooks on testing; (7) professional journals; (8) publishers' cataglogs; (9) other users of tests. In the sections below, we briefly discuss each of these major sources.

What are test norms and how to use them? - Bulletpoints 3

  • Norms are used to provide meaning to scores. This is done, because raw scores are often difficult to interpret in the absence of further information. The basic idea of test norms is to translate raw scores (the more or less direct result of an individual's responses to a test) to normed scores, in which the individual's raw score is compared with the scores of individuals in the norm group. Normed scores are also referred to as derived scores or scale scores. 

  • A very important term in descriptive statistics is the z-score. The z-score is defined as: z = (X - M) / SD. In words: a z-score tells you how many standard deviations the score of an individual deviates from the mean. Because z-scores are standardized (which means they have a mean of zero and a standard deviation of one, regardless the values of the original scores), they play a key role in the development of test norms. 

  • The second main category of type of norms is formed by standard scores. A standard score is a score that is converted from an arbitrarily chosen mean (M) and standard deviation (SD) to z-scores (with mean zero and standard deviation one). The formula to convert a raw score into a standard score is: SS = (SD/ SDr) (X - Mr) + Ms, in which subscript s denoted the standard score system and subscript r denotes the raw score system. When the score (X) is already translated into z-score form, the formula is: SS = z(SDs) + Ms. Commonly, the process of converting raw scores to standard scores is obtained by a linear transformation. However, some standard scores follow a distribution that deviates much from a normal distribution. In those cases, a nonlinear transformation might be neded to approximate a normal distribution. The scores are then referred to as normalized standard scores

  • Results can be interpretated in two ways: criterion-referenced and norm-referenced. Criterion-referenced refers to the situation in which there is no reference to any norm group. For example, a student gets 30 of the 50 items right (60%). The teacher judges this as being unsatisfactory. In making the judgement that 60% was unsatisfactory, there was no reference to any norm group. This is in contrast with norm-referenced interpretation, for which the interpretation of a test result is interpreted in relation to some norm group.

  • It is important to be aware of the usefulness of a norm group. To determine the usefulness, it is helpful to compare the norm group to the population in terms of gender, race, SES, geographic region, and so on. Two important issues should be considered in doing so. First, what is the stability of the norm group? Is the norm group relatively stable? This depends on the size of the group. In other words: Is the sample size of the norm group large enough to ensure statistical stability? Second, is the norm group representative of the target population for the test? When the norm group is not (properly) representative for the population, a common procedure is to weigh certain cases in the norm group to improve the match. Suppose, that the male-female divide in the population is 50 - 50, but in the norm group we obtain a 40 - 60 division. Then, the male participants are assigned a weight of 1.5 (or the female participants 0.67) to obtain the 50 - 50 split. 

How is reliability related to psychological testing? - Bulletpoints 4

  • Chapter four deals with the topic of reliability. Before discussing the use of reliability in psychological testing, it is important to distinguish between validity and reliability. Validity refers to the question whether a test measures what it is intended to measure. Reliability  refers to the question whether a test is consistent, regardless of what it is that it is measuring. Be aware that a test can be reliable (thus consistent) without being valid. The opposite, however, can not occur! A test can not be valid without being reliable. 

  • There are four factors that affect the correlation coefficient: (1) linearity; (2) heteroscedasticity; (3) relative (not absolute) position; (4) group heterogeneity.

  • There are four major sources of unreliability: (1) test scoring; (2) test content; (3) test administration conditions; (4) personal conditions.

  • Four common ways to determine reliability are: (1) test-retest; (2) inter-scorer; (3) alternate form; (4) internal consistency. Each method treats one or more of the different sources of unreliability. 

  • While the formulas may be hard to understand, we can draw three rather simple conclusions from these formulas. First, test length matters. In general it applies that the longer the test, the more reliable it will be. Second, reliability increases when the percentage of respondents answering yes/correctly is approaching .50 (thus p = .50). Third, the correlation between items is important. The higher the correlation between items, the higher reliability generally is. To ensure reliability, it is important the items measure a well-defined trait.

  • A widely asked question is: How high should reliability be? The short answer is: It depends. For high-stakes tests, in which the decision has far reaching consequences, one should aim for a high reliability. If a test, on the other hand, is used in a research project in which only group averages are of interest, a lower degree of reliability may be sufficient. A rule of thumb is that reliability should be at least .90 for important decisions, and .80 when the test is only one of several types of information. 

How is validity related to psychological testing? - Bulletpoints 5

  • Before further formalising the definition of validity, we introduce two more concepts that are important for valid measurements. First, a construct is a trait or characteristic that we wish to measure. Examples are depression and mathematical reasoning ability. A test is designed to measure a specific construct. The part of the construct that is not covered by the test is referred to as construct underrepresentation. However, the opposite may also occur. That is, a test can also measure more than the construct we are interested in. This other measurement is referred to as construct-irrelevant variance. A valid measurement is balanced between construct underrepresentation and construct-irrelevant variance.

  • Content validity refers to the degree to which there is a relationship between the content of a test and some well-defined domain of knowledge or behaviour. Thus, a content valid test is one in which there is a good match between content of the test and content of the relevant domain. Content validity is the most important type of validity for (academic) achievement tests. The second major application field of content validity is for employment tests. The application of content validity to other areas, such as personality and intelligence, is limited, because these areas often do not have a clear specification of the domains to be covered. For example, what is the content outline for extraversion or social intelligence?  

  • Criterion-related validity purports to establish the relationship between test performance and performance on some other criterion, that is considered an important indicator of the construct of interest. There are two clearly distinguishable contexts for criterion-related validity: predictive and concurrent. First, predictive validity is considered when the test aims to predict status on some criterion that will be attained in the future. For instance, a college entrance test may be used to predict student performance at the educational trajectory. Second, concurrent validity is used to check whether there is agreement between test performance and performance on some other variable at the same time. For example, we may assess the concurrent validity of a depression test for patients with the clinician's rating of depression. 

  • In conclusion, assessing the validity of a test is a complex procedure, in which there are often many different sources of the extent of validity. The process of weighing all these different sources of evidence and judging their relevance is referred to as validity generalization. When combining all the evidence and making a final judgement, one should consider the following question: Am I better off using this test as an information source, or is it better not to use this test?  

How to develop a test? - Bulletpoints 6

  • Test development comprises six major steps: (1) Defining the purpose of the test; (2) Preliminary design issues; (3) Item preparation; (4) Item analysis; (5) Standardization and ancillary research programs; (6) Preparation of final materials and publication. 

  • Item preparation includes two processes: item writing and item review. An item comprises four parts. First, there is a stimulus, also known as item stem, to which examinees respond. Second, there is a response format (for example multiple-choice or true/false). Third, there are conditions prevailing the examinees response (for example time limit). And fourth, there is a scoring procedure (scoring rubric), for example dichotomous (correct versus incorrect) or with partial credit for selecting certain options. 

  • Selected-response items require little judgement, and therefore are beneficial regarding scoring reliability. In addition, they often require little time. Within a fixed time limit, an examinee can usually respons to more selected-response items than to constructed-response items. Third, selected-response items can be scored more efficiently than free-format items.

  • Constructed-response items allow for an easier observation of test-taking behaviour and processes than selected-response items do. Moreover, they allow for exploring unusual areas that might be missed by a selected-response formatted item. Lastly, some test developers believe that multiple-choice questions encourage a different (undesirable) study tactic of rote memorisation and an atomistic approach to learning. This is less so for constructed-response items, as they encourage a more holistic, meaningful approach to study the content. 

  • One last question when writing items is: how many items should be written? There is no definitive answer to this question. It depends, among others, on how well decisions were made in the preliminary design stage. If the area to be tested is examined thoroughly and appropriate item types are selected, fewer items are necessary. A general rule-of-thumb is to write two to three times as many items as needed for the final test. Thus, if the final test comprises 50 questions, it is recommended to prepare 100 to 150 items. After the items are written, they are usually reviewed for content correctness, clarity, grammar, bias (for example gender, racial, or ethnic bias), and conformity with the rules for writing items that we discussed earlier. 

  • Test fairness implies that a test measures a trait, construct, or target with equivalent validity in different groups. In contrast, an unfair (biased) test does not measure the trait equivalently for different groups. It is important to note that a test does not yield equal testing scores for different groups. For example, if some groups really do differ on the ability or trait we are trying to measure, then this should result from the test.  

How to study intelligence? - Bulletpoints 7

  • What is intelligence? First of all, it is important to note that there is not a universal definition of intelligence. Yet, there is a surprisingly good agreement between most psychologists of what constitutes intelligence. The following terms are commonly used in the definition of intelligence: think abstractly, solve problems, identify relationships, learn quickly, memory functions, speed of mental processing, learn from experience, plan effectively, deal effectively with symbols. 

  • The first of two classical theories of intelligence is developed by Charles Spearman (1904, 1927). Spearman claimed that test performance was mostly affected by one, general mental ability, called "g". In addition to this general factor, each test had some unique or specific variance, denoted by "s". Moreover, each "s" carried some unique variance that is due to a specific ability, and error variance. His theory is also referred to as two-factor theory, because it has two factors (g and s). However, the g-factor is the dominant one in his theory. 

  • The second classical theory of intelligence is developed by the American psychologist L. L. Thurnstone. In contrast to Spearman, Thurnstone did not believe in a single, general underlying factor of intelligence. His theory is referred to as the multiple-factor theory. He extracted twelve factors of intelligence, of which he considered nine interpretable. The nine factors of intelligence are: (1) spatial; (2) perceptual; (3) numerical; (4) verbal; (5) memory; (6) words; (7) induction; (8) reasoning; (9) deduction. 

  • The 'battle' between the first and second theory is also referred to as the one versus many argument. A compromise position is sought for by hierarchical models. In a hierarchical model, there are many separate abilities, but these are arranged in a hierarchy, with just one or a few dominant factors of intelligence on top of the hierarchy. Examples of such hierarchical models are developed by Catell (fluid versus crystallized intelligence), Vernon, and Caroll (three-stratum theory, with Spearman's "g" at the highest level, or stratum). 

  • A topic on which there is much debate in the literature concerns the influences of heredity (nature) and environment (nurture). However, the question is no longer whether it is nature or nature. All scholars nowadays agree that intelligence results from an interaction between these two components. In addition, these two influences are not additive, but are related more in a sort of multiplicative relationship. That is, if one is completely negative, no matter how high the other is, intelligence will not develop. Another misconception is that nature traits are already present at birth, whereas nurture influences develop later. This is not true. Think, for example, about baldness. While this is not visible at birth already and does not manifest until midlife or later, it is already present at birth in the genes. 

How to administer individual intelligence test? - Bulletpoints 8

  • In sum, there are nine common features of intelligence tests: (1) individual administration; (2) it requires advanced training to administer the test; (3) broad range of ages and abilities; (4) establishment of rapport; (5) free-response formatted items; (6) immediate scoring of items; (7) test administration takes about one hour; (8) it allows for observation.  

  • Although individual intelligence tests may be very different, they usually contain items covering the following nine categories: (1) vocabulary; (2) verbal relations; (3) information; (4) meaning, comprehension; (5) arithmetic; (6) short-term memory; (7) form patterns; (8) psychomotor; (9) matrices.

  • For many years, the Stanford-Binet test was the most popular method for measuring intelligence. However, two drawbacks of this test were (according to David Wechsler) that is was oriented on children only, and that it yielded only a single, general score. Hence, Wechsler developed the so-called Wechsler Adult Intelligence Scale (WAIS), which was first published in 1939. Since then, many revisions and new editions are published. Somewhat paradoxical, he later also developed the Wechsler Intelligence Scale for Children (WISC) for children aged 6-16 years, and the Wechsler Preschool and Primary Scale of Intelligence (WPPSI) for children aged 21-27 months. Even now, new versions are published using his name, even though Wechsler died in 1949.

  • Which one is better, the Stanford-Binet test or the Wechsler scales? This question remains unanswered. However, these two series nowadays certainly are much more similar than they were in the past. 

  • In the early twenty-first century, professionals became increasingly concerned about the stigmatisation of these terms. An alternative term was found in intellectual disability. The definition of intellectual disability depends heavily on the concept of adaptive behaviour, that is, how well a person copes with ordinary life. Examples of adaptive behaviours are: feeding oneself, clothing (level 1), reading simple words (level 2), and taking the bus (level 3). The formal definition of intellectual disability yields three criteria that must be met for intellectual disability: (1) a significant subaverage intellectual functioning; (2) limitations in adaptive behaviour; (3) onset before age 18. 

How to apply a group test of mental ability? - Bulletpoints 9

  • Group mental ability tests are mainly used in (a) elementary and secondary schools, with achievement tests; (b) predicting school success in college and graduate or professional school; (c) job selection or placement in military and business, and; (d) research. 

  • Similar to the individually administered tests, there are a number of characteristics that are very common in group tests of mental ability. The eight common characteristics are: (1) obviously, the test can be administered to large groups in which there, theoretically, is no limit to the size of the group; (2) multiple-choice items, so that the tests are amenable to machine-scoring; (3) the content of the test is very similar to the individually administered tests; (4) fixed time limit and fixed number of items; (5) administration time is very bimodal distributed, administration often takes about one or three hours; (6) total score and several subscores; (7) very large research base for norming, equating, reliability and so on; (8) their primary purpose is to predict future success in school or on the job.

  • Six generalisations can be made about group mental ability tests. First, although there are huge differences in target groups and purpose of the tests, there is notable similarity in their content. Commonly, there are items on vocabulary, verbal relationships, reading, and so on. Second, total scores of the group mental ability tests are commonly very reliable, with internal consistency reliabilities around .95 and test-retest reliabilities around .90. Third, there is notable similarity in predictive validity of these tests. Fourth, a lack of differential validity is identified. Various combinations of subtests often do not yield higher validity coefficients than obtained with the total scores. Fifth, there are two statistical issues with these group tests: range restriction and imperfect reliability. Sixth, so far, there is no successful culture-free test to measure intelligence with a group test. 

What is neuropsychological assessment? - Bulletpoints 10

  • The history of clinical neuropsychological can be traced back to the ancient Greeks, although its official birth is attributed to Arthur Benton in the 20th century. There are six main reasons for neuropsychological evaluation: (1) diagnosis; (2) identifying strengths and weaknesses; (3) vocational planning; (4) treatment planing; (5) forensics; (6) research

  • There are two main approaches for neuropsychological assessment: (1) fixed battery; (2) flexible battery. In the fixed battery approach, the same set of tests (a battery) is used for each respondent. In the flexible battery approach, the clinician chooses the subtests he or she considered best suited to assess a certain individual. Approximately 5% of the clinical neuropsychologists adopt a standardised, fixed battery approach and 78% adopts a flexible battery approach. 

  • A complete neuropsychological assessment involves more than only administering a test (battery). Supplementary information can be provided by (1) medical history; (2) psychiatric history; (3) psychosocial history; (4) school records; (5) collateral information; (6) observations of behaviour. Collateral information is information that is collected via, for example, family member. For instance, a patient who is suffering from dementia, may have been very pleasant and socially appropriate. This type of information could be provided by for example family members and closely related relatives. 

How to test achievement? - Bulletpoints 11

  • What is the difference between ability tests and achievement tests? It is best to think of it as a continuum, rather than considering these tests as rigidly distinct compartments. The point on which a test falls within the ability-achievement continuum depends on the degree of specific training. At the extreme right of the continuum (that is, specific achievement) fall tests that are highly dependent on specific training. For example, tests related to historical facts (civil war battles) or specialised skills (riding a bicycle). On the extreme left of the continuum (general ability) fall tests that require little dependence on specific training. Examples are solving puzzles, and detecting patterns. In the middle then, are traits such as reading comprehension and arithmetic problems. 

  • Although there are huge differences between achievement tests, we identified five common features. First, most batteries -although referred to as "a" test- are a system of many interrelated tests. Second, the tests are oftentimes accompanied by a large amount of supplementary materials (detailed lists of objectives, interpretive booklets for students, teachers, parents, school administrators, computer-generated scoring reports, and so on). Third, there often are exemplary norming procedures and other research programs, with exhaustive technical manuals. Fourth, while achievements tests traditionally relied solely on multiple-choice items, they now often are accompanied by free-writing exercises and open-ended questions. Fifth, many achievement batteries depend on the same sources of information for their content, for example major textbook series and outlines provided by the National Assessment of Educational Progress.  

How is testing applied in clinical settings? - Bulletpoints 13

  • There are both similarities and differences between tests of normal personality and clinical instruments. Four similarities are: (1) nature of items (simple), and response format (simple); (2) subdivisions for comprehensive and specific domain tests; (3) strategies for development; (4) threat of response sets and faking. Three differences are: (1) orientation; (2) administration setting; (3) manual and purpose.

  • One way of testing in clinical settings is by using an interview. First, we have to distinguish between the different types of interviews: unstructured, semistructured and structured. Note that these three types are not discrete categories, but that they fall on a continuum. At the one end, the unstructured (traditional) interview does not follow a certain pattern and consequently varies from one respondent to another, as well as from one examiner to another. At the other end of the spectrum, an structured interview uses a strict pattern. The same topics, with the same questions, are administered with each respondent. The semistructured interview falls between these two approaches. There are some standard questions, but there is also flexibility to tailor the interview to the individual respondent. In this section, we will mainly focus on the structured clinical interview

  • The MMPI-2 is an extensive self-report inventory, consisting of 567 items. The MMPI was first published in 1942. The revision (second edition) appeared in 1989. Test administration usually takes about 60 to 90 minutes (120 for examinees with low reading levels or high distraction levels). This test is widely used. It is the most frequently used test by neuropsychologists and the second most frequently used test by clinical psychologists. The MMPI has its own language, customs, and rituals. For instance, a respondent could be described as: "24 code type with elevated F scale". An elevated F score may indicate, among others, severe pathology or "a cry for help". 

  • A group of clinical instruments that yields their own category is the Behaviour Rating Scales (BRS). These scales are widely used to determine conditions such as attention disorders and assorted emotional problems. Two important features of these scales are: (1) someone else completes the rating, typically a teacher; (2) it lists specific behaviours, and the descriptors are short, usually one to three words.

What are projective techniques? - Bulletpoints 14

  • Two key features of projective techniques are the ambiguous nature of the stimuli and the projective hypothesis. First, it is not immediately obvious what the test stimulus means. Items from an objective personality test, for example "I often feel happy" have a reasonably obvious meaning. This is not the case for items of projective techniques. Second, the rationale underlying projective techniques is: when the stimulus for a response is ambiguous, the examinee's personality dynamics will determine the response itself. This is called the projective hypothesis. According to this hypothesis, a respondent will formulate the answer in terms of his or her desires, fantasies, fears, motives, and so on. Therefore, the projective test is considered an ideal tool to uncover deep-seated, perhaps even unconscious, personality characteristics. 

  • Four commonly used projective techniques are: (1) The Rorschach Inkblot Tests; (2) Thematic Apperception Test (TAT); (3) Rotter Incomplete Sentences Blank (RISB); (4) Human Figure Drawings (HTP).

How to measure interests and attitudes? - Bulletpoints 15

  • Two pioneers in the field of career assessment are Edward K. Strong Jr. and G. Frederic Kuder. Broadly speaking, there are two traditional differences in approaches to career interest measurement: (1) origin of scales: criterion-keying versus broad areas; (2) item format: absolute versus relative level of interest. 

  • Based on our examination of the three widely used career interest inventories, we derived the following five generalisations. First, career-related interest patterns of individuals seem to be quite reliable, at least from middle adolescence onward. Second, measures or career interest have a respectable degree of validity. It appears that different occupational groups tend to differ in their interest patterns, and that people tend to enter occupations that are consistent with their interests. Third, manuals of these tests provide little or no references to more modern psychometric techniques (such as IRT of differential item functioning). Fourth, career interest inventories are increasingly administered online. Fifth, not surprisingly, there is a positive relationship between interests and ability. Career interest testing must be accompanied by relevant information from the ability domain. Someone who wants to be a physician, but does not have the ability to successfully complete the requisite science course, is not suitable for the occupation. 

  • Measurement of attitudes overlaps partly with public opinion polls. What is the difference between these? Clearly, there is no difference in the nature of the questions. Items from public opinion polls could easily be used in attitude scales and vice versa. The difference lies in the target for inference. Attitude measures aim to assess the attitude of an individual. Public opinion polls aim to assess the position of the group. 

How to deal with ethical and legal issues in testing? - Bulletpoints 16

  • There is a close relationship between ethical and legal issues, yet they are not the same. Ethics concerns the issue of what one should or should not do, according to norms of conduct or principles. Law concerns what one must of must not do, according to legal dictates. Commonly, ethical principles and laws overlap. For instance, it is both illegal and unethical to steal or murder. However, the may also differ. For instance, you lie to your wife about your income. Unethical? Yes. Illegal? No. In this chapter, we will discuss the ethical and legal issues related to psychological testing. 

  • In sum, the broadly applicable principles of ethical use of tests are: (1) ensuring competence; (2) obtaining informed consent; (3) providing knowledge of results; (4) maintaining confidentiality; (5) guarding test security. The more limited applicable principles, that apply specifically to psychological testing, are: setting high standards for test development, assuming responsibility for automated reporting, and purporting to prevent unqualified test use. 

Join World Supporter
Join World Supporter
Log in or create your free account

Waarom een account aanmaken?

  • Je WorldSupporter account geeft je toegang tot alle functionaliteiten van het platform
  • Zodra je bent ingelogd kun je onder andere:
    • pagina's aan je lijst met favorieten toevoegen
    • feedback achterlaten
    • deelnemen aan discussies
    • zelf bijdragen delen via de 7 WorldSupporter tools
Follow the author: Psychology Supporter
Comments, Compliments & Kudos

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.