Summary of Chapter 13: Standardized Tests in Education Civil Service and the Military of the book Psychological assessment and theory: Creating and using psychological tests ( Kaplan, R.M., Saccuzzo, D.P., 2013)
During your time spent studying, you have doubtless encountered a standardized test. This may have come in the form of the GRE Revised General Test (GRE), the SAT Reasoning Test (SAT-I), or even a Goodenough-Harris Drawing Test. Many universities handle admissions through the use of standardized group entrance exams. The key factor to these standardized tests is the test criterion i.e. what the test is trying to predict. This can prove difficult. In the case of the GRE, which is widely used in the admission process to postgraduate programs, the test does not predict the capacity to solve real world problems or clinical skill.
While the tests discussed in this chapter improve the accuracy of a selection process, it is important to note that they account for a very small amount of variability.
Comparison of Group and Individual Ability Tests
Individual tests and group tests both have their own advantages and disadvantages. Individual tests are carried out with a single examiner assigned to a single subject. The examiner follows instructions which are provided in the manual of the standardized test. What follows is a response–record interaction in which the examiner records exactly the subject’s response. These responses are then evaluated, a process which can require a high degree of skill. In contrast, a single examiner can administer a group test to multiple individuals at the same time. Subjects are read the instructions by the examiner, time limits are established, subjects record their responses’ themselves, and the responses are calculated as a percentage which usually requires very little skill.
If a subject is experiencing distress for any reason, be it fear, stress, an uncooperative nature, the examiner in an individual test takes responsibility for maximizing performance. In other words, the examiner can attempt to elicit maximum performance. In the case of a group test, it must be assumed that a subject is fully motivated and cooperative. For this reason, low scores on group tests can be difficult to interpret. They can be attributed to a wide range of factors whether it be low motivation, clerical error, unclear understanding, etc.
Advantages of Individual Tests
Through individual tests, it is possible to learn more about a subject beyond their test score. After time, examiners develop internal norms. Having these internal norms, the experimenters are able to easily identify unusual reactions to certain tasks or situations. This gives the chance to observe behaviour in a standardized setting. This allows the examiner to see beyond the test scores in a unique way.
Advantages of Group Tests
When compared to individual tests, group tests are more cost efficient, require less expensive material, and require less examiner skill. They are commonly more objective as the subject records their own responses, thus making them usually more reliable. Individual tests are mostly applied in clinical settings, whereas group tests are applied in a much broader setting. Group tests are commonly used at various levels of schooling. Areas of military, industry, and research also greatly rely on them.
Overview of Group Tests
Characteristics of Group Tests
For the most part, group tests can be categorized as paper and pencil or booklet and pencil tests due to most of them consisting of a printed booklet, test manual, scoring key, answer sheet, and pencil. This is changing however, as we see a trend of increasing use of computerized testing as opposed to paper and pencil. The amount of group tests far outweighs the number of individual tests. Generally, group test scores are converted to produce percentiles or standard scores, however a few become ratios or deviation Iqs.
Selecting Group Tests
Because of the sheer amount of group tests available, the test user is assured a selection of well-documented and psychometrically sound tests. In particular, ability tests in schools are found to be very reliable.
Using Group Tests
The tests which are to be discussed are almost as reliable and soundly standardized as the best individual tests. As is the case with some individual tests, however, validity data for some group tests are weak, meagre, or contradictory – sometimes all three. When working with group test information, the following cautions should be exercised. Use results with caution: avoid over interpretation, don’t consider scores as being absolute or isolated, and be careful when using results for prediction. Be especially suspicious of low scores: there are many factors which can contribute to a low score, be aware of them. Consider wide discrepancies as a warning signal: if an individual produces large discrepancies either among test scores or other data, this may be a sign all may not be well with the individual. When in doubt, refer: in the case of low scores, wide discrepancies, or suspicion to doubt validity, the best option is to refer the subject for individual testing.
Group Tests in the Schools: Kindergarten Through 12th grade
The goal of tests aimed at schools to measure educational achievement in children.
Achievement Tests Versus Aptitude Tests
Achievement tests aim to ascertain what an individual has learned following a specific instruction. These tests measure how much a student has learned after sufficient training has been provided. Validity is determined by the content related evidence. The test is said to be valid if it accurately samples the domain of the construct being assessed.
Aptitude tests on the other hand aim to measure how much potential for learning an individual possesses. A wide variety of experiences are evaluated in a multitude of ways Validity of an aptitude test is determined by its ability to predict future performance. Hence, these tests rely extensively on criterion oriented evidence.
Group Achievement Tests
The Stanford Achievement Test (SAT) is renowned as being one of the oldest standardized achievement tests still widely used within the education system. The SAT is in its 10th edition and is currently well normed and criterion referenced, with outstanding psychometric documentation. It primarily evaluates achievement in kindergarten to 12th grade in a variety of areas.
The Metropolitan Achievement Test (MAT) is another well standardized and psychometrically sound group measure of achievement. This test measures achievement in reading by assessing word recognition, vocabulary, and reading comprehension. Versions of this test include Braille, large print, and audio formats.
The MAT and the SAT are the pinnacle of modern achievement testing. These tests are psychometrically well documented, reliable, and normed on large samples. Both sample a wide variety of educational factors and cover all grade levels.
Group Tests of Mental Abilities (Intelligence)
Kuhlmann-Anderson Test (KAT) – Eighth Edition
The Kuhlmann-Anderson Test (KAT) is a group intelligence test which is applied to kindergarteners through to 21th graders. The test measures 8 separate levels with a variety of items on each. Unlike most tests, the KAT does not become more verbal the higher the age group being tested, it instead remains primarily non-verbal throughout. This makes the KAT suitable not just for young-children but also for individuals who may be handicapped in following verbally procedures. It may even prove to be suitable for non-English-speaking populations, after proper norming. Results from a KAT can be represented in verbal, quantitative, and total scores. Scores can also be expressed as percentile bands. A percentile band provides the range of percentiles which most likely represent a subject’s true score, much like a confidence interval. The KAT is a soundly reliable, valid, sophisticated test and its non-verbal qualities make it an ideal candidate for tests involving non-English-native speakers.
Henmon-Nelson Test (H-NT)
The Henmon-Nelson Test of mental abilities is another widely used test applicable to all grade levels. This test produces one score, which is thought to measure general intelligence. This has been and continues to be the product of some controversy. However it remains a quick predictor of future academic success. Unfortunately, by just scoring general intelligence, the H-NT does not consider multiple intelligences. The H-NT manual also calls for caution when testing individuals from an educationally disadvantaged background. Research has also shown that the H-NT has a tendency to underestimate Wechsler full-scale IQ scores by 10 to 15 points for a number of populations.
Cognitive Abilities Test (COGAT)
When talking about reliability and validity, the COGAT is similar to the H-NT. The COGAT provides three scores for results: verbal, non-verbal, and quantitative. Unlike the H-NT, the COGAT was designed with poor readers, poorly educated individuals, and non-native-English speakers in mind. Additionally, research has shown that the COGAT is a sensitive differentiator for giftedness, a fine predictor of future performance, and a good measure of verbal underachievement. However, the COGAT has been found to be very time consuming, there is uncertainty regarding whether the norms are representative, and minority populations have been found to score lower than white students across the test batteries and grade levels. For these reasons, great care should be taken when scores are used in conjunction with minority populations.
College Entrance Tests
The SAT Reasoning Test (SAT)
Formerly known as the Scholastic Aptitude Test, the SAT Reasoning Test (SAT-I) is still the most widely used university entrance test. Renorming of the SAT occurred in 1994 as an attempt to restore the national average to the 500 point level as it was in 1941. Even more recently changes were made, changing the number of scored sections to three, each scored from 200-600 points. This will likely lead to less interpretation errors due to interpreters no longer relying on old versions as points of reference. At 3 hours and 45 minutes long, the modern SAT is an endurance race which rewards determination, motivation, stamina, and persistent attention. The SAT is great predictor of first year college GPA.
Cooperative School and College Ability Tests (SCAT)
The SCAT, developed in 1955, is second only to the SAT, however it has not been updated since its implementation. It encompasses the college level as well as three precollege levels, starting at 4th grade. Its primary goal is to measure school-learned abilities and an individual’s potential to take on further schooling. In comparison to the SAT, the SCAT’s psychometric documentation is neither as strong nor as extensive. Revisions and extensions of the SCAT are encouraged, as currently it is unable to compete with the SAT.
The American College Test (ACT)
The American College Test is a widely used aptitude test for college entrants. Its biggest strength is that it is particularly useful for non-native-English speakers. Specific content scores and a composite form the results of the ACT. In comparison with the SAT, the ACT has similar success in predicting college GPA alone or in combination with high-school GPA. Despite this, internal consistency coefficients are not as strong as the SAT.
Graduate and Professional School Entrance Tests
Graduate Record Examination Aptitude Test (GRE)
The GRE is among the most widely used tests for graduate-school entrance. The primary measure is general scholastic ability. The test is administered throughout the year at various examination centres across the globe. The test consists of three parts: verbal (GRE-V), quantitative (GRE-Q), and analytical reasoning (GRE-A). Based on Kuder-Richardson and odd-even reliability, the GRE is stable, with coefficients just slightly lower than the SAT. False-negative rates are high, also the GRE has been found to not be a significant predictor for a group of Native American students. It has also shown a tendency to over-predict achievement in younger students while under-predicting the performance of older students. Despite this, many schools have developed their own methods of using the GRE which either use it independently or in combination with other sources of data. The best way of using the GRE score is to use it in conjunction with other data. When combined with GPA, graduate success can be predicted with great accuracy. A common problem among colleges is that of grade inflation. This refers to the rising average college grades in spite of the fact that the average SAT scores are declining.
Miller Analogies Test
Similar to the GRE is the Miller Analogies Test, another measure of scholastic aptitudes for graduate studies. The difference is that this test is strictly verbal. Hence, knowledge of specific content coupled with a proficient vocabulary are very useful tools. In terms of odd-even reliability, the Miller Analogies Test is sufficiently reliable. However, it does lack validity support. Also, this test tends to over-predict the GPAs of younger students and under-predict GPAs of older students, much like the GRE.
The Law School Admission Test (LSAT)
Taken under extreme time pressure, the LSAT is a test which requires almost no specific knowledge, and like the Miller Analogies Test, it contains some of the most difficult problems one can encounter on a standardized test. The three types of problems covered in the LSAT are related to: reading comprehension, logical reasoning, and analytical reasoning. Every single previously administered test since the format changed in 1991 is available for study. The LSAT has been found to be psychometrically sound. Researchers have raised concerns that the test favours whites over blacks and is biased. This and other concerns have led to a 10 million dollar initiative to increase diversity in American law schools.
Nonverbal Group Ability Tests
Raven Progressive Matrices (RPM)
The Raven Progressive Matrices test is among the most widely known and used nonverbal group tests. This test can be used anytime as an estimate of an individual’s intelligence, though it is most commonly used in an educational environment. The RPM instructions are very simple and can be given without the use of language. For this reason the test is used throughout the world. The test consists of 60 matrices, which contain a pattern with a piece missing. The RPM has the advantage of minimizing the effects of language and culture.
Goodenough-Harris Drawing Test (G-HDT)
Originally standardized in 1926 and then re-standardized in 1963, the Goodenough-Harris Drawing Test is one of the simplest, quickest, and cost efficient tests of nonverbal intelligence there is. Requiring just a pen and paper, the subjects are tasked with drawing a whole man and are instructed to do their best job possible. Subjects achieve credits for each item they include in the drawing. Because of the ease of administration of this test, it is commonly used. It gives a quick and rough estimation of the intelligence of the child. However, caution is advised as results based purely on the G-HDT can be misleading.
The Culture Fair Intelligence Test
One goal of nonverbal tests has always been to restrict cultural influences on scores. The Culture Fair Intelligence Test was designed with this in mind, to provide an estimate of intelligence which is free of cultural and linguistic influences. Research has shown that this test does not succeed any more than any other test, however its popularity reflects the desire for a test which reduces cultural factors. The test has been found to be best applied for measuring intelligence of a Western European or Australian individual. More work is needed if the Culture Fair Intelligence Test is to compete with the RPM.
Standardized Tests Used in the U.S. Civil Service
The General Aptitude Test battery (GATB), which measures aptitude for a number of occupations, is a widely used test for assisting employment decisions. It measures a wide range of aptitudes. The GATB has been the subject of controversy, as it used within-group norming prior to the Civil Rights Act of 1991. For example women would only be compared with other women, men only with other men, Latinos with only other Latinos, etc. The argument was that within-group testing was done on the basis of fairness, however it was outlawed and labelled as reverse discrimination.
Standardized Tests in the U.S. Military: The Armed Services Vocational Aptitude Battery (ASVAB)
The ASVAB is a test designed by the Department of Defence which is administered to over 1.3 million individuals per year. The test consists of 10 subtests, which consist of a wide range of factors. The psychometric characteristics of the ASVAB are exemplary. The test has been shown to be reliable and a valid predictor of performance during training for a variety of civilian and military occupations. The ASVAB has been moving away from the pen and paper format in favour of computerized testing. This allows the tests to be adapted based on the subject’s unique ability.