Summary Psychological assessment and theory creating and using psychological tests

This summary of Psychological assessment and theory creating and using psychological tests by Kaplan & Saccuzzo is written in 2016


Chapter 1 - Introduction

Book Structure

R.M. Kaplan & D. P. Saccuzo Psychological testing: principles, applications, and issues. 2013 8th edition.

This book is structured in a way that enables the reader to grasp the simplest and most complex issues of testing. It is roughly divided in three sections: principles, issues and applications.

Basic Concepts

We use tests in order to measure certain behaviour and give it a quantitative value; we also gain a better understanding of the behaviour, which further gives us an opportunity to predict behaviour. The measures or test scores we obtain are never perfect, though they greatly help the prediction process. Tests consist of items, which represent stimuli - questions/problems that need to be worked on in the test.

When we want to measure some features of human behaviour we use psychological testing. Make a distinction between different types of behaviours such as overt behaviour which is observable and covert behaviour which is intrinsic and not that obvious (e.g. thoughts).

Be careful when interpreting the scores a test is measuring – the meaning of the scores is subject to change depending on how we define the scoring results. In order to avoid interpretation problems we use scales – which cluster bare scores into distributions that are more specific.

Since tests measure a variety of behaviours, there are many test variations in use. Individual test – only one person at a time receives the test. Group test – more people at a time receive the same test (high school class exams).

Ability tests –speed, accuracy or both are being measured. The three types of ability measured are: achievement that is based on the previous successes, aptitude concerned with the potential to master a skill, and intelligence referring to someone’s general capacity to solve and adapt to problems, think abstractly, and benefit from experience. These three constructs often interact with each other.

Personality tests – overt and covert behaviours are being measures, more specifically a person’s typical behaviour. General distinction is made between structured-objective and unstructured –projective personality tests. Structured tests are those for which you for example need to tick a box and state true or false, while Rorschach is a projective test for which an individual is asked to interpret a stimulus that is rather ambiguous.

Main use of psychological testing is to compare individuals and draw conclusions about these differences (if possible).

 

Historical Perspective

The tests we encounter nowadays were most likely developed during the past 100 years, even though the origins of testing can be traced back to more than 4 millennia ago in China where oral examination was used to asses promotion issues and evaluate work.

Test batteries represent the use of two or more tests at once and they were common during the Han Dynasty. The Western world most likely got familiar with testing via the Chinese.

Charles Darwin’s contribution to testing culture was an indirect one. Sir Francis Galton, Darwin’s relative, used the evolutionary theory proposed by Darwin to study humans. If the fittest ones survive and we all differ from one another, then some people must have certain characteristics that make them fitter than the rest, Galton argued.

His most valuable work was that he exposed existence of the individual differences in sensory and motor functioning that are the cornerstone of modern scientific psychology. Cattell took this work further and introduced mental tests.

Another stream of thoughts set ground for experimental psychology with Herbart, Fechner, Weber and Wundt whose works were theoretically more relevant and led to the understanding of the great importance of testing control and standardization.

The modern tests of today however, originate from the arising need to test those who were emotionally and mentally impaired. In order to provide those individuals with adequate education the need for further test development was necessary. Alfred Binet is a name associated with the emergence of first general intelligence test. The Binet-Simon scale consisted of 30 items and the results were compared with the standardized sample. He was aware of the significance of standardization of tests, though the sample taken for comparison was not necessarily the accurate one. Take for example 100 Asian girls from poor families as a standardization sample and use test results of an African American adult man from a rich family – the comparison is of no use in this case.

This led to the emergence of the representative sample that is needed to compare the person being tested to people similar to him/her in order to get useful test outcomes.

The Binet-Simon scale was revised several times and the standardization sample increased over time, even more importantly the term mental age was introduced bringing attention to the importance of the measurement of child’s performance compared to its own specific age group. This term brings across the idea of the difference between child’s chronological age (let us say 8) and mental age (let us say 6 - meaning that this child is 8 but performs as an average 6 year old). It was highly criticized for its focus on verbal and language skills.

World War I contributed to the growth of testing demands due to the emerging need to evaluate military recruits. Since the Binet scale was an individual test, a need for mass testing arose during this time, leading to the development of two structured group tests called Army Alfa – which required literacy and Army Betawhich did not.

Another development that followed was the emergence of achievement testing which consisted of multiply choice questions that had a large standardized sample as a norm against which one could compare the results. They are easy to administer and less biased subjectivity-wise.

Furthermore, the Wechsler-Bellevue Intelligence Scale (W-B) won an innovation in intelligence testing now giving the opportunity to test multiply abilities and their combinations in an individual. No need for verbal ability in order to assess the performance (non-verbal scale inclusion).

Personality testing is associated with measuring traits. Traits represent (partly) stable dispositions that can be used to differentiate between people. Optimists tend to remain optimistic even during harsh times. The Woodworth Personal Data Sheet is the first structured personality test that was developed during World War I. The test included items such as: “Do you wet the bed?” – “yes” or “no” and the responses were taken for granted, meaning that dishonesty and personal interpretation of the question were disregarded. Personality tests were harshly criticized and almost disappeared from use by the late 1940s.

Projective tests emerged at around the same time and in addition to the ambiguous stimulus they also provide very vague responses. An example is the Rorschach inkblot test, which provided the subject with an ambiguously looking ink drawing and asked for a rough interpretation of the same. A similar approach to testing was developed in the Thematic Apperception Test (TAT) where the individual was asked to make up a story based on a presented photograph. This way the TAT is supposed to assess human needs and motivations.

Projective tests became popular during the time personality tests were disregarded. Over time, projective tests have failed to prove solid psychometric properties. The need for empirical methods to construct tests was growing and structured personality tests such as the Minnesota Multiphasic Personality Inventory (MMPI) emerged. The authors claimed that the meaning of tests responses had to be explained using empirical methods. This is the most widely used test of the present.

The Sixteen Personality Factor Questionnaire, introduced by R.B.Cattel uses factor analysis as a way to find the minimum number of characteristics (dimensions) or factors to represent a large number of variables (this was the main issue with previous personality tests such as Woodworth - too many assumptions to be investigated). It is still widely used.

With the process of test development many applied areas of psychology developed. Tests remain a controversial issue, nevertheless all psychological areas depend on them greatly.

Chapter 2 – Norms and Basic Statistics for Testing

The need for statistics

Science needs information on how likely it is that certain events happen due to a chance alone, this is why we use statistical methods, and more specifically we use statistics for two purposes: descriptionbecause numbers can serve as summaries of the observations and we can make inferences which represent logical deductions explaining events that cannot be explained from direct observation. In terms of making inferences, imagine you want to know how many people listen to a certain radio station. You cannot ask everyone so you take a sample and by examining the sample you make inferences about the population.

 

Scales of measurement

We need to define measurements in order to make sense of the results. For this we use different scales. We recognize following important measurement properties of scales:

Magnitude (“moreness”), which stands for instance of an attribute that can be described as more, less or equal in amount compared to another instance. For example, weigh Anna and Hannah and since Anna weighs more, one can say that the scale of weight has the property of magnitude.

 

Equal Intervals is a scale property that indicates if the range between two points on the scale has the same differential meaning as if you were to take any other two points on the scale. It is not the same difference if two people score 35 and 40 on an IQ scale while two others score 130 and 135 even though the score difference is exactly 5 in both cases.

 

Absolute 0 is the case when no property about to be measured exists. This construct is hard to obtain in psychology since defining an absolute 0 point of, for example, friendliness is hard and somewhat meaningless.

 

These properties are used to determine different types of scales:

Nominal scales have one purpose and that is to name the objects. It is used when information is qualitative. An example is when we want to explain the person’s gender so we put 1 = male, 2 = female.

Ordinal scales allow us to rank individuals but are unable to describe the differences between those ranks. The scale has a magnitude, but lacks the property of equal intervals and absolute zero.

Interval scales have the magnitude and equal intervals, but no absolute zero (for example Fahrenheit).

Ratio scale is the one containing all three properties (speed of travel as an example where 0 km/h means no movement). Mathematical operations are possible with ratio scales, we can for example say that the speed of 120km/h is twice as fast as 60km/h.

 

Frequency distributions

In order to get an overview of the score of a group or an individual we use the distribution of scores. Frequency distributions provide information about how frequently each value was acquired. Usually one can find on the X-axis the scores while the Y-axis explains the frequency of the scores. When the distribution is bell-shaped we have the highest frequency towards the centre of the distribution.

 

A skewed situation occurs when the distribution is asymmetrical, thus the tail goes off to the right of the X-axis making it a positive skew, or left of the X- axis making it a negative skew. An example of a variable that is highly skewed is income, because very few people are extremely rich and a large deal of the population has a low income.

The class interval is the unit on the X-axis that explains a particular score interval.

 

Percentile ranks

By calculating the percentile rank, one answers the question how many scores fall below a certain value “Xi”. In order to calculate this we use the following formula:

Pr = B/N * 100 = percentile rank of Xi

Pr stands for percentile rank, Xi for the score of interest, B for the number of scores below Xi and N for the total number of scores. Since B is always less or equal to N we multiply the fraction by 100. By doing so we get a percentage. It is useful to know that the percentile rank fully depends on the comparison group.

 

Percentiles and percentile ranks are similar, while the first one explains the point in a distribution under which a certain percentage of cases fall, the percentile rank is the percentage of cases below the percentile (Pr).

 

Describing distributions

The mean (X bar) represents the arithmetic average score in a distribution of scores that we use as one of the ways to summarize our data. In order to calculate the mean score we divide the total score (sum of X’s) with the number of cases (N).

 

The standard deviation (S) represents the average deviation around the distribution’s mean. If the mean is for example 4, and the standard deviation 2, this means that the values between 2 and 6 fall within one standard deviation from the mean.

 

Variance (S²) is the squared deviation around the mean that represents the average squared deviation. We use variance to avoid getting values of zero when calculating standard deviations, since the sum of deviations around the mean is always zero. So we initially square it and then in order to get the standard deviation out of the variance value we take the square root of that value. In short, standard deviation is the square root of the average squared deviation around the mean.

 

The Z-Score is used to transform the data into standardized units because this way it is easier to interpret. It is calculated by dividing the difference between the individual score and the mean value (Xi ‑ X bar) with the standard deviation (S).

 

The Z score can be a positive as well as a negative number, depending on whether the score falls below the average score (negative) or above the average (positive). If there is no difference between the score and the average than the Z-score equals to 0.

Formula to obtain Z score:

 

Z = (Xi ‑ Xbar) / S

 

A standard normal distribution has the mean of 0 and variance 1.0. This is because any variable transformed into a Z score has specific properties. Think of the formula for Z and notice that in case you want to find the mean for the Z score you would have a formula where the numerator is the deviation around the mean- sum(Xi-Xbar)/S and the denominator is a constant (N). Since the sum of the deviation around the mean is always 0, the mean of Z scores will always be equal to 0. Now we have 50-50 to the left and right from the mean, which explains the S of 1.0.

 

There are many ways to transform the raw data in order to make more sense of it. McCall’s T is an example, since it is a system where the mean distribution is set at 50 and standard deviation at 10. There is nothing special about these numbers, since the T score is a simple transformation of Z: T = 10Z + 50. However, you can create any system that suits you by multiplying the Z score with what you want your standard deviation to be and adding what you would like your mean score to be. This way we standardize our scores (take SAT as an example). This is different from normalizing the scores – if you would transform scores of a skewed distribution, it would remain a skewed one.

 

Quartiles are points in the distribution that divide it into equal fourths. So the first quartile (Q1) stands for a 25th percentile, second (Q2) for the median, third (Q3) for 75th etc. The interquartile range stands for the interval between the Q1 and Q3 or the middle 50% of the distribution.

 

Deciles are points in the distribution that divide it into equal tenths. They range from D1-D10, each taking up equal 10% of the whole distribution. The stanine system, developed by the U.S. Air Force, converts a set of scores into a scale ranging from 1-9.

 

Norms

Norms in testing represent the performances on a specific test by defined groups. They are used to provide us with information on the performance by comparing it to what has been observed in the standardized sample. Take IQ scores and SAT scores as an example; you score something and then you use that score to compare it to the standardized or normative score. If you score 130 on an IQ test that is known to have a mean of 100 and standard deviation of 15, your scores indicate an above average intelligence.

 

Age related norms are found with tests that have several normative groups – intelligence tests, for example, have to take into account whether the test taker is a child or an adult. In order to assess the growth of children, the paediatricians commonly use the age-related norms. What is important to bear in mind is that children of the same age tend to go through different patterns of development. However, children tend to stay at about the same levels as their peers and this is called tracking. Tracking is often controversial, especially in education, where it happens that children are distributed over different classes by the specific performance they show at that moment.

 

A norm-referenced test works by comparing an individual to a norm. This has been the subject of some criticism as many young children are exposed to competition in areas were they to a below average standard.

 

Criterion referenced tests are used to assess a specific skill or ability that the test takers can demonstrate (e.g. math skills). The results are not used to compare it to any group or individual, the results have a rather diagnostic kind of reference. They are used to identify issues that can be further worked on.

 

 

 

Chapter 4 - Reliability

 

History and concept of reliability

Psychology as a science has a difficult task with measurement tasks. Complex features such as intelligence are not simple to assess. Fortunately the theory of measurement error is well developed within psychological research. Reliability (consistency of data from many examinations) has a special place in psychological examination since it provides evidence for the scientific feature of psychology as a study. Charles Spearman is a pioneer in development of reliability assessment. Later on many reliability coefficients were introduced to the field.

 

Classical test score theory states that everyone has a true score that we could obtain if no measurement errors were made. This is why we measure the observed score by adding the error to the true score: X=T+E, where X - observed score, T- true score and E- error. The error of measurement is then the difference between the observed score and the true score we want to obtain: X-T=E. The classical test theory emphasizes that the error in the measurement are random.

 

Sampling theory indicates that the distribution of those random errors is bell-shaped, so the distributions centre should show the true score and what is around the centre is the distribution of sampling errors. This distribution of errors tells us further how much error there is in out measurement. Classical theory adopts the thought that the true score will not change with repetition of the same test for a certain individual. However, due to the random errors after repeatedly applying the test it is possible to obtain different scores.

Due to the assumption that the error distribution will be the same for everyone, the classical test theory uses the error standard deviation as its essential measure of error - standard error of measurement. Moreover, this measurement tells us, on average, how much the observed score differs from the true score.

 

The Domain Sampling Model

The domain sampling model is a notion related to the classical theory and tries to figure out problems related to the use of limited number of items while trying to explain and assess a more complex construct. This model explains the reliability as the ratio of the observed score variance on the test (short one - since no time to assess to assess all the feature that could explain, for example, intelligence) and the long-run true score variance. Reliability is not that easy to achieve but can be estimated from the observed test correlational score with the true score. This would have been a good option if we knew the true score, nevertheless the true scores are rarely possible to assess. Finding a true score in testing someone’s ability to spell in German would require from that person to spell every existing word. The alternative way is to estimate the true scores and these estimations’ distributions should be normal and random. To estimate reliability we can continuously create several random parallel tests by drawing new random item samples from the same domain. Find a correlation between the scores on one test and all other random and parallel tests. Then we average the correlations and take the square root of it. Because of the squaring the estimation of the reliability is always positive.

 

Item Response Theory

Item response theory (IRT) is a psychometric item and it is very important since the testing reliability culture is moving away from the classical theory. The new approach IRT relies on the use of computer to focus on the levels of item difficulty which further helps to gather knowledge about a person’s ability. So the computer fits the person’s responses by for example switching to harder items if the person gets several items in a row correctly and vice versa. More reliability is acquired using IRT with a short test that contains fewer items.

 

Reliability Models

To explain reliability we usually rely on the correlational coefficients, though it is possible to use a mathematical ratio instead. To do this we make a ratio of true score variance and observed scores variance. Moreover, the observed score does not have to resemble the true score and this can be due to many external influences such as noise or temperature. The reliability is most commonly estimated in the following ways: test–retest, parallel forms, or internal consistency.

 

Test-Retest reliability estimates are used to try to assess the error when administering a test at two different time points. This is of course valuable only if we are testing something that is not supposed to change over time. In contrast to that, those tests that are about to measure something that changes is not useful to be assessed by test-retest estimation. This type of reliability is somewhat easy to assess, we just need to give out a test at two different but well planned points and then find correlation between them. A drawback is a possibility for a carryover effect to happen which is the moment when the first testing has an influence on the following one. Practice effect is a well-known type of carryover effect when we have certain skills improving over time because of practice. Due to these issues we must set an exact time interval between the two tests and do it carefully.

In order to make a test that is reliable one needs to be sure that the test scores are not representatives of some subset of items from the field we are initially studying. Parallel forms reliability is comparing two of the same forms of a test measuring the same feature. The items themselves are different but selected according to the same rules and have the same difficulty. Another term for parallel forms reliability is the equivalent form reliability. If the two test forms are administered at different times, we include the error that relates to the time discrepancy.

 

The Split-half method for assessing reliability simply divides the administered test into halves that are then scored independently. After assessment the results of the halves are compared to each other. When the test is long is it preferable to divide it in half at random, while if one wants to keep it simple the split in first and second half is also possible. If the test items are getting increasingly more complex than the odd-even system of splitting is most commonly used.

However, the reliability of the halves is not as strong as of the whole test and this is when we can use the Spearman-Brown formula in order to estimate what the correlation would be if applied to the whole test. This formula increases the estimate of the reliability but it is not always good to use, for example when the halves do not have same variances. In this case the general reliability coefficient α can be used and α provides the lowest possible estimate of the reliability. Important to know is that alpha can support that a test has a needed reliability but cannot tell when a test is unreliable. In case the variances of both halves are equal then alpha and Spearman-Brown coefficient provide the same results. The formula to use when assessing the reliability of items that are dichotomous (0 or 1) is the Kuder-Richardson formula – KR20.

 

The coefficient alpha is used when we cannot recognize right or wrong answers, such as personality tests. Imagine scales from strongly disagree to strongly agree - none is incorrect but it explains your position on the scale in between the agreement and disagreement. This is a very general reliability estimate. Important formula is: r = α = (N/N-1)*(S²-sumSi/S²).

Alpha is more general because it has the power of describing the item even if no right-wrong indication is present (in contrast to the Kuder-Richardson formula). Alpha estimates the reliability through the use of internal consistency - if the items are not measuring the same feature than we can say the test is lacking internal consistency. If this is the case, factor analysis is the most common way to deal with inconsistent measurements.

 

Sometimes we want to study a type of behaviour or characteristic by obtaining a difference in particular scores and evaluating why this is the case. In such a situation where we are comparing two different attributes we must make sure to make a Z comparison, since Z is the standardized unit. This difference in the scores is a common problem with further use of scores. As mentioned, when a difference is found, the error (E) is probably larger that the observed scored separately (T-true score) because in this case E consists of errors from both parts that create the initial difference.

Moreover, T is expected to be smaller than E because whatever the two parts have in common will vanish when the difference in the scores is made. Because of this, the reliability of the different scores is expected to be smaller that of each of the scores.

 

Use in Behavioural Observational Studies

It is well known that some psychologists prefer the use of observational studies to tests. Observational studies seem simple and straightforward, however those have many sources of errors, and very common are sampling errors that must be taken into account with the evaluation of results. Generally, when observing behaviour one often meets high unreliability mostly due to the difference in true scores and those recorded by the observer. To control this and improve reliability we can use several techniques to estimate the reliability. Those are interrater, interscorer, interobserver, and interjudge reliability and they all test how consistent the reports of different judges on the same behaviour are. We can simply record the percentage of times they agree, however this technique has two problems. One is that we lack the level of agreement that could be gotten just by chance, and the second is that we cannot get an average of the percentages.

 

The Kappa statistic is known as the most suitable way for evaluating the agreement level amongst observers. Kappa measures the agreement between them by relying on the nominal scale. Thus, we get a proportion of the expected agreement taking into consideration the chance agreement. Kappa varies between -1 and 1 (less than a chance agreement-full agreement).

 

Sources of Error and how to assess them

Errors: time sampling occurs when we give the same test in different time, even if we administer them to the same people. Item sampling when we have the possibility to assess a feature using a great possibility of items. Internal consistency stands for the intercorrelations between the items in the same test.

Important: When assessing reliability, take the possible sources of errors into consideration.

 

The use of Reliability Information

In the next paragraph the practical facets of the reliability evaluation will be described. The standard error of measurement provides information on how inaccurate a measurement can possibly be. When large, standard error indicates less certainty about how accurate the measurement of a particular item is. Standard error can be calculated by using the reliability coefficient and standard deviation: Sm= S*√1-r, where Sm is the standard error for the measurement, S is the standard deviation and r is the reliability coefficient. To create a confidence interval around particular observed scores, the researchers use the standard error. More specifically, one cannot know whether the observed score is the true one, but when forming the confidence interval around that score we can estimate the probability that the true score will fall within the interval (or not).

What should be the reliability level so that we can call it a high reliability? Range from .7-.8 is good enough for most cases in research, however it depends on the purpose of the test. Others believe everything under .9 is not worth of mentioning. Highly focused tests tend to have high reliability, while complex constructs are usually less reliable.

 

In order to increase the test reliability psychometrics suggests two methods and those are lengthening the test and discarding low reliability items. Moreover, the reliability will increase as we increase the number of items, to do this the researcher might end up spending a lot of time and money. Using Spearman-Brown formula could help in this case since it can indicate how many items more are needed in order to increase the reliability. Often while testing it turns out that some of the items do not measure the construct in question. If one leaves those out - the reliability will increase. In order to make sure the items are measuring the same thing, one can use factor analysis or inspect the correlation between every item and the total test score - discriminability. When this correlation is low, this indicates the discrepancy in the measures of the items. It can mean another thing - too easy/hard item will give results that are different to evaluate. Low correlation - should be excluded.

 

Measurement error attenuated = diminishes the potential correlation. We need to correct for the attenuation and we do this by dividing the observed correlation between tests 1 and 2 with the square root of the reliability of test 1 * the reliability of test 2. R12(hat)=r12/√r11*r22. The discrepancy that we obtain indicates that correcting for the attenuation would increase the observed correlation by x (from-to).

 

Chapter 5 - Validity

 

Definition of Validity

In testing, validity stands for something close to a meaning. It is an agreement between the quality of what is the test supposed to measure and a test score (measure). In simple words: “Is the test measuring what we want it to measure?” Validity can also be defined as the evidence for implications made about a test score. This evidence consists of three criterions: construct-related, criterion-related and content related. Face validity is officially not a form of validity but it is a term that is widely used in testing. It stands for a simple measure if a measure has validity. It is somewhat a brief impression of whether the items seem to be related to the purpose of the test, however, it has nothing to do with validity because it does not provide any evidence in support of the conclusions (test scores).

 

The different aspects of Validity

Content-related evidence for validity of a measure provides the information about how adequate the representation of the domain test is designed to cover. Does what you have on your exam really represent your knowledge of the subject? It is the logical type of evidence, in comparison to the rest that is rather statistical. Two concepts related to the content validity evidence are construct underrepresentation ‑ failure to grasp important components of a construct and construct-irrelevant variance ‑ when scores are influenced by some side factors not related to the construct itself.

 

Criterion-related evidence for validity tries to assess how well a test relates to a specific criterion, providing such evidence when correlation between the test and criterion measure is high. We have a test that stands in for a measure for what we actually want to measure. For example a premarital test serves to predict the marital satisfaction in the future. This predicting feature of the criterion validity evidence is better known as predictive validity evidence. Take the SAT and GPA score as examples, the SAT is a predictive variable while the GPA is the criterion. The test is used to predict the success on the criterion mentioned. Another criterion is the concurrent-related validity evidence; it explains the simultaneous relationship between the criterion and the test. It is possible to assess only when they can be measured at the same time. For example, test for learning disabilities and school performance. Moreover, when a person does not know how to respond to a measure of criterion-say occupation, the SII ‑ Strong Interest Inventory (uses collection of patterns of interest among people satisfied with their jobs) will be a better predictor of perceived career fit than personality would. This stands for vocational interests in general (better predictors).

The test-criterion relationship is mostly expressed with a correlation called validity coefficient. This number expresses how good the test is in making assumptions about the criterion. In general the coefficient between .3 and .4 is often regarded as high. If we square our coefficient, we get the percentage of variation in the criterion.

 

Construct-related validity evidence represents a succession of procedures where a researcher concurrently defines constructs and is developing the tools to measure it. By this he is making evidence of what a specific test means. To gather this evidence is a continuous process that takes time as if finding support for a complex theory. In 1959, Campbell and Fiske found a distinction between two essentials for a test to be meaningful. Those are the convergent and discriminant types of evidence. The convergent evidence is present when a certain measure correlates with other tests that are believed to measure the same thing. Thus, the measures of a same construct converge on the same item.

 

Convergent evidence can be assessed in two ways, one is that we have to provide information that a test measures the same things as other tests used for the same cause. The second way is to show specific interactions that we can expect if the test is measuring what it is supposed to measure.

 

Discriminant evidence is needed in test validation as a proof that a test is measuring something distinctive. This demonstration of distinctiveness is what we actually call discriminant evidence (same as divergent validation). This type of evidence actually shows that the measure is not able to represent another construct but the one it was designed for. Different categories of validity are no longer supported as constructs, the different categories of evidence are.

In theory, we can have reliability without validity, but it is not possible to demonstrate that a test without reliability is valid. Reliability and validity are certainly related concepts.

Chapter 7 – Writing and Evaluating Test Items

 

Guidelines for Item Writing

Writing items can be difficult and there are many things to consider. DeVellis provided some guidelines to in order to help with item writing:

 

  • Make the item very specific - make it clear what you actually want to measure by using the theory.

  • Pay attention when selecting and developing items, for example avoid unneeded ones.

  • Avoid items that are too long.

  • Make sure that the language difficulty is suitable for the test takers - that it is clear to them what is asked.

  • Avoid bringing up two or more ideas with one item - avoid so called “double barrelled” questions.

  • Having both positively and negatively worded items is good.

 

Being cautious about ethnic and cultural differences is necessary since the same item can be interpreted differently across cultures.

 

The dichotomous format is a format that offers you two alternatives for each item. The usual form of this format is the true-false test. You are presented with a statement and it is on you to decide if it is true or false. The positive aspects of this format are: simplicity, fast scoring, and easy test administration. Some drawbacks are: they rely on the test takers ability to learn by heart – not allowing him/her to show the understanding around the topic.

Dichotomous tests tend to lack reliability in comparison to some other tests. This type of format is widely used with personality tests where we are in a need of the absolute justice.

Polytomous (polychotomous) format is similar to the dichotomous format, though it provides more than two alternatives for the response. Most commonly found with the multiple-choice examination, this format is easy to score and the ability to provide the correct answer just by guessing is lower than 50% (which is the case with the dichotomous format).

 

The major advantage is that is takes up less time to respond to an item because it lacks the need to elaborate upon and write an answer. As only one of the alternatives is correct the rest are labelled as distractors. Choice of distractors is essential because too many or too complicated distractors take up too much time and often have a negative effect on the test reliability. Research suggests the use of three-four good distractors is the best option.

Years of psychometric analysis indicates that the three-option multiple choice items are as good (or better) than any other number of alternatives used.

 

The problem of the expectancy of a right answer by guessing is often dealt with the “correction for guessing” formula:

 

corrected score= R-(W/(n-1)) where R=the number of right responses

W=the number of wrong responses

n=the number of choices for each item

 

W/(n-1) represents an estimation of how many items one is expected to get right only by chance.

 

The essay is another format, very common in school class use and its validity/reliability are rarely assessed and analysed.

 

The Likert format, which is very common with personality and attitude assessment, requires from a test taker to specify the degree of agreement with a presented attitudinal quote. For example: “I am afraid of spiders” – strongly disagree, disagree, neutral, agree, and strongly agree. Sometimes the neutral option is avoided.

 

The category format is a form of the Likert format, though it provides even more choices for an answer. Most common is the 10-point rating scale. For example: “On a scale from 1-10 how much do you find your best friend reliable?” It can have more than 10 points, or less at times. Since it is known that people tend to change ratings depending on the environment/context, the category formats are criticized for their lack or reliability. This can be avoided by specifying the endpoints of the scale very strongly and reminding the test taker to think about the endpoints in this way. Why 10? It depends on the test takers involvement and relatedness to the topic in question. If the test taker is greatly involved and motivated to give accurate responses he/she is able to respond best if there are many points on the scale - since they can distinguish many “shades”. With people who are uninterested it makes no difference if you provide them with a 7 or 27 point scale.

 

The visual analogue scale is a format related to the category format and presents the test taker with a 100-mm line and asks him/her to place a mark as a response to the question somewhere between the end points of the line. Scoring in this case is time-consuming, though these scales are popular with self-rating health.

 

Checklists and Q-sorts

Adjective checklists are lists of adjectives that require the test taker to indicate which ones characterize him/her. It is also used to characterize others. Only two options are present- either you are something (adventurous) or you are not.

 

The Q-sort technique is similar but it uses more categories in order to assess the personality. For example, you are given statements and asked to place them in 9 piles. Most of the characteristics people place in the piles-4, 5 and 6 which reserve places for statements that mildly characterize the subject in question. Those at the extremes 1 and 9 usually tell something interesting about an individual.

 

All of the above option is mostly advised to be avoided as an alternative answer- though highly ignored.

 

Analysing Items

Item analysis stands for a group of methods used to assess test items and are considered to be an essential aspect of test construction.

 

Item difficulty is a measure obtained by the number of people who get a specific item right. As the proportion of correct answers increases among the group of test takers - the difficulty of the item is decreasing. An item that is answered correctly by everyone is obviously a bad item since it provides us with no information about the discrepancies among test takers - which is exactly what we are trying to assess.

 

The optimal difficulty is considered to be halfway between getting the right answer by chance and 100% getting it right. To be more precise, you take the 100% success level – 1.0 and subtract the chance level - .25 from it and divide this by 2 - the number obtained is the half-way point (for a 4-choice item). Furthermore, we add the “by chance” performance -0.25 to the obtained value and this way we calculate the optimum item difficulty which is 0.625 for a 4-choice item. The best is to have items of different difficulty in order to make several discriminations. For example - few easy items can contribute to the control of anxiety of the test takers, which further increases the reliability.

 

Item discriminability is another way to assess the item quality by looking at the relationship between the performances on a particular item with the performance on the test as a whole. In other words test if people who did well on a certain item have also done well on the test in general. There are several ways of assessing the discriminability.

 

The extreme group method compares those who have performed well with those who have performed rather poorly. Then you have to find the proportion of people from these groups who got each item right and compare it between the two extreme groups. This difference is called the discrimination index. When this index is a positive number somewhat away from 0 – we consider this item a good one, when near 0 - means no discriminability, when negative – bad item.

 

The point biserial method is assessing the correlation between an item and a total test score:

rpbis = [(Y1bar-Ybar)/Sy]*sq.root (Px/(1-Px)) where:

rpbis = the point biserial correlation or index of discriminability

Y1bar = the mean score on the test for those who got item 1 correct

Ybar = mean score on the test for all persons

Sy = the standard deviation of the exam scores for all persons

Px = the proportion of persons getting the item correct (Allen & Yen, 1979)

 

Note it is not smart to use point biserial correlation with tests with only a few items since the item performance necessarily contributes to the total score. In order to better assess the results we use the item characteristic curve. The total score is presented on the X axis while the proportion of the test takers who got the item correct is presented on the Y axis. When you observe a gradual positive slope of the graph line representing the proportion of people who pass item gradually increasing as the test scores increase this means that the item is good because it discriminates at all levels of performance.

 

A flat line indicates that a test taker of any ability was equally likely to answer correctly - this is a consequence of a poor item.

A curve that gradually rises and then starts turning down for people at the highest levels of performance indicates that those with the best overall scores did not have the best chances of getting the item correct. This often happens with the “none of the above” alternatives.

 

Item response theory is an approach to testing that analyses the item while considering the chances of getting each item right or wrong taking the ability level of each test taker into consideration. The biggest advantage of this approach is that that person’s score is not defined by the total number of correct answers but by the difficulty of the items the person got correct. Another crucial advantage is in its ease to adapt to computer administration.

 

Criterion-Referenced testing compares performance with some specific criterion for learning, for example Annie’s score of 77 (out of 100) on a maths test is compared not to the rest of his class but to how much he “should have” learned. Step one in using this type of testing is to clearly define the learning outcome about to be achieved. To properly evaluate the criterion-references test one should study two groups of students - the ones exposed to the learning program and the ones who are not.

 

The main issue with the item-analyses is that even though statistics help the test maker to assess which item is good and which one is not it still does not contribute to the successful learning of the students.

 

 

Chapter 10 - Theories of Intelligence and the Binet Scales

 

The difficulty of defining Intelligence

Alfred Binet defines intelligence as: “The tendency to take and maintain a definite direction; the capacity to make adaptations for the purpose of attaining a desired end, and the power of autocriticism”. Other such as Spearman and Freeman had different views, which explain that it is hard to define intelligence in only one manner. Moreover, Taylor (1994) defines three streams of research that study intelligence and those are: the psychometric approach - examining the fundamental structure of a test; the information-processing approach that emphasizes the underlying processes of how humans solve problems; and cognitive-tradition - focusing on human adaptations to the real-world demands. The mentioned view of Binet falls within the psychometric approach. It is and was widely known that people are able to accomplish remarkable things and that they also differ in this capability on different levels, which indicates the existence of intelligence, however the main problem was how to define intelligence. Binet for example, was indecisive of what he actually wanted to measure, and alongside the above-mentioned definition, he and his colleagues developed the first intelligence test.

 

Binets Principles of Test Construction

In Binet’s view intelligence corresponds to a capacity to find and retain a purpose or direction, adapt if necessary in order to achieve the purpose and be able to criticize oneself, which would induce the adjustment in the strategy towards the goal. Binet with colleagues worked on developing ways to measure judgment, reasoning and attention. Binet provided foundation for further human test abilities and was guided by nowadays well-known constructs of age differentiation and general mental ability.

 

Age differentiation is explained by the fact that we can make a difference between younger and older children in their capabilities where the older ones are more capable. Binet decided to use tasks with which he could estimate the mental ability of a child by comparing the result on the task with the one “average” for a child of a specific age. This way one can determine age capabilities of a child independent from the child’s chronological age. This was later called the mental age. Moreover, Binet decided to measure only the total product of several distinct elements of intelligence and named this general mental ability. He chose this most probably to ease himself from having to define every independent element of intelligence, thus to make it more practical.

 

General Mental Ability - Spearman

Next to Binet, Spearman used the notion of mental ability as a ground for all intelligent behaviour. Following Spearman’s theory, general intelligence factor g and a great number of other specific factors consist of what he called intelligence. Spearman’s idea of general mental ability (which he called the psychometric g, or just g) was grounded on the phenomenon that if one administers many different ability tests to an unbiased population we will find that almost all the correlations will end up positive. This is also referred to as positive manifold, which, as Spearman explains, results from all the tests being influenced by the “g”.

 

Factor analysis was introduced by Spearman as a way to statistically support the notion of g. Simply put, factor analysis reduces a set of variable scores into factors – a smaller number of variables. Spearman also claimed that as much as 50% of the variance in mental-ability tests is characterized by the g. This notion is still used in the present day.

 

However, present theories tend to emphasize the idea of multiple intelligences rather than a single one. As the “gf-gc” theory proposes, we have two basic types of intelligence and those are fluid and crystalized. Fluid intelligence refers to all the things that enable us to acquire new knowledge, to reason and to think while crystallized intelligence is what we have already acquired and understood.

 

Binet - scale history (including Terman’s Stanford-Binet Intelligence Cycle)

Binet scales went through many revisions and new forms. In 1905 we have the Binet-Simon scale that represented a 30-items individual intelligence test with the increasing difficulty of the test items. By this time Binet solved two problems about his previous work and those are: he now knew exactly what he wanted to measure and he further came up with the items to support those measurements. However the Binet-Simon scale lacked several things, some of them were specific measuring units as well as the normative data that could support the validity. Norms in the 1905 scale were based on only 50 children and the children were considered “normal” according to their school performance.

 

1908 Simon (French minister of public instruction) and Binet incorporated the idea of age differentiation in their work and made the 1908 scale an age scale. The items in the scale were centred around the age level and not just simply by increasing difficulty like before. This scale meet a few challenges as well due to the fact that when we group items according to the child’s age level, comparing the performance on different forms of tasks becomes increasingly difficult. Besides the challenges the 1908 scale was a definite improvement in comparison to the 1905 version. This scale was very focused on the verbal and language ability and this was the main criticism towards it. With the introduction of the mental age concept, Binet started working on the problem of unspecific units for evaluating the results. In short, Binet-Simon scale offered two crucial concepts and those are age scale format and the mental age concept.

 

Terman’s Stanford – Binet scale was developed by L.M.Terman and for a leading intelligence scale from 1916 up to its revisions. The 1916 Stanford-Binet scale was developed with the presence of age differentiation, age scale and general mental ability. Mental age construct was retained as well. What made a difference was the increased scope of the standardized sample, however, it consisted of only white, native-Californian children, which indicates that it was not as representative.

 

Furthermore, the 1916 scale introduced the concept of IQ or intelligence quotient. The IQ was now using the person’s mental age together with the chronological age to obtain a ratio score. This ration was regarded as a reflection of the person’s mental development.

IQ = MA/CA*100 where MA is mental and CA is chronological age. The result is multiplied by 100 in order to avoid the fractions. The problem was that the scale had a maximum mental age of 19.5 years, which led to people being older than that having unusually low IQ’s. The maximum mental age was then set to 16 because it was believed that after 16 your mental age stops developing further.

 

The 1937 scale contained several further improvements. The age range was extended down to the 2-year old point. The maximum mental age was extended to 22 years and 10 months by adding new tasks. The standardization sample was greatly improved, now the norms were set by representatives from 11 US states, however, this obviously did not make it perfect. What helped the psychometric properties of the scale to be more easily examined was the inclusion of an alternate equivalent form. Furthermore, the 1937 scale had a major problem and that was that its reliability coefficients were higher in the case of older subjects than in the case of the younger ones. Also, the reliability was higher for the lower ends of the IQ scale, and lower for the higher ends. The scores were and are most unstable for the youngest age groups in the highest IQ ends. Apart from the reliability problems, the fact those different age groups expressed significant differences in the standard deviation of the IQs was another problem. Because of this, the IQs at a particular age level could end up different to IQs at another level (e.g. Standard Deviation at age level 6 was 12.5 whereas for ages 2.5 and 12 the Standard Deviation would be 20.6 and 20).

 

The 1960 Stanford-Binet Revision and deviation IQ (SB-LM) was created as an attempt to develop a scale similar to the one from 1937 by using the best things about the 1937 scale: the fact that with the increase in age the test scores increase and that some tasks correlated highly with the test scores as a whole. To add to that, the IQ tables were extended to age 18 and the scorings as well as the test administration were improved. The major problem of the previous scale - the differential variation in the IQs - was solved by introducing the deviation concept of the IQ. The deviation IQ introduced a standard score with a mean of 100 and standard deviation of 16 (now 15). Furthermore, new tables were made in order to correct for the differences in variability at different age levels. This enabled the comparison of the IQs of a specific age to those of another. Several additional revisions followed, however, return to the 1960 model was present due to its better instructiveness.

 

Looking at the Modern Binet Scale

The modern Binet scales include the gf-gc theory that is grounded on the view that we possess multiply intelligences, not only one. It is a hierarchical model and at the top we have the g. Under g we have three group factors: crystalized abilities - which reflects the learning that is when one is aware of the initial possibility or capacity, fluid-analytic abilities - the original capacity that one uses to obtain the crystalized abilities and short-term memory - the information one can keep briefly after only one presentation. Crystalized ability also has two sub-abilities and those are nonverbal and verbal reasoning. Thurstone’s multidimensional model relies on his argument that in contrast to Spearman’s idea of intelligence as a single construct, intelligence can be best understood and defined as a compromise of independent factors or “primary mental abilities.

 

The 1986 revision kept much of the previous versions, however, the age scale was completely removed. In replacement for the age scale, we now have items of the same content grouped together in one of the 15 tests to create point scales (e.g. all language items would be grouped in one test).

 

The 2003 version provided an extra hierarchical model containing 5 factors where general intelligence is at the top which is the same case as with the 1986 version, just now each of the 5 factors is a “main” factor and each has a verbal and nonverbal measure. 2003 fifth edition is an integration of the point and age scale. Nonverbal-verbal scales have equal weight for any item and each test starts with one of the subtests: verbal or nonverbal. Moreover, there is a point scale of similar content of increasing difficulty. So in this version of the scale the routing serves to assess test taker’s ability, where nonverbal examines nonverbal ability, the verbal examines the verbal ability. Since both verbal and nonverbal are equally weighted, it is possible to evaluate the test takers score on all items of similar content.

 

The level of ability we can initially estimate for a person is called the start point. The level where a minimum of correct responses are found is called the basal. The ceiling is the testing point where a specific number of wrong answers indicate that the items are too difficult. The main idea of the fifth edition is to bring back the extremes in intelligence which is a valuable property of the Binet lost in the fourth edition.

The range in age now spans from 2-85+. The score now ranges from 40-160 making it very useful in assessing the extremes in intelligence. The reliability of the fifth edition is regarded as good with coefficients for the full-scale IQ of .97 or .98 for all of the 23 age ranges mentioned in the manual. In addition, the manual reports four forms of evidence that are supporting the validity of the test in question and those are content validity, empirical approach to item analysis, relative criterion-related evidence of validity and construct validity.

Chapter 11 – The Wechsler Intelligence Scales: WAIS-IV, WISC-IV, and WPPSI-III

 

Wechsler scales to WAIS-III- scales, subtests and indexes, and interpretation of features Wechsler indicated that intellectuality is not the only factor involved in the intelligent behaviour, he suggests other factors play a part as well. Three Wechsler intelligence tests are available and those are Wechsler Adult Intelligence Scale, Third Edition (WAIS-III), Wechsler Intelligence Scale for Children, Fourth Edition (WISC-IV) and the Wechsler Preschool and Primary Scale of Intelligence, Third Edition (WPPSI-III).

 

Wechsler scales greatly differ from what are the main concepts in the Binet. Wechsler believed that since Binet’s scales were meant for children they shall lack the validity when it comes to testing adults. Amongst others, two big differences between these two approaches are Wechsler’s use of point scale instead of the age scale and Wechsler’s use of the performance scale. With the point scale we assign points to each item and the person will receive a certain amount of points for each item done. The advantage of this approach is that it makes it simple to group items of a specific content. Wechsler’s test produces scores for each specific area. This approach is nowadays standard. Performance scale is an entirely new construct embedded in the scale and provides the measure of nonverbal intelligence. It is built of tasks that require the person to perform and not only answer questions.

 

Wechsler scale included two different scales, verbal was the measure of verbal intelligence and performance measured nonverbal intelligence. Nowadays versions of Wechsler include four major scales. The performance scale was not entirely Wechsler’s innovation, it was used in different forms before as an alternative to the Binet. Wechsler was innovative in offering the possibility to directly compare the nonverbal and verbal intelligence since the verbal and performance scales were standardized on the exact same sample and the units of expression of results were comparable. The idea of performance scale is to overcome the issues and biases triggered by different language, level of education or culture. It took several attempts until the Wechsler scale reached a proper form since the first version called Wechsler-Bellevue was standardized badly; it consisted of about 1000 eastern US whites (mostly from New York). WAIS-III was last revised in 1997 and will possibly to be revised again soon.

 

Similar to Binet, Wechsler saw intelligence as a capacity to act towards a goal and adjust to the environment. However, in his opinion the elements that build intelligence are not independent, but interrelated. He uses terms global and aggregate to explain this. Intelligence further consists of many interrelated elements and the general intelligence will be the outcome of the interaction of these constructs. Wechsler was concentrated on several constructs on his way explaining the general mental ability while this was not the case with Binet. WAIS-III has seven verbal subtests and those are: vocabulary, similarities, arithmetic, digit span, information, comprehension and letter-number sequencing.

 

The vocabulary subtest represents the ability to define the presented word and it stands for one of the best measures of intelligence as well as a most stable one, which is one of its crucially important features. If we have a patient who suffered from some sort of a brain damage, the vocabulary subset is the least one to be affected.

 

The similarities subset presents the person with 15 joint items of increasing difficulty and requires the person to indicate the difference between the presented paired items. Some items require the subject to think rather abstractly and notice the similarity between not so obviously similar constructs.

 

The arithmetic subtest presents the test takers with 15 somewhat easy problems, which do not require complex mathematical knowledge but require the ability to hold the information available for calculation until the answer has been made. This is why memory, motivation, and good concentration are essential for the performance on this subtest.

 

The digit span subtest requires from the person to repeat the presented digits that follow one another in a span of 1 second. This measures the capacity for short-term memory. Bear in mind that with Wechsler there are always some side-nonintellective factors that could influence the performance. In this case that would be attention. Anxiety is another example.

 

The information subtest presents the person with both nonintellective and intellective modules including the necessity to understand what is asked, follow rules and give a response. Nonintellective features in this case are some constructs like curiosity and knowledge acquisition. This subtest is also influenced by alertness, more specifically alertness to environment and cultural opportunities.

 

The comprehension subtest presents the person with three types of questions, firstly questions that require the person to decide what should be done in a specific situation. Secondly, it requires for a logical explanation for some presented phenomenon. Thirdly, it is required from the person to define or explain a certain proverb. In general, this subtest provides information about the understanding of everyday practical situations or common sense. The problem that could arise in this subtest is if a person’s emotional involvement affects the judgment and leads to an unsuitable response. The letter-number sequencing subtest is one of the latest WAIS-III subtests. It consists of seven items and the person is required to reorder the list of letters and numbers. This subtest provides information on attention and working memory. The verbal scale is assessed by combining the raw scores of the subtests just mentioned to obtain a Verbal IQ (VIQ) one should sum up the age-correlated scores from the verbal subtests just mentioned.

 

WAIS-III has seven performance subtests and those are: picture completion, digit symbol-coding, block design, matrix reasoning, picture arrangement, object assembly and symbol search.

 

Picture completion subtest is made out of a picture that is missing an important part and the person is asked to spot the missing part. This task is timed.

 

Digit symbol system-coding subtest asks from the person to copy items paired with numbers from 1 to 9 and assesses the person’s capacity to learn an unknown task, the level of persistence and speed of performance. It also measures the visual and motor agility.

 

Block design subtest includes nice differently coloured blocks as well as a booklet with the photos of the same blocks arranged in a specific geometric manner. The person is asked to arrange the blocks and make up increasingly difficult patterns. The input is visual and the response required is a motor product. It is a good way of evaluating the abstract thinking in a nonverbal manner.

 

Matrix reasoning subtest became a part of the WAIS-III as a way of inducing the assessment of fluid intelligence, which incorporates the ability to reason. The person is given nonverbal, figural stimuli and the task is to spot a pattern or a relationship between those stimuli. This subtest is a good way of assessing the abstract-reasoning and how well a person can process the information.

 

Picture arrangement subtest asks the person to spot and indicate relevant features of the picture and explain the cause-effect interactions. The person must place the received misarranged photos in the right order and make up a story, thus it assesses the capacity to find the logical sequence of the events.

 

Symbol search subtest assesses the speed of processing information within intelligence. The person is asked to spot two objects in a group of many and report if the objects required were present or not.

The performance IQ (PIQ) is assessed by summing the age-correlated scores from the performance subtests and comparing them with the standardized sample. The FSIQ or the full scale IQ further follows the same principle as the VIQ and PIQ and we can sum the age-correlated scores from the verbal and performance scale and make a comparison to the standardized sample.

 

Index scores are another way of assessing intelligence. There are four index scores and those are: verbal comprehension, perceptual organization, working memory, and processing speed. The verbal comprehension score is thought of as a good way to assess crystalized intelligence. It is regarded as better than VIQ because it disregards the arithmetic subtests that have more to do with the working memory. The perceptual index is regarded as a good measure of the fluid intelligence. One of the biggest innovations of WAIS-III is the concept of working memory and this refers to the information that we can hold in our minds for a short while in order to work with some information. Finally, the processing speed index is trying to assess the speed of your mind, while one person needs 30 seconds for a certain task, another needs only 5.

 

Comparing the verbal and performance IQ and providing a measure for it is one of the very useful features of the WAIS-III in comparison with the Binet scales. What needs to be taken into account is the influence of ethnic background. Another useful measure is pattern analysis ‑ one can assess and describe quite large differences found between the subtests scores. For example, some sorts of emotional problems might have an effect on the subtest performance and this could further form special score patterns. Possibly, we can look further for those patterns every next time we obtain the scores and we might conclude something about the person that took the test. Research on pattern analysis provides contradictory results. This way of analysis must be done very carefully.

 

Psychometric features and evaluation of Wechsler

The standardized sample of WAIS-III consists of 2450 adults classified into 13 age groups from 16-17 up to 85-89. Race, gender, level of education, and geographical placement were taken into account. The reliability of the WAIS-III is quite high and includes internal and external reliability measure of the verbal, performance, and full-scale IQs. The average coefficients among all age levels range from .94 for PIQ to .98 for the FSIQ (VIQ- .97).

SEM (standard error of measurement) is a number based on the reliability coefficients and it is supposed to assess the discrepancy between what a perfect measuring instrument would provide with what is actually gotten. The validity of the WAIS-III is greatly based on the correlation with the previous versions, especially revisions. Generally the correlations are higher between the FSIQ, VIQ, and the PIQ, and lower for their subtests. Important to note is that according to the theory we possess at least seven different intelligences that are independent of each other and those are: interpersonal, intrapersonal, linguistic, body-kinaesthetic, special, musical and logical mathematical. WAIS-III does not support this theory and leaves little space for such an idea.

 

Extensions – the WISC-IV and the WPPSI-III

The extensions of the WAIS-III are the WISC-IV and the WPPSI-III.

The WISC-IV is the most recent version of the scale measuring the global intelligence and indexes of certain specific cognitive abilities, process speed, and working memory. This scale has introduced some innovations such as the idea and assessment of fluid reasoning with emphasis on working memory and processing speed concepts. Furthermore, WISC-IV uses empirical data to assess the item biases. It is known that it does not have the power to completely remove the biases but it uses empirical data that are going to that direction. The standardization sample contains 2200 children of taking age, race, regional data, parental education, and occupation were taken into account. If needed to interpret the WISC-IV the approach is very similar to the one used for the WAIS-III and involves the assessment and evaluation of the four major indexes in order to examine the drawbacks in any area and further assess the validity. The reliability is at the lowest level for the youngest children that have the greatest achievements. The validity has been greatly supported just as the competitor-scale was for Binet. The good standardization contributes to this.

 

The WPPSI-III is another extension of the WAIS-III and it has been revised several times before this version was published. It contains almost all of the WISC-IV components such as five composites, but not the PIQ and the VIQ. Reliability is comparable to that of the WISC-IV and the validity is greatly supported in the manual. This scale is more sensitive to the measurements of the abilities of youths with less language ability than that of their older equivalents.

Chapter 13 – Standardized Tests in Education Civil Service and the Military

Introduction

During your time spent studying, you have doubtless encountered a standardized test. This may have come in the form of the GRE Revised General Test (GRE), the SAT Reasoning Test (SAT-I), or even a Goodenough-Harris Drawing Test. Many universities handle admissions through the use of standardized group entrance exams. The key factor to these standardized tests is the test criterion i.e. what the test is trying to predict. This can prove difficult. In the case of the GRE, which is widely used in the admission process to postgraduate programs, the test does not predict the capacity to solve real world problems or clinical skill.

While the tests discussed in this chapter improve the accuracy of a selection process, it is important to note that they account for a very small amount of variability.

 

Comparison of Group and Individual Ability Tests

Individual tests and group tests both have their own advantages and disadvantages. Individual tests are carried out with a single examiner assigned to a single subject. The examiner follows instructions which are provided in the manual of the standardized test. What follows is a response–record interaction in which the examiner records exactly the subject’s response. These responses are then evaluated, a process which can require a high degree of skill. In contrast, a single examiner can administer a group test to multiple individuals at the same time. Subjects are read the instructions by the examiner, time limits are established, subjects record their responses’ themselves, and the responses are calculated as a percentage which usually requires very little skill.

 

If a subject is experiencing distress for any reason, be it fear, stress, an uncooperative nature, the examiner in an individual test takes responsibility for maximizing performance. In other words, the examiner can attempt to elicit maximum performance. In the case of a group test, it must be assumed that a subject is fully motivated and cooperative. For this reason, low scores on group tests can be difficult to interpret. They can be attributed to a wide range of factors whether it be low motivation, clerical error, unclear understanding, etc.

 

Advantages of Individual Tests

Through individual tests, it is possible to learn more about a subject beyond their test score. After time, examiners develop internal norms. Having these internal norms, the experimenters are able to easily identify unusual reactions to certain tasks or situations. This gives the chance to observe behaviour in a standardized setting. This allows the examiner to see beyond the test scores in a unique way.

 

Advantages of Group Tests

When compared to individual tests, group tests are more cost efficient, require less expensive material, and require less examiner skill. They are commonly more objective as the subject records their own responses, thus making them usually more reliable. Individual tests are mostly applied in clinical settings, whereas group tests are applied in a much broader setting. Group tests are commonly used at various levels of schooling. Areas of military, industry, and research also greatly rely on them.

 

Overview of Group Tests

Characteristics of Group Tests

For the most part, group tests can be categorized as paper and pencil or booklet and pencil tests due to most of them consisting of a printed booklet, test manual, scoring key, answer sheet, and pencil. This is changing however, as we see a trend of increasing use of computerized testing as opposed to paper and pencil. The amount of group tests far outweighs the number of individual tests. Generally, group test scores are converted to produce percentiles or standard scores, however a few become ratios or deviation Iqs.

 

Selecting Group Tests

Because of the sheer amount of group tests available, the test user is assured a selection of well-documented and psychometrically sound tests. In particular, ability tests in schools are found to be very reliable.

 

Using Group Tests

The tests which are to be discussed are almost as reliable and soundly standardized as the best individual tests. As is the case with some individual tests, however, validity data for some group tests are weak, meagre, or contradictory – sometimes all three. When working with group test information, the following cautions should be exercised. Use results with caution: avoid over interpretation, don’t consider scores as being absolute or isolated, and be careful when using results for prediction. Be especially suspicious of low scores: there are many factors which can contribute to a low score, be aware of them. Consider wide discrepancies as a warning signal: if an individual produces large discrepancies either among test scores or other data, this may be a sign all may not be well with the individual. When in doubt, refer: in the case of low scores, wide discrepancies, or suspicion to doubt validity, the best option is to refer the subject for individual testing.

 

Group Tests in the Schools: Kindergarten Through 12th grade

The goal of tests aimed at schools to measure educational achievement in children.

Achievement Tests Versus Aptitude Tests

Achievement tests aim to ascertain what an individual has learned following a specific instruction. These tests measure how much a student has learned after sufficient training has been provided. Validity is determined by the content related evidence. The test is said to be valid if it accurately samples the domain of the construct being assessed.

Aptitude tests on the other hand aim to measure how much potential for learning an individual possesses. A wide variety of experiences are evaluated in a multitude of ways Validity of an aptitude test is determined by its ability to predict future performance. Hence, these tests rely extensively on criterion oriented evidence.

 

Group Achievement Tests

The Stanford Achievement Test (SAT) is renowned as being one of the oldest standardized achievement tests still widely used within the education system. The SAT is in its 10th edition and is currently well normed and criterion referenced, with outstanding psychometric documentation. It primarily evaluates achievement in kindergarten to 12th grade in a variety of areas.

The Metropolitan Achievement Test (MAT) is another well standardized and psychometrically sound group measure of achievement. This test measures achievement in reading by assessing word recognition, vocabulary, and reading comprehension. Versions of this test include Braille, large print, and audio formats.

The MAT and the SAT are the pinnacle of modern achievement testing. These tests are psychometrically well documented, reliable, and normed on large samples. Both sample a wide variety of educational factors and cover all grade levels.

 

Group Tests of Mental Abilities (Intelligence)

Kuhlmann-Anderson Test (KAT) – Eighth Edition

The Kuhlmann-Anderson Test (KAT) is a group intelligence test which is applied to kindergarteners through to 21th graders. The test measures 8 separate levels with a variety of items on each. Unlike most tests, the KAT does not become more verbal the higher the age group being tested, it instead remains primarily non-verbal throughout. This makes the KAT suitable not just for young-children but also for individuals who may be handicapped in following verbally procedures. It may even prove to be suitable for non-English-speaking populations, after proper norming. Results from a KAT can be represented in verbal, quantitative, and total scores. Scores can also be expressed as percentile bands. A percentile band provides the range of percentiles which most likely represent a subject’s true score, much like a confidence interval. The KAT is a soundly reliable, valid, sophisticated test and its non-verbal qualities make it an ideal candidate for tests involving non-English-native speakers.

 

Henmon-Nelson Test (H-NT)

The Henmon-Nelson Test of mental abilities is another widely used test applicable to all grade levels. This test produces one score, which is thought to measure general intelligence. This has been and continues to be the product of some controversy. However it remains a quick predictor of future academic success. Unfortunately, by just scoring general intelligence, the H-NT does not consider multiple intelligences. The H-NT manual also calls for caution when testing individuals from an educationally disadvantaged background. Research has also shown that the H-NT has a tendency to underestimate Wechsler full-scale IQ scores by 10 to 15 points for a number of populations.

Cognitive Abilities Test (COGAT)

When talking about reliability and validity, the COGAT is similar to the H-NT. The COGAT provides three scores for results: verbal, non-verbal, and quantitative. Unlike the H-NT, the COGAT was designed with poor readers, poorly educated individuals, and non-native-English speakers in mind. Additionally, research has shown that the COGAT is a sensitive differentiator for giftedness, a fine predictor of future performance, and a good measure of verbal underachievement. However, the COGAT has been found to be very time consuming, there is uncertainty regarding whether the norms are representative, and minority populations have been found to score lower than white students across the test batteries and grade levels. For these reasons, great care should be taken when scores are used in conjunction with minority populations.

 

College Entrance Tests

The SAT Reasoning Test (SAT)

Formerly known as the Scholastic Aptitude Test, the SAT Reasoning Test (SAT-I) is still the most widely used university entrance test. Renorming of the SAT occurred in 1994 as an attempt to restore the national average to the 500 point level as it was in 1941. Even more recently changes were made, changing the number of scored sections to three, each scored from 200-600 points. This will likely lead to less interpretation errors due to interpreters no longer relying on old versions as points of reference. At 3 hours and 45 minutes long, the modern SAT is an endurance race which rewards determination, motivation, stamina, and persistent attention. The SAT is great predictor of first year college GPA.

 

Cooperative School and College Ability Tests (SCAT)

The SCAT, developed in 1955, is second only to the SAT, however it has not been updated since its implementation. It encompasses the college level as well as three precollege levels, starting at 4th grade. Its primary goal is to measure school-learned abilities and an individual’s potential to take on further schooling. In comparison to the SAT, the SCAT’s psychometric documentation is neither as strong nor as extensive. Revisions and extensions of the SCAT are encouraged, as currently it is unable to compete with the SAT.

 

The American College Test (ACT)

The American College Test is a widely used aptitude test for college entrants. Its biggest strength is that it is particularly useful for non-native-English speakers. Specific content scores and a composite form the results of the ACT. In comparison with the SAT, the ACT has similar success in predicting college GPA alone or in combination with high-school GPA. Despite this, internal consistency coefficients are not as strong as the SAT.

 

 

Graduate and Professional School Entrance Tests

Graduate Record Examination Aptitude Test (GRE)

The GRE is among the most widely used tests for graduate-school entrance. The primary measure is general scholastic ability. The test is administered throughout the year at various examination centres across the globe. The test consists of three parts: verbal (GRE-V), quantitative (GRE-Q), and analytical reasoning (GRE-A). Based on Kuder-Richardson and odd-even reliability, the GRE is stable, with coefficients just slightly lower than the SAT. False-negative rates are high, also the GRE has been found to not be a significant predictor for a group of Native American students. It has also shown a tendency to over-predict achievement in younger students while under-predicting the performance of older students. Despite this, many schools have developed their own methods of using the GRE which either use it independently or in combination with other sources of data. The best way of using the GRE score is to use it in conjunction with other data. When combined with GPA, graduate success can be predicted with great accuracy. A common problem among colleges is that of grade inflation. This refers to the rising average college grades in spite of the fact that the average SAT scores are declining.

 

Miller Analogies Test

Similar to the GRE is the Miller Analogies Test, another measure of scholastic aptitudes for graduate studies. The difference is that this test is strictly verbal. Hence, knowledge of specific content coupled with a proficient vocabulary are very useful tools. In terms of odd-even reliability, the Miller Analogies Test is sufficiently reliable. However, it does lack validity support. Also, this test tends to over-predict the GPAs of younger students and under-predict GPAs of older students, much like the GRE.

 

The Law School Admission Test (LSAT)

Taken under extreme time pressure, the LSAT is a test which requires almost no specific knowledge, and like the Miller Analogies Test, it contains some of the most difficult problems one can encounter on a standardized test. The three types of problems covered in the LSAT are related to: reading comprehension, logical reasoning, and analytical reasoning. Every single previously administered test since the format changed in 1991 is available for study. The LSAT has been found to be psychometrically sound. Researchers have raised concerns that the test favours whites over blacks and is biased. This and other concerns have led to a 10 million dollar initiative to increase diversity in American law schools.

 

Nonverbal Group Ability Tests

Raven Progressive Matrices (RPM)

The Raven Progressive Matrices test is among the most widely known and used nonverbal group tests. This test can be used anytime as an estimate of an individual’s intelligence, though it is most commonly used in an educational environment. The RPM instructions are very simple and can be given without the use of language. For this reason the test is used throughout the world. The test consists of 60 matrices, which contain a pattern with a piece missing. The RPM has the advantage of minimizing the effects of language and culture.

 

Goodenough-Harris Drawing Test (G-HDT)

Originally standardized in 1926 and then re-standardized in 1963, the Goodenough-Harris Drawing Test is one of the simplest, quickest, and cost efficient tests of nonverbal intelligence there is. Requiring just a pen and paper, the subjects are tasked with drawing a whole man and are instructed to do their best job possible. Subjects achieve credits for each item they include in the drawing. Because of the ease of administration of this test, it is commonly used. It gives a quick and rough estimation of the intelligence of the child. However, caution is advised as results based purely on the G-HDT can be misleading.

 

The Culture Fair Intelligence Test

One goal of nonverbal tests has always been to restrict cultural influences on scores. The Culture Fair Intelligence Test was designed with this in mind, to provide an estimate of intelligence which is free of cultural and linguistic influences. Research has shown that this test does not succeed any more than any other test, however its popularity reflects the desire for a test which reduces cultural factors. The test has been found to be best applied for measuring intelligence of a Western European or Australian individual. More work is needed if the Culture Fair Intelligence Test is to compete with the RPM.

 

Standardized Tests Used in the U.S. Civil Service

The General Aptitude Test battery (GATB), which measures aptitude for a number of occupations, is a widely used test for assisting employment decisions. It measures a wide range of aptitudes. The GATB has been the subject of controversy, as it used within-group norming prior to the Civil Rights Act of 1991. For example women would only be compared with other women, men only with other men, Latinos with only other Latinos, etc. The argument was that within-group testing was done on the basis of fairness, however it was outlawed and labelled as reverse discrimination.

 

Standardized Tests in the U.S. Military: The Armed Services Vocational Aptitude Battery (ASVAB)

The ASVAB is a test designed by the Department of Defence which is administered to over 1.3 million individuals per year. The test consists of 10 subtests, which consist of a wide range of factors. The psychometric characteristics of the ASVAB are exemplary. The test has been shown to be reliable and a valid predictor of performance during training for a variety of civilian and military occupations. The ASVAB has been moving away from the pen and paper format in favour of computerized testing. This allows the tests to be adapted based on the subject’s unique ability.

Chapter 14 - Projective Personality Tests

 

Hypothesis of projection

Projective tests are regarded as very controversial and often misunderstood ways of psychological testing. However, five out of ten most used testing procedures in clinical settings are projective techniques. The projective hypothesis is the basic concept of projective tests and suggests that when people want to understand a stimulus that is vague, then the interpretation of it will tell something about people’s feelings, experiences, thoughts, needs etc. The issue that rises is the fact that the examiners can never make secure assumptions about the responses of the test takers and their evaluation of what they see. Some research, however, supports the use of projective tests and its validity. Findings are contradictory.

 

Rorschach inkblot

Rorschach inkblot test has been regarded as the most powerful tool for psychometric measurements and also a test somewhat resembling a party game ‑ the support and findings are highly ambiguous. However, this test is still widely used. Rorschach’s research on the inkblots started in 1911 and soon after got published in the famous book Psychodiagnostik. Initially, the material was highly avoided, but after time, the use of the test became increasingly popular. Some people took it further and studied Rorschach thoroughly, even though they often disagreed with each other. In a way they all developed their own way of scoring and administering the test. Moreover, Rorschach is an individual test, it presents the test taker with 10 cards, five are black and grey; two consisted of black, grey, and red; and three consisted of various colours. When presented with cards, the subject is asked to elaborate on what that could be and no rules are present about the answer the subject is about to provide. The lack of clear rules and structure of what it is to be expected from the subject are the primary structures of the projective tests. The examiner should be as ambiguous as possible.

 

In the first phase ‑ free association phase the examiner presents the cards one at a time and if the subject responds with only one explanation then the examiner might encourage him/her to explain more by saying something like: “Most people see more than one thing” or “Take your time, since people usually see something here”. Secondly, in the inquiry phase, the examiner presents the subject with the cards again and scores the responses in five dimensions: location, determinant, form quality, content, and frequency of occurrence ‑ all in regard to what the subject spotted in the inkblot.

 

To score the location, a small version of the inkblot card is presented ‑ the location chart. The examiner records whether the subject made use of the whole blot (W), a common detail (D) or a detail that is not that usual (Dd). The confabulatory response (DW) is the situation when the subject overgeneralizes from a part to a whole. Normal subjects most likely end up having a balance in their W, D, and Dd responses. Otherwise, some problems are suspected. Furthermore, the examiners need to assess what it is that led the test taker to see that particular feature and this is known as the assessment of the determinant. Was it the movement, colour, shading, or a shape that led to a response ‑ if only shape is used for example, then this is called the pure form response. The movement feature is regarded an issue since it is an ambiguous concept in this situation. The identification of a determinant is regarded as the most difficult aspect of the Rorschach. To score the content appears to be quite simple, mostly we categorize in humans (H), animals (A), and nature (N). The populars are the general responses most frequently found. Form quality refers to the degree to which the response matches the features of the stimulus in the inkblot. Scoring this is quite hard. As obvious as it is, scoring Rorschach is very difficult and a complex process ‑ to use it, you need a higher graduate training.

 

The psychometric properties of the Rorschach indicate that after the 1960s the test was seen as much less astonishing than it was believed. After this, the Comprehensive System for scoring the Rorschach became prominent and very accepted until the present moment. However, this system failed to help Rorschach’s inadequacies. Research suggests that the Rorschach results tend to identify more than half of normal people as emotionally disturbed and this is referred to as overpathologizing.

Another problem refers to the fact that those people who tend to give more responses (Rs) to the inkblot, tend to evaluate the whole area bordering the inkblot, which is not the initial idea of the test. Furthermore, when administering the Rorschach, there are no fixed rules on how to do this. Reliability research provides inconsistent results. Even when the reliability is provided, the validity is always questionable.

 

An alternative for Rorschach was made and it is the Holtzman Inkblot Test that allows the test taker to give only one response per card and the administration and scoring of the test are standardized. Due to the apparent advantages of this test, it is still not even close as popular as the Rorschach.

 

TAT-Thematic Apperception Test

The Thematic Apperception Test (TAT) is a test that could be compared to Rorschach and is similar to it on several levels. The TAT became very popular after its appearance and is nowadays used more than any other projective test. It is based on Murray’s theory of needs while Rorschach is, on the other hand, not grounded on any theory. The TAT is not presented as an instrument for diagnoses, but as an instrument for evaluating human personality characteristics. This is regarded as one of the crucial techniques used in the personality research. It is more structured and less vague in comparison to Rorschach, there are 30 pictures and one empty card and some of the cards are specifically aimed at male subjects, while others are meant for the female ones. Also, some are more appropriate for people in different age groups and finally some of the cards are appropriate for everyone. However, the standardization and administration of the procedures of scoring are as bad with TAT, if not, worse than with Rorschach. When interpreting TAT the notions of needs, press (environment that influences the satisfaction of needs), themes (the frequency of ‑ for example ‑ depression), heroes (who you identify with) and outcomes (success/failure). The psychometric properties of TAT are inconsistent, due to the lack of standardization. The test-retest results also seem to be inconsistent. Validity has even blurrier findings. The content- validity has some research support while the criterion-validity has been hard to find.

Other projective procedures

The projective tests do not have to include any pictures, they might as well provide words or phrases as a stimulus. Word association tests consist of a psychologist saying a word out loud and the subjects’ task is to say whatever comes to his/her mind first. The use of this test is limited, but still present. Sentence completion tasks consist of words in incomplete sentence tasks (“I am…”, “Men…”). A possible best projective test ‑ psychometrics-wise ‑ is the Washington University Sentence Completion Test (WUSCT), which gives insight in ego development, self-acceptance, autonomy etc. Figure drawing tests use the expressive techniques in order to make the subject create something like a drawing. One of these tests is regarded as valid and useful in the clinical setting is the Goodenough Draw-a-Man Test. It is simple and practical.

Chapter 17 (excluding pages 460-470) Testing in Health Psychology and Health Care

 

Neuropsychological assessment measures

Clinical neuropsychology is a field of study that puts emphasis on the research of psychological deficiencies of the central nervous system and the treatments for it. It examines the relationships between brain functions and behaviour, covering the areas of cognitive, emotional, and sensory processing. The impairments in the spinal cord are also studied. This field is a mixture of the use of psychiatric, psychometric, and neurological practices. This field is concerned with memory, learning, spatial recognition, language, attention, and similar processes.

 

Neuroimaging has provided the field of clinical neuropsychology with remarkable opportunities for research and development. As the neuroimaging techniques developed over time, it became clearer that the brains of individuals differ in structure and organization. The cornerstone of the practice of clinical neuropsychology is work of Broca and Wernice who were studying the speech locations in the brain. Neuropsychologists within the field tend to be very specialized for certain areas or age groups. These specialists are usually concerned with brain dysfunctions but others work with brain injuries and similar problems.

 

Furthermore, memory is one of the most widely researched constructs within the field. Memory dysfunctions are assessed with the Wechsler Memory Scale-Revised (WMS-R), the RANDT Memory Test (RTM), the Memory Assessment Scales (MAS) and the Luria-Nebraska battery. Short-term memory, specifically, is best assessed with the use of verbal tests.

The latest research tries to beat the view that problems in functioning are related to problems in a specific location in the brain. The new view suggests that complex functioning is regulated by neural systems and not by specifics structures in the brain.

 

One of the more intensively studied areas is the attempt to assess the deficits of left hemisphere comparing to the right and vice versa. The findings of this research usually come from brain damage studies or stimulation during surgery of some patients that needed one (e.g. epilepsy). Different kinds of deficits and impairments are tested within the field of clinical neuropsychology, such as Wernicke’s aphasia, different apraxia, information-processing system’s deficits and so one.

 

Developmental neuropsychology is a field that focuses on different deficits and complications children experience and changes occurring during time. There are serious challenges with studying children and one of them is that children are still in the process of development and some deficits can be revealed much later. Another fact is that children’s brains have a remarkable potential to recognize the injury and actively strive for recovery. This process is called plasticity. High diversity in neuropsychological tests for children with examples such as Child Development Inventory, Children’s State-Trait Anxiety Scale, Reynolds Depression Scale and more. The mentioned tests focus on adaptation and development measures.

 

Another group of tests focuses on attention and executive function of children and an example of the test measuring these functions is the Trail Making Test and it assesses quite a few cognitive skills including attention, sequencing, and thought processing. Important to mention is that attention and executive functioning are not considered the same construct. Executive function embraces the notion of volition such as being capable of forming and achieving a specific goal and taking action in order to succeed in a task. Self-monitoring and self-control also fall within the same notion. As for the mental processing it includes four factors that are believed to be related to different regions of the brain. The factors are: focus execute - ability to scan information and give response to it in a right way, sustain ‑ capacity to be attentive for a sequence of time, encode ‑ capacity to store information and later recall them and finally the shift ‑ present flexibility.

 

Learning disabilities refer to the neuropsychological problems with reading and speech. Dyslexia is a type of reading disability when people experience difficulties to decode separate words. It may have a genetic basis for occurring or might be the result of processing phonemes with difficulties.

 

CRI ‑ Concussion Resolution Index is a development of neuropsychological research used to follow recovery of sportsmen that experienced a concussion.

 

Anxiety and stress assessment measures: State-trait Inventory, measures, coping measurements

Stress refers to a response to some happenings and situations that elicit constrains, demands, and similar. We could divide the concept of stress into three components and those are frustration, pressure, and conflict. We are frustrated when the road to achieving our goal is blocked, this can take physical as well as mental forms ‑ you can be stopped at the club entrance or not accepted to a university/job. Either way, it is likely that one would feel frustrated. Types of stress induced by conflicts appears when we have to make some decisions such as choosing between two important things. Finally, pressure stress is present when some tasks need to be sped up, it can take the form of external pressure where one for example has a set deadline by his boss, or internal, when you put pressure on yourself in order to reach a goal on time.

 

Reaction to these stressful situations most commonly lead to anxiety ‑ a state of emotions that is manifested by tension and worry. Physical changes occur as well when your heart is pounding fast, your hands sweat and similar. Two types of anxiety can be distinguished: state anxiety – characterized by a reaction that will change from one situation to another. Trait anxiety on the other hand is characterized as a personality feature that will stay unchanged across situations. State-Trait Anxiety Inventory – STAI is based on the anxiety theory and further explains two scores ‑ one for state anxiety (A-State) and one for trait anxiety (A-Trait). Validity and reliability for this inventory are high. Each component of the STAI is measuring what it is supposed to measure since the two components strive to measure different aspects. This inventory is available in many languages and is suitable for different age groups.

 

When measuring test anxiety, we can describe two ways in which it manifests itself and those are: task-relevant responses ‑ one that directs your energy toward your goal-achieving a fine grade, and task-irrelevant responses ‑ behaviour that restricts your performance. When taking a test, students usually respond in the second manner ‑ which interferes with their performance.

 

Test anxiety questionnaire – TAQ is one of the first tests used to measure anxiety. To make a difference between motivational states present in test-taking situation we divided them into learned task drive and learned anxiety drive. The first one refers to the motivation to give responses that correspond to the task you are dealing with and the second consists of task-relevant and task-irrelevant responses. Concerning the psychometric feature of this test, the reliability is known to be high. Some criticism of the TAQ is that it deals much more with the state anxiety than with trait anxiety.

 

Some other test anxiety measures suggest that test anxiety actually has different two components and those are emotionality and worry. The first one refers to the physical response we encounter when taking a test such as heart rate and muscle condition. The second refers to the mental preoccupation when thinking about possible failure and the personal consequences this situation would pose for the individual. Spielberger’s 20 - item Test Anxiety Inventory refers to worry as to a trait that will be consistent throughout time. Emotionality on the other hand is a way in which we express arousal in a specific situation. This theory poses the idea of emotional component as a state or a situational aspect.

A 5-item version of this test was recently published.

 

Another approach to the same topic is the Achievement Anxiety TestAAT. This is an

18-item scale that provides two different components of anxiety and those are facilitating and debilitating. The first one refers to a state that motivates the person to perform in a certain way and the latter refers to the anxiety that influences the performance by interfering with it. The facilitating component simply gets a person to worry enough in order to for example do work before the deadline. In this respect, the facilitating anxiety is helpful while debilitating is not.

 

An important question is how do different people cope with anxiety? In order to assess this, we have a measure for it ‑ Ways of Coping Scale ‑ a 68-item checklist. The scale has seven subscales concerned with problem solving, wishful thinking, advice seeking, growth, support seeking, threat minimizing, and self-blaming. Furthermore, these seven are classified with either problem-focused or emotion-focused. The first one involves attempts to change the influence of stress and these are active ways of coping while the latter way does not try to change the stress course but one focuses on the ways to cope with emotional responses he/she is experiencing. This coping scale is widely used but some research failed to replicate the findings.

 

Closely related to this measure is the Coping Inventory – a 33-item measure that first describes the attitudes that people take to avoid stress, secondly it includes the items that explain the strategies for dealing with stressful situations, and finally it considers how each of those strategies would help the person with coping with situations. It is used with both adolescents and adults.

 

Life quality assessment: Quality of life measure, measuring methods: SF-36, NHP, Decision theory approaches

We find the two most common definitions of health and those are based on the facts that firstly: most people agree they would not like to die early-avoidance of death is an aspect of health. Secondly, people appreciate life quality, disease and disabilities are considered because they will influence the length of their lives.

 

Quality-of-life measurements are conceptualized in two different ways; where one is psychometric and the other is the decision theory. The first one tries to provide distinct measures for different views on quality of life. Best known example is the Sickness Impact Profile (SIP), a 136-item measure. The decision theory however, tries to make a distinction between different dimensions of health and by this tries to provide a united view on what health status is. In the end, quality-of-life measuring and views tend to be seen as highly subjective.

 

In order to measure quality of life we have several options. SF-36 is a commonly used tool that includes eight concepts of health including physical functioning, bodily pain, general heath perceptions, role-physical, social functioning, vitality, role-emotional, and mental health. There are several advantages of the SF-36 and those are the fact that it is a quick measure with a significant level of reliability and validity. However, it does not have questions that are age-specific (which is an obvious problem when assessing health and life qualities).

 

The Nottingham Health Profile (NHP) is another approach and consists of two parts ‑ 38-items divided into six categories: energy, pain, sleep, physical mobility, social isolation, and emotional reactions. Items in each of these units are rescaled so they vary between 0 and 100. The second part of the NHP approach is that it includes seven statements that are in relation to certain areas of life that are most likely to be affected by health such as social life, sex life, home life, hobbies, holidays, interests, employment, and household activities. The NHP is somewhat supported by reliability and validity measures.

 

Chapter 16 - Applications in Clinical Psychology

 

Structured personality tests

Personality characteristics can be defined as nonintellective features of human behaviour and are essential in the clinical and counselling settings. Personality is generally defined as a set of somewhat stable and unique behavioural patterns that explains the ways an individual reacts to the world around him/her. Personality traits furthermore refer to certain persisting features of the way we act, feel, or think and that are distinctive from one person to another. Personality types are regarded as general portrayals of people, for example those people who are very social tend to go out and engage in conversation a lot. Personality states refer to the way we differently react emotionally in different situations. Lastly, self-concept is the way we self-define or an organized and somewhat consistent set of thinking one has about him/herself (Rogers, 1959). It took until the First World War for personality tests to begin developing. There was a need to come up with a test that can be used to assess and screen people who were not fit for the military. Psychologists then came up with self-report questionnaires which gave an opportunity for people to report things about themselves by ticking true or false to report if that applies to them or not. The discrepancy between the structured and projective method of assessment is that in the structured case the person is requested to respond to a written statement (yes-no, true-false). In the case of a projective assessment the stimulus itself is vague and there are few rules to which the person could respond.

 

 

Structured personality tests are broadly described in a deductive and empirical manner. Deductive strategies correspond to the logical-content and the theoretical approach while the empirical part embraces the notion of criterion-group and the factor analysis method. Sometimes these procedures are combined.

 

Deductive strategies use deductive logic and reason in order to come up with a meaning of a test response. Logical content strategy is characterized by the fact that it assumes that the test item defines the subject’s behaviour and personality, meaning if a person responds with “false” on a statement if he/she is friendly, the test administers will assume this is true and the person is not friendly. Theoretical strategy items firstly have to be consistent with the theory, this strategy tries to create a homogenous scale and may use statistics in order to analyse the item.

 

Empirical strategies rest on the collection of data and some statistical procedures and strive to find a meaning of the rest response or the type of personality and psychopathology. One feature of this strategy is that it tends to use the experimental research on its way to interpreting the meaning of a test response, some extreme dimensions of personality or those two combined. Criterion-group strategy initially represents a group of many individuals who share a specific characteristic like depression or leadership. Then, in order to test, we select and give out a group of items to those people in the group as well as to a control group that is representative of the general population. With the use of contrast the examiners try to compare the two and learn something from those results. The scale has been cross-validated when it distinguishes the two groups well. Factor analytic strategy is using factor analysis in order to empirically assess the basic personality dimensions. What it basically does is narrowing down the data and through this reduces them to a small number of units that are very descriptive. This further provides the results with the least variability in the data that is possible.

 

Logical-content strategy

Logical content strategy includes one of the first tests made and that is the Woodworth personal data sheet. It was developed during the first World War and its purpose was to identify those people that would not be fit enough to stay in combat. The manner with which the items were selected was a logical-content one and it additionally had two other features. Items recognized as falling within the 25% or higher of a normal sample in the direction of the scores were excluded from the test. This way the false positives (identified as a risky unfit but actually not) were reduced. Only symptoms that emerged in a double than normal manner were included in the test. Furthermore, two other well-known tests of the same category were the Bell Adjustment Inventory and the Bernreuter Personality Inventory. They have set the grounds for many modern tests that were multidimensional ‑ providing multiple scores rather than a single one. Criticism of the logical content approach is that after all this, subjects are not able to evaluate their own behaviour objectively and even if the response is close to accurate there is a risk of misinterpretation of the item in question. This is likely to lead to biases.

 

Criterion-group strategy

The criterion group strategy works on the idea that nothing should be assumed about the meaning of the person’s response to a test item, it should be determined by ways of empirical research. The Minnesota Multiphasic Personality Inventory (MMPI) is a true or false questionnaire based on the person’s self-report. The best things about it are its clinical and content scale, and its validity. The clinical scales were made in order to assess the psychological abnormalities and disorders, the content scales strive to group certain items that appear to have something in common, while the validity scale ensures the information about a subject’s approach to testing ‑ faking bad or faking good. The main purpose of the MMPI is to help in making a distinction between the normal and abnormal groups. It was initially designed to help major psychiatric and psychological disorders assessment. The original number of patients used for the development of the test’s criterion groups was 800, this number was later drastically reduced. Eight criterion groups with about fifty patients each were there in the end and the classification is: hypochondriacs, depressives, hysterics, psychopathic deviates, paranoids, psychasthenics, schizophrenics, and hypomanics. The major critic of the MMPI was that the control group consisted of patients’ relatives and visitors (officially excluding the mental patients).

The new version ‑ the MMPI-2 ‑ has a much better and a more representative control sample. To these eight scales the M-F masculinity-femininity one was added, as well as the social-introversion one that measures the obvious.

 

The validity scales were developed due to criticism of this realm and assess the way a subject is approaching the test ‑ was it a normal, honest approach or not. The L or lie scale was developed to spot the individuals who strived to present themselves in a more favourable way than in reality. Another scale is the K scale designed to spot the items that made distinction between the abnormal and normal groups in case both groups would produce a normal test pattern. It was believed that the pathological groups would express normal patterns when trying to be defensive ‑ to deny and hide the problems and the defensiveness would help recognize the pathology that was absent in normal individuals. The last one was the F scale that was made to spot the people who were “faking bad” trying to present the situation worse than it is – a person who is high in F scores is raising a validity concern because high F scores indicate a strive to over exaggerate. An additional scale is a “cannot say” one where the person simply fails to provide a true or false. One drawback is that a person with specific disturbance such as schizophrenia will never score high on only one of the scales, but more likely on two, three, or more. To deal with this issue of resulting high on multiple scales, we use the pattern analysis as described before. The idea of analysing the two highest scales ‑ two point code was highlighting the necessity to do research based on the individuals that show high scores specifically on two scales.

 

Further development was the re-standardization and improvement of the MMPI resulting in the emergence of the MMPI-2. This was done in order to revise the items that seemed problematic and increase the number and variety of items. An attempt to retain many MMPI features was present and a separate form of MMPI specifically for adolescents was a new idea. Major improvements with MMPI-2 were the validity scales that were added and those are the VRIN-Variable Response Inconsistency Scale ‑ trying to assess and evaluate the random responding and the TRIN-True Response Inconsistence Scale ‑ attempts to measure the acquiescence or the strive to mark “true” irrespective of the context.

 

Psychometric features of the MMPI and its revision are closely comparable. Reliability of both tests is high at .90. Intercorrelations between the scales are extremely high and because of that the validity of the pattern analysis is highly questionable. Another drawback is the instability of the ways the items are keyed ‑ this is a problem since many people tackle the tests with a specific response style which leads them to mark some items in a specific way regardless of its meaning. However, the validity of the two test versions is greatly supported by many research studies that are evaluating features of a specific profile pattern. For example after many studies on alcoholism and substance abuse using the MMPI test, we can be sure that the test can at least predict well who might become an alcoholic later on (having higher scores on the F-scale).

 

The California Psychological Inventory (CPI) is another type of structured personality test, primarily based on the criterion-group strategy. The third edition has 36 scales that are meant to measure personality features such as introversion-extraversion, self-realization and sense of integration, and conventionality versus unconventionality in following norms. Opposite of the MMPI and MMPI-2, the CPI attempts to measure the personality of normally accustomed individuals and due to this feature the CPI is used more in the counselling settings. Many items are very similar to those in the MMPI out of which many are identical. Reliability is also similar to the MMPI one. The advantage over the MMPI is in CPI’s possibility to be applied to the normal individuals.

 

Factor analytic strategy

Factor-analytic strategy as mentioned before, minimizes the number of common factors of an item by uniting the components and thus it reduces the variability. This is nowadays done with the use of computers, however it was used before computers and was based on the basic strategy of correlating the scores of the new test with the scores of the old one that is intending to measure the same thing. Guilford took this one step further and approached the intercorrelations of many tests and applied the factor analysis on the results, thus he found major dimensions that are underlying all personality tests. Guilford’s work was left behind due to the emergence of the MMPI.

 

Cattel started collecting all the adjectives that could be applied to the human beings and from a very large number he narrowed it down to 171, then to 36 items (surface traits). This number of adjectives was further cut to 16 different variables ‑ source traits. Out of this project the well-known 16pf Sixteen Personality Factor Questionnaire emerged. This test has a short-term test-retest correlation median (for all the 16 trait items) of .83. Professionals do not find the 16PF as useful as the MMPI. On the contrary, much research supports the validity of the 16PF, so the evaluation can be called inconsistent. Problems with the factor analysis in general is its highly subjective way of naming the factors ‑ in the end, the name in the test has many more factors that very simply narrowed down to the one we find in the test/questionnaire.

 

Theoretical strategy

Theoretical strategy has developed as an idea to use the theory to make the personality tests and in this way avoid biases and problems. Edwards Personal Preference Schedule (EPPS) is one of the first and most well-known of its kind. It is not an actually test since there are no right or wrong answers and it is widely used in counselling. What is theoretical about it is that it is based on the work of Murray who proposed that human needs are the need to accomplish, need for attention, and the need to conform. Edwards further selected 15 needs from Murray’s list of needs and developed construct validity for each of those. When taking the test, the subject would be requested to select one need over another (with that excluding the other one). From this, one can assess the selection of items made on the first scale with the selection on the other scale and this procedure is called the ipsative score. Those scores give results in very relative terms. They compare one individual to him/herself and further provide information that shows the relative strength of each need separately (of the same individual). Test-retest reliability reports the coefficient in the range from .74-.88, which is a high and satisfying number in testing personality. This test is widely used in applied settings and is one of the most intensely researched ones.

 

Combination strategies

Combination strategies are sort of a modern trend to use a mixture of the above mentioned ones in order to develop personality tests.

Positive Personality Measurement and the NEO Personality Inventory-Revised (NEO-PI-R) ‑ research suggests that it may be useful to use the positive characteristics of people in order to grasp the capacities an individual has and how those influence a person’s behaviour and life. The concept of hardiness refers to the way one copes with the stressful situation, in this case the hardiness actually means that the person sees stressful situations as meaningful and changeable (Kobasa, 1979). Furthermore, Bandura suggested that people with a strong sense of self efficacy tend to believe they are in control and face the hard times with “hardiness” (1986). Research supports the idea that to lead a satisfying life one needs to concentrate on the positive personal feature and not on the absence of psychopathology.

 

NEO-PI-R supports this idea and uses both theory and factor analysis in order to create scales. Three broad domains of the NEO-PI-R are N for Neuroticism, E for Extraversion, and O for Openness. Every one of these domains has six extra aspects. Neuroticism’s six facets are anxiety, depression, hostility, self-consciousness, vulnerability, and impulsiveness. Extraversion contains six following extra facets: warmth, activity, assertiveness, seeking excitement, positive emotions, and gregariousness. Finally, openness consists of: values, actions (trying out new activities), fantasy, feelings (openness to them), aesthetics, and ideas (intellective). The response of the NEO-PI-R is made on a Likert scale (5-point) and 14 out of 18 facets are written, while 7 are positively worded, the other 7 are negatively worded. The reliability for all three is quite high (high .80 to low .90) and this stands for both the test-retest reliability as well as for the internal consistency. The individual facets have a lower reliability.

 

The NEO-PI-R is supporting the notion of the well-known five dimensions of personality ‑ the five factor model and those are the following: extroversion, neuroticism, conscientiousness, agreeableness and openness to experience. Conscientiousness has been the most widely researched one and it consists of two major parts: dependability and achievement. It has been found that conscientiousness is a good, positive predictor of performance in all professions studied so far. Furthermore, it is positively correlated with having effective ways of coping with stress and also with the satisfaction of life. The facet of openness highly correlates with crystallized intelligence and further agreeableness, extraversion, and openness are very useful in predicting success in particular job environments.

The NEO-PI-R is one of the modern ways of personality test constructs that uses logic, theory, and statistical approaches combined with factor analysis in order to provide proper test results and behavioural insights with less biases and problems.

 

Chapter 18 – Testing in Industrial and Business Settings

 

Personnel psychology-selecting employees

I/O Psychology (industrial/organizational) puts emphasis in structured psychological testing, relies on research, and quantitative methods. The main areas of IO psychology are: personnel psychology concerned with job recruitment, employee selection, and evaluating the performance and organizational psychology concerned with motivation and satisfaction of the employees and it considers leadership and some other factors present in the organizations.

 

The employment interview is used in industry and business and it helps in making selection and promotion decisions. Structured interviews are a supported format of interviews used. With employment interviews, it is known to involve a greater search for negative rather than for favourable evidence. Webster noted in his research that the first impression tends to have a great impact on the candidate’s evaluation, that is, if the early impression is negative the final rejection rate is about 90%, while it drops to 25% if the first impression made was positive. Negative factors that usually lead to rejection are: low enthusiasm, nervousness, no eye contact, lack of confidence, and poor communication skills. On the other hand, positive factors consist of poise and self-confidence, ability to sell one-self etc.

 

A good first impression goes with: wearing professional attire, seeming competitive, showing expertise, being friendly or warm by giving out non-verbal cues, and not overdoing it altogether. One study shows that female participants wearing perfume and expressing friendly nonverbal cues were negatively evaluated compared to the positive evaluation of wearing perfume or being friendly.

 

Base and hit rates

The hit rate is the percentage of cases when the test accurately predicts the success or failure for employment for example. In case we do not use a test in predicting success we have a base rate. The true value of a test is when we compare the hit rate with the base rate. In case of the use of dichotomous tests – two choice decisions a cut-off score is usually used. Those in the above cutting score are for example employed while those below are not. Establishing a cutting score does not ensure a correct decision.

Miss rates are found when concluding (employing) something is true or suitable when it is actually not. A false negative would be if we had diagnosed someone with a benign tumour while he/she actually suffers from a malign one. A false positive is another form of a miss rate where for example, one hires someone based on the test results, but that someone performs rather poorly later on. The hit rates can also be positive and negative – for example employing someone who ends up performing well - True positive or not hiring someone who would not have performed well. False negatives and false positives have different meanings depending on the environment and context. A child that is rated to be potentially aggressive in the future and ends up not being so is a false positive, even though it has a positive connotation. In order to assess the hits and misses we use cutting scores and those involve criterion validity.

 

Taylor-Russell tables

Taylor-Russell Tables were developed as a method for assessing the validity of tests, more specifically, to examine if the test is better than chance and serves its purpose. These tables require certain information in order to be properly used:

 

1. Definition of success that is for each situation tested the success of the outcome must be clearly defined (e.g. over 5.5 is a pass, below is a fail)

2. Determination of base rate - must determine the number of people who would count as a success if there were no testing procedures present

3. Definition of selection ratio - must define the percentage of people who are admitted

4. Determination of validity coefficient - correlation between the test and the criterion is needed.

 

In short, the Taylor-Russell table provides the likelihood that a person we selected on the basis of the test score will succeed. A different table represents each one of the mentioned base-rates. The most useful tests are the ones with high validity and a low ratio. On the other hand if the validity is low and ratio high the test is not very useful. When we have no validity this means the test is no better than choosing someone by chance, while a high ratio indicated that almost everyone is chosen. Even though we will end up rejecting some applicants that would have performed well, the percentage of those who succeed among the people selected is higher than among the rejected ones. One drawback of these tables is that dichotomous responses are needed: success or failure.

 

Utility theory developed as an alternative to the Taylor-Russell Tables in order to assess levels beyond ‘success’ or ‘failure’. It is used in the selection procedures, mostly in personnel selection and lately it finds its place in education and some other fields. Research demonstrates advantages finance-wise when using the utility theory models to select employees.

 

Incremental validity

Incremental validity refers to the specific information gathered while using a test. Moreover, determining how much of that information gathered from the test contributes in contrast to some simpler measures that could lead to the same prediction. The idea is that validity and reliability are not enough, thus we need to assess how valuable the test is.

Furthermore, some results indicate that self-reported tests can be as good in predicting traits/responses as some complex personality tests (Hase & Goldberg, 1967). This is not always the case since it is known that supervisors are, for example, bad raters. Take interview validity as another example. Situational interviews have higher validity than job-related interviews, whereas psychologically based ones had the lowest validity of all interviews studied. The general thought is that one should consider using cheaper or simpler methods before involving more complex ones, since the fact that they are complex does not grant higher validity.

 

Employee’s perspective – fitting people to jobs

Focusing on personnel psychology now, its aim is to match people and jobs in a certain way. Very often temperament is seen as a critical component for reaching job satisfaction. The Myers-Briggs type indicator (MTBI) is based on Jung’s theory that introduces four main types (ways we experience the world around us) and those are: sensing, intuition, feeling, and thinking. Feeling refers to being attentive for emotional aspects while experiencing and sensing to gaining knowledge through hearing, touching etc. Jung believed that even though we all strive for some sort of balance taking the four types into account - every person has a tendency to emphasize one type. Another dimension Jung mentioned was in terms of extraversion and introversion. The use of the Myers-Briggs indicator is to assess the extrovert/introvert dimension and to place emphasis on one of the types. MTBI is widely used, mostly to explore communication styles, leadership skills, and self-efficacy.

 

 

Join World Supporter
Join World Supporter
Log in or create your free account

Waarom een account aanmaken?

  • Je WorldSupporter account geeft je toegang tot alle functionaliteiten van het platform
  • Zodra je bent ingelogd kun je onder andere:
    • pagina's aan je lijst met favorieten toevoegen
    • feedback achterlaten
    • deelnemen aan discussies
    • zelf bijdragen delen via de 7 WorldSupporter tools
Follow the author: Vintage Supporter
Promotions
verzekering studeren in het buitenland

Ga jij binnenkort studeren in het buitenland?
Regel je zorg- en reisverzekering via JoHo!

Access level of this page
  • Public
  • WorldSupporters only
  • JoHo members
  • Private
Statistics
[totalcount]
Comments, Compliments & Kudos

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.
WorldSupporter Resources
Summary: Neuropsychological Assessment

Summary: Neuropsychological Assessment

Deze samenvatting is geschreven in collegejaar 2012-2013.


Chapter 1. Neuropsychological assessment (NPA) in practice

Clinical neuropsychological is an applied science concerned with the behavioral expression of brain dysfunction. It owes its concepts to those who puzzled about what people made people do what they did and how. These people first called attention to what seemed to be linkages between body structures and people’s common response to common situations and behavioral anomalies. In the 19th century the idea of controlled observations became generally accepted. The beginning of the basis schema of brain behavioral relationships. In the first half of the 20th century, war-damaged brains gave the chief impetus to the development of clinical neuropsychology. Neuropsychological programs were created for the First World War for screening and diagnosis of brain injured and behavioral disturbed servicemen and for their rehabilitation afterwards. The Second World War promoted the development of many talented neuropsychologists and of increasingly sophisticated examination and treatment techniques.

Clinical neuropsychology can trace its lineage directly to the clinical neurosciences. Psychology contributed to the two other domains of knowledge and skills that are integral to the scientific discipline and clinical practices of neuropsychology today

  • -‘Intelligence’, by educational psychologists (Binet, Spearman) for screening re   cruits. (Raven’s Progressive Matrices, the Wechsler Intelligence Scales ed.)

  • -Experimental studies of cognitive functions in both humans and other

  •   animals.

The practice of neuropsychology calls for flexibility, curiosity, inventiveness, and empathy even in seemingly most routine situations. Any of six different purposes may prompt a neuropsychological examination:

  1. Diagnosis: accurate diagnosis, including localization of a lesion, is often achieved by means of the neurologist’s examination and laboratory devices. Still conditions may not be diagnostically enlightening (f.e. Alzheimer). Here is neuropsychological assessment needed. Despite general similarities in the pattern of brain function sites, these patterns will differ more ore less between patients and thus the outcome will be different. Screening is another aspect of diagnosis.

  2. Patient’s care and planning: Many patients are referred for detailed information about their cognitive status, behavioral alterations, and personality characteristics – often with questions about their adjustment to their disabilities – so that they and the people responsible for their wellbeing may know how the neurological condition has affected their behavior. When all the data of a comprehensive neuropsychological examination – the patient’s history, background, and present situation; the qualitative observations; and the quantitative scores – are taken together, the examiner should have a realistic appreciation of how the patient reacts to deficits and can best compensate for them, and whether and how retraining could be profitably undertaken. Deterioration on repeated testing can identify a dementing process early in its course. Repeated testing may also be used to measure the effects of surgical procedures, medical treatment, or retraining. Brain impaired patients must have factual information about their functioning to understand themselves and to set realistic goals, yet their need for this information is often overlooked. The family needs to know about their patient’s condition in order to respond appropriately.

3. Treatment -1: Treatment planning and remediation Sensitive, broad gauged, and accurate neuropsychological assessment is necessary for determining the most appropriate treatment for each rehabilitation candidate with brain dysfunction. This includes delineation of problem areas and evaluation of the patient’s strengths and potential for rehabilitation.

4. Treatment -2: Treatment evaluating - Consumers and referring clinicians need to ask whether a given service promises more than can be delivered, or whether what is produced in terms of the patient’s behavioral changes has psychological or social value and is maintained long enough to warrant the costs.

5.Research Neuropsychological assessment has been used to study the organization of brain activity and its translation into behavior, and to investigate specific brain disorders and behavioral disabilities. Research with neuropsychological assessment techniques also involves their development, standardization and evaluation.

6.Forensic Neuropsychology Legal proceedings -claims of bodily injury and loss of function Criminal cases -misbehavior, question about mental capacity to stand trail. The neuropsychologist may uncover vocational or family problems, or patient care needs that have been overlooked, or the patient may prove to be suitable candidate for research.

The validity of neuropsychological assessment. Ecological validity typically refers to how well the neuropsychological assessment data predict future behavior or behavioral outcomes.

Chapter 2. Primary concepts

Brain examination
Neuropsychological assessment is another method of examining the brain by studying its behavioral product. Because the subject matter is behavior, there are much similarity’s between neuropsychological and psychological assessment, it is both based on the same techniques, assumptions and theories. Both types of assessment involves the intensive study of behavior by means of interviews and standardized scaled tests and questionnaires that provide relatively precise and sensitive indices of behavior. The distinctive character of neuropsychological assessment lies in a conceptual frame of reference that takes brain function as its point of departure.

The earliest instruments for studying brain function that continue to be in use are electrophysiological (EEG, EP, ERP). Some of these techniques are not only useful to detect brain diseases, but can be used also to study aspects of cognition. Elektrodermal activity -autonomic nervous systems functioning -emotional response Brain mapping: large volumes of data generated by these techniques displayed on a stylized head (brain image). Whether EEG/EP brain mapping should be employed in routine assessments has become a controversial issue because of the many technological and methodological problems in this practice that lead to a high rate of erroneous interpretations.

Functional brain imaging: non-invasive methods that permit the study of ongoing brain activity ­useful for exploring both normal brain functioning and the nature of specific brain disorders. In studies of cognition and other forms of behavior, these procedures (rCBF, PET, SPECT) compared data obtained during an activation task of interest to data from a resting / control state.

fMRI will, more than others procedures discussed, greatly affect neuropsychology as well as cognitive neuroscience in general, in part due to its widespread use. In addition to the many current clinical applications of motor and language mapping, fMRI has become a popular method for investigating traditional psychological processes such as time perception, semantic processing, emotional processing, response inhibition, and many others.

Procedures like WADA test and electrical cortical stimulation mapping have not only significantly reduced cortical morbidity following epilepsy chirurgy, but they have also greatly enhanced our knowledge of brain-behavior relations. These procedures have drawbacks in that they are invasive and afford only a limited range of assessable behavior due to restrictions on patient response and the short durations of medication effects. Generalizability of data obtained by this technique is further limited by the atypical functioning of these patients diseased brain.

‘Damage of the brain’ and ‘organicity’

Organicity: viewpoint of brain damage as a unitary phenomenon (’30, ’40). Although it was well recognized that brain damage result from different conditions and had different effect, much of the work with brain damaged patients was based on the assumption that organicity was characterized by one central behavioral defect. In neuropsychology’s next evolutionary stage, ‘brain damage’ was still treated as a unitary phenomenon but was given measurable extension. The theoretical basis for this position had been provided by Karl Lashley (localisation of function). Later on, Chapman and Wolff (1959) reviewed the literature on localization, presented data on their patients and concluded, with Lashley, that sheer extent of cortical loss played a greater role in determining the amount of cognitive impairment than did the site of the lesion. Advances in diagnostic medicine have changed the typical referral question to the neuropsychologist from one that attempts to determine if the patient has neurologic disease or not. In most cases the presence of ‘brain damage’ has been clinically established, however, the behavioral repercussions of brain damage vary with nature, extent, location and duration of the lesions, age, sex, status en many other individual characteristics. Not only does the pattern of deficits vary with different lesion characteristics, but two persons with similar pathology and lesion site may have distinctively different neuropsychological profiles. 

Terminology

Behavior dimensions
Behavior may be conceptualized in terms of three functional (dimensional) systems: -cognition: the information-handling aspect of behavior -emotionality: feelings and motivation -executive functions: how behavior is expressed

Although brain damage rarely effects just one of these systems, the cognitive functions have received

more attentions than does the emotional and control systems. This is because: -the cognitive defects can figure so prominently in a brain damaged patient; -these systems can easily be conceptualized, measured and correlated with neuro-anatomical

systems;

-and it is more difficult to detect such subtle changes in emotion and control Behavioral problems can also be secondary reactions to the specific problems created by the brain injury. Additional repercussions en reactions may occur as the patient attempts to cope with succeeding sets of reactions and the problems they bring.

Cognitive functions
The four major classes of cognitive functions are:

  1. Receptive functions (select, acquire, classify and integrate information)

  2. Memory and learning (information storage and retrieval)

  3. Thinking (mental organization and reorganization of information

  4. Expressive functions (means trough which communicated or acted upon) although each function constitutes a distinct class of behaviors, normally they work in close interdependent concert. Generally speaking, within each class of cognitive functions a division may be made between those functions that mediate verbal/symbolic information and those that deal with data that cannot be communicated in words or symbols, such as complex visual or sound patterns. These subclasses differ from another in their neuroanatomical organization and in their behavioral expression while sharing other basic neuro-anatomical and psychometric relationships within the functional system. These functional divisions are conceptual constructions that can help the clinician to understand what goes into the typically very complex behaviors and test responses of their brain damaged patients.

Cognitive activity was originally attributed to a single function, intelligence. As refinements n testing and data-handling techniques became more precise, it became evident that the behavior measured by ‘intelligence’ tests involves specific cognitive and executive functions. Neuropsychological studies have not found a general cognitive or intellectual function, but rather many discrete ones that work together so smoothly in the intact brain that cognition is experienced as a single seamless attribute. Although IQ can be a good predictor of academic performance, it is not always useful in describing cognitive test performances. They represent so many different kinds of more or less confounded functions as to be conceptually meaningless. In neuropsychology, IQ scores are often unreliable indices of neuropathic deterioration.

Cognitive functions

Receptive functions

Entry of information into the central processing system proceeds from sensory stimulation, i.e., sensation, through perception, which involves the integration of sensory impressions into psychologically meaningful data and thence into memory. The components can be splintered into ever smaller receptive units. Sensory reception involves an arousal process that triggers central registration leading to analysis, encoding, and integrative activities. Neuropsychological assessment and research focus primarily on the five traditionally recognized senses: sight, hearing, touch, taste and smell. Perception involves active processing of the continuous torrent of sensations as well as their inhibition or filtering from consciousness. Normal perception in the healthy organism is a complex process engaging many different aspects of brain functioning. The extensive cortical distribution and complexity of perceptual activities make them highly vulnerable to brain injury. The perceptual functions include such activities as awareness, recognition, discrimination, patterning and orientation. Agnosia: disorders of recognition, impairments in perceptual integration. The fine degree to which brain organization is specialized becomes apparent in patients with similarly placed lesions who van identify inanimate objects but not animated ones, or comprehend words tat are abstract, better than those that are concrete. The agnosia can be divided into two major categories:

-Associative agnosia: breakdown in one ore more aspects of the patient’s information store or generic’ knowledge

-Apperceptive agnosia are due to higher level perceptual disturbances

Types and stages of memory
Central to all cognitive functions and probably to all that is characteristically human in a person’s behavior is the capacity for memory, learning and intentional access to this knowledge store. Although there are many named types of memory, for clinical purposes, the dual system conceptualization – into declarative (explicit) and nondeclarative (implicit) memory with its major subsystems – provides a useful framework for observing and understanding patterns of memory competence and deficits presented by our patients.

When complaining about memory problems, most patients are referring to problems with remembering information, objects and events (explicit memory).

Stages of memory processing

1. Registration or sensory memory Holds large amounts of incoming information briefly (seconds) in sensory store. It is a selecting and recording process by which perceptions enter the memory system. Either the information is further processed as short-term memory or it quickly decays.

2a. Immediate memory (first stage of short-term memory) Temporarily holds information retained from the registration process. It serves as a limited capacity store from which information is transferred to a more permanent store. It typically lasts from about 30 seconds up to several minutes

2b. Rehearsal is any repetitive mental process that serves to lengthen the duration of a memory trace, with rehearsal a memory trace may be maintained for hours. 2c. Another kind of short term memory may be distinguished from immediate memory in that is lasts from an hour or so to one or two days.

3. Longterm memory (LTM) or secondary memory – i.e. learning, the acquisition of new information – refers to the organism’s ability to store information. Long-term memory is most readily distinguishable from short term memory in amnesic patients. Amnesic conditions all have in common that there is a relatively intact short term memory capacity with significant long term memory impairments Consolidation: process of storing information as long-term memory. Long term memory storage involves a number of processes occurring at the cellular level. These include neurochemical alterations in the neuron en in the synapse, and perhaps pruning. There is no single local site for stored memories.

Anterograde amnesia: inability to remember one’s life events beginning with the onset of a condition. Patients are unable to learn and have defective recent memory. Retrograde amnesia: Loss of memory for events preceding the onset of brain injury, most often due to trauma. When retrograde amnesia occurs with brain disease, loss of one’s own history and events may go back years and even decades, in which newer memories are more vulnerable than the older ones. Remembering: retrieval of information, this may occur through recall involving active, complex search process.

Nondeclarative memory.The knowledge and skills in nondeclarative memory have been defined as ‘knowledge that is expressed in performance without subjects’ phenomenal awareness that they possess it. Two subsystems are clinically relevant: -Procedural memory (motor and cognitive skill learning)

-Priming/perceptual learning (form of recall in which, without the subjects’ awareness, prior exposure facilitates the response (classical conditioning is also considered a form of nondeclarative memory

Two elements common to these different aspects of memory: -their preservation in most amnesic patients -they are acquired or used without awareness of deliberate effort

Forgetting: some loss or diminished access to information – both recently acquired and stored in past. Normal forgetting differs from amnesic conditions in that only amnesia involves the inaccessibility or nonrecording of large chunks of personal memories. What the process of normal forgetting might be is still unclear.

Thinking
Thinking may be defined as any mental operation that relates two or more bits of information explicitly (in making arithmetic computation) or implicitly (as in judging that this is bad, i.e., relative to that). Lots of complex cognitive functions are subsumed under the rubric of thinking, such as computation, reasoning and judgment, e.d. The nature of the information being mentally manipulated and the operation define the category of thinking. The higher cognitive functions of abstraction, reasoning, judgment, analysis and synthesis tend to be relatively sensitive to diffuse brain injury, even when most specific receptive, expressive or memory functions remain essentially intact. The higher cognitive functions tend to be more fragile than the lower, more discrete functions. Problem solving involves executive functions as well as thinking, since a problem first has to be identified. As with other cognitive functions, the quality of any complex operation will depend in part on the extent to which its sensory and motor components are intact at the central integrative (cortical) level.

Expressive functions
Expressive functions, such as speaking, drawing or writing, manipulating, physical gestures, facial expressions or movements, make up the sum of observable behavior. Mental activity is inferred from them. Apraxia: disturbance of purposeful expressive functions.

Given the complexity of purposeful activity it is not surprising that apraxia occurs with disruption of pathways at different stages. Apraxic disorders may appear when pathways have been disrupted that connect the processing of information with centers for motor integration and executive functions integral to the performance of complex learned acts. Apraxias tend to occur in clusters of disabilities that share a common anatomical pattern of brain damage.

Constructional disorders, often classified as paraxial, are actually not paraxial in the strict sense of the

Read more
Samenvatting: The practice of social research

Samenvatting: The practice of social research

Deze samenvatting is geschreven in collegejaar 2012-2013.


Auteur voor update van 11e druk naar 12e druk gezocht.

Deel A: Human Inquiry and Science

Een van de belangrijkste doelen van onderzoekers (sociaal of anders) is het verklaren van waarom dingen zijn zoals ze zijn. Normaliter proberen zij dit te doen door vast te stellen wat de oorzaken zijn van de te onderzoeken gebeurtenissen. Met andere woorden welke dingen veroorzaken andere dingen. Zodra de oorzaken van een gebeurtenis bekend zijn weten we waarom datgene gebeurt is, en weten we dat ze niet anders hadden kunnen gebeuren, gegeven de omstandigheden. Dit heet determinisme.

 

Determinisme: Een theorie die ervan uitgaat dat iets wordt bepaald door het causale verband tussen de voorgaande gebeurtenissen en/of natuurlijke wetten.

De hoofdvraag die onderzoekers (researchers) zich moeten stellen is de volgende: is ons gedrag het product van onze eigen vrije gedachten en keuzes of wordt ons gedrag bepaald door krachten der natuur en omstandigheden waar we zelf geen controle over hebben en misschien zelfs niet eens weten dat ze er zijn?

 

Causaliteit

In de natuurlijke en sociale wetenschappen wordt vaak het deterministische model van oorzaak-gevolg gebruikt. Hierbij wordt een verklaring gedaan aan de hand van de causaliteit van bepaalde oorzaken met een bepaald gevolg. Dit wordt toegepast op mensen, planten en op niet-levende objecten.

 

Belangrijk bij het onderzoeken van de oorzaak-gevolg relatie is dat de gevonden (mogelijke) oorzaken ook werkelijk oorzaken moeten zijn. Een voorbeeld om uit te leggen. Als onderzocht wordt waarom iemand naar McDonalds gaat is de uitkomst, ‘omdat hij ervoor gekozen heeft om naar McDonalds te gaan’ niet goed. Hij heeft er wel voor gekozen, maar de echte (werkelijke) oorzaken, zijn bijvoorbeeld omdat hij honger had, etc. Wanneer je de keuze van iemand accepteert als oorzaak voor een bepaald gevolg, dan ben je nog steeds niet verder dan je eerst was, want de volgende vraag is dan gelijk waarom heeft hij die keuze gemaakt. Er moet dus een duidelijk causaal verband zijn tussen oorzaak en gevolg. Uiteindelijk zijn alle keuzes die iemand maakt bepaald door factoren (oorzaken) die hij of zijn niet zelf in de hand had.

Ook al worden veel verklaringen gedaan vanuit het deterministische model, moeten we er toch duidelijk over blijven wat niet bij het model hoort:

  • Ten eerste denken onderzoekers niet dat alle menselijke acties van tevoren vast liggen en ook niet dat mensen hun leven leiden naar een vastgelegd patroon.

  • Ten tweede neemt het deterministische model niet aan dat oorzakelijke verbanden simpel zijn.

  • Ten derde neemt het deterministische model ook niet aan dat we allemaal door dezelfde krachten en factoren beheerst worden.

  • Ten vierde stelt het deterministische model niet dat onderzoekers alles kunnen verklaren wat nu gebeurt of ooit zal gebeuren.

  • Als laatste hoeven onderzoekers niet aan te nemen dat het deterministische model verondersteld dat alles al vast ligt, maar ze moeten wel bereid zijn, om met behulp van deterministische logica, fenomenen te verklaren.

 

Bij onderzoeken wordt er gebruik gemaakt van inductie of deductie. Bij inductie trekt men conclusies over het algemeen, nadat men bepaalde specifieke objecten onderzocht heeft. Een onderzoeker die beweert dat alle zwanen wit zijn, nadat hij verscheidene gebieden met zwanen heeft onderzocht, is hier een voorbeeld van. Theoretisch gezien, zou het best wel kunnen dat er ergens op de wereld een zwarte zwaan te vinden is. Bij deductie gaat het net andersom. Er worden verwachtingen en hypotheses over specifieke verschijnselen geformuleerd op basis van algemene principes. Een voorbeeld is dat men verwacht dat de eerste zwaan die hij tegen komt wit is, omdat alle zwanen wit zijn.

 

Ideografisch en Nomothetisch model

Het ideografische model richt zich op het verklaren van gebeurtenissen door het bestuderen van de vele redenen die een dergelijke gebeurtenis kan hebben. Je kunt het vergelijken met een rechtbank die een oordeel moet vellen en voor de verzachtende omstandigheden alle mogelijke beweegredenen, die hebben geleid tot het gedrag van de ‘schuldige’, onderzoekt.

Terwijl het ideografische model vaak wordt gebruikt in het dagelijks leven en door onderzoekers van de sociale wetenschappen zijn er andere omstandigheden die moeten worden verklaard aan de hand van het nomothetische model. Dit model probeert te ontdekken wat de belangrijkste beweegredenen van een bepaalde actie of gebeurtenis zijn.

Een voorbeeld om de beide modellen wat duidelijker te maken. Stel dat we iemand vragen waarom hij op de PVDA heeft gestemd en we krijgen 100 redenen, dan mogen we er vanuit gaan dat als we iemand anders vinden, die dezelfde 100 redenen geeft ook voor de PVDA zal stemmen. Dit is een voorbeeld van het ideografische model. De kans dat je 2 mensen vindt die beide dezelfde 100 redenen geven is erg klein, dan moet je gebruik maken van het nomothetische model. Dit model kijkt naar de redenen van (bijna) alle mensen en het richt zich voornamelijk op de redenen die verklarend voor het stemmen. Het nomothetische model probeert zo veel mogelijk te halen uit zo min mogelijk informatie. Wanneer je naar alle 100 redenen kijkt wordt het detail van de verklaring groter maar ook veel complexer en mogelijk zelfs onbruikbaar. Kortom, een nomothetisch model geeft een simpeler en algemener verklaring dan een ideografisch model.

 

Je hebt twee soorten informatie, namelijk kwalitatieve en kwantitatieve. Bij kwantitatieve data is er sprake van numerieke gegevens. Kwalitatieve data leent zich meer voor de verklaringen van het ideografische model, terwijl de kwantitatieve data zich meer leent voor de verklaringen van het nomothetische model.

Beide modellen hebben te maken met causaliteit en beide kunnen gebruikt worden de modelkeuze is o.a. afhankelijk van de soort informatie die de onderzoeker tot zijn beschikking staat.

 

Criteria

Als je opmerkt dat mensen die hoger zijn opgeleid ook vaak luxere huizen hebben wil dat nog niet zeggen dat die twee factoren causaal verbonden zijn. Het hebben van een luxe huis betekend niet dat je hoger opgeleid bent en andersom is ook niet altijd zo. Het feit dat iets samenhangt, betekent niet dat het ook causaal verbonden is.

Volgens Joseph Maxwell zijn de belangrijkste criteria voor causaliteit:

  • De geloofwaardigheid

  • Of er serieus gezocht/onderzocht is naar alternatieve verklaringen

De laatste kun je vergelijken met de beroemde woorden van Sherlock Holmes die stelde: ”wanneer alle andere mogelijkheden zijn geëlimineerd, moet de overgebleven mogelijkheid de waarheid zijn”.

 

Volgens Paul Lazarsfeld zijn er 3 criteria voor causaliteit die gelden voor het nomothetische model:

  1. De oorzaak moet het gevolg voorgaan (eerst het startschot en dan gaan de schaatsers pas schaatsen, andersom zou onlogisch zijn).

  2. Twee variabelen moeten empirisch (echt onderzocht) samenhangen.

  3. Het effect van de causale relatie mag niet door een of andere derde variabele verklaard kunnen worden.

Het is belangrijk om het verschil te realiseren tussen het werkelijk testen en onderzoeken van een causale relatie en het gissen naar een causale relatie.

 

Noodzakelijke en Voldoende Oorzaken

Een noodzakelijke oorzaak (necessary cause) is een oorzaak waarvoor geldt dat een bepaalde voorwaarde aanwezig moet zijn om het gewenste gevolg te verkrijgen.

Een voldoende oorzaak (sufficient cause) is een oorzaak waarvoor geldt dat wanneer één van de bepaalde voorwaarden aanwezig is het gewenste gevolg al verzekerd is.

 

Een voorbeeld om dit nader uit te leggen:

Als iemand zijn middelbare school diploma wil halen, moet deze persoon naar school gaan. Wanneer hij of zij niet naar school gaat kan hij dat diploma niet halen. Naar school gaan is hier een noodzakelijke voorwaarde om tot het gevolg diploma behalen te komen. Wanneer iemand niet naar een tentamen gaat is dat een voldoende voorwaarde om het vak niet te halen, maar dat wil niet zeggen dat dit de enige voorwaarde is om dat tentamen niet te halen, iemand zou het ook gewoon slecht kunnen leren of een black out hebben. Niet naar het tentamen gaan is hier een voldoende voorwaarde voor het gevolg niet halen tentamen. Onderzoekers streven ernaar om tot een uitkomst te komen die zowel een de noodzakelijke als voldoende oorzaak voldoet.

 

Redeneerfouten

In onze veronderstellingen over causale relaties maken we vaak fouten. Soms is het mogelijk om alleen al op basis van logisch redeneren achter zulke fouten te komen, zelfs wanneer er geen mogelijkheid is om empirisch de data te controleren.

Howard Kahane noemt een aantal mogelijke fouten die gemaakt kunnen worden bij het verklaren van causale relaties:

  • Miscommunicatie

  • Te snelle conclusies

  • Twijfelachtige oorzaak

  • Achtergehouden informatie

  • Onjuist dilemma

 

Miscommunicatie

Onderzoekers interpreteren de verkregen informatie op een manier zodat zij er betekenis aan kunnen geven, het gevaar bestaat dan dat dat niet de bedoelde interpretatie was en dat er verkeerde conclusies uit worden getrokken.

Te snelle conclusies

Onderzoekers trekken soms te snel conclusies uit de verkregen uitkomsten en laten verder onderzoek achterwege, waardoor de kern van het onderzoek misschien wel misgelopen wordt.

 

Twijfelachtige oorzaak

Het is voor onderzoekers altijd belangrijk om na te gaan of een causale relatie wel echt een causale relatie is en geen toeval was of iets dergelijks. Wanneer A de oorzaak van B lijkt, is het belangrijk om uit te zoeken of dat werkelijk zo is. Onderzoekers moeten zich afvragen of er niet nog andere oorzaken geweest konden zijn. Vaak is het gemakkelijk om te komen met de meest voor de hand liggende oorzaak, maar het is vaak lastiger om achter de werkelijkheid te komen. Andere mogelijke oorzaken moeten altijd onderzocht worden.

Een tautologie is een verklaring die per definitie waar is (vb. het groene gras, gras is per definitie groen, het is onnodig om dit nogmaals te vermelden of te onderzoeken)

 

Achtergehouden informatie

Onderzoekers zijn degene die bepalen welke informatie relevant is voor het onderzoek en welke niet. Dit is wel een subjectieve inschatting. Het is mogelijk dat de onderzoeker iets over het hoofd ziet of dat hij de verkeerde dingen als relevant beschouwd.

 

 

Onjuist dilemmaVaak denken onderzoekers dat wanneer er meerdere oorzaken mogelijk zijn dat de ene oorzaak de mogelijkheid op de andere oorzaak uitsluit. Dat kan zo zijn maar is zeker niet altijd het geval. Het is vaak goed mogelijk dat de onderzoeker denkt voor een dilemma te staan terwijl dat niet zo is.

 

Metingen

Vaak wordt in het (deductieve) model van de sociale wetenschap gesteld dat het proces van vaststellen en meten van variabelen los staat van het proces van onderzoeken naar samenhangen, terwijl het veel beter is om deze processen als nauw verbonden te beschouwen. Aangezien er vrijwel nooit een ideale situatie zal ontstaan, komt onderzoeken bijna altijd neer op onderzoekers, die imperfecte indicatoren van theoretische modellen gebruiken om tot imperfecte samenhangen te komen, die open staan voor een imperfecte interpretatie.

 

 

Deel B: Research Design

 

In de sociale wetenschap draait alles om het onderzoeken en het ontdekken. Er bestaat geen uniforme manier om dat te doen, maar er zijn heel veel verschillende manieren. Uiteindelijk betekent wetenschappelijk onderzoek het observeren en interpreteren van de waarnemingen. Daarom moet je vaststellen wat je gaat onderzoeken, waarom je dat hebt gekozen en hoe je dat gaat doen. Dit zijn de twee belangrijkste taken van het onderzoeksontwerp:

  1. Zo precies mogelijk vaststellen wat het is dat je wilt onderzoeken.

  2. Vaststellen wat de beste manier is om dat te onderzoeken.

Zoals in de wiskunde gezegd wordt: een goed geformuleerde vraag omvat het antwoord.

 

Drie doelen van onderzoek

  • Verkennen (exploration)

  • Beschrijven (description)

  • Verklaren (explanation)

 

Verkennen

Veel onderzoek wordt verricht om een onderwerp te verkennen of om de onderzoeker bekend te maken met het onderwerp. Het wordt vaak verricht bij nieuwe onderwerpen (kunnen ook oude onderwerpen zijn die nu weer interessant zijn) of wanneer het onderwerp nog relatief nieuw is als onderzoeksonderwerp. Verkennend onderzoek wordt voornamelijk ingesteld voor 3 redenen.

  1. Om de nieuwsgierigheid van de onderzoeker te bevredigen

  2. Om te onderzoeken of uitgebreider onderzoek zin heeft

  3. Om manieren te ontwikkelen om vervolg studies mogelijk te maken

Verkennend onderzoek is inductief, hetgeen inhoudt dat de onderzoekers alle variabelen zelf moeten vaststellen en moeten bepalen welke relevant zijn. Men moet dus een algemene conclusie trekken uit specifieke verschijnselen, aangezien er weinig informatie bestaat over nieuwe onderwerpen. Dit in tegenstelling tot deductieve onderzoeken, waarbij de relevante variabelen van te voren al bekend zijn.

 

Beschrijven

Ook veel onderzoek wordt verricht om bepaalde situaties of omstandigheden te beschrijven. Hierbij neemt de onderzoeker eerst dingen waar en probeert ze vervolgens te beschrijven. De onderzoeksvraag is voornamelijk waarom gebeurt of gebeurde iets.Verklaren

Onderzoek kan ook nog als doel hebben om een bepaald verschijnsel te verklaren. Dit soort onderzoeken beantwoorden over het algemeen vragen als: wat, waar, wanneer en hoe en eventueel verklarende vragen als waarom. Ook al is het handig om de drie doelen van elkaar te onderscheiden, is het vaak zo dat een onderzoek onderdelen van meerdere of alle drie in zich heeft.

 

‘Nomothetic Causality’

Er zijn drie criteria waaraan een ‘nomothetic causal relationship’ aan moet voldoen:

  • de variabelen moeten gecorreleerd zijn: dit is een empirische relatie tussen twee variabelen, zodat (1) veranderingen in de ene variabele verbonden zijn met veranderingen in de andere variabele en (2) eigenschappen van de ene variabele verbonden zijn met die van de andere variabele;

  • de oorzaak vindt plaats voor het effect;

  • de variabelen zijn ‘nonspurious’; het effect kan niet worden verklaard door middel van een derde variabele. Dit wordt ook wel de “lurking variable” genoemd.

Onderzoekers gebruiken liever niet de termen oorzaak en gevolg. Het is namelijk veiliger om van een samenhang te spreken. Redenen hiervoor zijn:

 

  • Er is vaak sprake van meerdere oorzaken van een verschijnsel. Als men spreekt van ‘de oorzaak’ dan laat men andere oorzaken buiten beschouwing.

  • Er zijn uitzonderingen: Een oorzaak hoeft niet per sé tot een bepaald effect te leiden.

  • Majority cases: We kunnen niet zeggen dat kinderen die geen oppas hebben na schooltijd, eerder geneigd zijn om verkeerde dingen te doen. Dus leidt gebrek aan toezicht tot slecht gedrag van kinderen. Want het kan zo zijn dat slechts een klein deel van de kinderen zonder oppas zich slecht gaan gedragen.

 

Onderzoeksobject

Één van de belangrijkste dingen die gedaan moeten worden bij elk onderzoek is het vast stellen van wat precies het onderzoeksobject (unit of analysis) is, oftewel omschrijven tot in het verst mogelijke detail wat en waarom je wilt gaan onderzoeken. In de sociale wetenschap is er vrijwel geen limiet tot wat er onderzocht kan worden. Het onderzoeksobject is datgene wat je wilt analyseren, observeren en wat je wilt verklaren.

Er zijn ruwweg 4 soorten onderzoeksobjecten:

  • Individuele objecten

  • Groepen

  • Organisaties

  • Sociaal object

 

In de meeste onderzoeken zijn individuele mensen het onderzoeksobject, maar het onderzoeksobject kan elke soort individueel zijn. Deze individuele kunnen eigenschappen vertonen van een groep en het is dus vaak verstandig om eerst het individueel te onderzoeken en daarna de kennis te gebruiken om de groep te onderzoeken. Wanneer we groepen onderzoeken zijn we vaak op zoek naar de eigenschappen van die groep alsof het een individueel is. Vaak zijn die eigenschappen dan ook te vinden in alle leden van die groep.

Het is ook mogelijk om een sociale organisatie te onderzoeken, je moet dan bijvoorbeeld denken aan een bedrijf, waarbij gekeken kan worden naar de economische aspecten als aantal werknemers, netto omzet, enz. Ook kan onderzocht worden hoe de etnische verdeling binnen een organisatie is.

 

Als laatste kunnen alle sociale objecten als onderzoeksobject dienen. Sociale objecten zijn alle producten afkomstig van sociale wezens of gedragingen (boeken, grappen, ontdekkingen).

Belangrijk is ook dat je een onderscheid weet te maken tussen wat je precies wilt onderzoeken en waar je onderzoeksobject onder valt. Wil je de crimineel onderzoeken of de criminaliteit. Als deze begrippen door elkaar heen worden gehaald creëer je het gevaar dat je verkeerde conclusies trekt, die gebaseerd zijn op bijvoorbeeld groepen in plaats van op individuen.

 

Foutief redeneren

Er zijn twee soorten van foutief redeneren die kunnen voor komen bij het vaststellen van het onderzoeksobject:

  1. Ecologisch verkeerd idee: Iets dat is geconstateerd bij de ecologische eenheid, zegt niet specifiek iets over de delen welke die ecologische eenheden vormen.

Het is fout om eigenschappen van individuele te baseren op datgene wat is waargenomen voor de groep waar die individueel deel van uitmaakt.

  1. Herleiding: Het uitleggen van complexe fenomenen aan de hand van eenvoudige en simpele redenen. Je reduceert dan wat in werkelijkheid een erg complexe situatie is tot een simpele situatie.

Beide fouten komen vooral voor wanneer niet duidelijk is vastgesteld wat het onderzoeksobject is. Dit is ook niet altijd gemakkelijk en het vaststellen van een onderzoeksobject is een veel besproken probleem.

Tot nu toe hebben we het gehad over welke objecten we willen onderzoeken, waarom we dat willen en voor welke doeleinden we dat willen. Nu zullen we een aantal tijdsgerelateerde opties bespreken, die dwars door de voorgaande overwegingen lopen. Het kan hier gaan om één enkele tijdsperiode of over meerdere tijdsperioden. Tijd speelt een belangrijke rol in het ontwerpen en uitvoeren van een onderzoek, vooral wanneer het gaat om de tijd die nodig is om het onderzoek uit te voeren. Tijd speelt ook een belangrijke rol bij deRead more