Psychological measurement-instruments - a summary for WSRt -of an article by Oostervel & Vorst (2010)

Critical thinking
Article: Oostervel & Vorst (2010)
Psychological measurement-instruments

The construction of measurement-instrument is an important subject.

  • certain instruments age because theories about human behaviour or because social changes tear down existing instruments
  • new instruments can be necessary because existing instruments aren’t sufficient enough.
  • new instruments can be necessary because existing instruments aren’t suitable for an certain target group.

Measurement preferences

Measurement preferences of an instrument: the goal of an measurement-instrument.
This is about a more or less hypothetical property.

The domain of human acting

The instrument is usually focussed on measuring an property in a global domain of human acting.
A domain: a wide area of more or less coherent properties.

Observation methods

Every measurement-instrument uses one or more observation methods. For different properties of different domains, usually different observation methods are used.

  • performance-tests
  • questionnaires
  • observation tests

When properties are measured with different observation methods, it is logical that with different methods, different domains of the traits or categories are measured.

Instruments based on one observation method seem to form a common method-factor, which usually is stronger than the common trait-factor of equal traits measured with different observation methods.


The development of an instrument is usually based on an elaborated theory or insights based on empirical research or ideas based on informal knowledge.
Instruments developed on the base of formal knowledge and an elaborated theory are of better quality than instruments based on informal knowledge and an poorly formulated theory.


An instrument forms the elaboration of an construct that refers to an combination of properties.
Measurement instruments for specific (latent) traits are of better quality than instruments for global traits or composite traits.


The structure of an test depends on the properties it measures.

Unstructured observation-methods are the measurement-conditions that aren’t standardized and because of that it’s measurement-results are difficult to compare to other persons and situations. Objective scores are difficult to obtain.

Application possibilities

The application possibilities of an measurement-instrument the researcher wants to achieve can be related to theoretical or describing research.
It is about analysis of an great number of observations.

For individual applications high requirements are placed on realised measurement-preferences.


An often decisive element in the description of the measurement-preferences of an measurement-instrument are the costs of that instrument.


An instrument consists of one or more measurement-scales or sub-tests.
More scales refer to more dimensions of the construct and a subdivision in more latent traits or latent categories.

An instrument that is based on a specific latent trait must be one-dimensional.


Three kinds of reliability:

  • Internal consistence-reliability
    Mutual cohesion of items that form a scale or sub-tests.
  • Repeated reliability
    Repeated measures with the same instrument
  • Local reliability
    an impression of the reliability of the measurement within a certain wide of scores.


Does the test measure what it is supposed to measure?

Forms of validity:

  • Impression-validity
  • content-validity
  • criterium-validity
  • process-validity
  • construct-validity


Utility of an instrument: the use of an instrument as becomes apparent from a costs-bate analysis.


A psychological measurement-instrument doesn’t lead to absolute results, but to relative ones. The individual scores must be compared to scores of others.
The scores of others form the norm.
Norm-group: the group of people that forms the norm
Norming exists of the calculation of rough score to relative norm-score.

Validity and measurement-quality

Validity and measurement-quality of measurement-instruments

Impression-validity – an subjective judgment of the measurement-quality

Impression-validity: an subjective judgment of the usability of an measurement-instrument on the base of the direct observable properties of the test-material.
The judgement of test-takers and other laics.

Content-validity - content measurement-quality

Content-validity: the judgment about the representativeness of the observations, appointments, and questions for a certain purpose.
This can be determined by offering potential respondents or experts domain-descriptions and the items of the instrument, and then order them to sort items on domain-descriptions.

  • With big conformity between items and domain-descriptions on judges, the content-validity is high.

Especially important for tests and exams.

Criterium-validity – predicting value of the measurement

Criterium-validity: the (cor)relation between test-score and a psychological or social criterium.
Can be found by researching test-score and criterium-score.

Process-validity: procedural measurement-quality

Process-validity: the manner on which the response is established.
Can be researched with thinking-out-loud protocols or experiments with instructions.

Construct-validity – theoretical measurement-quality

A part of the similarities between the strictly formulated, hypothetical relations between the measured construct, and other constructs and otherwise empirical proved relations between instruments which should measure those constructs.

Convergent validity: if measurement-results from different instruments that research the same construct are coherent or highly correlated.
Divergent validity: if measurement-results from different instruments that test different constructs have a low correlation.

Homogeneity of consistence-reliability

The coherence between separate indicators (items) in a scale.
By a psychological scale, assumed is that the items of which the scale is composed are independent, repeated measured of the same trait.

Homogeneity is determined with different indices:

  • mean inter item correlations
  • split-halves reliability
  • coefficient alpha

The height of homogeneity-indices is usually dependent on the height of the inter item correlations of the number of items.

Generalizability of the measurement-quality

The validity and reliability/measurement-quality is in principle dependent on the population or sample.
For every group of persons who differ on one or more characteristics the validity and reliability/ measurement-quality of an instrument must be determined separately.

Paradoxes and measurement-qualities

Subjective judgments of the measurement-quality

The unarmed judgment about the measurement-quality of a test can be deceiving and doesn’t have to have a relation to the researched measurement-quality.

Content validity

The content validity of an instrument turns out the representative choice of items out of one or more domains of items.
If the content of an instrument is chosen optimally, this can lead to a less homogeneous instrument.

  • This property is measured with items that are diverse of content.

These items elicit a great diversity of responses, which do not lead to a homogeneous one-dimensional scale or sub-test.
If the constructor also wants a high homogeneity or predictive value of the instrument, this will be at expense of the content representativenes of the instrument.

Predictive value

The quality of an one-dimensional measurement-model requires homogeneous responses on an limited number repeated measurements. That requires homogeneous items.

The consistency-reliability of homogeneous items is lower than that of heterogeneous items. The predictive value of homogeneous items is lower than that of heterogeneous items.

With heterogeneous items one can’t usually form a scale with good measurement-properties, but one can form a predictor.
With homogeneous items, forming a scale with good measurement-properties is possible, but forming a predictor isn’t.

Theoretical measurement-quality

Items must be homogeneous per scale in order to meet the requirements of an one-dimensional measurement-model.

  • The items of a scale must correlate with (the items of) scales that measure similar properties (convergent relations) and may not correlate highly with (items of) scales that measure dissimilar properties (divergent relations)
  • Items may not correlate highly with indications of response tendencies

If items meet these criteria, predictive value and content quality of the instrument can’t be optimal


Homogeneity of a scale is based on the assumption that items of a scale or sub-test form independent, repeated measures of a property.
These repeated measurements must be mutually coherent to form a high consistence-reliability or homogeneity of the scale or sub-test.
Homogeneous scales or sub-tests threaten the maximal content representativeness of the property and the predictive value of the measurement.

Selection of measurement-quality

A test-constructor can’t maximize all the measurement-qualities in one measurement-instrument, and the test-taker shouldn’t expect all the measurement-qualities in an instrument.

If a constructor hadn’t chosen an measurement-quality to maximize beforehand, the instrument will have random measurement-qualities.
The constructor can attain the best results by focusing on one measurement-quality during test-construction.

Optimization, probability capitalization, and cross-validation

Most methods of test-construction have an empirical character in which the constructor attains the best result using optimizing choice-procedures, optimizing solution-strategies, or optimizing analyzing-strategies.
There, coincidence can play a big role
By making optimal choices, or using optimal techniques, the constructor can stack coincidence on coincidence. This is probability capitalization

The constructor can gain insight in the effects of probability capitalization by using cross-validation on the results.
With cross-validation, results of an optimizing strategy become more apparent, and because of this more certainty can be obtained about the (in)stability of the results.

Optimizing procedures: examples

  • selection of items on the basis of optimal psychometric properties
  • selection of items on the basis of differences between groups
  • selection of optimal weights for item-scores and test-scores

Probability capitalization

Typical for optimizing procedures is their empirical character and that they lead to optimizing choices with empirical data.
There are no theoretical or hypothetical considerations that adjust the selection process.

The data on which selection is based is commonly unreliable to a certain extent.

Probability capitalization means that with an optimizing strategy, the choice is make in part because of chance.
By such a choice, no distinction can be made between true variance and false variance or coincidence.

Probability capitalization doesn’t exists when there is only selected on the base of true differences/ true correlations but (also) when there is selected on accidentally big differences/big correlations.
Coincidence here is not systematical or repeatable.

Sorts of optimizing procedures and techniques

Three common forms of optimizing:

  • optimizing of psychometric characteristics of measurement and/or prediction through selection of items.
  • optimizing of differences between mean scores of groups by selection of items
  • optimizing of quality of measuring of accuracy of predicting by giving weights to item-scores or test-scores

These procedures are often executed with optimizing, exploratory analyzing-techniques
The most common techniques are:

  • Exploratory factor-analysis
  • Cluster-analysis
  • multiple regression-analysis
  • discrimination-analysis

Measurement-model: about what the constructor wants to measure
Structure-model: about what the constructor wants to predict

Cross-validation: control of instability of outcomes

The central idea of cross-validation: repeatedly test/calculate optimal indices

  • This divides the research-sample in two comparable groups.
  • Out of every sub-group, two new sub-groups (A and B) are at random composed. The sub-groups then merged to two groups, A and B.
  • Then, exploratory analysis is done twice.
  • outcomes of the analysis are compared

If the optimizing technique or procedure brings similar results in both analysis, than probability capitalization is present is such a small degree that outcomes are stable or reliable.

Threats to validity of measures

Threats with observation-methods:

  • Disruptive influence of the presence and behavior of the observant on the observed person and his or her behavior.
  • Expectancy-effect
    distorted effect of the expectations of the observant on the observations
  • adjustment-effect
    changes in the manner of observation in the course of time
  • Category-effect
    loss of precision due to use of global categories
  • order-effect
    distorted influence of first and last observations on other observations of the series
  • effect of under-performance
    distortion of observations due to under-representation of common behavior or events
  • effect of event rate
    distortion due to missing observations as a result of the rate of events or behaviors
  • effect of event-complexity
    distortion due to missing observations as a result of the complexity of behaviors or events

Threats of rating-methods

  • Halo-effect
    a positive distortion on specific traits as a result of a positive first or general impression
  • horn-effect
    a negative distortion on specific traits as a result of a negative first or general impression
  • regression to the middle
    distortion of judgments as a result of a tendency to give average judgments or a tendency to give little variation in judgments.
  • contrast-effect
    distortion due to a tendency to increase existing differences between people or differences with the judge
  • willingness-effect
    distortion as a result of the tendency to avoid negative judgments or the tendency to give relatively positive judgments
  • hardness-effect
    distortion as a result of the tendency to give relative negative judgments or to give relatively few positive judgments
  • logical flaw
    distortion as a result of assuming traits on that base of psycho-’logical’ connections or assuming cause and effect

Psycho-metrical research

The measurement-preference of a test must be empirically researched with psycho-metrical research.


Connect & Continue
WorldSupporter Resources

WSRt, critical thinking - a summary of all articles needed in the third block of second year psychology at the uva


This is a summary of the articles and reading materials that are needed for the third block in the course WSR-t. This course is given to second year psychology students at the Uva. The course is about thinking critically about scientific research and how such research is

Contributions, Comments & Kudos

Add new contribution

This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Enter the characters shown in the image.
Summaries & Study Note of SanneA
Join World Supporter
Join World Supporter
Log in or create your free account

Why create an account?

  • Your WorldSupporter account gives you access to all functionalities of the platform
  • Once you are logged in, you can:
    • Save pages to your favorites
    • Give feedback or share contributions
    • participate in discussions
    • share your own contributions through the 11 WorldSupporter tools
Access level of this page
  • Public
  • WorldSupporters only
  • JoHo members
  • Private

JoHo kan jouw hulp goed gebruiken! Check hier de diverse bijbanen die aansluiten bij je studie, je competenties verbeteren, je cv versterken en je een bijdrage laten leveren aan een mooiere wereld