Article: Oostervel & Vorst (2010)
The construction of measurement-instrument is an important subject.
- certain instruments age because theories about human behaviour or because social changes tear down existing instruments
- new instruments can be necessary because existing instruments aren’t sufficient enough.
- new instruments can be necessary because existing instruments aren’t suitable for an certain target group.
Measurement preferences of an instrument: the goal of an measurement-instrument.
This is about a more or less hypothetical property.
The domain of human acting
The instrument is usually focussed on measuring an property in a global domain of human acting.
A domain: a wide area of more or less coherent properties.
Every measurement-instrument uses one or more observation methods. For different properties of different domains, usually different observation methods are used.
- observation tests
When properties are measured with different observation methods, it is logical that with different methods, different domains of the traits or categories are measured.
Instruments based on one observation method seem to form a common method-factor, which usually is stronger than the common trait-factor of equal traits measured with different observation methods.
The development of an instrument is usually based on an elaborated theory or insights based on empirical research or ideas based on informal knowledge.
Instruments developed on the base of formal knowledge and an elaborated theory are of better quality than instruments based on informal knowledge and an poorly formulated theory.
An instrument forms the elaboration of an construct that refers to an combination of properties.
Measurement instruments for specific (latent) traits are of better quality than instruments for global traits or composite traits.
The structure of an test depends on the properties it measures.
Unstructured observation-methods are the measurement-conditions that aren’t standardized and because of that it’s measurement-results are difficult to compare to other persons and situations. Objective scores are difficult to obtain.
The application possibilities of an measurement-instrument the researcher wants to achieve can be related to theoretical or describing research.
It is about analysis of an great number of observations.
For individual applications high requirements are placed on realised measurement-preferences.
An often decisive element in the description of the measurement-preferences of an measurement-instrument are the costs of that instrument.
An instrument consists of one or more measurement-scales or sub-tests.
More scales refer to more dimensions of the construct and a subdivision in more latent traits or latent categories.
An instrument that is based on a specific latent trait must be one-dimensional.
Three kinds of reliability:
- Internal consistence-reliability
Mutual cohesion of items that form a scale or sub-tests.
- Repeated reliability
Repeated measures with the same instrument
- Local reliability
an impression of the reliability of the measurement within a certain wide of scores.
Does the test measure what it is supposed to measure?
Forms of validity:
Utility of an instrument: the use of an instrument as becomes apparent from a costs-bate analysis.
A psychological measurement-instrument doesn’t lead to absolute results, but to relative ones. The individual scores must be compared to scores of others.
The scores of others form the norm.
Norm-group: the group of people that forms the norm
Norming exists of the calculation of rough score to relative norm-score.
Validity and measurement-quality of measurement-instruments
Impression-validity – an subjective judgment of the measurement-quality
Impression-validity: an subjective judgment of the usability of an measurement-instrument on the base of the direct observable properties of the test-material.
The judgement of test-takers and other laics.
Content-validity - content measurement-quality
Content-validity: the judgment about the representativeness of the observations, appointments, and questions for a certain purpose.
This can be determined by offering potential respondents or experts domain-descriptions and the items of the instrument, and then order them to sort items on domain-descriptions.
- With big conformity between items and domain-descriptions on judges, the content-validity is high.
Especially important for tests and exams.
Criterium-validity – predicting value of the measurement
Criterium-validity: the (cor)relation between test-score and a psychological or social criterium.
Can be found by researching test-score and criterium-score.
Process-validity: procedural measurement-quality
Process-validity: the manner on which the response is established.
Can be researched with thinking-out-loud protocols or experiments with instructions.
Construct-validity – theoretical measurement-quality
A part of the similarities between the strictly formulated, hypothetical relations between the measured construct, and other constructs and otherwise empirical proved relations between instruments which should measure those constructs.
Convergent validity: if measurement-results from different instruments that research the same construct are coherent or highly correlated.
Divergent validity: if measurement-results from different instruments that test different constructs have a low correlation.
Homogeneity of consistence-reliability
The coherence between separate indicators (items) in a scale.
By a psychological scale, assumed is that the items of which the scale is composed are independent, repeated measured of the same trait.
Homogeneity is determined with different indices:
- mean inter item correlations
- split-halves reliability
- coefficient alpha
The height of homogeneity-indices is usually dependent on the height of the inter item correlations of the number of items.
Generalizability of the measurement-quality
The validity and reliability/measurement-quality is in principle dependent on the population or sample.
For every group of persons who differ on one or more characteristics the validity and reliability/ measurement-quality of an instrument must be determined separately.
Paradoxes and measurement-qualities
Subjective judgments of the measurement-quality
The unarmed judgment about the measurement-quality of a test can be deceiving and doesn’t have to have a relation to the researched measurement-quality.
The content validity of an instrument turns out the representative choice of items out of one or more domains of items.
If the content of an instrument is chosen optimally, this can lead to a less homogeneous instrument.
- This property is measured with items that are diverse of content.
These items elicit a great diversity of responses, which do not lead to a homogeneous one-dimensional scale or sub-test.
If the constructor also wants a high homogeneity or predictive value of the instrument, this will be at expense of the content representativenes of the instrument.
The quality of an one-dimensional measurement-model requires homogeneous responses on an limited number repeated measurements. That requires homogeneous items.
The consistency-reliability of homogeneous items is lower than that of heterogeneous items. The predictive value of homogeneous items is lower than that of heterogeneous items.
With heterogeneous items one can’t usually form a scale with good measurement-properties, but one can form a predictor.
With homogeneous items, forming a scale with good measurement-properties is possible, but forming a predictor isn’t.
Items must be homogeneous per scale in order to meet the requirements of an one-dimensional measurement-model.
- The items of a scale must correlate with (the items of) scales that measure similar properties (convergent relations) and may not correlate highly with (items of) scales that measure dissimilar properties (divergent relations)
- Items may not correlate highly with indications of response tendencies
If items meet these criteria, predictive value and content quality of the instrument can’t be optimal
Homogeneity of a scale is based on the assumption that items of a scale or sub-test form independent, repeated measures of a property.
These repeated measurements must be mutually coherent to form a high consistence-reliability or homogeneity of the scale or sub-test.
Homogeneous scales or sub-tests threaten the maximal content representativeness of the property and the predictive value of the measurement.
Selection of measurement-quality
A test-constructor can’t maximize all the measurement-qualities in one measurement-instrument, and the test-taker shouldn’t expect all the measurement-qualities in an instrument.
If a constructor hadn’t chosen an measurement-quality to maximize beforehand, the instrument will have random measurement-qualities.
The constructor can attain the best results by focusing on one measurement-quality during test-construction.
Optimization, probability capitalization, and cross-validation
Most methods of test-construction have an empirical character in which the constructor attains the best result using optimizing choice-procedures, optimizing solution-strategies, or optimizing analyzing-strategies.
There, coincidence can play a big role
By making optimal choices, or using optimal techniques, the constructor can stack coincidence on coincidence. This is probability capitalization
The constructor can gain insight in the effects of probability capitalization by using cross-validation on the results.
With cross-validation, results of an optimizing strategy become more apparent, and because of this more certainty can be obtained about the (in)stability of the results.
Optimizing procedures: examples
- selection of items on the basis of optimal psychometric properties
- selection of items on the basis of differences between groups
- selection of optimal weights for item-scores and test-scores
Typical for optimizing procedures is their empirical character and that they lead to optimizing choices with empirical data.
There are no theoretical or hypothetical considerations that adjust the selection process.
The data on which selection is based is commonly unreliable to a certain extent.
Probability capitalization means that with an optimizing strategy, the choice is make in part because of chance.
By such a choice, no distinction can be made between true variance and false variance or coincidence.
Probability capitalization doesn’t exists when there is only selected on the base of true differences/ true correlations but (also) when there is selected on accidentally big differences/big correlations.
Coincidence here is not systematical or repeatable.
Sorts of optimizing procedures and techniques
Three common forms of optimizing:
- optimizing of psychometric characteristics of measurement and/or prediction through selection of items.
- optimizing of differences between mean scores of groups by selection of items
- optimizing of quality of measuring of accuracy of predicting by giving weights to item-scores or test-scores
These procedures are often executed with optimizing, exploratory analyzing-techniques
The most common techniques are:
- Exploratory factor-analysis
- multiple regression-analysis
Measurement-model: about what the constructor wants to measure
Structure-model: about what the constructor wants to predict
Cross-validation: control of instability of outcomes
The central idea of cross-validation: repeatedly test/calculate optimal indices
- This divides the research-sample in two comparable groups.
- Out of every sub-group, two new sub-groups (A and B) are at random composed. The sub-groups then merged to two groups, A and B.
- Then, exploratory analysis is done twice.
- outcomes of the analysis are compared
If the optimizing technique or procedure brings similar results in both analysis, than probability capitalization is present is such a small degree that outcomes are stable or reliable.
Threats with observation-methods:
- Disruptive influence of the presence and behavior of the observant on the observed person and his or her behavior.
distorted effect of the expectations of the observant on the observations
changes in the manner of observation in the course of time
loss of precision due to use of global categories
distorted influence of first and last observations on other observations of the series
- effect of under-performance
distortion of observations due to under-representation of common behavior or events
- effect of event rate
distortion due to missing observations as a result of the rate of events or behaviors
- effect of event-complexity
distortion due to missing observations as a result of the complexity of behaviors or events
Threats of rating-methods
a positive distortion on specific traits as a result of a positive first or general impression
a negative distortion on specific traits as a result of a negative first or general impression
- regression to the middle
distortion of judgments as a result of a tendency to give average judgments or a tendency to give little variation in judgments.
distortion due to a tendency to increase existing differences between people or differences with the judge
distortion as a result of the tendency to avoid negative judgments or the tendency to give relatively positive judgments
distortion as a result of the tendency to give relative negative judgments or to give relatively few positive judgments
- logical flaw
distortion as a result of assuming traits on that base of psycho-’logical’ connections or assuming cause and effect
The measurement-preference of a test must be empirically researched with psycho-metrical research.
Add new contribution