# Glossary for Reliability and Validity

Definitions and explanations of the most important terms generally associated with statistical reliability and validity

Reliability and validity are concepts used to evaluate the quality of your research. They indicate how well a method, technique or test measures something.

Reliability and validity are two central themes within statistics. The *reliability* refers to the phenomenon that the measurement instrument provides consistent results. If you repeat the same measurement, a reliable instrument will provide the same result. *Validity* describes whether the construct that is aimed to be measured, is indeed being measured by the instrument. The validity is dependent upon the aim of the study: an instrument may be valid for one concept, but not for another. A valid measurement is always a reliable measurement too, but the reverse does not hold: if an instrument provides consistent result, it is reliable, but does not have to be valid.

The score of a participant on a measurement consists of two parts: 1) the true score of the participant and 2) measurement error. In short:

\[Observed\: score = True\: score + Measurement\: error\]

The true score is the score that a participant would have had if the measurement technique was perfect and hence no measurement errors have been made. However, the measurement techniques that researchers use are (almost) never flawless. All measurement techniques consist of measurement error. Because of these measurement errors, scientist can never reveal the exact score of a participant.

Measurement errors and reliability of a measurement are related. When a measurement has a low reliability, the measurement errors are large and the researcher knows little about the true scores of the participants. When a measurement has a high reliability, little measurement error occurred. The observed scores of a participant are then a good (but not perfect) reflection of the true score of the participant.

Scientist are never completely certain how much measurement error is persistent in a study and what the true scores of participants are. In addition, they do not know how reliable their measure is precisely, but they can estimate how reliable it is. If they determine that their measure was not reliable enough, they can try to make their measurement more reliable. If making their measurement more reliable is not possible, they can decide not to use the measurement at all.

The total variance in a data set of scores consists of two parts: 1) variance by true scores and 2) variance by measurement errors. In formula form, this is:

\[{\small Total\: variance = Variance\: by\: true\: scores + Variance\: by\: measurement\: errors}\]

We can also say that the proportion of total variances that is in accordance with the true scores of the participants is the

*systematic variance*, because the true scores are systematically related to the measurement.The variance that is caused by measurement errors is called

*error variance*, because this variance is not related to what the scientist examines.We therefore can say that the reliability can be computed by dividing the systematic variance by the total variance:

\[Reliability = \frac{Systematic\: variance}{Total\: variance}\]

- The reliability of a measurement is somewhere between 0 and 1. A reliability of 0 implies that the scores solely exist of measurement errors and that there is no true score variance present in the data. The scores only refer to measurement errors. The reverse applies to a reliability of 1: now, only true score variance is present, and there is no variance caused by measurement errors. The rule-of-thumb is that a measure is reliable when the reliability is at least .70. This implies that 70% of the variance in the data refers to true score variance (systematic variance).

Researchers use three types of reliability for analyzing their data: 1) test-retest reliability 2) inter-item reliability and 3) inter-rater reliability.

Test-retest reliability refers to the consistency in the responses of participants throughout time. Often, participants are measured with time between the measurement occasions. If we assume that a characteristic is stable, the person should get similar scores with similar measurements. If someone scores 110 on an IQ-test the first time, this person should score around 110 on the second measurement occasion. This is because IQ is a relatively stable concept. However, both measurement occasions will not be completely similar, so measurement errors will occur. If the correlation between both tests is high (at least .70), a test (here: IQ-test) has a high reliability. Examples where we expect a high test-retest reliability are: intelligence-, attitude- and personality tests. Examples where we expect a low test-retest reliability are less stable characteristics such as hunger, fatigue or concentration level.

The inter-item reliability is important for measurements that consist of more than one item. *Inter-item reliability* refers to the extent of consistency between multiple items measuring the same construct. Personality questionnaires for example often consist of multiple items that tell you something about the extraversion or confidence of participants. These items are summed up to a total score. When researchers sum up the answers of participants to receive a single score, they have to be certain that all items measure the same construct (for example extraversion). To check to what extent items are in accordance with each other, the *item-total correlation* can be computed for each combination of items. This is the correlation between an item and the rest of all items combined. Each item on the measurement instrument should correlate with the remaining items. An item-total correlation of .30 or higher per item is considered to be sufficient.

Next to calculating whether each item is in accordance with the remaining items, it is also necessary to calculate the reliability of all items combined. In the past, the split-half reliability was calculated. For the *split-half reliability* all items are subdivided into two sets. A total score is computed for each set and then the correlation between both sets is calculated. If the items in both sets measure the same construct, there should be a high correlation between the tests. The correlation (and hence split-half reliability) is considered high if it is .70 or higher.

The disadvantage of the split-half reliability is that the correlation that is found depends on which items are placed in which set. If you subdivide the items a little differently, it may result in a different split-half reliability. Because of this reason, we recently calculate more often the ‘*Chronbach’s alpha coefficient’*. The Chronbach’s alpha is used to calculate the mean of all possible split-half reliabilities. Researchers assume that the inter-item reliability is sufficient when Chronbach’s alpha is .70 or higher.

Chronbach's alpha in formula:

\[\alpha = \frac{Items}{Items - 1} 1 - \frac{\sum{Variance\: of\: all\: items}}{Total\: variance\: of\: complete\: scale}\]

or

\[\alpha = \frac{N\cdot\bar{c}}{\bar{v}+(N-1)\cdot\bar{c}}\]

- N : the number of items
- c-bar : the average inter-item covariance among the items
- v-bar : equals the average variance

Inter-rater reliability is also called ‘*inter-judge*’ or *‘inter-observer*’ reliability. It refers to the extent to which two or more observers observe and code the behavior of participants equally. When the observers make similar judgements (thus, a high inter-rater reliability), the correlation between their judgements should be .70 or higher.

A *correlation coefficient* is a statistic that indicates the strength of the relation between two measurements. This statistic lies between 0 (no relation between the measurements) and 1 (perfect relation between the measurements). Correlation coefficients can be positive or negative. When this statistic is squared, we see what proportion of the total variance of both measures is systematic. The higher the correlation, the more related the two variables are.

Measurement techniques should not only be reliable, but also valid. Validity refers to the extent to which a measurement technique measures what it should measure. The question is thus whether we measure what we want to measure. It is important to note that reliability and validity are two different things. A measurement instrument can be reliable, whilst not being valid. A high reliability tells us that the instrument measures *something*, but does not tell us exactly *what* the instrument measures. To discover that, it is important to check the validity of the instrument. Validity is not a definite characteristic of a measurement technique or instrument. A measure can be valid for one aim, whilst not being valid for another aim.

A subdivision is made into *internal validity* and *external validity*.

- Internal validity refers to drawing right conclusions about the effects of the independent variable. Internal validity is warranted by experimental control. This causes namely that only the independent variable differs between the conditions. If participants in different conditions differ systematically on more than only the independent variable, we are facing
*confounding*. - External validity refers to the extent to which the research results can be generalized to other samples. Researchers distinguish three kinds of validity: 1) face validity 2) construct validity and 3) criterion-validity.

*Face-validity *refers to the extent to which a measure *seems* to measure what it should measure. A measure has face-validity when people think that what is measured is indeed the case. This form of validity can thus not be computed statistically, but is more an assessment of the measure based on the feelings of people. The face-validity is determined by the researcher, the participants and/or field experts.

Face-validity is important in statistics, because if a measurement does not have face-validity, the participants think it is not important to really participate (if a personality test has no face-validity, but participants have to fill in the questionnaire, then they do not see the added value of the test). It is important to remember three things: 1) If a measurement has face-validity, it does not mean per se that the measure is valid too 2) If a measurement does not have face-validity, it does not mean per se that the measurement is not valid 3) Some researchers try to hide their aims to get valuable answers. For example, if answers are too much associated with sensitive topics, participants may not want to answer those questions correctly; if the face-validity of the questions is lowered, the participants may not know that they are giving delicate information and may more easily do so.

Often, researchers are interested in *hypothetical constructs*. These are constructs that can not be observed directly by empirical evidence. The question arises how to determine whether the measurement of a hypothetical construct (that can not be observed directly) is valid. Chronbach and Meehl say that the validity of the measurement of a hypothetical construct can be determined by comparing the measure with other measures. Scores on an instrument for self-confidence for example should correlate positively with measures for optimism, but negatively with measures for insecurity and fear.

A measurement instrument has construct validity when 1) it correlates strongly with instruments with which it should correlate (*convergent validity*) and 2) it does not correlate (or correlates to a small extent) with instruments to which it should not correlate (*discriminant validity*).

Criterion validity refers to the extent to which a measurement instrument is related to a specific outcome or *behavioral criterion*. Researchers distinguish between two primary types of criterion validity: 1) concurrent criterion validity and 2) predictive criterion validity.

- Concurrent criterion validity tells us something about the correlation between measurement instrument and outcome - for instance whether people with a high grade for the course 'Introduction to Statistics' also have a high grade for the course 'Introduction to social sciences'. Generally, the measurements are at almost the same time.
- Predictive criterion validity tells us something about the predictive value of a certain measurement instrument for an outcome - for instance whether people with a high grade for the course 'Introduction to Statistics' also have a high grade for the course 'Statistics for advanced students'. Generally, measurements are made with (a lot of) time in between them.

Definitions and explanations of the most important terms generally associated with statistical reliability and validity

Knowledge and assistance for discovering, identifying, recognizing, observing and defining statistics.

Knowledge and assistance for classifying, illustrating, interpreting, demonstrating and discussing statistics.

Knowledge and assistance for choosing, modeling, organizing, planning and utilizing statistics.

1. What is the difference between reliability and validity, two central terms within statistics?

2. Of which two parts consists the total variance in a data set of scores?

3. Between which two numbers does reliability range?

4. Which three kinds of reliability can be distinguished?

5. How can the split-half reliability be computed?

6. What is the difference between internal and external validity?

Extra clarification with basic concepts of reliability and validity

Understand statistics with knowledge and explanation about a topic of statisticsPractice with questions and answers to test your statistical knowledge and skillsWatch statistics practiced in real life with selected videos for extra clarificationStudy relevant terminology with glossaries of statistical topicsShare your knowledge and experience and see other WorldSupporters' contributions about a topic of statistics...

Latest news and updates of WorldSupporter Statistics

## Add new contribution