Je vertrek voorbereiden of je verzekering afsluiten bij studie, stage of onderzoek in het buitenland
Study or work abroad? check your insurance options with The JoHo Foundation
To measure: to discover the extent, dimensions, quantity, or capacity of something, especially by comparison with a standard.
Measurement in research consists of assigning numbers to empirical events, objects or properties, or activities in compliance with a set of rules.
This definition implies that measurement is a three-part process:
Variables being studied in research may be classified as objects or as properties.
- Psychological properties: include attitudes and intelligence.
- Social properties include leadership ability, class affiliation, and status. In a literal sense, researchers do not measure either objects or properties.
àThey measure indicants of the properties or indicants of the properties of objects. Since each property cannot be measured directly, one must infer its presence or absence by observing some indicant or pointer measurement.
MEASUREMENT SCALES
In measuring, one devises some mapping rule and then translates the observation of property indicants using this rule.
Several types of measurement are possible; the appropriate choice depends on what is assumed about the mapping rules.
Each one has its own set of underlying assumptions about how the numerical symbols correspond to real-world observations.
Mapping rules have four assumptions:
1. Numbers are used to classify, group, or sort responses. No order exists.
2. Numbers are ordered. One number is greater than, less than, or equal to another number.
3. Differences between numbers are ordered. The difference between any pair of numbers is greater than, less than, or equal to the difference between any other pair of numbers.
4. The number series has a unique origin indicated by the number zero. This is an absolute and meaningful zero point.
Combinations of these characteristics of classification, order, distance, and origin provide four widely used classifications of measurement scales:
1) NOMINAL SCALES – with these scales, a researcher is collecting information on a variable that naturally (or by design) can be grouped into two or more categories that are mutually exclusive and collectively exhaustive.
The only possible arithmetic operation when a nominal scale is employed is the counting of members.
Nominal classifications can consist of any number of separate groups if the groups are mutually exclusive and collectively exhaustive. These scales are the least powerful of the four data types. They suggest no order or distance relationship and have no arithmetic origin.
Any information a sample element might share about varying degrees of the property being measured, is wasted by this scale. The only qualification is the number count of cases in each category (the frequency distribution), so the researcher is restricted to the use of the mode as the measure of central tendency.
It can only be concluded which category has the most members. There is no generally used measure of dispersion for nominal scales.
Dispersion: describes how scores cluster or scatter in a distribution. Nominal data are statistically weak, but they can still be useful. One can almost always classify a set of properties into a set of equivalent classes. Nominal measures are especially valuable in exploratory work where the objective is to uncover relationships rather than secure precise measurements. Nominal scales are also widely used in survey and other research when data are classified by major subgroups of the population. Classifications such as respondents’ marital status, gender, political orientation, and exposure to a certain experience provide insight into important demographic data patterns.
2) ORDINAL SCALES – include the characteristics of the nominal scale plus an indicator of order. Ordinal data require conformity to a logical postulate: If a > b and b > c, then a > c. The use of an ordinal scale implies a statement of ‘greater than’ or ‘less than’ (or equal) without stating how much greater or less. Other descriptions can be used – ‘superior to’, ‘happier than’ etc. An ordinal concept can be extended beyond the three cases used in the simple illustration of a>b>c – any number of cases can be ranked.
Another extension of the ordinal concept occurs when there is more than one property of interest. Examples of ordinal data include attitude and preference scales.
Because the numbers used with ordinal scales have only a rank meaning, the appropriate measure of central tendency is the median. The median is the midpoint of a distribution. A percentile or quartile reveals the dispersion. Co-relational analysis of ordinal data is restricted to various ordinal techniques.
Measures of statistical significance are technically confined to a body of statistics known as nonparametric methods, synonymous with distribution-free statistics.
3) INTERVAL SCALES – have the power of nominal and ordinal data plus one additional strength: they incorporate the concept of equality of interval (the scaled distance between 1 and 2 equals the distance between 2 and 3). Calendar time is such a scale. Centigrade and Fahrenheit temperature scales are other examples of classical interval scales. Both have an arbitrarily determined zero point, not a unique origin. Researchers treat many attitude scales as interval. When a scale is interval and the data are relatively symmetric with one mode, you use the arithmetic mean as the measure of central tendency. When the distribution of scores computed from interval data leans in one direction or the other (skewed right or left), we often use the median as the measure of central tendency and the interquartile range as the measure of dispersion.
4) RATIO SCALES – incorporate all of the powers of the previous scales plus the provision for absolute zero or origin. Ratio data represent the actual amounts of a variable. Measures of physical dimensions such as weight, height, and distance are examples. In business research, we ratio scales in many areas – there are money values, population counts, return rates, and productivity rates. For statistical purposes the analyst would use the same statistical techniques as with interval data. All statistical techniques mentioned up to this point are usable with ratio scales. Other manipulations carried out with real numbers may be done with ratio-scale values. Thus, multiplication and division can be used with this scale but not with the others mentioned. Geometric and harmonic means are measures of central tendency, and coefficients of variation may also be calculated for describing variability. Higher levels of measurement generally yield more information.
Because of the measurement precision at higher levels, more powerful and sensitive statistical procedures can be used. When we collect information at higher levels, we can always covert, rescale, or reduce the data to arrive at a lower level.
SOURCES OF MEASUREMENT DIFFERENCES
Since compete control (of study) is unattainable, error does occur. Much error is systematic (results from bias), while the remainder is random (occurs erratically). There are four major error sources which may contaminate the results:
- THE RESPONDENT – opinion differences that affect measurement come from relatively stable characteristics of the respondent. Typical of these are employee status, ethnic group membership, social class, etc.
The skilled researcher will anticipate many of these dimensions, adjusting the design to eliminate, neutralize, or otherwise deal with them. Respondents may be reluctant to express strong negative or positive feelings, may purposefully express attitudes that they perceive as different from those of others, or may have little knowledge about something but be reluctant to admit ignorance. This reluctance to admit ignorance of a topic can lead to an interview consisting of ‘guesses’ or assumptions, which, in turn, create erroneous data. Respondents may also suffer from temporary factors like fatigue, boredom, anxiety, hunger, etc.; these limit the ability to respond accurately and fully.
- SITUATIONAL FACTORS – any condition that places a strain on the interview or measurement session can have serious effects on the interviewer-respondent rapport. If another person is present, that person can distort responses by joining in, by distracting, or by merely being there. If the respondents believe anonymity is not ensured, they may be reluctant to express certain feelings.
- THE MEASURER – the interviewer can distort responses by rewording, paraphrasing, or reordering questions. Stereotypes in appearance and action introduce bias. Inflections of voice and conscious or unconscious prompting with smiles, nods, and so forth, may encourage or discourage certain replies. Checking of the wrong response or failure to record full replies will obviously distort findings.
In the data analysis stage, incorrect coding, careless tabulation, and faulty statistical calculation may introduce further errors.
- THE INSTRUMENT – a defective instrument can cause distortion in two major ways. First, it can be too confusing and ambiguous. The use of complex words and syntax beyond participant comprehension is typical. Leading questions, ambiguous meanings, mechanical defects (inadequate space for replies, response-choice omissions, and poor printing), and multiple questions suggest the range of problems.
Many of these problems are the direct result of operational definitions that are insufficient, resulting in an inappropriate scale being chosen or developed. A more elusive type of instrument deficiency is poor selection from the universe of content items. Seldom does the instrument explore all the potentially important issues.
Even if the general issues are studied, the questions may not cover enough aspects of each area of concern.
THE CHARACTERISTICS OF GOOD MEASUREMENT
The tool should be an accurate counter or indicator of what we are interested in measuring. In addition, it should be easy and efficient to use.
There are three major criteria for evaluating a measurement tool:
- VALIDITY – is the extent to which a test measures what we actually wish to measure. This text features two major forms: external and internal validity. The external validity of research findings is the data’s ability to be generalized across persons, settings, and times. Internal validity is further limited in this discussion to the ability of a research instrument to measure what it is purported to measure. One widely accepted classification of validity consists of three major forms:
If it is not available, how much will it cost and how difficult will it be to secure? The amount of money and effort that should be spent on development of a criterion depends on the importance of the problem for which the test is used. Once there are test and criterion scores, they must be compared in some way.
Such an approach would provide us with preliminary indications of convergent validity (the degree to which scores on one scale correlate with scores on other scales designed to assess the same construct). Another method to of validating the trust construct would be to separate it from other constructs in the theory or related theories. To the extent that trust could be separated from bonding, reciprocity, and empathy, we would have completed the first steps toward discriminant validity (the degree to which scores on a scale do not correlate with scores from scales designed to measure different constructs).
- RELIABILITY – has to do with the accuracy and precision of a measurement procedure. A measure is reliable to the degree that it supplies consistent results. Reliability is a necessary contributor to validity but is not a sufficient condition for validity. If a measurement is not valid, it hardly matters if it is reliable – because it does not measure what the designer needs to measure in order to solve the research problem. In this context, reliability is not as valuable as validity, but it is much easier to assess.
Reliability is concerned with estimates of the degree to which a measurement is free of random or unstable error. Reliable instruments can be used with confidence that transient and situational factors are not interfering. Reliable instruments are robust; they work well at different times under different conditions. This distinction of time and condition is the basis of frequently used perspectives on reliability:
Some of the difficulties that can occur in the test-retest methodology and cause a downward bias in stability include:
A suggested remedy is to extend the interval between test and retest (from two weeks to a month).
When the two halves are correlated, if the results of the correlation are high, the instrument is said to have high reliability in an internal consistency sense. The high correlation tells us there is similarity (or homogeneity) among the items. The potential for incorrect inferences about high internal consistency exists when the test contains many items – which inflate the correlation index.
- PRACTICALITY – is concerned with a wide range of factors of economy, convenience, and interpretability. The scientific requirements of a project call for the measurement process to be reliable and valid, while the operational requirements call for it to be practical.
Scaling is the ‘procedure for the assignment of numbers (or other symbols) to a property of objects in order to impart some of the characteristics of numbers to the properties in question.’ Procedurally, numbers are assigned to indicants of the properties of objects. Thus, one assigns a number scale to the various levels of heat and cold and calls it a thermometer.
SELECTING A MEASUREMENT SCALE – selecting and constructing a measurement scale requires the consideration of several factors that influence the reliability, validity, and practicality of the scale:
1) Research objectives – researchers face two general types of scaling objectives:
With the first study objective, the scale would measure the customers’ orientation as favourable or unfavourable. With the second objective, the same data may be used, but the focus is on how satisfied people are with different design options.
2) Response types – measurement scales fall into one of four general types: rating, ranking, categorization, and sorting. A rating scale is used when participants score an object or indicant without making a direct comparison to another object or attitude. Ranking scales constrain the study participant to making comparisons and determining order among two or more properties (or their indicants) or objects. A choice scale requires that participants choose one alternative over another. Categorization asks participants to put themselves or property indicants in groups or categories. Sorting requires that participants sort cards (representing concepts or constructs) into piles using criteria established by the researcher. The cards might contain photos or images or verbal statements of product features.
3) Data properties – decisions about the choice of measurement scales are often made with regard to the data properties generated by each scale. Scales are classified in increasing order of power; scales are nominal, ordinal, interval, or ratio. Nominal scales classify data into categories without indicating order, distance, or unique origin. Ordinal data show relationships of more than and less than but have no distance or unique origin. Interval scales have both order and distance but no unique origin. Ratio scales possess all four properties’ features. The assumptions underlying each level of scale determine how a particular measurement scale’s data will be analysed statistically.
4) Number of dimensions – measurement scales are either uni/one-dimensional or multidimensional. With a uni-dimensional scale, one seeks to measure only one attribute of the participant or object. A multidimensional scale recognizes that an object might be better described with several dimensions than on a uni-dimensional continuum.
5) Balanced or unbalanced – a balanced rating scale has an equal number of categories above and below the midpoint. Generally, rating scales should be balanced, with an equal number of favourable and unfavourable response choices. An unbalanced rating scale has an unequal number of favourable and unfavourable response choices.
6) Forced or unforced choices – an unforced-choice rating scale provides participants with an opportunity to express no opinion when they are unable to make a choice among the alternatives offered. A forced-choice scale requires that participants select one of the offered alternatives. Researchers often exclude the response choice ‘no opinion’, ‘don’t know’, or ‘neutral’ when they know that most participants have an attitude on the topic. However, when many participants are clearly undecided and the scale does not allow them to express their uncertainty, the forced-choice biases results.
7) Number of scale points – a scale should be appropriate for its purpose. For a scale to be useful, it should match the stimulus presented and extract information proportionate to the complexity of the attitude, object, concept, or construct. First, as the number of scale points increases, the reliability of the measure increases. Second, in some studies, scales with 11 points may produce more valid results than 3-, 5-, or 7-point scales. Third, some constructs require greater measurement sensitivity and the opportunity to extract more variance, which additional scale points provide.
Fourth, a larger number of scale points are needed to produce accuracy when using single-dimension versus multiple-dimension scales. Finally, in cross-cultural measurement, the cultural practices may condition participants to a standard metric.
8) Errors to avoid with rating scales – Before accepting participants’ ratings, their tendencies to make errors of central tendency and halo effect should be considered. Some raters are reluctant to give extreme judgments, and this fact accounts for the error of central tendency. Participants may also be ‘easy raters’ or ‘hard raters’, making what is called an error of leniency. These errors most often occur when the rater does not know the object or property being rated. To address these tendencies, researchers can:
The hallo effect: the systematic bias that the rater introduces by carrying over a generalized impression of the subject from one rating to another. Halo is especially difficult to avoid when the property being studied is not clearly defined, is not easily observed, is not frequently discussed, involves reactions with others, or is a trait of high moral importance. Ways of counteracting the halo effect include having the participant rate one trait at a time, revealing one trait per page, or periodically reversing the terms that anchor the endpoints of the scale, so positive attributes are not always on the same end of each scale.
RATING SCALES – rating scales are used to judge properties of objects without reference to other similar objects. These ratings may be in such form as ‘like-dislike’ or other classifications using even more categories.
1) Simple attitude scales – the simple category scale (also called a dichotomous scale) offers two mutually exclusive response choices. These may be ‘yes’ and ‘no’, ‘important’ and ‘unimportant’. This response strategy is particularly useful for demographic questions or where a dichotomous response is adequate. When there are multiple options for the rater but only one answer is sought, the multiple-choice, single-response scale is appropriate. Both the multiple-choice, single-response scale and the simple category scale produce nominal data. A variation, the multiple choice, multiple-response scale (also called a checklist) allows the rater to select one or several alternatives. The cumulative feature of this scale can be beneficial when a complete picture of the participant’s choice is desired. This scale generates nominal data. Simple attitude scales are easy to develop, are inexpensive, and can be designed to be highly specific. The design approach is subjective. The researcher’s insight and ability offer the only assurance that the items chosen are a representative sample of the universe of attitudes about the attitude project. There is no evidence that each person will view all items with the same frame of reference as will other people.
2) Likert scales – the Likert scale is the most frequently used variation of the summated rating scale. Summated rating scales consist of statements that express either a favourable or an unfavourable attitude toward the object of interest. The participant is asked to agree or disagree with each statement. Each response is given a numerical score to reflect its degree of attitudinal favourableness, and the scores may be summed to measure the participant’s overall attitude.
The Likert scale is easy and quick to construct. Careful researchers are careful that each item meets an empirical test for discriminating ability between favourable and unfavourable attitudes. Likert scales are probably more reliable and provide a greater volume of data than many other scales. The scale produces interval data.
Originally, creating a Likert scale involved a procedure known as item analysis.
In the first step, a large number of statements were collected that met two criteria: (1) each statement was relevant to the attitude being studied; (2) each was believed to reflect a favourable or unfavourable position on that attitude. People similar to those who were going to be studied were asked to read each statement and to state the level of their agreement with it, using a 5-point scale.
To ensure consistent results, the assigned numerical values are reversed if the statement is worded negatively. The two extreme groups represent people with the most favourable and least favourable attitudes toward the attitude being studied.
These extremes are the two criterion groups by which individual items are evaluated.
Item analysis assess each item based on how well it discriminates between those persons whose total score is high and those whose total score is low. The mean scores for the high-score and low-score groups are then tested for statistical significance by computing t values. After finding the t values for each statement, they are rank-ordered, and those statements with the highest t values are selected.
3) Semantic differential scales – the semantic differential (SD) scale measures the psychological meanings of an attitude object using bipolar adjectives. Researchers use this scale for studies such as brand and institutional image. The method consists of a set of bipolar rating scales, usually with 7 points, by which one or more participants rate one or more concepts on each scale item. The SD scale is based on the proposition that an object can have several dimensions of connotative meaning. The meanings are located in multidimensional property space, called semantic space. Connotative meanings are suggested or implied meanings, in addition to the explicit meaning of an object. The semantic differential has several advantages. It is an efficient and easy way to secure attitudes from a large sample. These attitudes may be measured in both direction and intensity. The total set of responses provides a comprehensive picture of the meaning of an object and a measure of the person doing the rating. It is a standardized technique that is easily repeated but escapes many problems of response distortion found with more direct methods. It produces interval data.
4) Numerical/multiple rating list scales – numerical scales have equal intervals that separate their numeric scale points. The verbal anchors serve as the labels for the extreme points.
Numerical scales are often 5-point scales but may have 7 or 10 points. The participants write a number from the scale next to each item. The scale’s linearity, simplicity, and production of ordinal or interval data make it popular for managers and researchers. A multiple rating list scale is similar to the numerical scale but differs in two ways: (1) it accepts a circled response from the rater, and (2) the layout facilitates visualization of the results. The advantage is that a mental map of the participant’s evaluations is evident to both the rater and the researcher. This scale produces interval data.
5) Staple scale – is used as an alternative to the semantic differential, especially when it is difficult to find bipolar adjectives that match the investigative question. For example, there are three attributes of corporate image. The scale is composed of the word (or phrase) identifying the image dimension and a set of 10 response categories for each of the three attributes. Fewer response categories are sometimes used.
Participants select a plus number for the characteristic that describes the attitude object. The more accurate the description, the larger is the positive number. Similarly, the less accurate the description, the larger is the negative number chosen. Ratings range from +5 to -5, with participants selecting a number that describes the store very accurately to very inaccurately. Like the Likert, SD, and numerical scales, Stapel scales usually produce interval data.
6) Constant-sum scales – is a scale that helps the researcher discover proportions. With a constant-sum scale, the participant allocates points to more than one attribute or property indicant, such that they total a constant sum, usually 100 or 10. Up to 10 categories may be used, but both participant precision and patience suffer when too many stimuli are proportioned and summed. The advantage of the scale is its compatibility with per cent and the fact that alternatives that are perceived to be equal can be so scored – unlike the case with most ranking scales. The scale is used to record attitudes, behaviour, and behavioural intent. The scale produces interval data.
7) Graphic rating scales – the scale was originally created to enable researchers to discern fine differences. Theoretically, an infinite number of ratings are possible if participants are sophisticated enough to differentiate and record them. They are instructed to mark their response at any point along a continuum. Usually, the score is a measure of length (millimetres) from either endpoint. The results are treated as interval data. The difficulty is in coding and analysis. This scale requires more time than scales with predetermined categories.
RANKING SCALES – in ranking scales, the participant directly compares two or more objects and makes choices among them.
Frequently, the participant is asked to select one as the ‘best’ or the ‘most preferred’. When there are only two choices, this approach is satisfactory, but it often results in ties when more than two choices are found. Using the paired-comparison scale, the participant can express attitudes unambiguously by choosing between two objects. The number of judgements required in a paired comparison is [(n)(n-1)/2], where n is the number of stimuli or objects to be judged. Reducing the number of comparisons per participant without reducing the number of objects can lighten this burden. Each participant can be presented with only a sample of the stimuli. In this way, each pair of objects must be compared an equal number of times. Another procedure is to choose a few objects that are believed to cover the range of attractiveness at equal intervals. All other stimuli are then compared to these few standard objects.
Paired comparisons run the risk that participants will tire to the point that they give ill-considered answers or refuse to continue. A paired comparison provides ordinal data. The forced ranking scale lists attributes that are ranked relative to each other. This method is faster than paired comparisons and is usually easier and more motivating to the participant. A drawback to forced ranking is the number of stimuli that can be handled by this method. In addition, rank ordering produces ordinal data since the distance between preferences is unknown. Often the manager is interested in benchmarking. This calls for a standard by which other programs, processes, brands, or people can be compared. The comparative scale is ideal for such comparisons if the participants are familiar with the standard. Some researchers treat the data produced by comparative scales as interval data since the scoring reflects an interval between the standard and what is being compared. The rank or position of the item would be treated as ordinal data unless the linearity of the variables in question could be supported.
Arbitrary scales are designed by collecting several items that are unambiguous and appropriate to a given topic. These scales are not only easy to develop, but also inexpensive and can be designed to be highly specific. Moreover, arbitrary scales provide useful information and are adequate if developed skilfully.
Consensus scaling requires items to be selected by a panel of judges and then evaluate them on:
In this field, especially Turnstone equal-appearing interval scale is well-known.
Item analysis scaling is the procedure for evaluating an item based on how well it discriminates between those persons whose total score is high and those whose total score is low. The most popular scale using this approach is the Likert scale.
CUMULATIVE SCALES – total scores on cumulative scales have the same meaning. Given the person’s total score, it is possible to estimate which items were answered positively and negatively. A pioneering type of this type was the scalogram.
Scalogram analysis is a procedure for determining whether a set of items forms a uni-dimensional scale. A scale is uni-dimensional if the responses fall into a pattern in which endorsement of the item reflecting the extreme position results in endorsing all items that are less extreme.
The scalogram and similar procedures for discovering underlying structure are useful for assessing attitudes and behaviours that are highly structured, such as social distance, organizational hierarchies, and evolutionary product stages.
Factor scales include a variety of techniques that have been developed to address two problems:
Factoring develops measurement questions through factor analysis or similar correlation techniques. It is particularly useful in uncovering latent attitude dimensions, and it approaches sampling through the concept of multidimensional attribute space. The semantic differential scale is an example.
Other developments in scaling include multidimensional scaling and conjoint analysis. Each represents a family of related techniques with a variety of applications for handling complex judgments. Magnitude estimation and Rasch models provide an avenue for reconceptualising traditional scaling techniques for greater efficiency and freedom form error.
There are 3 suggested phases of developing an instrument design.
PHASE 1: REVISITING THE RESEARCH QUESTION HIERARCHY
In general, once the researcher understands the connection between the investigative questions and the potential measurement questions, a strategy for the survey is the next step. This proceeds to getting down to the particulars of instrument design. The following are important issues to be considered:
1) Type of scale for desired analysis – the analytical procedures available to the researcher are determined by the scale types used in the survey. It is important to plan the analysis before developing the measurement questions.
2) Communication approach – Communication-based research may be conducted by personal interview, telephone, mail, computer, or some combination of these (called hybrid studies). The different delivery mechanisms result in different introductions, instructions, instrument layout, and conclusions.
3) Disguising objectives and sponsors – it has to be decided whether the purpose of the study should be disguised. A disguised question is designed to conceal the question’s true purpose. The decision about when to use disguised questions within surveys may be made easier by identifying four situations where disguising the study objective is or is not an issue:
- Willingly shared, conscious-level information – in surveys requesting conscious-level information that should be willingly shared, either disguised or undisguised questions may be used, but the situation rarely requires disguised techniques.
- Reluctantly shared, conscious-level information – sometimes the participant knows the information which a researcher need but is reluctant to share it for a variety of reasons. When the participant is asked for an opinion on some topic on which he may hold a socially unacceptable view, projective techniques are used. In this type of disguised question, the survey designer phrases the questions in a hypothetical way or asks how other people in the participant’s experience would answer the question. The assumption is that responses to these questions will indirectly reveal the participant’s opinions.
- Knowable, limited-conscious-level information – not all information is at the participant’s conscious level. Given some time – and motivation – the participant can express this information.
Asking about individual attitudes when participants know they hold the attitude but have not explored why they hold the attitude may encourage the use of disguised questions.
- Subconscious-level information – in assessing buying behaviour, it is accepted that some motivations are subconscious. This is true for attitudinal information as well. Seeking insight into the basic motivations underlying attitudes or consumption practices may or may not require disguised techniques.
4) Preliminary analysis plan – researchers are concerned with adequate coverage of the topic and with securing the information in its most usable form. A good way to test how well the study plan meets those needs is to develop ‘dummy’ tables that display the data one expects to secure. Each dummy table is a cross-tabulation between two or more variables. The preliminary analysis plan serves as a check on whether the planned measurement questions meet the data needs of the research question. This also helps the researcher determine the type of scale needed for each question – a preliminary step to developing measurement questions for investigative questions.
PHASE 2: CONSTRUCTING AND REFINING THE MEASUREMENT QUESTIONS
Drafting or selecting questions begins once a complete list of investigative questions is developed and a decision is made on the collection processes to be used. The order, type, and wording of the measurement questions, the introduction, the instructions, the transitions, and the closure in a quality questionnaire should accomplish the following:
These questions usually appear at the end of a survey (except for those used as filters or screens, questions that determine whether a participant has the requisite level of knowledge to participate.
2) Question content - is first and foremost dictated by the investigative questions guiding the study. From these questions, questionnaire designers craft or borrow the target and classification questions that will be asked of participants.
Four questions, covering numerous issues, guide the instrument designer in selecting appropriate question content:
3) Question wording - a dilemma arises from the requirements of question design (the need to be explicit, to present alternatives, and to explain meanings). All contribute to longer and more involved sentences. The difficulties caused by question wording exceed most other sources of distortion in surveys. The diligent question designer will put a survey question through many revisions. Leading questions can inject significant error by implying that one response should be favoured over another.
4) Response strategy - a third major area in question design is the degree and form of structure imposed on the participant.
The various response strategies offer options that include unstructured response (or open-ended response, the free choice of words) and structured response (or closed response, specified alternatives provided).
Free-response questions - also known as open-ended questions, ask the participant a question and either the interviewer pauses for the answer (which is unaided) or the participant records his or her ideas in his or her own words in the space provided on a questionnaire.
Dichotomous question - suggest opposing responses (yes/no) and generate nominal data.
Multiple-choice questions - are appropriate when there are more than two alternatives or when a researcher seeks for gradations of preference, interest, or agreement the question. Multiple-choice questions usually generate nominal data. When the choices are numeric alternatives, this response structure may produce at least interval and sometimes ratio data. When the choices represent ordered but unequal, numerical ranges or a verbal rating scale, the multiple-choice question generates ordinal data.
Checklist – when multiple responses to a single question are required, the question should be asked in one of three ways: the checklist, rating, or ranking strategy. If relative order is not important, the checklist is logical choice. They are more efficient than asking for the same information with a series of dichotomous selection questions, one for each individual factor. Checklists generate nominal data.
Rating questions - ask the participant to position each factor on a companion scale, either verbal, numeric, or graphic. Generally, rating-scale structures generate ordinal data; some carefully crafted scales generate interval data. It is important to remember that the researcher should represent only one response dimension in rating-scale response options. Otherwise, the participant is presented with a double-barreled question with insufficient choices to reply to both aspects.
Ranking questions - ideal when relative order of the alternatives is important. The checklist strategy would provide the three factors of influence, but there is no way of knowing the importance the participant places on each factor. Ranking generates ordinal data.
PHASE 3: DRAFTING AND REFINING THE INSTRUMENT
Phase 3 of instrument design – drafting and refinement – is a multistep process:
1) Participant screening and introduction – the introduction must supply the sample unit with the motivation to participate in the study.
It must reveal enough about the forthcoming questions, usually by revealing some or all of the topics to be covered, for participants to judge their interest level and their ability to provide the desired information. In any communication study, the introduction also reveals the amount of time participation is likely to take. The introduction also reveals the researcher organization or sponsor (unless the study is disguised) and possibly the objective of the study. In personal or phone interviews the introduction usually contains one or more screen questions or filter questions to determine if the potential participant has the knowledge or experience necessary to participate in the study.
2) Measurement question sequencing - the design of survey questions is influenced by the need to relate each question to the others in the instrument. Often the content of one question (called a branch question) assumes other questions have been asked and answered.
The basic principle used to guide sequence decisions is this: the nature and needs of the participant must determine the sequence of questions and the organization of the interview schedule. Four guidelines are suggested to implement this principle:
- The question process must quickly awaken interest and motivate the participant to participate in the interview. Put the more interesting topical target questions early. Leave classification questions not used as filters or screens to the end of the survey.
- The participant should not be confronted by early requests for information that might be considered personal or ego-threatening. Put questions that might influence the participant to discontinue or terminate the questioning process near the end. Use buffer questions – neutral questions designed chiefly to establish rapport with the participant.
- The questioning process should begin with simple items and then move to the more complex, as well as move from general items to the more specific. Put taxing and challenging questions later in the questioning process. The procedure of moving from general to more specific questions is sometimes called the funnel approach. The objectives of this procedure are to learn the participant’s frame of reference and to extract the full range of desired information while limiting the distortion effect of earlier questions on later ones.
- Changes in the frame of reference should be small and should be clearly pointed out. Use transition statements between different topics of the target question set.
3) Instructions - to the interviewer or participant attempt to ensure that all participants are treated equally, thus avoiding building error into the results. Two principles form the foundation for good instructions: clarity and courtesy. Instruction topics include those for:
- Terminating an unqualified participant – defining for the interviewer how to terminate an interview when the participant does not correctly answer the screen or filter questions.
- Terminating a discontinued interview – defining for the interviewer how to conclude an interview when the participant decides to discontinue.
- Moving between questions on an instrument – defining for an interviewer or participant how to move between questions or topic sections of an instrument (skip directions) when movement is dependent on the specific answer to a question or when branched questions are used.
- Disposing of a completed questionnaire – defining for an interviewer or participant completing a self-administered instrument how to submit the completed questionnaire.
4) Conclusion - its role is to leave the participant with the impression that his or her involvement has been valuable. Subsequent researchers may need this individual to participate in new studies.
OVERCOMING INSTRUMENT PROBLEMS – there is no substitute for a thorough understanding of question wording, question content, and question sequencing issues.
However, the researcher can do several things to help improve survey results, among them:
There are abundant reasons for pretesting individual questions, questionnaires, and interview schedules:
A population element = the unit of study - the individual participant or object on which the measurement is taken.
A population = the total collection of elements about which some conclusion is to be drawn.
A census = a count of all the elements in a population. The listing of all population elements from which the sample will be drawn is called the sample frame.
Accuracy = the degree to which bias is absent from the sample. When the sample is drawn properly, the measure of behaviour, attitudes or knowledge of some sample elements will be less than the measure of those same variables drawn from the population. Also, the measure of the behaviour, attitudes, or knowledge of other sample elements will be more than the population values. Variations in these sample values offset each other, resulting in a sample value that is close to the population value.
Systematic variance =“the variation in measures due to some known or unknown influences that ‘cause’ the scores to lean in on direction more than another.” The systematic variance may be reduced by e.g. increasing the sample size.
Precision: precision of estimate is the second criterion of a good sample design. In order to interpret the findings of research, a measurement of how closely the sample represents the population is needed.
Sampling error = The numerical descriptors that describe samples may be expected to differ from those that describe populations because of random fluctuations natural to the sampling process.
Representation =The members of a sample are selected using probability or non-probability procedures.
Probability sampling is based on the concept of random selection – a controlled procedure which ensures that each population element is given a known non-zero change of selection.
Non-probability sampling is arbitrary and subjective; when elements are chosen subjectively, there is usually some pattern or scheme used. Thus, each member of the population does not have a known chance of being included.
Element selection - Whether the elements are selected individually and directly from the population – viewed as a single pool – or additional controls are imposed, element selection may also classify samples.
Probability sampling - is based on the concept of random selection – a controlled procedure that assures that each population element is given a known nonzero chance of selection. Only probability samples provide estimates of precision and offer the opportunity to generalize the findings to the population of interest from the sample population.
Population parameters = summary descriptors (e.g., incidence proportion, mean, variance) of variables of interest in the population.
Sample statistics = used as estimators of population parameters. The sample statistics are the basis of conclusions about the population. Depending on how measurement questions are phrased, each may collect a different level of data. Each different level of data also generates different sample statistics.
The population proportion of incidence “is equal to the number of elements in the population belonging to the category of interest, divided by the total number of elements in the population.”.
The sampling frame = is closely related to the population. It is the list of elements from which the sample is actually drawn. Ideally, it is a complete and correct list of population members only.
Stratified random sampling = is the process by which the sample is constrained to include elements from each of the segments is called.
Cluster sampling = this is where the population is divided into groups of elements with some groups randomly selected for study.
An area sampling = the most important form of cluster sampling. It is possible to use when a research involves populations that can be identified with some geographic area. This method overcomes the problems of both high sampling cost and the unavailability of a practical sampling frame for individual elements
The theory of clustering = that the means of sample clusters are unbiased estimates of the population mean. This is more often true when clusters are naturally equal, such as households in city blocks. While one can deal with clusters of unequal size, it may be desirable to reduce or counteract the effects of unequal size.
Double sampling = It may be more convenient or economical to collect some information by sample and then use this information as the basis for selecting a subsample for further study. This procedure is called double sampling, sequential sampling, or multiphase sampling. It is usually found with stratified and/or cluster designs.
Convenience = Non-probability samples that are unrestricted are called convenience samples. They are the least reliable design but normally the cheapest and easiest to conduct. Researches or field workers have the freedom to choose whomever they find.
Purposive sampling = A non-probability sample that conforms to certain criteria is called purposive sampling. There are two major types – judgment sampling and quota sampling:
Judgment sampling occurs when a researcher selects sample members to conform to some criterion. When used in the early stages of an exploratory study, a judgment sample is appropriate. When one wishes to select a biased group for screening purposes, this sampling method is also a good choice.
Quota sampling is the second type of purposive sampling. It is used to improve representativeness. The logic behind quota sampling is that certain relevant characteristics describe the dimensions of the population. If a sample has the same distribution on these characteristics, then it is likely to be representative of the population regarding other variables on which the researcher has no control. In most quota samples, researchers specify more than one control dimension. Each should meet two tests: (1) It should have a distribution in the population that can be estimated, and (2) be pertinent to the topic studied.
Snowball = In the initial stage of snowball sampling, individuals are discovered and may or may not be selected through probability methods. This group is then used to refer the researcher to others who possess similar characteristics and who, in turn, identify others.
The observation approach: involves observing conditions, behaviour, events, people or processes.
The communication approach: involves surveying/interviewing people and recording their response for analysis. Communicating with people about various topics, including participants, attitudes, motivations, intentions and expectations.
Survey: a measurement process used to collect information during a highly structured interview – sometimes with a human interviewer and other times without.
Participant receptiveness = the participant’s willingness to cooperate.
Dealing with non-respond errors - By failing to respond or refusing to respond, participants create a non-representative sample for the study overall or for a particular item or question in the study.
In surveys, non-response error occurs when the responses of participants differ in some systematic way from the responses of nonparticipants.
Response errors: occur during the interview (created by either the interviewer or participant) or during the preparation of data for analysis.
Participant-initiated error: when the participant fails to answer fully and accurately – either by choice or because of inaccurate or incomplete knowledge
Interviewer error: response bias caused by the interviewer.
Response bias = Participants also cause error by responding in such a way as to unconsciously or consciously misrepresent their actual behaviour, attitudes, preferences, motivations, or intentions.
Social desirability bias = Participants create response bias when they modify their responses to be socially acceptable or to save face or reputation with the interviewer
Acquiescence = the tendency to be agreeable.
Noncontact rate = ratio of potential but unreached contacts (no answer, busy, answering machine, and disconnects but not refusals).
The refusal rate refers to the ration of contacted participants who decline the interview to all potential contacts.
Random dialling: requires choosing telephone exchanges or exchange blocks and then generating random numbers within these blocks for calling.
A survey via personal interview is a two-way conversation between a trained interviewer and a participant.
Computer-assisted personal interviewing (CAPI): special scoring devices and visual materials are used.
Intercept interview: targets participants in centralised locations, such as shoppers in retail malls. Reduce the costs associated with travel.
Outsourcing survey services offers special advantages to managers. A professionally trained research staff, centralized location interviewing, focus group facilities and computer assisted facilities are among them.
Causal methods are research methods which answer questions such as “Why do events occur under some conditions and not under others?”
Ex post facto research designs, in which a researcher interviews respondents or observes what is or what has been, have the potential for discovering causality. In comparison, the distinction is that with causal methods the researcher is required to accept the world as it is found, whereas an experiment allows the researcher to systematically alter the variables of interest and observe what changes follow.
Experiments are studies which involve intervention by the researcher beyond what is required for measurement.
Replication = repeating an experiment with different subject groups and conditions
Field experiments =a study of the dependent variable in actual environmental conditions
Hypothesis = a relational statement as it describes a relationship between two or more variables
In an experiment, participants experience a manipulation of the independent variable, called the experimental treatment.
The treatment levels of the independent variable are the arbitrary or natural groups the researcher makes within the independent variable of an experiment. The levels assigned to an independent variable should be based on simplicity and common sense.
A control group could provide a base level for comparison. The control group is composed of subjects who are not exposed to the independent variable(s), in contrast to those who receive the experimental treatment. When subjects do not know if they are receiving the experimental treatment, they are said to be blind. When the experimenters do not know if they are giving the treatment to the experimental group or to the control group, the experiment is said to be double blind.
Random assignment to the groups is required to make the groups as comparable as possible with respect to the dependent variable. Randomization does not guarantee that if the groups were pretested they would be pronounced identical; but it is an assurance that those differences remaining are randomly distributed.
Matching may be used when it is not possible to randomly assign subjects to groups. This employs a non-probability quota sampling approach. The object of matching is to have each experimental and control subject matched on every characteristic used in the research.
Validity = as whether a measure accomplishes its claims.
Internal validity = do the conclusions drawn about a demonstrated experimental relationship truly imply cause?
External validity – does an observed causal relationship generalize across persons, settings, and times? Each type of validity has specific threats a researcher should to guard against.
Statistical regression = this factor operates especially when groups have been selected by their extreme scores. No matter what is done between O1 and O2, there is a strong tendency for the average of the high scores at O1 to decline at O2 and for the low scores at O1 to increase. This tendency results from imperfect measurement that, in effect, records some persons abnormally high and abnormally low at O1. In the second measurement, members of both groups score more closely to their long-run mean scores.
Experimental mortality – this occurs when the composition of the study groups changes during the test.
Attrition is especially likely in the experimental group and with each dropout the group changes. Because members of the control group are not affected by the testing situation, they are less likely to withdraw.
Diffusion or imitation of treatment = if the control group learns of the treatment (by talking to people in the experimental group) it eliminates the difference between the groups.
Compensatory equalization = where the experimental treatment is much more desirable, there may be an administrative reluctance to withdraw the control group members. Compensatory actions for the control groups may confound the experiment.
Compensatory rivalry = this may occur when members of the control group know they are in the control group. This may generate competitive pressures.
Resentful demoralization of the disadvantaged = when the treatment is desirable and the experiment is obtrusive, control group members may become resentful of their deprivation and lower their cooperation and output.
Reactivity of testing on X – the reactive effect refers to sensitising subjects via a pre-test so that they respond to the experimental stimulus (X) in a different way. This before-measurement effect can be particularly significant in experiments where the IV is a change in attitude.
Interaction of selection and X = the process by which test subjects are selected for an experiment may be a threat to external validity. The population from which one selects subjects may not be the same as the population to which one wishes to generalize results.
Static Group Comparison – the design provides for two groups, one of which receives the experimental stimulus while the other serves as a control.
Pre-test-Post-test Control Group Design – this design consists of adding a control group to the one-group pre-test-post-test design and assigning the subjects to either of the groups by a random procedure (R).
Post-test-Only Control Group Design – The pre-test measurements are omitted in this design. Pre-tests are well established in classical research design but are not really necessary when it is possible to randomize.
Non-equivalent Control Group Design – this is a strong and widely used quasi-experimental design. It differs from the pre-test-post-test control group design - the test and control groups are not randomly assigned.
There are two varieties.
- Intact equivalent design, in which the membership of the experimental and control groups is naturally assembled. Ideally, the two groups are as alike as possible. This design is especially useful when any type of individual selection process would be reactive.
-The self-selected experimental group design is weaker because volunteers are recruited to form the experimental group, while no volunteer subjects are used for control. This design is likely when subjects believe it would be in their interest to be a subject in an experiment.
Separate Sample Pre-test-Post-test Design = Most applicable when it is unknown when and to who to introduce the treatment but it can decide when and whom to measure. This is a weaker design because several threats to internal validity are not handled adequately.
Measurement in research consists of assigning numbers to empirical events, objects or properties, or activities in compliance with a set of rules.
Mapping rules = a scheme for assigning numbers or symbols to represent aspects of the event being measured
Objects include the concepts of ordinary experience, such as touchable items like furniture. Objects also include things that are not as concrete, i.e. genes, attitudes and peer-group pressures.
Properties are the characteristics of the object. A person’s physical properties may be stated in terms of weight, height.
Psychological properties: include attitudes and intelligence.
Social properties include leadership ability, class affiliation, and status. In a literal sense, researchers do not measure either objects or properties.
Dispersion: describes how scores cluster or scatter in a distribution. Nominal data are statistically weak, but they can still be useful.
Nominal scales = with these scales, a researcher is collecting information on a variable that naturally (or by design) can be grouped into two or more categories that are mutually exclusive and collectively exhaustive.
Ordinal scales = include the characteristics of the nominal scale plus an indicator of order. Ordinal data require conformity to a logical postulate: If a > b and b > c, then a > c.
Interval scales = have the power of nominal and ordinal data plus one additional strength: they incorporate the concept of equality of interval (the scaled distance between 1 and 2 equals the distance between 2 and 3).
Ratio scales = incorporate all of the powers of the previous scales plus the provision for absolute zero or origin. Ratio data represent the actual amounts of a variable. Measures of
Content Validity – of a measuring instrument is the extent to which it provides adequate coverage of the investigative questions guiding the study. If the instrument contains a representative sample of the universe of subject matter of interest, then content validity is good. To evaluate the content validity of an instrument, one must first agree on what elements constitute adequate coverage. A determination of content validity involves judgment.
Criterion-Related Validity – reflects the success of measures used for prediction or estimation. You may want to predict and outcome or estimate the existence of a current behaviour or time perspective
Construct validity – in attempt to evaluate, we consider both the theory and the measuring instrument being used. If we were interested in measuring the effect of trust in cross functional teams, the way in which ‘trust’ was operationally defined would have to correspond to an empirically grounded theory. If a known measure of trust was available, we might correlate the results obtained using this measure with those derived from our new instrument.
Reliability – has to do with the accuracy and precision of a measurement procedure. A measure is reliable to the degree that it supplies consistent results. Reliability is a necessary contributor to validity but is not a sufficient condition for validity.
Stability – a measure is said to possess stability if consistent results with repeated measurements of the same person with the same instrument can be secured.
An observation procedure is stable if it gives the same reading on a particular person when repeated one or more times.
Equivalence – a second perspective on reliability considers how much error may be introduced by different investigators (in observation) or different samples of items being studied (in questioning or scales).
Internal Consistency – a third approach to reliability uses only one administration of an instrument or test to assess the internal consistency or homogeneity among the items.
The split-half technique can be used when the measuring tool has many similar questions or statements to which participant can respond.
Practicality = concerned with a wide range of factors of economy, convenience, and interpretability. The scientific requirements of a project call for the measurement process to be reliable and valid, while the operational requirements call for it to be practical.
Scaling = the ‘procedure for the assignment of numbers (or other symbols) to a property of objects in order to impart some of the characteristics of numbers to the properties in question.’
Ranking scales constrain the study participant to making comparisons and determining order among two or more properties (or their indicants) or objects.
A choice scale requires that participants choose one alternative over another.
Categorization asks participants to put themselves or property indicants in groups or categories.
Sorting requires that participants sort cards (representing concepts or constructs) into piles using criteria established by the researcher. The cards might contain photos or images or verbal statements of product features.
Nominal scales classify data into categories without indicating order, distance, or unique origin.
Ordinal data show relationships of more than and less than but have no distance or unique origin.
Interval scales have both order and distance but no unique origin.
Ratio scales possess all four properties’ features. The assumptions underlying each level of scale determine how a particular measurement scale’s data will be analysed statistically.
Uni-dimensional scale, one seeks to measure only one attribute of the participant or object.
Multidimensional scale recognizes that an object might be better described with several dimensions than on a uni-dimensional continuum.
Balanced rating scale has an equal number of categories above and below the midpoint. Generally, rating scales should be balanced, with an equal number of favourable and unfavourable response choices.
Unbalanced rating scale has an unequal number of favourable and unfavourable response choices.
Unforced-choice rating scale provides participants with an opportunity to express no opinion when they are unable to make a choice among the alternatives offered.
Forced-choice scale requires that participants select one of the offered alternatives. Researchers often exclude the response choice ‘no opinion’, ‘don’t know’, or ‘neutral’ when they know that most participants have an attitude on the topic.
Hallo effect = the systematic bias that the rater introduces by carrying over a generalized impression of the subject from one rating to another. Halo is especially difficult to avoid when the property being studied is not clearly defined, is not easily observed, is not frequently discussed, involves reactions with others, or is a trait of high moral importance.
Simple category scale (also called a dichotomous scale) offers two mutually exclusive response choices. These may be ‘yes’ and ‘no’, ‘important’ and ‘unimportant’.
When there are multiple options for the rater but only one answer is sought, the multiple-choice, single-response scale is appropriate.
Likert scale is the most frequently used variation of the summated rating scale. Summated rating scales consist of statements that express either a favourable or an unfavourable attitude toward the object of interest. The participant is asked to agree or disagree with each statement. Each response is given a numerical score to reflect its degree of attitudinal favourableness, and the scores may be summed to measure the participant’s overall attitude.
Item analysis assesses each item based on how well it discriminates between those persons whose total score is high and those whose total score is low. The mean scores for the high-score and low-score groups are then tested for statistical significance by computing t values. After finding the t values for each statement, they are rank-ordered, and those statements with the highest t values are selected.
The semantic differential (SD) scale measures the psychological meanings of an attitude object using bipolar adjectives. Researchers use this scale for studies such as brand and institutional image.
Numerical/multiple rating list scales have equal intervals that separate their numeric scale points. The verbal anchors serve as the labels for the extreme points.
Staple scale = used as an alternative to the semantic differential, especially when it is difficult to find bipolar adjectives that match the investigative question.
Constant-sum scales = a scale that helps the researcher discover proportions.
Graphic rating scales – the scale was originally created to enable researchers to discern fine differences. Theoretically, an infinite number of ratings are possible if participants are sophisticated enough to differentiate and record them.
They are instructed to mark their response at any point along a continuum. Usually, the score is a measure of length (millimetres) from either endpoint. The results are treated as interval data. The difficulty is in coding and analysis. This scale requires more time than scales with predetermined categories.
Ranking scales– in ranking scales, the participant directly compares two or more objects and makes choices among them.
Arbitrary scales are designed by collecting several items that are unambiguous and appropriate to a given topic. These scales are not only easy to develop, but also inexpensive and can be designed to be highly specific. Moreover, arbitrary scales provide useful information and are adequate if developed skilfully.
Consensus scaling requires items to be selected by a panel of judges and then evaluate them on:
Scalogram analysis = a procedure for determining whether a set of items forms a uni-dimensional scale.
Factor scales include a variety of techniques that have been developed to address two problems:
A disguised question = designed to conceal the question’s true purpose.
Administrative questions – identify the participant, interviewer, interview location, and conditions. These questions are rarely asked of the participant but are necessary for studying patterns within the data and identify possible error sources.
Classification questions – usually cover sociological-demographic variables that allow participants’ answers to be grouped so that patterns are revealed and can be studied.
Target questions (structured or unstructured) – address the investigative questions of a specific study. These are grouped by topic in the survey. Target questions may be structured (they present the participants with a fixed set of choices, often called closed questions) or unstructured (the do not limit responses but do provide a frame of reference for participants’ answers; sometimes referred to as open-ended questions).
Response strategy - a third major area in question design is the degree and form of structure imposed on the participant.
The various response strategies offer options that include unstructured response (or open-ended response, the free choice of words) and structured response (or closed response, specified alternatives provided).
Free-response questions - also known as open-ended questions, ask the participant a question and either the interviewer pauses for the answer (which is unaided) or the participant records his or her ideas in his or her own words in the space provided on a questionnaire.
Dichotomous question - suggest opposing responses (yes/no) and generate nominal data.
Checklist – when multiple responses to a single question are required, the question should be asked in one of three ways: the checklist, rating, or ranking strategy. If relative order is not important, the checklist is logical choice. They are more efficient than asking for the same information with a series of dichotomous selection questions, one for each individual factor. Checklists generate nominal data.
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
Je vertrek voorbereiden of je verzekering afsluiten bij studie, stage of onderzoek in het buitenland
Study or work abroad? check your insurance options with The JoHo Foundation
There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
Field of study
Add new contribution