# Understanding data: distributions, connections and gatherings

## In short: Data

• Data is any collection of facts, statistics, or information that can be used for analysis or decision-making. It can be raw or processed, and it can be in the form of numbers, text, images, or sounds.

# Understanding data: distributions, connections and gatherings

Understanding and knowing what sort of data you have is essential for conducting succesfull statistical tests. Here you will find general introductiory information about the types of data you will most likely encounter in your researches.

## Collecting data

Data collection can be subdivided into three groups:

• Observational measurements: in this case, behavior is observed directly. This can be done in every study in which the behavior that is to be examined, can be seen directly. Researchers can observe the behavior directly, or they can make audio- or video-recordings, from which information about the participants can be deduced. In observational studies the dependent and independent variable(s) of interest are not manipulated. No claims can be made about cause-and-effect relationships between variables.

• Physical measurements: these are used when the researcher is interested in the relation between behavior and not-directly observable physical processes. It refers here to processes of the human body, that often can not be observed by eye. For example, hart rate, sweating, brain activity and hormonal changes.

• Self reportage measurements: participants answer questions on questionnaires or interviews themselves. There are three kinds of self reportages: 1) cognitive: these measure what people think 2) affective: these measure what people feel and 3) behavioral: these measure what people do.

## Quantitative versus qualitative data

In statistics, a subdivision is made into quantitative and qualitative data. Quantitative data results from a certain measurement, for example the grade on a test, weight or the scores on a scale. A measurement instrument is used to determine how much a certain characteristic is present in an object.

Qualitative data is also called frequency or categorical data. Qualitative data refers to categorizing objects. For example, 15 people are categorized as ‘very anxious’, 33 people are categorized as ‘neutral’ and 12 people are categorized as ‘little anxious’. The data consists of frequencies for each category.

## Associations

Most research is conducted to discover or examine associations between variables, for example to examine the relation between sleeping habits and school achievement. The first research technique to examine relations is the correlational method. With the correlation method the researcher observes two variables to discover if there is a relation between them. The experimental method is used when the researcher is interested in the cause-and-effect relation between variables. A change in one variable will cause a change in another variable. This method has two essential characteristics. First, there is a manipulation. This implies that the researcher changes the values of a variable (X). Next, values of a second variable (Y) are measured to see if changes of X influence the values of Y. Second, there is control. This means that the researcher has to keep the research situation constant. When all other variables/conditions are kept constant, the researcher can claim that changes in Y are caused by X and not by another variable. It is important to be aware of the distinction between correlation and causation. A correlation implies that there is a relation between variables, but this does not tell us anything about the direction of the effect. Hence, you can not say that changes in one variable are caused by the other variable. Three conditions have to be met in order to make statements about causality:

1. Covariance: variables should covary together. A high score on the x-variable should be in accordance with a high score on the y-variable.

2. Direction: the cause should precede the consequence.

3. Exclusion of the influence of other variables: it may be the case that a third variable (z) influences both x and y.

## Interpreting and displaying raw data

### Frequency distributions, proportions and intervals

When participants are being measured, the obtained data are called raw data. These data are difficult to interpret. Therefore, steps have to be taken in order to process these data. Raw data is only a collection of numbers. Structure can be added by, for example, displaying the data in a graph. When reaction times are measured, one can for example make a frequency distribution. In a frequency distribution, you note how often a certain value (here: reaction time) occurred. This helps you to visualize which value (here: reaction time) occurred most frequently. Describing proportions and percentages is also useful in a frequency distribution. A proportion is calculated by dividing the frequency that belongs to a certain X-value by the total amount of participants. For example, if two people, that belong to a class of 20 persons, scored a six (X=6), the proportion for the score six is 2/20 = 0.10. The formula is:

$p = \frac{f}{N}$

• p: proportion for the score
• f: frequency of the score
• N: total amount of participations or observations

Because proportions are always calculated in relation with the total amount of participants or observations (N), we call them relative frequencies. Percentages can be obtained by multiplying proportions by hundred. Thus:

$p_{(100)} = \frac{f}{N}\times 100%$

• p(100): percentage of proportion for the score
• f: frequency of the score
• N: total amount of participations or observations

Sometimes, many different scores are possible. In that case, it is better to make grouped frequency distributions. Here, we make groups of scores instead of only looking at individual values. The groups (or intervals) are called class-intervals. Instead of noting for example each possible length, you make groups of different length-intervals. For example, a group with the interval of 100 to 120 cm and a group with the interval of 121 to 140 cm. You can note the group behind each frequency.

Example:

$p^{121-140}_{(100)} = \frac{f^{121-140}}{N}\times 100%$

### Graphs

A frequency distribution can be displayed well in a figure. This is called a graph. An example is a histogram. The horizontal axis is called the x-axis, and the vertical axis is called the y-axis. The categories are displayed on the horizontal axis, and the frequencies are displayed on the vertical axis. To make a histogram, bars have to be drawn. The height of each bar is in accordance with the frequency of the category. A bar chart is in principle similar to a histogram, except that the bars are not put directly next to each other. Also the values that differentiate strongly from the other values are displayed. These values are called outliers are often (but not always) not useful. Besides graphs, lines can also be applied to the obtained data. The most frequently used line is the normal curve. This line is highest in the middle of the distribution, and decreases symmetrically at both sides of the middle. The normal distribution is symmetric, but not every distribution looks like this. A bimodal distribution for example, has two peeks. If a distribution has only one peek, it is called a unimodal distribution. A distribution can also be asymmetric, because the distribution is longer on on of the sides. A distribution with a ‘tail’ to the left has a negative skewness, and a distribution with a tail to the right has a positive skewness.
Besides histograms and bar charts, one can also use stem-and-leaf-plots. In such plots, each score is subdivided into two parts. The first number (for example the 1 of 12) is called the stem, and the second number (for example the 2 in 12) is called the stem. When you draw a plot, first note all stems (the first number). Next, note each leaf of each score. A stem-and-leaf-plot offers you the opportunity to quickly find individual scores, which may be useful for calculations. This is not possible with a frequency distribution.

### Percentiles

Individual scores are called raw scores. However, these scores do not provide much information. For example, if you tell someone you had a scored of 76 points on your exam, it is not clear to the other person whether this is good or bad. To be able to interpret such a score, it should be clear what the mean of all scores is and how your score relates to the mean. The rank or percentile rank is a number that implies what percentage of all individuals in the distributions scored below a certain value (in this example: 76 points). Such a score is also called a percentile. The percentile rank refers to the percentage, whilst the percentile refers to a score. You might know that you scored 76 points out of 90 on a test. But that figure has no real meaning unless you know what percentile you fall into. If you know that your score is in the 90th percentile, that means you scored better than 90% of people who took the test. To determine percentiles and percentile ranks, it first has to be examined how many individuals score below any value. The result is called cumulative percentages. These percentages show what percentage of individuals score below a certain X-value and add up to 100 for the highest possible value of X. An easy way to use percentile is by means of quartiles. The first quartile (Q1) is 25%, the second quartile (Q2) is 50% (thus, the mean) and the third quartile is 75%. The distance between the first and third quartile is called the interquartile range (IQR). 1.5 times the IQR above Q3 or below Q1 is a criterion to identify possible outliers. All these data can be displayed in a boxplot. The so-called ‘box’ is from the first to the third quartile. In addition, the median is displayed in the box by a horizontal line. In addition, there is a vertical line from the lowest to the highest observations, that also goes through the box. Outliers are displayed with an asterisk above or below the line.

## Central tendency

Measurements of the central tendency are measurements that display where on the scale the distribution is centered. There are three ways to do so: the mode, the median and the mean. These manners differ in the amount of data they use.

1. Mode: is used least frequently and is often least useful. The mode is simply the most frequently occurring score. In case of two adjacent scores, the mean of these two numbers is taken.

2. Median: the score that corresponds to the point of which 50% of all scores falls below when the data are ordered numerically. Therefore, the median is also called the 50th percentile. Imagine that we have the scores 4, 6, 8, 9, and 16. Here, the median is 8. In case of an even number of scores, for example 4, 6, 8, 12, 15, and 16, the median falls between 8 and 12. In this case, we take the mean of the two middle scores as median. Thus, the median is 10 in this case. A useful formula to find the median, is that of the median location:

$Median\:location = \frac{(N+1)}{2}$

• N: number of scores
1. Mean: this measurement of the central tendency measurements is used most frequently, because all scores of the distributions are included. The mean is the sum of the scores divided by the total amount of scores. A disadvantage of the mean is that it is influenced by extreme scores. Therefore, the ‘trimmed’ mean is sometimes used. For example, ten scores at both ends of the distribution are excluded. As a result, the more extreme results are excluded and the estimation of the mean becomes more stable. Formula of the mean:

$Mean = \frac{\sum x}{N}$

• x: individual score of x
• N: number of scores

## Measuring variability

The variability of a distribution refers to the extent to which scores are spread or clustered. Variability provides a quantitative value to the extent of difference between scores. A large value refers to high variability. The aim of measuring variability is twofold:

1. Describing the distance than can be expected between scores;

2. Measuring the representativeness of a scores for the whole distribution.

The range of a measurement is the distance between the highest and lowest score. The lowest score should be subtracted from the highest score. However, the range can provide a wrong image when there are extreme values present. Thus, the disadvantage of the range is that it does not account for all values, but only for the extreme values.

## Variance and standard deviation

The standard deviation (SD) is the most frequently used and most important measure for spread. This measurement uses the mean of the distribution as comparison point. Moreover, the standard deviation uses the distance between individual scores and the mean of the data set. By using the standard deviation, you can check whether individual scores in general are far away or close to the mean. The standard deviation can be best understood by means of four steps:

1. First, the deviation of each individual score to the mean has to be calculated. The deviance is the difference between each individual score and the mean of the variable. The formula is:

$Deviation\: score = x - µ$

• x: individual score of x
• μ: mean of the variable
1. In the next step, calculate the mean of the deviation scores. This can be obtained by adding all deviations scores and dividing the sum by the number of deviation scores (N). The deviation scores are combined always zero. Before computing the mean, each deviation score should be placed between brackets and squared.

$mean\:of\:the\:deviation\:scores = \frac{\sum{(x-\mu)}}{N}$

• x: individual score of x
• μ: mean of the variable
• N: number of scores
1. Next, the mean of the squared sum can be computed. This is called the variance. The formula of the variance is:

$σ^2= \frac{\sum {(x-μ)^{2}}}{N}$

• σ2: squared sum or variance
• x: individual score of x
• μ: mean of the variable
• N: number of scores
1. Finally, draw the square root of the variance. The result is the standard deviation. The final formula for the standard deviation is thus:

$σ= \sqrt {\frac{\sum {(x-μ)^{2}}}{N}}$

• σ: standard deviation
• x: individual score of x
• μ: mean of the variable
• N: number of scores

Often, the variance is a large and unclear number, because it comprises a squared number. It is therefore useful and easier to understand to compute and present the standard deviation.

In a sample with n scores, the first n-1 scores can vary, but the last score is definite. The sample consists of n-1 degrees of freedom (in short: df).

### Systematic variance and error variance

The total variance can be subdivided into 1) systematic variance and 2) error variance.

• Systematic variance refers to that part of the total variance that can predictably be related to the variables that the researcher examines.

• Error variance emerges when the behavior of participants is influenced by variables that the researcher does not examine (did not include in his or her study) or by means of measurement error (errors made during the measurement). For example, if someone scores high on aggression, this may also be explained by his or her bad mood instead of the temperature. This form of variance can not be predicted in the study. The more error variance is present in a data set, the harder it is to determine if the manipulated variables (independent variables) actually are related to the behavior one wants to examine (the dependent variable). Therefore, researchers try to minimize the error variance in their study.

# Topics related to understanding data

This content refers to....

## Glossary for Data: distributions, connections and gatherings

Crossroads: activities, countries, competences, study fields and goals
Activities abroad, studies and working fields

## Statistics and Data analysis Methods

Comments, Compliments & Kudos