Chapter 2 - How can data be described and explored?

## How can data be plotted?

In this chapter, we show how data can be structured in such a way that it is easier to interpret. Raw data is just a collection of numbers. One way to add structure to the data is by representing the data by a graph. For example, an experiment is conducted to examine how people retrieve numbers from their memory. One, three, or five numbers are shown to the participants. Then, the participants are shown one number, and asked to press the red button if the number was not part of the series of numbers shown before, and to press the green button if the number was part of the series shown before. The researchers measure the reaction times. To structure the data, a the frequency distribution of the reaction times could be reported in a table, which indicates how often each (interval of) reaction time occurred.

## What are the characteristics of a histogram?

The frequency distribution can be depicted in a figure, for example a histogram. In our example, the vertical axis represents the frequency, and the horizontal axis represents the reaction times. To remove random "noise", which probably does not have a meaning, frequencies can be merged in blocks of 5/100 of a second. The most important trends in the data remain like that. When using merged frequencies, also called intervals, the real lower limit and the real upper limit are decimal values that fall halfway between the top of one interval and the bottom of the next. For example, the interval 35-39 comprises all values between 34.5 (true lower limit) and 39.5 (real upper limit).

To clarify a graph or table, one could also only show the midpoints: the mean of the upper and lower limit of an interval. When using intervals, it is recommended to use between 10 and 12 intervals and to use as many real breakpoints as possible. For example, 0-9, 10-19, and so on. In our example, one participants obtained a value of 125/100 of a second, a very slow reaction time in comparison to the other values. This value is called a outlier, because it deviates much from the rest of the data. Often (but not always!), outliers are mistakes in the data.

## When are smooth lines used to plot the data?

Histograms are used to plot the data easily, but they do have some disadvantages. For example, the data will not be plotted clearly for small sample sizes, because little changes in interval sizes can already depict large changes in the distribution form. To better illustrate these data, smooth lines should be used.

### A normal curve

The normal curve is also called the bell curve due to its form: it is highest in the middle of the distribution and tapers off on both ends. The normal distribution has a specific definition, which will be discussed into more detail in the next chapter.

### Kernel density plots

For a normal curve, only a few characteristics of the data are used. That is, the mean and standard deviation of the data. The actual density form is not taken into account. Individual scores are thus not considered in the plot. The kernel density plot does almost the opposite: it actually uses the individual data instead of the mean and standard deviation. The idea is that each observation might have been slightly different. A kernel density plot takes into account that every observation consists of a certain degree of coincidence. A score of 80/100 could also have been 79/100 or 85/100. Each score has its own distribution of scores. The kernel density plot merges all these separate normal distributions into one to better represent the true distribution.

## What are the benefits of a stem-and-leaf display?

The drawback of histograms is that they make use of intervals and thus lose the actual numerical values of individuals. The drawback of frequency distributions is that they use individual scores, but not summarize the data sufficiently. The stem-and-leaf display aims to overcome both drawbacks.

To explain this method, we use a hypothetical data set in which we recorded the amount of time (in minutes per week) that each of 100 students spends playing electronic games. Table 1 displays part of the data, namely the scores between 40 and 80 minutes per week.

Table 1.

 Raw data Stem Leaf ---40 41 41 42 4343 44 46 46 4647 48 49 4952 54 55 55 5758 59 5963 6771 75 75 76 7678 79--- 012345678910111213 0000000000023356667822235555793357722278999011233466678992455789937155668934779466236773479255789989

We refer to the tens' digits (here the 4, 5, 6, and 7) as the leading digits, also called the most significant digits. These numbers form the stem, or the vertical axis, of the display. Of the 14 scores that were in the 40's, there was one 40, two 41s, one 42, and so on. The units' digits (0, 1, 2, and so on) are the trailing digits, or the less significant digits. They form the leaves. In the table, you can deduce how many students played a game for a specific amount of time. For example, 11 students played 0 minutes per week, 1 student played 2 minutes, and so on. The number in the column "leaf" display a laying histogram.

Sometimes, the grouping is to coarse for the purposes. A solution then is to make the intervals in the leaf smaller. Instead of merging all number from 40 to 49, one could also decide to display intervals of 40-41, 42-43, 44-45, and so on (see Table 2). Then, the leaf column receives for example the variables 4, 4t, and 4f ('t' representing 'two' and 'three', and 'f' representing 'four'  and 'five'). The only required condition is that the intervals should be of equal size.

Table 2.

 Raw data Steel Leaf 42 42 42 43 43 43 43 4344 44 44 45 45 4t4f 2223333344455

Stem-and-leaf displays are particularly useful when two different distributions are to be compared. Such a comparison is accomplished by plotting the two distributions on opposites sides of the stem.

## Which terms can be used to describe the data?

A normal distribution is symmetric: it has the same shape on both sides of the center. In addition, it is unimodal, which means that it has only one peak. In contrast, a bimodal distribution has two peaks. Modality refers to the number of peaks of a distribution. When a distribution has a tail on one side of the distribution, it is called asymmetric. When a distribution has a tail going out to the left, it is negatively skewed. When a distribution has a tail going out to the right, it is positively skewed

The last characteristic of a distribution is kurtosis, a measure describing the relative concentration of scores in he center, the upper and lower ends (tails), and between the center and the tails (shoulders) of the distribution.

1. Mesokurtic: a normal distribution (peak in the middle).
2. Platykurtic: a distribution with a flatter curve (more scores between the tails and the center).
3. Leptokurtic: a distribution with a steaper peak and thicker tails (more scores in the center and tails).

The distribution of scores becomes visible only when the sample size is big enough.

## What is the notation system for the field of statistics?

In statistics, there is no standard notation system (yet). In this book, we use a simple notation system. Even though this system may be less precise, it is beneficial for the gain of comprehension.

### Notation of variables

In general, an uppercase letter, often X or Y, represent a variable as a whole. A subscript will then represent an individual value of that variable. To refer to an individual score without selecting a specific score, you can use Xi.

### Summary of notation

One of the most used symbols is the capital sigma (Σ), the default notation symbol for summation. It literally means: 'sum up anything that follows'. For instance,  ΣXi means that you should sum up all Xi's. Be aware that ΣX2 means to add all X2's, and that (ΣX)2 means that you should first add all X's and then square this sum. ∑XY means that you should add all products of X and Y's.

### Double subscripts

A double subscript can be used to specify which value of X is meant. For instance, with a table it can be used to indicate which row and column you refer to (row i, column j). X23 then means row 2, column 3. Denotation 1 (formula sheet) then means that all Xij''s should be summed for i values 1 and 2, and j values 1 to 5.

## What are measures of central tendency?

Measures of central tendency, also called measures of location, are a set of measures that reflect where on the scale the distribution is centered. The three major measures of central tendency are: mode, median, and mean. They differ in how much use they make of the data, particularly of extreme values. The mode is based on only a few data points, the median ignores most data, and the mean is calculated based on all data.

1. Modus (Mo): is used the least. Simply put, it is the most occurring score. In the event of two adjacent scores that are most frequently scored, the mean from these scores is taken. When the two most occurring scores are far apart, the distribution is bimodal and it is better to report two modi.

2. Median (Mdn): de score that corresponds to the point at or below which 50% of the scores fall when the data are arranged in numerical order. It is also called the 50th percentile. Imagine, we have the scores 4, 6, 8, 9, and 16. Then, the median is 8. Suppose, however, that there is an even number of scores. For instance, 4, 6,8, 12, 15, and 16. Then, the median falls between the 8 and 12. In such a case, the average (10) of the middle two scores is commonly taken as the median.

3. Mean: the most common measure of central tendency. The mean (denotation 2 of formula sheet) is the sum of all scores, divided by the total number of scores: (ΣX)/N.

## What are the benefits of the mode, median, and mean?

If the distribution is fairly symmetric, the median and mean are (approximately) equal. If the distribution is also unimodal, the mode is also (approximately) equal to the median and mean. In all other cases, one should deliberately choose which measure to use.

Mode: the mode is per definition a measure that always occurs. This is not (always) true for the mean and median. In addition, the mode represents the largest group of people.

Median: the main advantage of the median is, similar to the mode, that it is not influenced by extreme values. It therefore is a good option in the event of extreme values -when these extreme values are not of importance.

Mean: the main drawback of the mean is that it is influenced by extreme values and that the value of the mean is not per se an existing score. The main advantage of the mean is that it can be manipulated by certain calculations. In addition, it can be used to estimate the population mean: a sample mean is a better estimate of the population mean than a mode or median.

### Trimmed means

Trimmed means are means, calculated on data for which a part of the data is discarded (at the end of the tails). For instance, if we have a data set with 100 scores, a 10% trimmed mean means that we simply discarded the highest 10 scores and the lowest 10 scores and take the main of the remaining scores. By trimming the mean, the extreme scores are discarded, resulting in a more stable population estimate. In the event of extreme scores, the mean of the distribution is "pulled towards" these extremes, and by removing them, the distribution will become more normally distributed.

## How can variability be measured?

It is useful to know where the data is concentrated. In addition, it is important to know to what extent individual scores deviate from the mean, median, or mode. This is called dispersion, or variability, around a point. In general, the variability around the mean is examined.

To explain dispersion, we use the following example: two researchers want to examine what makes a face attractive: special features of simply a "general" look. They create pictures on a computer with one face that is composed from four different faces (set 4), in which a special feature is visible, and one face that is composed from a set of 32 faces (set 32), representing a "general" face. Students assess both pictures on a 1-5 scale. The research shows that set 32 is assessed as most attractive. In addition, it shows that the scores on set 32 are more closely related than the scores on set 4. The scores are more homogeneous. We would like to measure this difference in dispersion.

### Range

The range is a measure of distance between the lowest and highest score. The lowest score of set 4 was 1.20, the highest score was 4.02. Hence, the range is 4.02-1.20 = 2.82. The range is completely dependent on extreme scores, or even outliers, and therefore may give a distorted picture of the variability.

### Interquartile range and other range statistics

The interquartile range (ICQ) is an attempt to circumvent the dependency on the extreme scores. The ICQ is obtained by discarding the upper 25% and the lower 25% of the distribution and taking the range of what remains. The point that cuts off the lower 25% of the distribution is called the first quartile (Q1). Similarly, the point that cuts off the highest 25% is called the third quartile (Q3). The IQR is the difference between Q3 and Q1. Interquartiles are important in boxplots, which will be discussed later on.

A drawback of IQR's is that much data is discarded. Discarding data has to be justified. Then, any desired percentage of data can be discarded. In general, one wants to discard scores that are caused by mistakes or unusual events, such that the variability is not removed from the data. Trimming can be a valuable approach to skewed distributions. Here, a Winsorized sample can be used, meaning that the lowest 10% are removed and replaced by a copy of the lowest 10% of the remaining scores. The same is done for the highest 10%.

### The average deviation

At first glance, the easiest way to compute the variability from the mean, is by calculating all deviances and taking the average of all these deviances. However, because half of the scores will have a positive deviance and half will have a negative deviance, the summed deviance will be zero.

### The mean absolute deviation

The mean absolute deviation (m.a.d.) is the sum of the absolute (thus without the + or - sign) deviations divided by N (the number of scores). However, the mean absolute deviation has rarely been used, because there are more useful measures.

### Variance and standard deviation

The sample variance (s2) offers an alternative approach to the issue of the deviations themselves averaging to zero. The notation for the population variance is σ2. The variance uses the fact that a negative number becomes positive when it is squared. All squared deviations are summed up, and then divided by N-1. It uses N-1 instead of N, because that yields a better estimation of the population variance. Often, a subscript is used to specify to which variable it applies, for instance, s2X (see formula 1 of the formula sheet).

For our example, the variance for set 4 and set 32 is respectively:

s2X = ((1,20-2,64)2 + (1,82-2,64)2 + … + (4,02-2,64)2) / (20-1) = 8,1569/19 = 0.4293

s2y = ((3,13-3,26)2 + (3,17-3,26)2 + … + (3,38-3,26)2)/(20-1) = 0,0903/19 = 0.0048

Squared units result from these calculations. Because these are not easily comparable, the final step is to take the square root of these units. This is called the standard deviation (s, σ or sometimes SD). It is the positive squared root of the variance. In our example, the standard deviation for set 4 is 0.66. The standard deviation for set 32 is 0.07. This implies that, on average, the scores of set 4 deviate 0.66 from the mean. For set 32, the scores deviate, on average, only 0.07 units from the mean.

The standard deviation can also be used to represent how many scores do not deviate more than one standard deviation from the mean. For normal distributions, it applies that approximately two thirds of the scors fall within one standard deviation (above and below) the mean. Approximately 95% of the scores fall within two standard deviations from the mean.

Another formula for the variance is given by formula 2 (see formula sheet). Another formula for the standard deviation is given by formula 3 (see formula sheet). These formulas are not used frequently anymore, because it demands many hands on computations.

When calculating the variance and standard deviation, be aware of extreme scores, because these measures are very sensitive therefore. An extreme score obviously has a high deviance from the mean and high deviances from the mean are disproportionally represented in the variances.

## When is the coefficient of variation used?

Imagine, we have two tests to measure long-term memory. For one of the tests, the data show a mean of 15 and a standard deviation of 3.5 For the other test, the data show a mean of 75 and a standard deviation of 10.5. Which test will you choose? Perhaps the second, because it shows more variance in scores and has a larger standard deviation. However, the standard deviation is based on the deviances from the mean and because the mean of the second test is higher, the values can more easily deviate from the mean. To assess the standard deviation, we should therefore take into account the size of the mean. To compare the two tests, we should use the means. To do so, we can use the coefficient of variation (CV) = standard deviation / mean. Multiplying with 100 provides the percentage. The first test has a CV of 23.3. The second test has a CV of 14. Based on these values, we would choose the first test.

## What is an unbiased estimator?

Although we usually work with samples, we aim to infer something about the population. We use statistics (data of samples) to estimate the parameters (characteristics of the population). Statistics use Roman letters, and parameters are written by Greek letters. For instance, the population mean is denoted by  μ (mu). How well the estimation of the parameter is, depends on the choice of statistics. Estimations can be biased and biased. In general, we aim to have unbiased estimators.

Suppose we are interested in the mean of a population. If we draw a sample from this population, the mean will approximate μ. However, it will not be exactly equal to n μ. We could draw an infinite number of samples. In that case, the mean will be equal to μ. An estimator whose expected value equals the parameter to be estimated (thus in the event of infinite samples from the population) is called an unbiased estimator.

Sample means and variances are unbiased estimators of their parameters. To obtain unbiased sample variances, however, we should divide by N-1 instead of N. One degree of freedom is lost, because  μ has to be predicted from the sample mean. This is represented in the denominator of the fraction.

## When are boxplots used?

A boxplot is, similar to a stem-and-leaf diagram, a way to easily plot data. In an earlier section, we saw that the median can be located by (N+1)/2. To develop a boxplot, we use the median, the first quartile, and the third quartile. The easiest way to locate the quartiles is by using the quartile location:

Quartile location: (median location + 1)/2

Similar to the median location, the quartile location provides the scores of the quartiles. Next to the median, Q1, and Q3, a boxplot uses the IQR (Q3-Q1). An inner fence is defined by the points that fall within 1.5 times of the IQR below and above the appropriate quartile. Suppose the IQR is 2. Then, the inner fence is 2 x 1.5 = 3. Adjacent values are those actual values in the data that are no more extreme (no farther from the median) than the inner fences. Suppose the inner fence is between -1 and 7. The lowest actual value then can be 1. The highest actual value then is 7.

To draw a boxplot, we should first create a scale that comprises all values. Then, we should draw a box from Q1 to Q3 with a vertical line for the location of the median in there. To the left and right of the box are lines, so called whiskers, of the quartiles of the adjacent values. The points that are more extreme than the adjacent values are depicted outside the boxplot with a dot.

From the boxplot we can deduce that the central part of the distribution is fairly symmetric: the median line lies approximately in the middle of the box. We can also see that the distribution is positively skewed, when the whiskers at the right (bottom) of the box are longer than the whiskers at the left (upper) end of the box. Finally, the boxplot shows four obvious outliers. Boxplots are especially useful to examine the dispersion of data. They are therefore easy for comparing groups. The position of the boxplot can show where the mean scores of the group are concentrated.

The statistical software program SPSS can compute measures of central tendency and dispersion. By means of Analyze/Compare means/Means, you'll obtain descriptive statistics. Via Graphs/Interactive/Boxplot you'll obtain a boxplot of the data.

## What are percentiles, quartiles, and deciles?

Next to quartiles, scores can also be differently divided. If we want a finer gradation of the distribution, we can divide the distribution into tenths, with the first decile cutting off the lowest 10%, the second decile cutting of the the lowest 20%, and so on. Finally, percentiles are used to cut off the distribution into hundredths. Quartiles, deciles, and percentiles are the three most common examples of a general class of statistics called quantiles, or sometimes called fractiles

## What is the effect of a linear transformation on data?

It is possible to transfer data from inches to centimeters and from Fahrenheit degrees to Celsius degrees. These transformations fall within a set called linear transformations, in which X is multiplied by a certain constant and some constant is added to this:

Xnew = bXold + a (here, a and b are constants)

To find the mean and variance of the new scale, different formulas are developed. If a constant is added to the data, the same constant is added to the mean (see Formula 4 of the formula sheet). If the data are multiplied (or divided) by a constant, the mean is also multiplied (or divided) by this constant (see Formula 5 of the formula sheet).

When a constant is added or subtracted from the data, the variance and standard deviance remain the same. Thus, for Xnew = Xold  ± a it means that s2new = s2old.

Scores that are multiplied (or divided) by a constant result in multiplying (or dividing) the standard deviation by the constant, and the variance by the square of the constant :

Xnew = bXold leads to s2new = b2s2old and snew = bsold

Xnew = Xold/b leads to s2new = s2old/b and snew = sold/b

### Centering

Data is increasingly being centered. This is obtained by subtracting the sample mean from all observations. The new mean then is 0, but the standard deviation and variance remain the same.

### Reflection

Reflection is commonly used. With reflection, the order of the scale is converted. In many studies, half of the questions are positively framed and half of the questions negatively to prevent people from constantly filling in the same answer. To compare the scores on a 5-point scale, the negative items should be converted. So a 5 becomes a 1, a 4 becomes a 2, and so on. This is called reflection and is simply obtained by a linear transformation. In our example: Xnew = 6 - Xold. This also results in a change of the mean, but it does not influence the variance and standard deviation.

### Standardization

An even more common transformation involves creating deviation scores and then dividing all these deviation scores by the standard deviation. Such scores are called standard scores, and the process is called standardization. Standardization basically results in standard deviation units. A standard score of 0.46 means that the score deviates 0.46 standard units from the mean.

## What is the aim of non-linear transformations?

Linear transformations change the data, but do not alter the form of a distribution. Non-linear transformations, on the other hand, aim to change the form of a distribution. They can change a skewed distribution to be more symmetric, or lower the influence of outliers.

## Exam tickets

• The three most important measures of central tendency are: (1) Mean (average score); (2) Mode (most common score); and (3) Median (middle score).
• If the distribution is fairly symmetric, the median and mean are (approximately) equal.
• If the distribution is skewed, the mean of the distribution is "pulled towards" these extremes. Thus, if the distribution is positively skewed, the mean is to the right of the median. If the distribution is negatively skewed, the mean is to the left of the median (the median is more robust to extreme values).
Selected Categories
This Summary is part of the following bundle(s)

This question is for testing whether or not you are a human visitor and to prevent automated spam submissions. Join World Supporter
Join World Supporter

## Why create an account?

• Once you are logged in, you can:
• Save pages to your favorites
• Give feedback or share contributions
• participate in discussions
• share your own contributions through the 7 WorldSupporter tools
Content