Summary with the 9th edition of Statistics for Business and Economics by Newbold

## How to describe data graphically? - Chapter 1

Statistics are used in many aspects of our daily lives: to predict or forecast sales of a new product, the weather, grade point averages, and so on. We constantly need to absorb and interpret substantial amounts of data. However, once the data are collected, what should we do with them? How do data impact decision making? Generally, statistics help us to make sense of data. In this first chapter, we will introduce graphical ways of presenting data, that allow one to better understand the data. Examples of such graphical displays are: tables, bar charts, pie charts, histograms, stem-and-leaf displays, and so forth.

### How to make decisions in an uncertain environment?

Oftentimes, decisions are based on limited information. Suppose, for instance, that one is interested in bringing a new product to the market. Before doing so, the manufacturer wants to undertake a market research survey to assess the potential level of demand. While the manufacturer is interested in all potential buyers (population), this group is often too large to analyze. Collecting data for the entire population is impossible or prohibitively expensive. Therefore, a representative subgroup of the population (sample) is needed.

#### Sample and population

A population is defined as the complete set of all items (observations) that one is interested in. The population size is denoted by N and can be very large, at times even infinite. A sample is defined as an observed subset of the population. The sample size is denoted by n.

#### Sampling

There are different ways to obtain a representative subgroup (sample) of the population. This process is also called sampling. For instance, simple random sampling (SRS) can be conducted. SRS is a procedure to select a sample of n objects (individuals) in such a way that each member of the population is chosen purely by chance. The selection of one member does not influence the selection (chance) of another member. In other words, each observation (member/individual) has an equal chance of being included in the sample. SRS is very common, such that the adjective simple is often dropped, which implies that the resulting sample is commonly called a random sample. A second way of sampling is called systematic sampling. For systematic sampling, the population list is arranged in some manner unconnected with the subject of interest. In systematic sampling, then, every jth item in the population is selected, where j is the ratio of the population size N to the desired sample size n, that is: j = N / n. The first item to be included in randomly selected. Systematic samples provide a good representation of the population if there is no cyclial variation in the population.

#### Parameter and statistic

A parameter is defined as a measure that describes a population characteristic. A statistic is defined as a numerical measure that describes a sample characteristic. For instance, if we measure the average IQ of 500 registered voters, this average is called a statistic. If, for some reason, we are able to calculate the average IQ of the entire population, this resulting average would be called a parameter.

In practice, we are commonly unable to directly measure the parameters of interest. Therefore, we use statistics to gain some understanding of the population values. We must, however, realize that there is always some element of uncertainty involved, as we do not know the exact value of the population. There are two sources of error that influence this uncertainty. First, sampling error is due to the fact that information is available on only a subset of the population members (discussed in more detail in chapter 6, chapter 7, and chapter 8). Second, nonsampling error is unconnected to the sampling procedure used. Examples of nonsamping error are: the population that is sampled is actually not the relevant one; survey participants may give inaccurate or dishonest answers; survey subjects may not respond at all to (certain) questions.

### How to think statistically?

Statistical thinking begins with problem definition:

1. What information is required?
2. What is the population of interest?
3. How should sample members be selected?
4. How should information from the sample members be obtained?

After answering these questions, we are interested in the question how to use sample information to make decisions about the population. For this decision making, both descriptive statistics and inferential statistics are required. Descriptive statistics focus on graphical and numerical procedures; they are used to summarize and process data. Next, inferential statistics use the data to make predictions, forecasts, and estimates to make decisions.

### What is a variable and what are the measurement levels of a variable?

A variable is a characteristic of an individual or objects. Examples are age and weight.

Variables are either categorical (with responses that belong to groups or categories) or numeric (with responses that belong to a numerical scale). Numerical variables can be subdivided into discrete and continuous. A discrete numerical variable may (but does not have to) have a finite number of values. Discrete numerical variables often come from a count process, such as the number of students in a class, or the number of credits earned by students. A continuous numerical variable can take on any value within a given range of real numbers. Continuous numerical variables often results from a measurement process rather than a counting process. Examples are weight, length, and the distance between two cities.

Variables can be measured in several ways. The main distinction is between quantitative (in which there is a measurable meaning to the difference in numbers) and qualitative (in which there is no measurable meaning to the difference in numbers). Qualitative data can be further subdivided into nominal and ordinal data. Nominal data are considered the lowest measurement level. The numerical identification is chosen strictly for convenience and does not imply ranking of responses (for instance: country of citizenship or gender). Ordinal data does imply a ranking order of the data (for instance product quality rating, with 1 = poor, 2 = average, 3 = good). Quantitative data can be further subdivided into interval (arbitrary zero) and ratio (absolute zero). For instance, temperature is considered an interval variable (it has an arbitrary null point). Weight is considered a ratio variable (it has an absolute null point).

### How to graphically describe categorical variables?

Categorical variables can be graphically described in several ways. These are introduced briefly in this section.

A frequency distribution is a table that is used to organize data. The left column included all possible responses of a variable. The right column includes the frequencies, the number of observations for each possible response. One can also obtain a relative frequency distribution by dividing each frequency by the number of observations and multiplying the resulting proportion by 100%.

Frequencies can also be displayed by graphs. Common used graphs to display frequencies are a bar chart and a pie chart. Different from a histogram, in a bart charts, there is no need for the bars to "touch" each other. Each bar represents the frequency of a category. A bar chart is commonly used if one wants to draw attention to the frequency of each category. A pie chart is commonly used if one wants to draw attention to the proportion of frequencies in each category. The "pie" (that is, the circle) represents the total, and the "pieces" (the segments) represent shares (categories) of that total.

A special type of bar chart is a Pareto diagram. A Pareto diagram displays the frequency of defect causes. The bar at the left indicates the most frequent cause. The bars to the right indicate causes with decreasing frequencies. A Pareto diagram is commonly used to separate the "vital few" from the "trivial many".

A cross table (also known as crosstab and contingency table) lists the number of observations for every combination of values for two categorical variables (nominal or ordinal). The combination of all possible intervals for these two variables defines the number of cells in the number. A cross table with r rows (i.e., number of variables of the first variable) and c columns (i.e., the number of categories of the second variable) is referred to as an r x c cross table.

### How to graphically describe time series data?

Cross-sectional data are data collected at a single time point. In contrast, time series data refer to data measured at successive time points in time. In other words, a time series is a set of measurements, that is ordered over time, on a particular quantity of interest. The sequence of the observations in time series is important. Time series data can be graphically displayed by a line chart, also known as a time-series plot. This is a plot with time on the horizontal axis and the numerical quantity of interest along the vertical axis. Each observation yields one point on the graph. By merging adjacent points in time by a straight line, a time-series plot is produced. Time series plots thus can be used to graphically display a trend over time, such as the gross domestic product by time, the currency exchange rates (USD to EUR) during a decade, or the federal government receipts and expenditures during the past century.

### How to graphically describe numerical variables?

There are several ways to graphically describe numerical variables.

Just like categorical variables, one can create a frequency distribution for numerical variables. The classes (intervals), however, for a frequency distribution for numerical data are not as easily identifiable as for categorical data. To construct a frequency distribution for numerical data, three rules should be followed:

1. Determine k, that is the number of classes. To do so, one can use the following quick guide to approximate the number of classes:

 Sample size (n) Number of classes (k) Fewer than 50 5 - 7 50 - 100 7 - 8 101 - 500 8 - 10 501 - 1000 10 - 11 1001 - 5000 11 - 14 More than 5000 14 - 20

Although this quick guide offers a rule of thumb, it remains somewhat arbitrary. Often, practice and experience provide the best guidelines. Generally, larger data sets require more classes that smaller data sets. When too few classes are selected, patterns and characteristics of the data may be hidden. When too many classes are selected, some intervals may contain no observations or have very small frequencies.

2. Each class should have the same width, denoted by w. The width is determined by:
w = Class width = (largest observation - smallest observation ) / number of classes
Note that w should always be rounded upward.
3. Classes should be inclusive and non overlapping.
In other words, each observation must belong to one and only one class. Suppose the frequency distribution contains the following classes: "age 20 - 30" , "age 30 - 40", and "age 40+". To what category does a person of age 30 belong? It is therefore important to clearly identify the boundaries or endpoints of each class. To avoid overlapping, one could for instance redefine the classes as follows: "age 20, but less than age 30", "age 30 but less than 40", "age 40 and older".

A cumulative frequency distribution contains the total number of observations whose values are less than the upper limit for a certain class. The cumulative frequencies can be constructed by adding the frequencies of all frequency distribution classes up to and including the present class. In a relative cumulative frequency distribution, these cumulative frequencies are expressed as cumulative proportions or percent.

A histogram is a graphical display, consisting of vertical bars constructed on a horizontal line that yields intervals for the variable being displayed. These intervals correspond to the classes in a frequency distribution table. The height of each bar is proportional to the number of observations (the frequency) in that interval. The number of observations can (but does not have to) be displayed above the bars.

An ogive (also known as cumulative line graph) is a line that connects points that are the cumulative percent of observations below the upper limit of each interval in a cumulative frequency distribution.

The shape of a distribution can be measured, among others, via symmetry and skewness. A distribution is called symmetric when the observations are balanced or approximately evenly distributed around its center. A distribution is said to be skewed when the observations are not symmetrically distributed on either side of the center. A distribution is skewed to the right (also known as positively skewed) when it has a tail that extends farther to the right. A distribution is skewed to the left (negatively skewed) when its tail extends farther to the left. Income, for example, is skewed-right, because there is a relatively small proportion of people with a high income. A large proportion of the population receives a modest income and only a small proportion of people receives a (very) high income.

A stem-and-leaf display is a graph that is used for exploratory data analysis (EDA). It provides an alternative to the histogram. The leading digits are displayed in the stems. The final digits are called leaves. The leaves are listed separately for each member of a class. They are dispayed in ascending order after each of the stems.

### How to graphically display two variables simultaneously?

So far, we mainly discussed graphical displays of a single variable. Graphical displays, however, can also be used to display two variables. One such possibility is offered by a scatter plot. A scatter plot is a graphical display of two numerical variables, often an independent variable (on the x-axis), and a dependent variable (on the y-axis). The scatter plots includes the following information: the range of both variables, the pattern of values over the range, a suggestion as to a possible relationship between the two variables, and an indication of outliers (extreme points). An example of a simple scatter plot between variable X and Y is displayed below.

### What are common data presentation errors?

Unfortunately, when graphically displaying data, errors can be made. Poorly designed graphs can easily distort the truth. Hence, accurate graphic design is of utmost importance. Graphs must be persuasive, clear, and truthful. In this section, some common examples of misleading graphs will be discussed.

Histograms can be misleading. We know that the width of all intervals should be the same. Yet, sometimes, researchers are tempted to construct a frequency distribution with some narrow intervals where the bulk of observations are, and broader ones elsewhere. Such unequal intervals may lead to inaccurate interpretation of the data displayed. In general, we can state that, under no circumstance should we ever construct a histogram with unequal errors. This is considered only as a warning against deceptive graphs.

A time-series plot can be misleading by selecting a particular scale of measurement. This scale can be chosen such that it may yield the impression of eiter relative stability or of substantial fluctuation over time (depending on what one wants to highlight). Although there is no "correct" choice of scale for any particular time-series plot, one should keep in mind the scale on which the measurements are made. The reader, then, should be aware of this potential influence when interpreting the graph.

### How to describe data graphically? - BulletPoints 1

• A population is defined as the complete set of all items (observations) that one is interested in. The population size is denoted by N and can be very large, at times even infinite. A sample is defined as an observed subset of the population. The sample size is denoted by n.
• A parameter is defined as a measure that describes a population characteristic. A statistic is defined as a numerical measure that describes a sample characteristic. For instance, if we measure the average IQ of 500 registered voters, this average is called a statistic. If, for some reason, we are able to calculate the average IQ of the entire population, this resulting average would be called a parameter.
• Variables can be measured in several ways. The main distinction is between quantitative (in which there is a measurable meaning to the difference in numbers) and qualitative (in which there is no measurable meaning to the difference in numbers). Qualitative data can be further subdivided into nominal and ordinal data. Nominal data are considered the lowest measurement level. The numerical identification is chosen strictly for convenience and does not imply ranking of responses (for instance: country of citizenship or gender). Ordinal data does imply a ranking order of the data (for instance product quality rating, with 1 = poor, 2 = average, 3 = good). Quantitative data can be further subdivided into interval (arbitrary zero) and ratio (absolute zero). For instance, temperature is considered an interval variable (it has an arbitrary null point). Weight is considered a ratio variable (it has an absolute null point).
• The shape of a distribution can be measured, among others, via symmetry and skewness. A distribution is called symmetric when the observations are balanced or approximately evenly distributed around its center. A distribution is said to be skewed when the observations are not symmetrically distributed on either side of the center. A distribution is skewed to the right (also known as positively skewed) when it has a tail that extends farther to the right. A distribution is skewed to the left (negatively skewed) when its tail extends farther to the left. Income, for example, is skewed-right, because there is a relatively small proportion of people with a high income. A large proportion of the population receives a modest income and only a small proportion of people receives a (very) high income.
• Determine k, that is the number of classes. To do so, one can use the following quick guide to approximate the number of classes. Each class should have the same width, denoted by w. The width is determined by:
w = Class width = (largest observation - smallest observation ) / number of classes
Note that w should always be rounded upward.

## How to describe data numerically? - Chapter 2

In chapter 1, we discussed how to describe data graphically. In this chapter, we will discuss how to describe data numerically. Furthermore, we will discuss the different numerical measures that can be used for categorical and numerical variables, as well as measures for grouped data, and measures to describe the relationship between two variables.

### What are measures of central tendency and location?

A central question in statistics is whether the data in a sample are centered or located around a particular value. In the first chapter, we discussed graphical displays to examine this. For instance, a histogram gives us a visual picture of the shape of a distribution and provides an idea of whether the data tend to center around some value. In this section, we move on to numerical measures to answer this question of central tendency or location. These measures are called measures of central tendency. Commonly, these measures of central tendency are computed from sample data (statistics) rather than from population data (parameters).

#### Arithmetic mean

The first measure of central tendency is the arithmetic mean, usually referred to as mean, or average. The mean is the sum of the data values divided by the number of observations. If this data set refers to the entire population, the formula for the parameter is:

$\mu = \frac{\sum^{N}_{i=1} x_{i}}{N} = \frac{x_{1}+x_{2}+...x_{N}}{N}$

with N = population size and Σ means "sum of"

If this data set refers to the sample, then the formula for the statistic is:

$\bar{x} = \frac{\sum^{n}_{i=1} x_{i}}{n}$

with n = sample size.

#### Median

The second measure of central tendency is the median. For the median, we must arrange the data in either increasing or decreasing order. The median, then, is the middle observation. If the number of observations is an even number, the median is the average of the two middle observations. In formula, the median will be the number located at: 0.50(n + 1)th order position.

#### Mode

The third measure of central tendency is the mode: the most frequently occurring value. If a particular distribution only has one mode, the distribution is called unimodal. If a distribution has two modes, it is called bimodal. For more than two modes, the distribution is called multimodal.

#### Geometric mean

Another measure of central tendency is the geometric mean, given by:

$\bar{x}_{g} = \sqrt[n]{(x_{1}x_{2}...x_{n})} = (x_{1}x_{2}...x_{n})^{1/n}$

The geometric mean is the nth root of the product of n numbers.

The geometric mean rate of return provides the mean percentage return of an investment over time. Is is given by:

$\bar{r}_{g} = (x_{1}x_{2}...x_{n})^{1/n}-1$

The geometric mean differs from the arithmetic mean. Suppose we have two observations: 20 and 5. The arithmetic mean = (20+5)/2 = 12.5. The geometric mean = √(20*5) = √100 = 10.

#### Percentiles and quartiles

Other measures of central tendency are percentiles and quartiles. These measures indicate the location of a value relative to all observations in the data set. For instance, if one told you that you scored in the 96th percentile on your statistics exam, this means that approximately 96% of the students who took this exam scored lower than you and that approximately 4% of the students who took this exam scored higher than you did. Percentiles and quartiles are commonly used to describe large data sets.

There is some disagreement about how to calculate percentiles and quartiles. As as a result, slightly different values are found when using different computer software programs (such as SPSS, R, and SAS). In this book, we use the formulas as described below. To find percentiles and quartiles, the data must first be arranged in ascending order.

The Pth percentile is computed as follows: value located in the (P/100)(n+1)th ordered position

Quartiles separate the data set into four quarter. The first quartile (Q1) equals the 25th percentile, and separates approximately the smallest 25% from the rest of the data. The second quartile (Q2) equals the 50th percentile and equals the mean. The third quartile (Q3) is the 75th percentile and separates approximately the smallest 75% of the data from the largest 25% of the data. Thus:

Q1 = the value in th 0.25(n+1)th ordered position
Q2 = the value in th 0.50(n+1)th ordered position
Q3 = the value in th 0.75(n+1)th ordered position

This can also be summarized in the five-number summary, which consists of: (1) the minimum; (2) Q1; (3) the median; (4) Q3; (5) the maximum.

### Which measure of central tendency should be used when?

To determine whether the mean, median, or mode is most appropriate for the data at hand, we must look at the data structure. One factor that influences this decision is the type of data: categorical or numerical. The mean is appropriate for numerical data. The median and mode are commonly used with categorical data.

The mean is not appropriate for categorical data. Suppose, for instance, that you collected data about country of origin. Each country is assigned a (random) value. For instance, Germany = 1, the Netherlands = 2, and Belgium = 3. Suppose there are 10 participants from Germany, 5 from the Netherlands, and 5 from Belgium. While we could compute the mean (that is: (10*1 + 5*2 + 5*3) / 30) = 1.17, this number is not meaningful in this context. Similarly, the median (2.5) is not very meaningful here. A better measure in this case is the mode, the most frequently occurring value. In this example, the mode is 1 (i.e., most participants are German).

Next to the type of data, another factor to consider is the presence of outliers. Outliers are observations that are unusually large or unusually small in comparison to the other data observations. The median is not affected by outliers. The mean, however, is affected by outliers. In chapter 1, we already described right-skewed and left-skewed distributions. If there are a lot of unusually large observations (outliers), the mean will tend to move to the right, while the median remains as is. If there are a lot of unusually small observations (outliers), the mean will tend to move to the left, while the median remains as is. Be aware that this does not imply that the median should always be preferred to the mean when the population or sample is skewed. In some situations, the mean is still preferred, even if the data are skewed. Consider, for instance, that a certain company wants to know how much money needs to be budgeted to cover claims. In that case, all observations are important, and the mean is the most appropriate measure of central tendency. If, on the other hand, the company wants to know the most typical claim size, the median is more appropriate.

### What are measures of variability?

Often, measures of central tendency alone are insufficient to describe the data. Different samples may, for instance, have the same mean, yet individuals may vary more from the mean in the first sample than do observations in the second sample. In addition to these measures of central tendency, measures of variability should be provided. In this section, we describe common measures of variability.

#### Range

The range is the difference between the largest and smallest observation(s). The greater the spread of the data from the center of the distribution, the larger the range will be. The range may not be appropriate when there are outliers, as this measure indicates the total spread of the data.

#### Interquartile range

The interquartile range (IQR) is a measure that provides the spread in the middle 50% of the data. The IQR is the difference between the observation at Q3 (the third quartile or the 75th percentile) and the observation at Q1 (the first quartile or the 25th percentile). In formula:

$IQR = Q_{3} - Q_{1}$

#### Box-and-whisker plots

A box-and-whisker plot is a plot that describes the distribution of the data in terms of the five-number-summary. The inner box displays the numbers that span the interquartile range (thus Q1 to Q3). The line that is drawn through the box represents the median. The two whiskers are the line from the minimum to the 25th percentile (Q1) and the line from the 75th percentile (Q3) to the maximum.

#### Variance and standard deviation

The population variance σ2 is the sum of the squared differences between each observation and the population mean divided by the population size N. In formula, that is:

$\sigma^{2} = \frac{\sum^{N}_{i=1} (x_{i} - \mu)^{2}}{N}$

The sample variance, s2, is the sum of the squared differences between each observation and the sample mean divided by the sample size n, minus 1. The sample variance is calculated as:

$s^{2} = \frac{\sum^{n}_{i=1} (x_{i} - \bar{x})^{2}}{n - 1}$

The standard deviation is simply the square root of the variance. That means that the population standard deviation is given by:

$\sigma = \sqrt{\sigma^{2}}$

Next, the sample standard deviation is given by:

$s = \sqrt{s^{2}}$

The standard deviation restores the data to their original measurement unit. Suppose, for instance, the original measurements were in feet. The variance then would be in feet squared, while the standard deviation would be in feet. The standard deviation provides a measure of the average spread around the mean.

#### Coefficient of variation

The coefficient of variation (CV) expresses the standard deviation as a percentage of the mean. It is a measure of relative dispersion. The population coefficient of variation is:

$CV = \frac{\sigma}{\mu} x 100\% \hspace{5mm} if \hspace{5mm} \mu > 0$

For the sample coefficient of variation, the formula is as follows:

$CV = \frac{s}{\bar{x}} x 100\% \hspace{5mm} if \hspace{5mm} \bar{x} > 0$

#### Chebyshev's theorem and the empirical rule

Pafnuty Lvovich Chebyshev (1821-1894) was a Russian mathematician, who established data intervals for any data set, regardless of the shape of the distribution. That is, for any population with mean μ, standard deviation σ, and k > 1, the percent of observations that fall within the interval [μ ∓ kσ ] equals at least 100 [1 - (1/k2) ]% where k is the number of standard deviations. The advantage of this approac is that it is applicable to any population. However, for many populations, the percentage of values that fall in any specified range is much higher than the minimum assured by Chebyshev's theorem. In practice, many large populations provide data that are at least approximately symmetric, hence many of the data points are clustered around the mean. We commonly observe a bell-shaped distribution.

For those large, mounded (bell-shaped) distributions, the following empirical rule of thumb can be applied:

• Approximately 68% of the observations are in the interval μ ∓ 1σ.
• Approximately 95% of the observations are in the interval μ ∓ 2σ.
• Almost all (approximately 99.7% of the) observations are in the interval μ ∓ 3σ.

Suppose, for instance, that the mean score on a statistics exam is 6 with a standard deviation of 1. Approximately 68% of the students then score between 5 and 7. Approximately 95% of the scores fall within the range 4 to 8. And almost all scores fall within the range 3 to 9.

#### z-score

The z-score is a standardized value, which indicates the number of standard deviations that a value is deviated from the mean. A z-score larger than zero indicates that the value is greater than the mean. Vice versa, a z-score below zero indicates that te value is less than the mean. A z-score of zero implies that the value is equal to the mean.

If the population mean μ and population standard deviation σ are known, then, for each value xi (with i observations), the corresponding z-score associated with xi is calculated as follows:

$z = \frac{x_{i} - \mu}{\sigma}$

To illustrate, consider that a large number of students take a college entrance exam. Suppose the mean score on this exam is 570 with a standard deviation of 40. If we are interested in the z-score for a student who scored 600 on the exam, we can compute the z-score corresponding to this value.

$z = \frac{x_{i} - \mu}{\sigma} = \frac{600 - 570}{40} = 0.75$

This means that the student scores 1.5 standard deviations above the mean. Here, we cannot use the empirical rule, because this only applies to z-scores of 1, 2, and 3. We can, however, look up the corresponding probability in the standard normal distribution table (which you will be provided during the exam, see also Table 1 of the book). Looking up a z-score of 0.75 results in p = 0.7734 which means that 77.34% of the scores are lower than the score of this student. Vice versa, 1 - 0.7734 = 22.66 implies that 22.66% of the students score higher than this student.

### Which measures to use for grouped data?

In the event of grouped data, other measures are available, such as the weighted mean, the approximate mean, and variance for grouped data. These measures will be discussed in this section.

#### Weighted mean

In some situations, a special type of mean is required, namely the weighted mean. Weighted means are used for example for calculating GPA, determining average stock recommendation, and approximating the mean of grouped data. The weighted mean is given by:

$\bar{x} = \frac{\Sigma w_{i}x_{i}}{n}$

where wi is the weight of the ith observation and n = Σwi.

#### Approximate mean and variance for grouped data

Suppose, the data are grouped into K classes with frequencies f1, f2, ... fK (this means that each class has its own frequency). Now, assume the midpoints of these classes are m1, m2, ..., mK. Then, the sample mean for grouped data is:

$\bar{x} = \frac{\sum^{K}_{i=1}f_{i}m_{i}}{n}$

where $n = \sum^{K}_{i=1}f_{i}$

The variance for grouped data is given by:

$s^{2} = \frac{\sum^{K}_{i=1}f_{i}(m_{i}-\bar{x})^{2}}{n-1}$

### Which numerical measures are available to describe a relationship between two variables?

In chapter 1, we described graphical ways to describe a relationship between two variables. Now, we move on to numerical measures to describe this relationship between two variables. There are two main numerical measures to this end: correlation and covariance.

#### Covariance

Covariance (Cov) is a measure that indicates the degree of linear relationship between two variables. A positive value indicates a direct or increasing linear relationship. A negative value indicates a decreasing linear relationship between the variables.

The population covariance between X and Y is given by:

$Cov(x,y) = \sigma_{xy} = \frac{\sum^{N}_{i=1}(x_{i} - \mu_{x})(y_{i} - \mu_{y}) }{N}$

where N is the population size, xi and yi are the observed values for populations X and Y, and μx and μy are the population means.

Similarly, the sample covariance is given by:

$Cov(x,y) = s_{xy} = \frac{\sum^{n}_{i=1}(x_{i} - \bar{x})(y_{i} - \bar{y}) }{n}$

where n is the population size, xi and yi are the observed values for samples X and Y, and x and y (with bars above) are the sample means.

Be aware that covariance does not provide a measure of the strength of a relationship between two variables. Instead, covariance is a measure of the direction of a linear relationship between two variables.

#### Correlation

The correlation coefficient is a measure that provides both the direction and the strength of a relationship. The population correlation coefficient (rho) is given by:

$\rho = \frac{Cov(x,y}{\sigma_{x}\sigma_{y}}$

Similarly, the sample correlation coefficient is given by:

$r = \frac{Cov(x,y}{s_{x}s_{y}}$

A useful rule of thumb is that there is a relationship between two variables if:

$\|r\| \geq \frac{2}{\sqrt{n}}$

The correlation coefficient ranges from -1 to +1. A value close to 1 indicates a strong positive linear relationship between the two variables. A value close to -1 indicates a strong negative linear relationship between the two variables. And a correlation coefficient of 0 indicates no linear relationship between the two variables.

Be aware that the correlation coefficient does not imply causation. It may happen that two variables are highly correlated, but that does not mean that one variable causes the other variable.

In chapter 1, we discussed how to describe data graphically. In this chapter, we will discuss how to describe data numerically. Furthermore, we will discuss the different numerical measures that can be used for categorical and numerical variables, as well as measures for grouped data, and measures to describe the relationship between two variables.

## How to use probability calculation? - Chapter 3

In practice, business decisions and policies are often based on an implicit or assumed set of probabilities. Often, we cannot be certain about the occurrence of a future event. Yet, if the probability of an event (for instance, whether a legal contract exists) is known, then we have a better chance of making the best possible decision, in comparison to having no idea at all about the occurrence of the event.

### Which definitions and concepts provide structure for defining probabilities?

To provide structure for defining probabilities, in this section, we discuss some important definitions and concepts, such as sample space, outcomes, and events. These are the basic building blocks for defining and calculating probabilities.

#### Random experiment

Probability starts with the concept of a random experiment. A random experiment is a process that leads to two or more outcomes without knowing exactly which outcome will occur. Examples are: tossing a coin (the outcome is either head or tail), the daily exchange in an index of stock market prices, and the number of persons admitted to a hospital emergency room during an hour (again, there are two or more outcomes, and the outcome cannot be known in advance).

#### Basic outcomes and sample space

The possible outcomes from a random experiment are referred to as the basic outcomes. The basic outcomes must be defined in such a way that no two outcomes can occur simultaneously. The set of all basic outcomes is called the sample space. Sample space is denoted by the symbol S.

An example of a sample space for a professional baseball batter is provided in Table 1. These probabilities are obtained by studying professional baseball batters' data. There are six outcomes. No two outcomes can occur simultaneously, and one of the six outcomes must occur.

 Sample Space (S) Probability O1 Safe hit 0.30 O2 Walk or hit by pitcher 0.10 O3 Strikeout 0.10 O4 Groundball out 0.30 O5 Fly ball out 0.18 O6 Reach base on an error 0.02

#### Event

Often, we are not interested in the individual outcomes, but in some subset of the basic outcomes. For instance, we might be interested in whether the batter reaches the base safely. Therefore, a subset of three outcomes is of interest: safe fit (0.30), walk or hit by pitcher (0.10), and reach base on an error (0.02). This subset of basic outcomes, then, is called an event. An event, denoted by the symbol E, is a subset of basic outcomes in the sample space. A null event refers to the absence of a basic outcome and is denoted by ⊘.

#### Intersection of events and joint probability

Sometimes, we are interested in the simultaneous occurrence of two or more events. In that case, the intersection of events is of interest. Let A and B be two events in the sample space S. Then, the intersection between A and B is denoted by A ∩ B, which refers to the set of all basic outcomes in S that belong to both A and B. In other words, the intersection A ∩ B occurs only if, and only if, both event A and B occur.

The concept joint probability refers to the probability of the intersection of event A and B. In other words, this is the probability that both events occur. It is possible, however, that the intersection of two events is an empty set. Suppose for instance, that we add an event C: "batter is out". In that case, the intersection between event A ("batter reaches base safely") and event C ("batter is out") would be an empty set. This implies that A and C are mutually exclusive (i.e., they have no common basic outcomes, and their intersection is said to be the empty set). More generally, we can state that the K events E1, E2, ..., EK, is mutually exclusive if every pair (Ei, Ej) refers to a pair of mutually exclusive events.

#### Union

Let again, A and B be two events in sample space S. A union, then, is the set of all basic outcomes in S that belong to at least one of the two events. The union is denoted by A ∪ B and occurs only if either A or B or both occur. In general terms, this implies that, given the K events E1, E2, ..., EK, their union E1 E2 ∪ ... ∪ EK is the set of all basic outcomes that belongs to at least one of these K events.

#### Collectively exhaustive

If the union of several events covers the entire sample space S, this implies that the events are collectively exhaustive. In general terms, we can state that, given the K events E1, E2, ..., EK in the sample space S, if E1 E2 ∪ ... ∪ EK = S, these K events are said to be collectively exhaustive.

#### Complement

Lastly, we define the concept of complement. Let A be an event in the sample space S. When the set of basic outcomes of a random experiment belongs to S, but not to A, it is called the complement of A, denoted by: Ā.

This implies that events A and complement Ā are mutually exclusive. That is, no basic outcome of a random experiment can belong to both events. Next, they are collectively exhaustive: every basic outcome must belong to one or the other.

Table 2 represents the probabilities in the event of intersection of events. Table 3 shows the probabilities in the event of mutually exclusive events.

 $B$ $\bar{B}$ $A$ $A ∩ B$ $A - (A ∩ B)$ $\bar{A}$ $B - (A ∩ B)$ $\bar{A} ∩ \bar{B}$
 $B$ $\bar{B}$ $A$ $⊘$ $A$ $\bar{A}$ $B$ $\bar{A} ∩ \bar{B}$

### What are the three defintions of probability?

There are three definitions of probability that will be considered in this section: (1) classical probability; (2) relative frequency probability; and (3) subjective probability.

#### Classical probability

Classical probability is considered the classical definition of probability. Classical probability refers to the proportion of times that a certain event will occur, assuming that all outcomes in the sample space have an equal probability to occur. The probability of such an event A, denoted by P(A), then is defined as:

$P(A) = \frac{N_{A}}{N}$

where NA refers to the number of outcomes that satisfy the condition of event A, and N refers to the total number of outcomes in the sample space. In other words, the probability of event A is obtained b dividing the number of outcomes in the sample space that satisfy the condition of event A by the total number of outcomes in the sample space.

To obtain all possible outcomes (N) can be very time consuming. Therefore, we can use the following formula to determine the number of combinations of n items taken x at a time:

$C^{n}_{x} = \frac{n!}{x!(n-x)!}$

with 0! = 1.

Suppose we are interested in some number x of objects that are placed in a certain order. Each object may only be placed once. How many different sequences are possible? In this case, we use the following formula:

$x(x-1)(x-2) ... (2)(1) = x!$

where x! is x factorial.

Now, suppose we have a certain number of n objects with which the x ordered boxes are filled (with n > x). Similar to the illustration above, each object may only be used once. The number of possible orderings is referred to as the number of permutations of x objects, chosen from n. The total number of permutations can be obtained as follows:

$P^{n}_{x} = n(n-1)(n-2) ... (n-x-1) = \frac{n!}{(n-x)!}$

An example: suppose there are 4 letters: A, B, C, and D Two letters have to be selected and these have to be arranged in order. Using the formula above with n = 4 and x = 2 yields the following number of permutations:

$P^{4}_{2} = \frac{4!}{(4-2)!} = \frac{4!}{2!} = \frac{4*3*2*1}{2*1} = \frac{24}{2} = 12$

Thus, there are twelve permutations. These are: AB, AC, AD, BA, BC, BD, CA, CB, CD, DA, DB, and DC.

Suppose we are not interested in the number of permutations. Instead, we now are interested in the number of different ways that x objects can be selected from n regardless of the order. This number of possible selections is also known as the number of combinations and can be calculated as follows:

$C^{n}_{x} = \frac{P^{n}_{x}}{x!} = \frac{n!}{x!(n-x)!}$

To illustrate, suppose we are interested in the probability of employee selection. There are 8 candidates who applied to the job. Yet, only 4 candidates can be selected. Of these candidates, 5 are men and 3 are women. If every combination of candidates has an equal probability, what is then the probability that no women will be hired?

First, we need to calculate the total number of possible combinations. This is done as follows:

$C^{8}_{4} = \frac{8!}{4!4!} = 70$

Then, if no women is to be hired, this implies that the four successful candidates must come from the available five men. That means that the number of combinations is as follows:

$C^{5}_{4} = \frac{5!}{4!1!} = 5$

To conclude, if out of 70 possible combinations each is likely to be chosen, the probability that one of the 5-all male combinations would be selected is 5/70 = 1/14 = 0.07 (that is, 7%).

#### 2. Relative frequency

A second definition of probability refers to the relative frequency. The relative frequency probability is the limit of the proportion of times that event A occurs in a large number of trials (n). The relative frequency probability can be computed as follows:

$P(A) = \frac{n_{A}}{n}$

where nA refers to the number of A outcomes and n to the total number of trials (or outcomes). The probability is the limit as n become large or approaches infinity.

#### 3. Subjective probability

The third definition of probability refers to subjective probability. Subjective probability is an individual's degree of belief about the chance that a particular event will occur. Such subjective probabilities are sometimes used in certain management decsion procedures. Subjective probabilities are personal. There is no requirement that different individuals arrive at the same probabilities for the same event.

### What are the three postulates of probabilities?

There are three postulates (rules) that probabilities will be required to obey.

1. If A is any event in de sample space S, then:
$0 \leq P(A) \leq 1$
2. If A is an event in S, and Oi denotes the basic outcomes, then:
$P(A) = \sum_{A} P(O_{i})$
which implies that the summation extends over al the basic outcomes in A.
3. P(S) = 1

In words, this means that: (1) the probability lies between 0 and 1; (2) NA is simply the sum of Ni for all basic outcomes in A, since the basic outcomes are mutually exclusive; and (3) when a random experiment is carried out, "something" has to happen. That is, the sum of all probabilities for all basic outcomes in the sample space is equal to 1.

### What are the probability rules for compound events?

In this section, the rules for compound events are introduced.

#### Complement rule

First, the complement rule is defined as:

$P(\bar{A}) = 1 - P(A)$

This rule is important, as it is sometimes easier to find P(A) than to obtain its complement (or vice versa). In that case, one can easily obtain P(A) (or its complement).

Second, according to the addition rule of probabilities, the probability of the union is defined as:

$P (A ∪ B) = P(A) + P(B) - P(A ∩ B)$

Note that this formula can also be transformed to:

$P (A ∩ B) = P(A) + P(B) - P(A ∪ B)$

#### Conditional probability

Third, suppose we are interested in the probability of A, given that B has occurred. In that case, we are interested in the conditional probability. The conditional probability is denoted by the symbol P(A|B) and can be obtained as follows:

$P(A|B) = \frac{P(A ∩ B)}{P(B)}$

provided that P(B) > 0.

Similarly, the conditional probability of B given that A has occured can be obtained as:

$P(B|A) = \frac{P(A ∩ B)}{P(A)}$

again, provided that P(A) > 0.

To illustrate, let P(A) = 0.75, P(B) = 0.80, and P(A ∩ B) = 0.65. The conditional probability of event A, given that event B has occurred is:

$P(A|B) = \frac{P(A ∩ B)}{P(B)} = \frac{0.65}{0.80} = 0.8125$

#### The multiplication rule of probabilities

Suppose, there are two events: event A and event B. Using the multiplication rule of probabilities, the probability of their intersection can be obtained from the conditional probability as follows:

$P(A ∩ B) = P(A|B) P(B)$

Another way to obtain the probability of their intersection is via:

$P(A ∩ B) = P(B|A) P(A)$

#### Statistical independence

Lastly, we consider the case of statistical independence. Statistical independence is a special case for which the conditional probability of A given B, thus P(A|B) is the same as the unconditional probability of A, thus P(A). In formula: P(A|B) = P(A). Thus, event A and B are statistically independent if, and only if, P(A ∩ B) = P(A) P(B). In general, this is not true. But when it is, we find that knowing that event B has occurred does not change the probability of event A to occur.

### What are bivariate probabilities?

In this section, we move on to the scenario in which there are two distinct sets of events. We level these sets A1, A2, ..., AH and B1, B2, ..., BK. Now, these two sets of events jointly are called bivariate and their probabilities are called bivariate probabilities. The methods that will be discussed in this section can also be applied to trivariate and higher-level probabilities, but with added complexity.

For bivariate probabilities, the intersection between these two sets, that is P(Ai ∩ Bj) is called joint probabilities. Next, the probabilities for individual events, P(Ai) or P(Bj) are called marginal probabilities. These marginal probabilities can be calculated by summing up the corresponding row or column belonging to that set.

If every event Ai is statistically independent of every event Bj, then A and B are said to be independent events.

#### Odds

Sometimes, we are interested in communicating probability information. One way to do so is via odds. The odds of a particular event are provided by the ratio of the probability of one event divided by the probability of the complement of that event. That is, the odds in favor of event A are:

$Odds = \frac{P(A)}{1-P(A)} = \frac{P(A)}{P(\bar{A})}$

To illustrate, the odds of 2 to 1 can be transformed to a probability of A winning:

$\frac{2}{1} = \frac{P(A)}{1-P(A)} = 2 - 2P(A)$

$3P(A) = 2$

thus P(A) = 0.67

#### Overinvolvement ratios

In some situations, it is difficult to obtain the desired conditional probabilities, yet alternative conditional probabilities are available. For instance, the costs of enumerations are too high or some ethical restriction prevents us from directly obtaining the set of probabilities. Based on these alternative probabilities, there are several ways we are still able to obtain the desired probabilities. One such way is via overinvolvement ratios.

The overinvolvement ratio is the ratio of the probability of event A1, conditional on event B1, to the probability of event A1 conditional on event B2, where B1 and B2 are mutually exclusive and complementary. In formula, the overinvolvement ratio is defined as:

$\frac{P(A_{1}|B_{1})}{P(A_{1}|B_{2})}$

If the overinvolvement ratio is greater than 1, this implies that event A1 increases the conditional odds ratio in favor of event B1. That is:

$\frac{P(B_{1}|A_{1})}{P(B_{2}|A_{1})} > \frac{P(B_{1})}{P(B_{2})}$

Suppose we know that 60% of the people who buy our product have seen our advertisement. Yet, only 30% of the people who do not buy our product have seen the advertisement. The ratio of 60% to 30% is the overinvolvement of the event "seen our advertisement".

Then, the population is divided into the following categories:

The overinvolvement ratio is 60/30 = 2.0. Thus, we conclude that the advertisement increases the probabiity of purchase.

### What is Bayes' theorem?

Bayes' theorem was developed, as the name already suggests, by Thomas Bayes (1702-1761). Bayes' theorem offers a tool to determine how probability statements can be adjusted given additional information.

Bayes' theorem follows from the multiplication rule. Now, let A1 and B1 be two events. According to Bayes' theorem then, it applies that:

$P(B_{1}|A_{1}) = \frac{P(A_{1}|B_{1})P(B_{1})}{P(A_{1})}$

and, similarly

$P(A_{1}|B_{1}) = \frac{P(B_{1}|A_{1})P(A_{1})}{P(B_{1})}$

To apply Bayes' theorem one should follow the next four steps:

1. Define from the problem the subset of events.
2. Define the probabilities and conditional probabilities for each of the events defined in step 1.
3. Compute the complements for each of these probabilities defined in step 2.
4. Formally state and apply Bayes' theorem to compute the solution probability.

This will be illustrated with an example. Suppose that a car dealership knows from previous experiences that 10% of the people who walk into the showroom and talk to the salesperson, eventually buy a car. The manager of the showroom wants to increase the chances of success and therefore proposes to offer free dinner for all people who agree to listen to the full presentation of the salesperson. However, some people will do anything to obtain free dinner, even if they are not interested whatsoever in buying a new car. It is therefore important to test the effectiveness of this free dinner plan. An experiment is conducted for six months. It was found that 40% of the people who bought a car had a free dinner. Further, 10% of the people who did not buy a car had a free dinner. Now, the question is twofold: (1) do people who accept the dinner have a higher probability of buying a new car? (2) what is the probability that a person who does not accept a free dinner buys a car?

Step 1. Define the subset of events.

• D1: the customer accepts a free dinner
• D2: the customer does not accept a free dinner
• P1: the customer buys a car
• P2: the customer does not buy a car

Step 2. Define the probabilities and conditional probabilities for each of the events defined in step 1.
P(P1) = 0.10 P(D1|P1) = 0.40 P(D1|P2) = 0.10

Step 3. Compute the complements for each of these probabilities defined in step 2.
P(P2) = 0.90 P(D2|P1) = 0.60 P(D2|P2) = 0.90

Step 4. Apply Bayes' theorem

For the first question, we find that:

$P(P_{1}|D_{1}) = \frac{P(D_{1}|P_{1}) P (P_{1}) }{P(D_{1}|P_{1}) P (P_{1}) + P(D_{1}|P_{2}) P (P_{2}) }$

$= \frac{0.40 * 0.10}{0.40 * 0.10 + 0.10 * 0.90} = 0.308$

This implies that the probability of buying a car is higher, given that the customer accepts a free dinner.

For the second question, we find that:

$P(P_{1}|D_{2}) = \frac{P(D_{2}|P_{1}) P (P_{1}) }{P(D_{2}|P_{1}) P (P_{1}) + P(D_{2}|P_{2}) P (P_{2}) }$

$= \frac{0.60 * 0.10}{0.60 * 0.10 + 0.90 * 0.90} = 0.069$

This shows that people who do not accept the dinner have a lower probability of buying a car.

In practice, business decisions and policies are often based on an implicit or assumed set of probabilities. Often, we cannot be certain about the occurrence of a future event. Yet, if the probability of an event (for instance, whether a legal contract exists) is known, then we have a better chance of making the best possible decision, in comparison to having no idea at all about the occurrence of the event.

## How to use probability models for discrete random variables? - Chapter 4

In the previous chapter, we introduced the concept of probability to represent situations with uncertain outcomes. In this chapter, we use those ideas to construct probability models for discrete random variables. In the next chapter, we will use those ideas to construct such probability models for continuous random variables. Probability models are widely applied to various business problems. Suppose, for instance, that you know from past experience that 30% of people who enter a car rental store want to rent a van. Today, you have three vans available. Five completely unrelated (random) people enter your rental store. What is the probability that these five people want to rent a total of four or five vans? To answer this question, probability models are useful.

### What is a random variable?

A random variable is a variable that takes on numerical values, which are the results of the outcomes in a sample space generated by a random experiment. Be aware that there is a difference between a random variable (denoted by capital letters, such as X) and the possible values that it can take (denoted by lower case letters, for instance, x).

There are two types of random variables: discrete random variables and continuous random variables. A discrete random variable is a random variable that can take no more than a countable (that is, finite) number of values. The possible outcomes are for instance: 1, 2, 3, and so forth. An example of a discrete random variable is the number of customers that want to rent a van. A continuous random variable is a random variable that can take on any value in a certain interval. For continuous random variables we assign probabilities only to a range of values. Examples of continuous random variables are: the yearly income for a family, the length of a phone call to your mother, and the time it takes you to get to work.

### What is a probability distribution function?

Once the probabilities have been calculated, we can form the probability distribution function. The probability distribution function, denoted by P(x) of a discrete random variable X represents the probability that the variable X takes the value x, as a function of x. That is: P(x) = P(X = x) for all values of x.

There are two properties that a probability distribution of a discrete random variable must satisfy:

1. 0 < P(x) < 1 for any value x
In words: the probabilities cannot be negative or exceed 1.
2. The individual probabilities sum to 1, that is:
$\sum_{x} P(x) = 1$
This implies that the events X = x for all possible values of x, which are mutually exclusive and collectively exhaustive.

The cumulative probability distribution F(x0) of a discrete random variable x represents the probability that X does not exceed a certain value, denoted by x0 as a function of x0. In formula, that is:

$F(x_{0}) = P(X \leq x_{0})$

Again, there are two properties that are derived from this distribution:

1. 0 < F(x0) < 1 for every number x0.
2. If x0 and x1 are two numbers with x0 < x1, then F(x0) < F(x1).

In words, this implies that the probability cannot be negative or exceed one and that the probability that a random variable does not exceed a particular number cannot be more than the probability that is does not exceed any larger number.

### What are the properties of discrete random variables?

Although the probability distribution contains all information about the probability properties of some random variable and a visual (graphical) inspection of this distribution certainly provides some information, it is useful to have some summary measures of the characteristics of the probability distribution. These summary measures are discussed in this section.

The expected value E[X] of a discrete random variable X is provided by:

$E[X] = \mu = \sum_{x} xP(x)$

That is, the expected value of a random variable is also called its mean and is denoted by the symbol μ. This notation of expectation is not limited to the random variable itself, but can also be applied to any function of the random variable. In that case, you simply have to replace the symbol x by a function, for instance g(x).

The variance, denoted as σ2 is the expectation of the squared deviations about the mean (X - μ), which is given by:

$\sigma^{2} = E[(X - \mu)^{2}] = \sum_{x}(x- \mu)^{2}P(x)$

This variance can also be expressed as:

$\sigma^{2} = E[X^{2}] - \mu^{2} = \sum_{x} x^{2}P(x) - \mu^{2}$

Subsequently, the standard deviation is obtained by taking the positive square root of the variance.

Finally, we consider the case for a linear function of a random variable using the linear function: Y = a + bX. That is, when a random variable X takes on a specific value x, Y must take on the value a + bX. The mean of Y can be derived as follows:

$\mu_{Y} = Ea + bX] = a + b\mu_{x}$

And the variance of Y can be derived as follows:

$\sigma^{2}_{Y} = Var(a + bX) = b^{2}\sigma^{2}_{X}$

so that the standard deviation of Y is equal to:

$\sigma_{Y} = |b| \sigma_{x}$

### What is a binomial distribution?

#### Bernoulli distribution

Before defining the binomial distribution, it is useful to begin with the Bernoulli model, because this model is considered to be the basic building block for the binomial distribution. Suppose we conducted a random experiment with only two possible outcomes. These two outcomes are mutually exclusive and collectively exhaustive. We label these outcomes respectively "success" and "failure". Now, let P denote the probability of success, so that 1 - P is the probability of failure. The probability distribution of the random variable can then be defined as follows: P(0) = (1 - P) and P (1) = P. This distribution is called the Bernoulli distribution.

The mean of the Bernoulli distribution can be found as follows:

$\mu_{X} = E[X] = \sum_{x}xP(x) = (0)(1 - P) + (1)P = P$

and the variance of the Bernouili distribution is given by:

$\sigma^{2}_{X} = E[(X - \mu_{X})^{2}] = \sum_{X} (x-\mu_{X})^{2} P(x) = P(1 - P)$

#### Binomial distribution

The binomial distribution is an important generalization of the Bernoulli distribution in which a scenario with two possible outcomes is repeated several times and the repetitions are independent. Let n be the number of independent repetitions and let x be the number of successes. The number of sequences with x successes in n independent trials is defined as follows:

$C^{n}_{x} = \frac{n!}{x!(n - x)!}$

Next, the binomial distribution for a random variable X = x is defined as follows:

$P(x) = \frac{n!}{x!(n - x)!} P^{x} (1 - P)^{(n - x)} for x = 0, 1, 2, ..., n$

The mean and variance for the binomial distribution can be found by:

$\mu = E[X] = nP$

$\sigma^{2}_{X} = E[(X-\mu_{X})^{2}] = nP (1 - P)$

### What is a Poisson distribution?

The Poisson distribution was proposed by, as the name already suggests, Simeon Poisson (1781 - 1840). The Poisson distribution is important for many applications in daily life, including among others: the number of failures in a computer system during a particular day, the number of customers that arrive at a checkout aisle in your local grocery store during a particular time interval, and the number of replacement order received by a company during a particular month. As you perhaps already may have noticed, the Poisson distribution implies to the number of occurrences or successes of a certain event during a given continuous interval. This interval can be divided into a large number of equal subintervals such that the probability of occurrence (success) of an event in any subinterval is very small.

There are three assumptions that apply to the Poisson distribution:

1. The probability of occurrence (success) of an event is constant for all the subintervals.
2. There can be no more than one occurrence (success) in each subinterval.
3. The occurrences (successes) are independent. This means that an occurrence (success) in one interval does not influence the probability of an occurrence (success) in another interval.

The Poisson distribution can be derived directly from the binomial distribution by taking the mathematical limits as P goes to 0 and n goes to infinity. As a result, the parameter λ = nP becomes a constant that specifies the average number of occurrences (successes) for a particular time and/or space interval.

Let P(x) be the probability of x successes over a given time or space, given λ. And let λ be the expected number of successes per time or space unit, with λ > 0. A random variable X is said to follow the Poisson distribution if is has the following probability distribution:

$P(x) = \frac{e^{-\lambda}\lambda^{x}}{x!} for x = 0, 1, 2, ...$

in which e denotes the base for natural logarithms (i.e., e ≅ 2.71828).

The mean of the Poisson distribution is provided by:

$\mu_{X} = E[X] = \lambda$

And the variance of the Poisson distribution is given by:

$\sigma^{2}_{x} = E[(X - \mu_{x})^{2}] = \lambda$

Note that the sum of Poisson random variable is also a Poisson random variable. That is, the sum of K Poisson random variables, each with mean λ, is a Poisson random variable with mean Kλ. Poisson distributions have two important applications in the modern global economy. First, they are applied to the probability of failure in complex systems and the probability of defective products in large production runs of several hundred thousand to a million units (such as Federal Express, a large shipping company with a very complex and extensive pickup, classification, shipping, and delivery system for millions of packages per day). Second, the Poisson distribution also appears to be very useful in waiting line or queuing problems, for instance the number of customers waiting for a large retail store. These queuing problems are important for management. For instance, if the queue becomes too long, customers may decide to quit the line or may not return for a next shopping visit.

Previously, we mentioned that the Poisson distribution is obtained on the basis of te binomial distribution with P approaching 0 and n becoming very large. From this, it follows that the Poisson distribution can be used to approximate the binomial distribution in the event of a large number of trials n with a small probability p such that λ = nP < 7. In that case, the probability distribution of the approximating distribution is given by:

$P(x) = \frac{e^{-nP}(nP)^{x}}{x!} for x = 0, 1, 2, ...$

How to decide on the distribution to use? More precisely, when to use the binomial distribution and when the Poisson distribution? This choice often can be made by carefully reviewing the assumptions for the two distributions. For instance, if the problem concerns a small sample of observations, then it is not possible to identify a limiting probability with large n, which implies that the binomial distribution should be used. Moreover, if there is a small sample and the probability of success for a single trials is between 0.05 and 0.95, then there is further support for a binomial distribution. In general, we can state that if the set of cases is very small, say, fewer than 30, then the binomial distribution should be used. However, if the set of cases that could be affected is very large (for instance several thousand), then the Poisson distribution should be used.

### When to use the hypergeometric distribution?

The binomial distribution which we discussed before is useful when the items are drawn independently with an equal probability of each item being selected. These assumptions can be met in many applications in real life, if a small sample is drawn from a (very) large population. Sometimes, however, we do not have such a large population. Assume, for instance, that we want to select five employees from a group of 15 equally qualified applications. Here, we have to deal with a small population. If in addition to this we are dealing with a situation in which there is sampling without replacement, the hypergeometric distribution can be used. The corresponding probability distribution is:

$P(x) = \frac{C^{s}_{x}C^{N-2}_{n-s}}{C^{N}_{n}}$

where x can take integer values ranging from the larger of 0 and [n - (N - S)] to the smaller of n and S.

Note that, if the population is large (typically N > 10,000) and the sample size is small (typically < 1%), then the change in probability of each draw is very small, and the binomial distribution appears to be a sufficient approximation. Hence, under these conditions, the binomial distribution is typically used.

### How to use probability distributions for jointly distributed discrete random variables?

Often in business and economic applications, statistical questions are related to the relationship between variables. For instance, products may have different prices at different quality levels. Age groups may have different preferences for clothing, cars, and music. The percent return on two different stocks may be related, and so forth. Therefore, in this section, we consider the case of two or more possibly related discrete random variables.

Suppose X and Y are a pair of discrete random variables. The joint probability distribution then refers to the probability that simultaneously X takes on the specific value x and Y takes on the specific value y, as functions of respectively x and y. In formula, the joint probability of x and y is denoted as follows:

$P(x,y) = P(X = x ∩ Y = y)$

The probability distributions of respectively x and y are called the marginal probability distributions, denoted by:

$P(x) = \sum_{y} P(x,y)$

and

$P(y) = \sum_{x} P(x,y)$

Subsequently, there are two properties that the joint probability distribution of discrete random variables must satisfy: (1) 0 < P(x,y) < 1 for any pair of values x and y; and (2) the sum of the joint probabilities P(x,y) over all possible pairs of values must be 1.

Next, the conditional probability distribution of a random variable Y, given specified values of the other random variable X, is the collection of conditional probabilities. The conditional probability of y given x is denoted by P(y|x). The conditional probability of x given y is denoted by P(x|y). They can be obtained as follows:

$P(y|x) = \frac{P(x,y)}{P(x)}$

and

$P(x|y) = \frac{P(x,y)}{P(y)}$

If, and only if, the joint probability distribution of X and Y is the product of their marginal probability distributions, the jointly distributed random variables X and Y are said to be independent. In formula, that is:

$P(x,y) = P(x)P(y)$

for all possible pairs of values x and y. Similarly, from this property of independence, it follows that P(y|x) = P(y) and that P(x|y) = P(x).

Finally, we consider the expectation of a function of two random variables. This was previously done for a single random variable. The expectation of any function g(X,Y) of two random variables X and Y is defined as follows:

$E [g(X,Y)] = \sum_{x} \sum_{y} g(x,y)P(x,y)$

One measure that is of particular importance for linear functions is the covariance. The covariance is a measure of linear association between two random variables. It refers to the joint probability of two random variables and is used with the variance of each random variable to calculate the variance of the linear combination. The covariance is denoted as Cov(x,y) and for discrete random variables, the covariance is defined as follows:

$Cov(X,Y) = E[(X - \mu_{X})(Y - \mu_{Y})] = \sum_{x} \sum_{y} (x - \mu_{X})(y - \mu_{Y}) P(x,y)$

If two random variables are statistically independent, the covariance between them is zero. However, be aware that the converse is not necessarily true.

The covariance does not have an upper or lower bound. Therefore, its size is heavily influenced by the scaling of the variables. As a result, it is difficult to use the covariance as a measure of the strength of a linear relationship between two random variables. A related measure, the correlation coefficient, provides an alternative way to measure the strength of a linear relationship between two random variables. The correlation coefficient is bounded with a range from -1 to +1. The correlation between X and Y can be found as follows:

$\rho = Corr(X,Y) = \frac{Cov(X,Y)}{\sigma_{X}\sigma_{Y}}$

A correlation value of zero indicates that there is no linear relationship between two variables. Moreover, if the two variables are independent, it follows that their correlation is equal to zero. A positive correlation value indicates that if one variable is high (low), the other variable also has a higher probability of being high (low). A value of one indicates a perfect positive linear relationship. Vice versa, a negative correlation value indicates that if one variable is high (low), the other variable has a higher probability of being low (high). A value of -1 indicates a perfect negative linear relationship.

Finally, some summary results are presented for linear sums and differences of two random variables:

$E[X + Y] = \mu_{X} + \mu_{Y}$

$E[X - Y] = \mu_{X} - \mu_{Y}$

If the covariance between X and Y is 0:

$Var(X + Y) = \sigma^{2}_{X} + \sigma^{2}_{Y}$

and

$Var(X - Y) = \sigma^{2}_{X} - \sigma^{2}_{Y}$

But if the covariance between X and Y is not 0:

$Var(X + Y) = \sigma^{2}_{X} + \sigma^{2}_{Y} + 2cov(X,Y)$

and

$Var(X - Y) = \sigma^{2}_{X} + \sigma^{2}_{Y} - 2cov(X,Y)$

These difference scores for the mean and variance are very useful in business applications. They are commonly used to develop a portfolio. Investment managers spend considerable effort in developing investment portfolios that consists of a set of financial instruments that each have returns defined by a probability distribution. These portfolios are used to obtain a combined investment that has a given expected return (the mean; expected value) and risk (the variance). Generally, one desires a high return (thus a higher expected value) and a low risk (thus a lower variance). The portfolio market value is denoted by W are is given by the linear function:

$W = aX + bY$

where a is the number of shares in stock A and b is the number of shares in stock B. The mean value for W can be calculated as follows:

$\mu_{W} = E[W] = E[aX + bY] = a\mu_{X} + b\mu_{Y}$

and the variance for W can be obtained as follows:

$\sigma^{2} = a^{2}\sigma^{2}_{X} + b^{2}\sigma^{2}_{Y} + 2abCov(X,Y)$

or, via the correlation, as follows:

$\sigma^{2} = a^{2}\sigma^{2}_{X} + b^{2}\sigma^{2}_{Y} + 2aborr(X,Y)\sigma_{x}\sigma_{Y}$

In the previous chapter, we introduced the concept of probability to represent situations with uncertain outcomes. In this chapter, we use those ideas to construct probability models for discrete random variables. In the next chapter, we will use those ideas to construct such probability models for continuous random variables. Probability models are widely applied to various business problems. Suppose, for instance, that you know from past experience that 30% of people who enter a car rental store want to rent a van. Today, you have three vans available. Five completely unrelated (random) people enter your rental store. What is the probability that these five people want to rent a total of four or five vans? To answer this question, probability models are useful.

## How to use probability models for continuous random variables? - Chapter 5

In the previous chapter, we discussed how to use probability models for discrete random variables. In this chapter, we extend the probability concepts to continuous random variables. Many measures in economics and business fall into this category of continuous random variables, for instance sales, investment, consumption, and costs. Hence, these probability models for continuous random variables are very important and offer an excellent tool for business and economics applications.

### What probability distribution function is used for continuous random variables?

The probability distribution function that is used for continuous random variables is called the cumulative distribution function, denoted by F(X). The cumulative distribution function is analogous to the probability distribution function that is used for discrete random variables. It expresses the probability that variable X does not exceed the value of x, as a function of x. In formula, that is:

$F(x) = P(X \leq x)$

If we are interested in the probability that a continuous random variable X falls in a specific range, we look for the difference between the cumulative probability at the upper end of this range and the cumulative probability at the lower end of the range. The probability of the range then lies between these two values. In formula, that is:

$P(a < X < b) = F(b) - F(a)$

For instance, suppose we are interested in the probability that a continuous random variabe X falls between 250 and 750. Further, it is provided that X is distributed uniformly in the range 0 to 1,000. Then, the cumulative distribution function is: F(x) = 0.001x. Therefore, the probability that the probability falls between 250 and 750 is: P(250 < X < 750) = (0.001)(750) - (0.001)(250) = 0.75 - 0.25 = 0.50 ]. To conclude, the probability that a continuous random variables falls between any two values can thus be expressed in terms of its cumulative distribution function.

To obtain a graphical interpretation of the probability structure for continuous random variables, we can use the probability density function. The probability density function, f(x) of a continuous random variable X has the following properties:

1. f(x) > 0 for all values of x.
2. The area under the probability density function, f(x) over all values of the random variable X -within its range(!)- is equal to 1.0. In other words, the total area under the curve f(x) is 1.
3. The are under the curve f(x) to the left of x0 is F(x0) where x0 is any value that the random variable X can take. In other words, the cumulative distribution function F(x0) is the area under the probability density function f(x) up to x0, where xm is the minimum value of the random variable X.
$F(x_{0}) = \int^{x_{0}}_{x_{m}} f(x)dx$
4. Let a and b be two possible values of the continuous random variable X with a < b. Then, the probability that X lies between a and b is the area under the probability density function between these two points.
$P(a \leq X \leq b) = \int_{a}^{b} f(x)dx$

For any uniform random variable defined over the range a to b, the probability density function is provided as follows:

$f(x) = \frac{1}{b-a} \hspace{3mm} for \hspace{3mm} a \leq x \leq b$

with f(x) = 0 otherwise (that is, if x does not fall between a and b).

### How to calculate expected values for continuous random variables?

In the previous chapter, we introduced the concept of expected values for discrete random variables. In this chapter, we extend that concept to the event of continuous random variables. Because the probability of any expected value is 0 for a continuous random variable, the expected values for continuous random variables are computed using integral calculus. The expected value is denoted by symbol E[X] and can be obtained as follows:

$E[g(x)] = \int_{x} g(x)f(x)dx$

The mean of a continuous random variable X is defined as the expected value of X, that is: μX = E[X]. The variance of X can be obtained as the expectation of the squared deviation, that is: σ2 = E[(X - μX)2] or, via an alternative expression: σ2 = E[X]2 - μX2. The standard deviation of X is, as always, obtained by taking the square root of the variance.

#### Uniform distribution

For a uniform distribution, we obtain the following properties:

1. $f(x) = \frac{1}{b - a} \hspace{3mm} a \leq X \leq b$
2. $\mu_{x} = E[X] = \frac{a +b}{2}$
3. $\sigma^{2}_{x} = E[(X - \mu_{x})^{2}] = \frac{(b - a)^{2}}{12}$

The mean and variance are also called the first and second moment.

#### Linear functions of random variables

In chapter 4, we showed how to obtain the means and variances for linear functions of discrete random variables. These are the samen for continuous random variables, because the derivations make use of the expected value operator. Therefore, the same formulas can be used to obtain the means and variances.

That is, consider the case for a linear function of a random variable using the linear function: Y = a + bX. That is, when a random variable X takes on a specific value x, Y must take on the value a + bX. The mean of Y can be derived as follows:

$\mu_{Y} = Ea + bX] = a + b\mu_{x}$

And the variance of Y can be derived as follows:

$\sigma^{2}_{Y} = Var(a + bX) = b^{2}\sigma^{2}_{X}$

so that the standard deviation of Y is equal to:

$\sigma_{Y} = |b| \sigma_{x}$

An important special case of these results it the standardized random variable which has mean 0 and variance 1:

$Z = \frac{X - \mu_{X}}{\sigma_{X}}$

### How to use the normal probability distribution?

The normal probability distribution, denoted by X ~ N(μ, σ2) is the probability distribution that is most often used for economics and business applications. There are many reasons for the popularity of the normal distribution function. First, it closely approximates the probability distributions for a wide range of random variables. Second, distributions of sample means approach a normal distribution given a "large" sample size. Third, computation of probabilities is direct and elegant. Fourth, the most important reason, the normal probability distribution has resulted in good business decisions for various applications. Formally, the probability density function for a normally distributed random variable X is given by:

$f(x) = \frac{1}{\sqrt{2\pi\sigma^{2}}} e^{-(x - \mu)^{2 / \sigma^{2}}}$

The normal probability distribution represents a large family of distributions, each with a unique specification for the two parameters (mean and variance). These parameters have a very convenient interpretation. More precisely, the normal distribution is symmetric. Hence, central tendencies are indicated by the mean. In contrast, the variance indicates the distribution width. By selecting values for the mean and variance, we can define a large family of probability density functions. Each is symmetric, but with a different value for the central tendency (mean) and distribution width (variance).

The cumulative distribution function for the normal distribution is given as follows:

$F(x_{0}) = P(X \leq x_{0})$

As for any density function, the total area under the curve is equal to -1.

#### Standard normal distribution

Any normal distribution can be converted to the standard normal distribution, that is a normal distribution with mean 0 and standard deviation 1. The standard distribution is very convenient, because it is easily interpretable for people, regardless of the scale of the raw variables. The standard normal distribution is denoted by Z ~ (0,1), which implies that the mean equals one and the variance (and standard deviation) equals 1. The formal relationship between the standard score, Z, and the raw score X is given by:

$Z = \frac{X - \mu}{\sigma}$

where X is a normally distributed random variable X ~ N(μ, σ2). The standard score Z allows us to use the standad normal table to compute probabilities associated with any normally distributed random variable. This table is provided in the Appendix, table 1, and will also be provided during the exam. The table gives values of F(z) = P(Z < z) for nonnegative values of z. For instance, it can be found that for a Z value of 1.25, the cumulative probability if F(1.25) 0.8944. That means that the area for Z less than 1.25 is equal to 0.8944. Or, in other words, the probability of a value of 1.25 or lower equals 0.8944. Vice versa, the probability of exceeding Z = 1.25 equals 1 - 0.8944 = 0.1056.

The most used probability model is the normal probability plot. In this plot, the horizontal axis indicates the data points ranked in order from the smallest to the largest. The vertical axis represents the cumulative normal probabilities of the ranked data values if the sample data were obtained from a population whose random variables follow a normal distribution. If the plotted values are close to a straight line, even at the upper and lower limits, we can conclude that the results provide solid evidence that the data have a normal distribution. If the data points show deviations from a straight line, for instance large deviations at the extreme high and low values, we can conclude that the distribution is skewed. A skewed distribution is a major concern in statistics, because statistical inference is often based on the assumption derived from a (standard) normal distribution.

### How can the normal distribution be used to approximate the binomial distribution?

Sometimes, when tables are not available, you can approximate the binomial distribution by the normal distribution. In this section, we will show you how this approximation works. By using the normal distribution instead of the binomial distribution, we can reduce the number of different statistical procedures that you need to know to solve business problems.

One way to assess whether the binomial distribution can be approximated via a normal distribution is by means of graphs. A plot may provide visual evidence that the distribution has the same shape as the normal distribution. More specifically, an approximation rule is developed for when to use the normal distribution as an approximation of the distribution. That is: if the number of trials n is large, such that nP(1-P) > 5 then the approximation of the distribution of the random variable can be approximated by the standard normal distribution. Note that nP(1-P) equals the variance of the binomial distribution. If this value is less than 5, the binomial distribution should be used to determine the probabilities. If this value exceeds 5, the normal distribution can be used as an approximation.

#### Proportions of random variables

In various applied problems we need to compute probabilities for proportions or percentage intervals. These can be obtained by using a direct extension of the normal distribution approximation for the binomial distribution. Let P be the proportion of a random variable, X be the number of successes, and n be the sample size (total number of trials). Then:

$P = \frac{X}{n}$

Next, the mean and variance of this proportion P can be computed as follows:

$\mu = P$

$\sigma^{2} = \frac{P(1 - P)}{n}$

The resulting mean and variance can be used with the normal distribution to compute the desired probabilities.

### How to use the exponential distribution?

The exponential distribution has been found to be very useful for waiting-line and queuing issues. The exponential distribution differs from the standard normal distribution in two important ways: (1) It is restricted to random variables with positive values; and (2) the distribution is not symmetric.

The exponential random variable T(t > 0), which is said to follow the exponential probability distribution, has the following probability density function:

$f(t) = \lambda e^{-\lambda t} \hspace{3mm} for \hspace{3mm} t > 0$

where λ refers to the mean number of independent arrivals per time unit, t refers to the number of time units until the next arrival, and e = 2.71828. The distribution has a mean of 1/λ and a variance of 1/λ2.

The cumulative distribution function is as follows:

$F(t) = 1 - e^{-\lambda t} \hspace{3mm} for \hspace{3mm} t > 0$

The probability that the time between arrivals is ta or less is computed as follows:

$P(T \leq t_{a}) = (1 - e^{-\lambda t_{a}})$

The probability that the time between arrivals is between tb and ta is computed as follows:

$P(t_{b}) \leq T \leq t_{a} = (1 - e^{-\lambda t_{a}}) - (1 - e^{-\lambda t_{b}}) = e^{-\lambda t_{b}} - e^{-\lambda t_{a}}$

To illustrate this, suppose the random variable T represents the length of time until the end of a service time or until the next arrival, beginning at an arbitrary time 0. The model assumptions are the same as those for the Poisson distribution. However, be aware that the Poisson distribution provides the probability of X successes or arrivals during a time unit. In contrast, the exponential distribution provides the probability that a success or arrival will occur during a time interval t. Now, the probability density function has λ - 0.2. The probability that an arrival occurs between time 10 and 20 can be computed as follows:

$P(t_{10} \leq T \leq t_{20}) = (1 - e^{-0.2t_{20}}) - (1 - e^{-0.2t_{10}}) = 0.1353 - 0.0183 = 0.1170$

### How to model continuous jointly distributed random variables?

In the previous chapter, the concept of jointly distributed variables has been introduced for discrete random variables. In this chapter, we show that many of these concepts and results also apply to continuous random variables. Jointly distributed random variables are very common in economics and business. For instance, the market values of various stock prices are regularly modeled as joint random variables.

Let X1, X2, ..., XK be continuous random variables. Their joint cumulative distribution F(x1, x2, ..., xk) refers to the probability that simultaneously X1 is less than x1, X2 is less than x2 and so forth. In formula, that is: F(x1, x2, ..., xk) = P(X1 < x1 ∩ X2 < x2 ∩ ... ∩ XK < xk). Further, the cumulative distribution functions - F(x1), F(x2), ..., F(xk) of the individual random variables are called their marginal distributions. For any value of i, F(xi) is the probability that the random variable Xi does not exceed the specific value xi. Lastly, the random variables are independent if and only if F(x1, x2, ..., xk) = F(x1)F(x2) ... F(x3). Be aware that the notion of independence is the same as for discrete variables. Independence of a set of random variables implies that the probability distribution of any one of these variables is unaffected by the values taken by others. For example, the assertion that consecutive daily changes in the price of a share of common stock are independent of one another implies that information about the past price changes is of no value in assessing what is likely to happen the next day.

Similar to the case of discrete random variables, we have the concept of variance, which is used to assess linear relationships between pairs of random variables. In addition, the same concept of correlation can be used to assess the strength (and direction) of the relationship between two continuous random variables.

In the previous chapter, we already presented the means and variances for sums and differences of discrete random variables. Here, the same applies to continuous random variables, because results are established using expectations and, therefore, are not affected by the condition of being discrete or continuous.

Lastly, recall that we developed in the previous chapter the mean and variance for linear combinations of discrete random variables. These results also apply for continuous random variables. Again, this is the case because their development is based on operations with expected values and, thus, does not depend on particular probability distributions. These linear combinations are commonly used for investment portfolios. Recall that the risk of an investment is directly related to the variance of the investment value. Be aware that, if the values of the two stock prices are positively correlated, the resulting portfolio will have a larger variance and a higher risk. Yet, if the two stock prices are negatively correlated, the resulting portfolio will have a smaller variance and, thus, a lower risk. This phenomenon is often referred to as hedging.

In the previous chapter, we discussed how to use probability models for discrete random variables. In this chapter, we extend the probability concepts to continuous random variables. Many measures in economics and business fall into this category of continuous random variables, for instance sales, investment, consumption, and costs. Hence, these probability models for continuous random variables are very important and offer an excellent tool for business and economics applications.

## How to obtain a proper sample from a population? - Chapter 6

The remainder of this book focuses on various procedures for using statistical sample data to make inferences about statistical populations. However, before being able to conduct these procedures, we first need to properly obtain a sample from the population. This process is also called sampling, and will be the focus of the present chapter.

### What is a simple random sample?

A simple random sample, also simply known as random sample, is chosen by a process that selects a sample of n objects from a population in such a way that each member of the population has the same probability of being selected. Random samples are the ideal; they provide insurance against personal biases that may influence the selection process.

### What are the three advantages of a simple random sample?

Generally, greater accuracy is obtained by using a random sample of the population rather than spending the resources to measure every item. There are three reasons for this. First, it is often very difficult to obtain and measure every item in a population, and, even if it were possible, the cost would be extremely high for a large population. Second, properly selected samples can be used to obtain measured estimates of population characteristics that are quite close to the actual population values. Third, by using the probability distribution of sample characteristics, we can determine the error that is associated with our estimates of population characteristics.

### How can you make inferences about the population (mean)?

To make inferences about the population, we need to know the sampling distribution of the observations and the computed sample statistics. The sampling distribution of the sample mean is the probability distribution of the sample means obtained from all possible samples of the same number of observations drawn from the population. Using this sampling distribution allows us to make inferences about the population mean.

Suppose we have the random variable X. At this point, we cannot determine the shape of the sampling distribution, but we can, however, determine the mean and variance of the sampling distribution. Note that the mean of the sampling distribution of the sample means is the population mean. Let the random variables X1, X2, ..., Xn denote a random sample from a population. The sample mean value of these random variables is obtained as follows:

$\bar{X} = \frac{1}{n} = \sum^{n}_{i = 1} X_{i}$

Note that the mean of the sampling distribution is equal to the expected value of the sampling distribution. That is:

$E[\bar{X}] = \mu$

After establishing that the distribution of the sample means is centered around the population mean, we want to determine the variance of the distribution of the sample means. If the population is very large in comparison to the sample size, then the distributions of the individual independent random sample observations are the same. On the other hand, if the sample size n is not a small fraction of the population size N, then the individual sample members are not distributed independently of one another, and it can be shown that the variance of the sample means is as follows:

$Var(\bar{X}) = \frac{\sigma^{2}}{n} * \frac{N - n}{N - 1}$

The term (N - n)/(N-1) is also known as the finite population correction factor. This term is included for completeness, because almost all real sampling studies use large populations. We know have developed the expressions for the mean and variance of the sampling distribution of the mean of X. Often, the mean and variance define the sampling distribution.

Lastly, if the parent population distribution is normally distributed, and therefore, the sampling distribution of the sample means is also normally distributed, the random variable Z can be obtained as follows:

$Z = \frac{X - \mu}{\sigma_{\bar{X}}}$

which has a standard normal distribution with mean 0 and variance 1.

Often, we would like to know the range within which sample means are likely to occur. To do so, we can use acceptance intervals. An acceptance interval is an interval within which a sample mean has a high probability of occuring, given that we know the population mean and variance. If the sample mean appears to be within that interval, then we can accept the conclusion that the random sample came from the population with the known mean and variance. Hence, acceptance intervals provide an operating rule for process-monitoring applications. These acceptance intervals are based on the mean and variance and use the normal distribution. Thus, assuming that we know the population mean and variance, denoted respectively by μ and σ2, we can construct a symmetric acceptance interval as follows:

$\mu ∓ z_{\alpha/2\sigma_{\bar{x}}}$

Typically, α is very small (i.e., α < .01). Often, in applications, a small variance is desired. If the sample mean is outside the acceptance interval, this indicates that the population mean may not be μ. In a typical project, engineers will adjust the process so that the variance is small. Once the process has been adjusted so that the variance is small, an acceptance interval for the sample mean (which is called a control interval) is established in the form of a control chart. If the sample mean then is within the control interval, we can conclude that the process is operating properly and that no further action is necessary.

### What is the central limit theorem?

In the previous section, it is already indicated that the sample mean for a random sample of size n drawn from a population with a normal distribution with mean μ and variance σ2, is also normally distributed with mean μ and variance σ2/n. The central limit theorem shows that, if the sample size is large enough, the mean of a random sample drawn from a population with any probability distribution, will be approximately normally distributed with mean μ and variance σ2/n. This is an important result, which enables us to use the normal distribution to compute probabilities for sample means that are obtained from many different populations. While in applied statistics the probability distribution for the population is often unknown, and in particular there is no way to be certain that the underlying distribution is normal, this central limit theorem provides a convenient way to model these situations and provide a good approximation of the true distribution.

A concept that is closely related to the central limit theorem is that of the law of large numbers. This law states that, given a random sample of size n from a population, the sample mean will approach the population mean as the sample size n becomes large, regardless of the underlying probability distribution.

### How to use the sample proportion to obtain inferences about the population proportion?

In chapter 4, we introduced the binomial distribution as the sum of n independent Bernoulli random variables, each with probability of success denoted by P. To characterize the distribution, we need a value of P. In this section, we therefore indicate how we can use the sample proportion to obtain inferences about the population proportion.

Let X be the number of successes in a binomial sample with n observations. Further, let P be the parameter: the proportion of the population members that have the characteristics of interest. The sample proportion is defined as follows:

$\hat{p} = \frac{X}{n}$

That is, p-hat is the mean of a set of independent random variables. The results we developed in the previous sections for sample means apply to this statistic. In addition, the central limit theorem can be used to argue that the probability distribution for p-hat can be modeled as a normally distributed random variable. The sample proportion of success approaches P as the sample size increases. Hence, we can make inferences about the population proportion using this sample proportion and the sample proportion gets more accurate as our sample size increases. However, the difference between the expected number of sample successes (the sample size multiplied by P) and the number of successes in the sample might actually increase and the sample size increases.

The sampling distribution of p-hat has mean P. In formula that is:

$E[\hat{p}] = P$

and standard deviation:

$\sigma_{\hat{p}} = \sqrt{ \frac{P(1 - P)}{n} }$

And, if the sample size is large enough, the random variable:

$Z = \frac{ \hat{p} - P }{\sigma_{\hat{p}}}$

is approximately normally distributed. This approximation is good if nP(1 - P) > 5.

Similar as before, we can see that the standard error of the sample proportion (p-hat) decreases as the sample size increases, hence the distribution becomes more concentrated. This is to be expected, because the sample proportion is a sample mean. As the sample size becomes larger, our inferences about the population parameter improve. From the central limit theorem we know that the binomial distribution can be approximated by a normal distribution with corresponding mean and variance. That result also applies to (sample) proportions.

### How to obtain sampling distributions for sample variances?

Now that we have developed sampling distributions for sample means and variance, it is time to consider sampling distributions for sample variances. Variances are important in many business and economics applications. Often, emphasis in the industry is on producing products that satisfy customer quality standards. In doing so, there is a need to measure and reduce population variance. The wider the range of outcomes (variance), the more likely it will result in more individual products that perform below an acceptable standard. Hence, there is a desire to obtain a low variance.

Let x1,x2, ..., xn be a random sample of observations from a population. The quantity

$s^{2} = \frac{1}{n - 1} \sum^{n}_{i = 1} (x_{i} - \bar{x})^{2})$

is called the sample variance. The square root of the sample variance, denoted by s, is called the sample standard deviation. Given a specific random sample, we can compute the sample variance. The sample variance will be different for each random sample, because of differences in sample observations.

When the actual sample size n is a small proportion of the population size N, then: E[s2] = σ2. The conclusion that the expected value of the sample variance is equal to the population variance is quite general. Yet for statistical inference, we would like to know more about the sampling distribution. If we can assume that the underlying population distribution is normal, then it can be shown that the sample variance and the population variance are related through a probability distribution known as the chi-square distribution. That is, given a random sample of n observations from a normally distributed population with population variance σ2 and resulting sample variance s2, it can be shown that:

$\chi^{2}_{(n-1)} = \frac{(n - 1)s^{2}}{\sigma^{2}} = \frac{\sum^{n}_{i = 1} (x_{i} - \bar{x})^{2} }{\sigma^{2}}$

has a chi-square distribution (χ2) with n - 1 degrees of freedom. The distribution is defined only for positive values, because variances cannot be negative. We can characterize a particular member of the family of chi-square distributions by a single parameter referred to as the degrees of freedom, denoted by the symbol v. A chi-square distribution with v degrees of freedom is denoted by χ2v. The mean and variance of this distribution are equal to, respectively, the number of degrees of freedom (v) and twice the number of degrees of freedom (2v). In formula, that is:

$E[X^{2}_{v}] = v \hspace{3mm} and \hspace{3mm} Var(X^{2}_{v}) = 2v$

Using these results for the mean and variance of the chi-square distribution, we find that:

$E[s^{2}] = \sigma^{2}$

Further, the variance of the sampling distribution of s2 depends on the underlying population distribution. If that population distribution is normal, then

$Var(s^{2}) = \frac{2 \sigma^{4}}{(n - 1)}$

The remainder of this book focuses on various procedures for using statistical sample data to make inferences about statistical populations. However, before being able to conduct these procedures, we first need to properly obtain a sample from the population. This process is also called sampling, and will be the focus of the present chapter.

## How to obtain estimates for a single population? - Chapter 7

### What is the difference between an estimator and an estimate?

To make inferences about the population, we need sample statistics. Here, a distinction has to be made between the terms estimator and estimate. An estimator of a population parameter is a random variable that depends on the sample information. The value of an estimator provides an approximation of the unknown parameter. An estimate is a specific value of that random variable. In other words, an estimator is a function of a random variable and an estimate is a single number. It is a distinction between a process (estimator) and the result of that process (estimate).

For considering the estimation of an unknown parameter, there are two possibilities. First, a single number could be computed from the sample as most representative of the unknown population parameter. This single number is also known as the point estimate. Note that the function corresponding to this is called the point estimator. Be aware that there is no single mechanism for determining a uniquely "best" point estimator in all circumstances. Instead, a set of criteria is available under which particular estimators can be evaluated. Second, a confidence interval can be obtained which provides some degree of confidence that the parameter falls within a specified range.

### Which two properties should be taken into account when searching for an estimator of a population parameter?

When looking for an estimator of a population parameter, two properties should be taken into account.

#### 1. Unbiasedness

The first property that an estimator should possess is unbiasedness. A point estimator is said to be an unbiased estimator of a population parameter if its expected value is equal to that of the parameter, that is, if:

$E[\hat{\theta}] = \theta$

then the point estimator (theta hat) is an unbiased estimator of the population parameter (theta). Note that unbiasedness does not mean that a particular (single) value of theta hat must be exactly the correct value of theta. Instead, an unbiased estimator has the capability of estimating the population parameter correctly on average. Hence, the average point estimator is a correct estimation of the parameter.

From this, it follows that the bias in the point estimator is defined as the difference between its mean and the population parameter. That is:

$bias(\hat{\theta}) = E(\hat{\theta}) - \theta$

Note that the bias of an unbiased estimator is always zero.

#### 2. Most efficient

Unbiasedness is not the only desired property of an estimator. The second property relates to efficiency. That is, if there are several unbiased estimators of a population parameter, then the unbiased estimator with the smallest variance is said to be the most efficient estimator. This is also called the minimum variance unbiased estimator. Suppose, there are two unbiased estimators of Θ. Both are based on the same number of sample observations. Then, Θ1 is said to be more efficient than Θ2 is the variance of the first is smaller than the variance of the second estimator. Moreover, the relative efficiency of Θ1 with respect to Θ2 is the ratio of their variances, that is:

$relative \hspace{1mm} efficiency = \frac{Var(\hat{\theta}_{2})}{Var(\hat{\theta}_{1})}$

When considering which measure is the most efficient estimator of the population mean, we emphasize the importance of using a normal probability plot. A normal probability plot is used to determine if there is any evidence of nonnormality. That is, if the population deviates from a normal distribution, the sample mean may not be the most efficient estimator of the population mean. Especially when outliers heavily affect the population distribution, the sample mean is less efficient than other measures, such as the median. Properties of selected point estimators are summarized in Table 1.

 Population parameter Point estimator Properties Mean (μ) $\hat{X}$ Unbiased, most efficient (when assuming normality) Mean (μ) $Median$ Unbiased (when assuming normality) but not most efficient Proportion (P) $\hat{p}$ Unbiased, most efficient Variance (σ2) $s^{2}$ Unbiased, most efficient (when assuming normality)

A problem that frequently occurs in practice is how to choose an appropriate points estimator for a population estimator. This appears to be a difficult issue. Although it is attractive to choose the most efficient of all unbiased estimators, sometimes, there are estimation problems for which no unbiased estimator is very satisfactory, or there may be cases in which it is not possible to find a minimum variance unbiased estimator. For these situations, selecting the best point estimator is not straightforward and requires substantial mathematical integrity, which goes beyond the scope of this book. Hence, for you, it is sufficient to know that often the best point estimator can be selected by choosing the most efficient of all unbiased estimators.

### How to estimate the confidence interval for the mean of a normal distribution?

#### Population variance known

First, consider the situation in which we assume that a random sample is taken from a population that is normally distributed with an unknown mean and a known variance. Note that this scenario may seem to be unrealistic, because one rarely known the population variance.

A confidence interval estimator for a population parameter is a rule for determining an interval that is likely to include the parameter (based on sample information). The corresponding estimate is called a confidence interval estimate. A 95% confidence interval can be interpreted as follows: "If the population is repeatedly sampled and intervals are calculated accordingly, then in the long run, 95% of the intervals would contain the true value of the unknown parameter". The quantity 100(1 - α)% is called the confidence level of the interval. In the example mentioned here, the confidence level is 95%. Note that this is also the most commonly used confidence interval in many scientific disciplines.

Suppose a random sample of n observations is drawn from a normal distribution with mean μ and variance σ2. If the sample mean is x̅, then a 100(1 - α)% confidence interval for the population mean with known variance is given by:

$\bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \hspace{3mm} or x \pm ME$

with ME being the margin of error (also known as sampling error) given by:

$z_{\alpha/2} \frac{\sigma}{\sqrt{n}}$

The width then is equal to twice the margin of error, that is: w = 2(ME).

In Table 2, the most commonly used confidence levels and their corresponding values of zα/2 are given. This quantity zα/2 is also referred to as the reliability factor. It is useful to know the numbers provided in this table by heart.

 Confidence interval 90% 95% 98% 99% αzα/2 0.101.645 0.051.96 0.022.33 0.012.58

#### Population variance unknown

Second, consider the situation in which we assume that a random sample is taken from a population that is normally distributed with an unknown mean and an unknown variance. This is a more realistic scenario, because, often in practice, we do not know precisely what the population variance is. Rather than using the z distribution, when the population variance is unknown, we use student's t distribution. Hence, rather than computing Z, we are using the following equation:

$t = \frac{\bar{x} - \mu}{s / \sqrt{n}}$

As said above, this random variable does not follow a standard normal distribution. Instead, its distribution is a member of a family of distributions called Student's t. Any specific member of this family of distributions is characterized by its number of degrees of freedom that is associated with the computation of the standard error. The degrees of freedom are denoted by the symbol v. The shape of the Student's t distribution is pretty similar to that of the normal distribution. Both distributions have a mean equal to zero. Both probability density functions are symmetric around their mean. However, they differ in the dispersion: the density function of the student's t distribution has a wider dispersion, which is reflected by a larger variance, then the standard normal distribution. This wider dispersion is the result of the extra uncertainty that is caused by replacing the known population standard deviation by its sample estimator. Note that, as the number of degrees of freedom increases, the student's t distribution becomes increasingly similar to the standard normal distribution.

For each random variable that follows the student's t distribution, we can compute the reliability factor as follows:

$P(t_{v} > t_{v,\alpha/2} ) = \alpha/2$

Similar to the z distribution, we can compute the confidence interval for the population mean with unknown variance, as follows:

$\bar{x} \pm t_{n-\alpha/2} \frac{s}{\sqrt{n}}$

with again the latter part being the margin of error, that is:

$ME = t_{n-\alpha/2} \frac{s}{\sqrt{n}}$

### How can the margin of error be reduced?

There are three factors that affect the margin of error:

1. The population standard deviation
2. The sample size n
3. The confidence interval

These three factors can, thus, be manipulated to reduce the margin of error. First, keeping all other factors constant, the lower the population standard deviation, the smaller the margin of error. However, sometimes, the population standard deviation cannot be reduced. Second, the higher the sample size, the smaller the margin of error. The more information obtained from a population, the more accurate our inference is about the population parameter of interest. Third, keeping all other factors constant, the lower de confidence interval (1 - α), the lower the margin of error. Note, however, that this implies a reduction in the probability that the interval includes the value of the true population parameter. In other words, decreasing the confidence interval reduces the margin of error, yet simultaneously reduced the probability that the interval includes the value of the true population parameter.

### How to estimate the confidence interval for population proportions?

What percent of Dutch students is expected to pursue a doctoral degree? What percent of students is expected to pass the next statistics exam? What proportion of adults is married? In each of these scenarios, the proportion of population members possessing a particular characteristic, is of interest. In this section, we focus on the establishment of confidence intervals for the population proportion.

For large sample sizes, that is if nP(1 - P) > 5, a 100(1 - α)% confidence interval for the population proportion is provided by:

$\hat{p} \pm z_{\alpha/2} \sqrt{ \frac{\hat{p}(1 - \hat{p})}{n} }$

or, equivalently:

$\hat{p} \pm ME$

with ME being the margin of error, given by:

$ME = z_{\alpha/2} \sqrt{ \frac{\hat{p}(1 - \hat{p})}{n} }$

Similar to the margin of error for the mean, when all other things are kept equal, the larger the sample size (n), the narrower the confidence interval. This shows the increasing precision of the information about the parameter obtained as the sample size becomes larger.

### How to obtain a confidence interval estimation for the variance of a normal distribution?

If the population is normally distributed (and this has been verified, then, the random variable

$\chi^{2}_{n-1} = \frac{(n - 1) s^{2}}{\sigma^{2}}$

follows a chi-square distribution with v = n - 1 degrees of freedom. For example, suppose that we are interested in the number that is exceeded with probability 0.05 by a chi-square random variable with 6 degrees of freedom. That is, we are interested in the following:

$P(\chi^{2}_{6} > \chi^{2}_{6, 0.05} ) = 0.05$

Then, using Appendix Table 7, we can find that:

$\chi^{2}_{6, 0.05} = 12.592$

The confidence interval for the population variance, then, is given by:

$LCL = \frac{(n - 1) s^{2}}{\chi^{2}_{n - 1,\alpha/2} } \hspace{3mm} and \hspace{3mm} UCL = \frac{(n - 1) s^{2}}{\chi^{2}_{n - 1,1 - \alpha/2} }$

in which LCL denoted the lower limit and UCL denotes the upper limit of the confidence interval. Be aware that the confidence interval is different from the usual form (which is: sample point estimator +/- the margin of error). Lastly, be aware that it is dangerous to follow this procedure when the population distribution deviates from being normally distributed. The validity of the interval estimator for the population variance depends heavily on the assumption of normality, even more than does that of the interval estimator for the population mean.

### How to estimate confidence intervals for finite populations?

In this section, we consider how to estimate confidence intervals for finite populations. In a finite population, the number of sample members is not a negligible proportion of the number of population members. Instead, the sample size is considered to be relatively large in comparison to the population size. More precisely, if n > 0.05N. In words: if the sample size is at least 5% of the population size. If this assumption is met, we assume that the sample is sufficiently large and that the central limit theorem applies. In addition, the population correction (fpc) factor, (N - n)/(N - 1) should be used.

#### Estimation of the population mean

The sample mean is an unbiased estimator of the population mean (μ). The point estimate of this mean is:

$\bar{x} = \frac{1}{n} \sum^{n}_{i = 1} x_{i}$

The unbiased point estimate for the variance of the sample mean is given by:

$\hat{\sigma}^{\frac{2}{x}} = \frac{s^{2}}{n} (\frac{N - n}{N - 1})$

Lastly, a 100(1 - α)% confidence interval for the population mean is given by:

$\bar{x} \pm t_{n - 1,\alpha/2} \hat{\sigma}_{\bar{x}}$

#### Estimation of the population total

The population total, denoted by Nμ for a finite population can be estimated via a point estimates Nx̄ as follows:

$N\hat{\sigma}_{\bar{x}} = \frac{Ns}{\sqrt{n}} \sqrt{ (\frac{N - n}{N - 1}) }$

A 100(1 - α)% confidence interval for the population total, Nμ can be obtained as follows:

$N\bar{x} \pm t_{n - 1,\alpha/2} N \hat{\sigma}_{\bar{x}}$

#### Estimation of the population proportion

Lastly, the population proportion for finite samples can be calculated. The sample proportion (p hat) is an unbiased estimator of the population proportion. Next, the point estimate for an unbiased estimation procedure for the variance of a population proportion is given by:

$\hat{\sigma}^{2}_{\hat{p}} = \frac{\hat{p} (1 - \hat{p}}{n - 1} ( \frac{N - n}{N - 1} )$

When it is provided that the sample size is large, the 100(1 - α)% confidence interval for the population proportion can be computed as follows:

$\hat{p} \pm z_{\alpha/2} \hat{\sigma}_{\hat{p}}$

where the margin of error (ME) is given by:

$z_{\alpha/2} \hat{\sigma}_{\hat{p}}$

### How to choose an appropriate sample size for large populations?

So far, we have developed confidence intervals for population parameters on the basis of information provided by a sample. Following this process, we may believe that the resulting confidence interval is too wide, hence yielding an undesirable amount of uncertainty about the parameter of interest. One (convenient) way to obtain a narrower confidence interval with a fixed confidence level is by taking a larger sample. In this section, we consider how an appropriate sample size can be selected for two interval estimation problems. These equations are derived from basic algebra (simply transforming the right-hand and left-hand side of the equation).

#### Sample size for the population mean

First, the sample size for the mean of a normally distributed population with known population variance is:

$n = \frac{z^{2}_{\alpha/2}\sigma^{2}}{ME^{2}}$

Note that, if n is not an integer, the resulting value should be rounded upward to the next whole number in order to guarantee that the confidence interval does not exceed the required width.

#### Sample size for the population proportion

Second, the required sample size for the population proportion can be computed as follows:

$n = \frac{0.25 (z_{\alpha/2})^{2}}{(ME)^{2}}$

### How to choose an appropriate sample size for finite populations?

Often, the resources that are available to the investigator (in terms of time and money), place constraints on what can be achieved. Therefore, in many real-life studies, we are facing a finite population. In this section, we extend the issue of selecting an appropriate sample size to the situation of finite populations. Note that, to compensate for nonresponse or missing data (which is very likely in real experiments), practitioners may add a certain percent (for instance 10%) to the sample size n determined by the equations in this section.

#### Sample size for population mean

The required sample size to estimate the population mean through simple random sampling is:

$n = \frac{N \sigma^{2}}{(N - 1) \sigma^{\frac{2}{x}} + \sigma^{2}}$

or, equivalently:

$n = \frac{n_{0} N }{n_{0} + (N - 1) }$

where n0 is equal to:

$n_{0} = \frac{z^{2}_{\alpha/2} \sigma^{2} }{ME^{2}}$

Note that it is quite often more convenient to directly specify the width of the confidence intervals for the population mean rather than the desired variance of the sample mean (σ2/x). This is often easily obtained, since, for example, a 95% confidence interval for the population will extend to approximately 1.96σ on each side of the sample mean. Similarly, if the object of interest is the population total, the variance of the sample estimator of this quantity and a 95% confidence interval for it extends approximately 1.96Nσ on each side of Nx̄.

#### Sample size for population proportion

The required population proportion (P) of individuals in a population can be computed as follows:

$n = \frac{NP(1 - P)}{(N - 1) \sigma^{2}_{\hat{p}} + 0.25 }$

The largest possible value for this expression (nmax), regardless of the value of P, is given by:

$n_{max} = \frac{0.25N}{(N - 1) \sigma^{2}_{\hat{p}} + 0.25 }$

To make inferences about the population, we need sample statistics. Here, a distinction has to be made between the terms estimator and estimate. An estimator of a population parameter is a random variable that depends on the sample information. The value of an estimator provides an approximation of the unknown parameter. An estimate is a specific value of that random variable. In other words, an estimator is a function of a random variable and an estimate is a single number. It is a distinction between a process (estimator) and the result of that process (estimate).

## How to estimate parameters for two populations? - Chapter 8

In the previous chapter, we discussed how to estimate parameters for one population. In this chapter, we extend those concepts to estimate certain parameters for two populations. A common application of statistics deals with the comparison of the difference between two means from normally distributed populations, or the comparison of the difference between two proportions from large populations. For example, a campaign manager for a presidential candidate may want to compare the popularity rating of this candidate in two different regions of the country. Or a chemical company receives shipments from two suppliers and wants to compare the impurity level of the two batches.

### How to develop a confidence interval of the difference between two normal population means (for dependent samples)?

In this section, we discuss how to develop a confidence interval estimation of the difference between two means for normally distributed populations. In doing so, we distinguish between the scenario of dependent samples and the scenario of independent samples.

For dependent samples, the values in one sample are influenced by the values in the other sample. There are two types of dependent samples: matched pairs or measuring the same individual or object twice (for instance before and after an intervention). The latter is also known as repeated measurements. The idea of a matched pair sample is that, apart from the factor under study, the members of these pairs resemble one another as closely as possible so that the comparison of interest can be made directly. For example, in clinical trials, one may be interested in comparing the effectiveness of two medications. Therefore, dependent samples may be selected and the members of each sample may be matched on various factors, such as age or weight.

Suppose there is a random sample of n matched pairs of observations from two normal distributions with μx and μy. Further, let x1, x2, ..., xn denote the values of the observations from the population with mean μx and let y1, y2, ..., yn denote the matched sample values from the population with mean μy. Let d bar and sd denote the observed sample mean and standard deviation for the n differences di= xi - yi. Now, if the population distribution of the differences is assumed to be normal, then a 100 (1 - α)% confidence interval for the difference between two means with dependent samplesd = μx - μy) is given as follows:

$\bar{d} \pm t_{n-1,a/2} \frac{s_{d}}{\sqrt{n}}$

or, equivalently:

$\bar{d} \pm ME$

with ME:

$ME = t_{n-1,a/2} \frac{s_{d}}{\sqrt{n}}$

The standard deviation of the differences (that is: sd) is given by:

$s_{d} = \sqrt{ \frac{\sum (d_{i} - \bar{d})^{2}}{n - 1}}$

where tn-1,a/2 is the number for which

$P(t_{n-1} > t_{n-1,\alpha/2}) = \frac{\alpha}{2}$

The random variable tn-1 has a Student's t distribution with (n - 1) degrees of freedom.

An example will be used to illustrate the computations. Suppose that we conducted a clinical trial to compare the difference in effectiveness of two drugs for lowering cholesterol levels. Let these drugs be called respectively drug X and drug Y. Although clinical trials often are conducted with large samples involving many hundreds or even thousands of participants, we simply illustrate the procedure here for dependent samples in a very small random samples of matched pairs. The gathered data are summarized in Table 8.1.

 Pair Drug X Drug Y Difference (di = xi - yi) 123456789 293231323032293130 2627282730263336 353523-2-6

As you can see in Table 8.1, there is missing data (the value of drug Y is missing for participant 5). Missing data is very common in surveys, clinical trials, and other types of research. Perhaps the individual simply choose to withdraw from the study and hence did not complete the clinical trial. Perhaps the researcher made a mistake and "lost" the data. There are many possible reasons for missing data. Here, in this study of dependent samples, we decided to first delete the observation(s) with missing values. Because we are dealing with a matched pairs sample, the result is that we are left with eight pairs instead of nine pairs of observations. From the table, we can compute the sample mean and sample standard deviation:

$\bar{d} = 1.625 \hspace{3mm} and \hspace{3mm} = 3.777$

Now, suppose we want to compute the 99% confidence interval. From the Student's t distribution table (see Appendix of the book), it follows that: tn-1,a/2 = t7,0.005 = 3.499. Then, the confidence interval is computed as follows:

$1.625 \pm 3.499 \frac{3.777}{\sqrt{8}}$

The resulting confidence interval has lower limit -3.05 and upper limit 6.30, that is: [-3.05; 6.30]. Because the confidence interval contains the value zero, we cannot conclude that one drug is more effective than the other. More precisely, there are three possibilities: (1) the difference score μx - μy could be positive, which suggests that drug A is more effective; (2) the difference score μx - μy could be negative, which suggests that drug B is more effective; (3) the difference score μx - μy could be zero, suggesting that drug X and drug Y are equally effective. Recall from basic statistical inference that here we cannot conclude that there is no difference; one can never accept the null hypothesis. We can only state that, based on these data, there is insufficient evidence to conclude that one drug is more effective than the other.

### How to develop a confidence interval of the difference between two normal population means (for independent samples)?

In this section, we move on to the development of a confidence interval for the situation in which two samples are drawn independently from two normally distributed populations. This implies that the membership of one sample is not influenced by the membership of another sample. In doing so, three situations are considered: (1) both population variances are known;(2) both population variances are unknown but are considered to be equal; (3) both population variances are unknown and cannot be considered to be equal.

#### Scenario 1: both population variances are known

Consider the scenario where two independent samples, not necessarily of equal size, are taken from two normally distributed populations. The size of these samples is denoted by nx and ny. The samples are drawn from two normally distributed populations and the population means are denoted by μx and μy. The population variances are σ2x and σ2y. Let the respective sample means be denoted by x̅ and ȳ. Then, the 100(1 - α)% confidence interval for the difference between the two means of independent samples and known population variances is given as follows:

$(\bar{x} - \bar{y}) \pm z_{\alpha/2} + \sqrt{\frac{\sigma^{2}_{x}}{n_{x}} + \frac{\sigma^{2}_{y}}{n_{y}}}$

where the part behind the plus minus sign is also referred to as the margin of error.

#### Scenario 2: both population variances are unknown, but can be considered to be equal

Common sense tells us that is reasonable that, if we do not know the population means, we most likely do not know the population variances either. Sometimes, however, we can assume that the unknown population variances are equal. They are assumed to have a common (unknown) variance, such that σ2 = σ2x = σ2y. Under these circumstances, the confidence interval for the difference between two means, from independent samples, with unknown population variances that are assumed to be equal, is given by:

$(\bar{x} - \bar{y}) \pm t_{n_{x} + n_{y} - 2, a/2} + \sqrt{\frac{s^{2}_{p}}{n_{x}} + \frac{s^{2}_{p}}{n_{y}}}$

where s2p is the pooled sample variance, that is given by:

$s^{2}_{p} = \frac{ (n_{x} - 1)s^{2}_{x} + (n_{y} - 1)s^{2}_{y} }{n_{x} + n_{y} - 2}$

Note that here, because the population variances are unknown, we use student's t distribution rather than the standard normal distribution, with degrees of freedom (df) equal to: df = nx + ny - 2.

#### Scenario 3: both population variances are unknown and cannot be considered to be equal

Lastly, it may also be the case that the population variances are unknown and that these cannot be considered to be equal either. In that case, the confidence interval for the difference between the two means is given by:

$(\bar{x} - \bar{y}) \pm t_{v,a/2} \sqrt{\frac{s^{2}_{x}}{n_{x}} + \frac{s^{2}_{y}}{n_{y}}}$

with, again, the part behind the plus minus sign being the margin of error. The degrees of freedom are denoted by v.

### How to develop a confidence interval of the difference between two population proportions (for large samples)?

In chapter 7, we discussed how to develop a confidence interval for a single population. Here, we extend that approach to the situation of two population proportions. Often, one is interested in comparing two population proportions. For instance, one might want to compare the proportion of residents in one city who indicate that they will vote for a particular presidential candidate to the proportion of residents in another city who indicate that they will vote for the same candidate. In the event of comparison two population proportions, a confidence interval (for large samples) can be obtained as follows:

$(\hat{p}_{x} + \hat{p}_{y}) \pm ME$

where the margin of error (ME) is as follows:

$ME = z_{\alpha/2} = \sqrt{ \frac{ \hat{p}_{x} (1 - \hat{p}_{x} ) }{n_{x}} + \frac{ \hat{p}_{y} (1 - \hat{p}_{y} ) }{n_{y}} }$

In the previous chapter, we discussed how to estimate parameters for one population. In this chapter, we extend those concepts to estimate certain parameters for two populations. A common application of statistics deals with the comparison of the difference between two means from normally distributed populations, or the comparison of the difference between two proportions from large populations. For example, a campaign manager for a presidential candidate may want to compare the popularity rating of this candidate in two different regions of the country. Or a chemical company receives shipments from two suppliers and wants to compare the impurity level of the two batches.

## How to develop hypothesis testing procedures for a single population? - Chapter 9

In this chapter, it is discussed how to developed hypothesis testing procedures to test the validity of some conjecture or claim about a population by using sample data.

### What are the central concepts of hypothesis testing?

We begin this chapter by providing a general framework to test hypotheses. First, we need to define two alternatives that cover all possible outcomes: the null hypothesis and the alternative hypothesis. Hypothesis testing always starts with the null hypothesis, that is a hypothesis about the parameter of interest. This null hypothesis will be maintained, unless there is strong evidence against the null hypothesis. If we reject the null hypothesis, then the second hypothesis, called the alternative hypothesis, will be accepted. Be aware that the null hypothesis can never be accepted (!), it can only be rejected or maintained. In other words: if we fail to reject the null hypothesis, then either the null hypothesis is correct or the alternative hypothesis is correct, but the test procedure appears to be not strong enough for rejecting the null hypothesis.

Both the null and alternative hypothesis might specify a single value. For instance, a null hypothesis may be: H0: μ = 100. Such a hypothesis is also called a simple hypothesis. This can be interpreted as follows: the null hypothesis is that the population parameter μ is equal to a specific value, in this case 100. For this example, a possible alternative hypothesis could be that the population mean exceeds 16, that is: H1: μ > 0. This is an example of a one-sided composite alternative hypothesis. Another possibility would be to test that the null hypothesis is different from 100 (regardless of whether it is higher or lower). Such an hypothesis is called a two-sided composite alternative hypothesis. In this example, that would be: H1: μ ≠ 16.

After specifiying the null hypothesis and the alternative hypothesis and collecting sample data, a decision has to be made regarding the null hypothesis. The null hypothesis can either be rejected or failed to be rejected. Again, the null hypothesis can never be accepted! For many reasons, statisticians prefer to say "we fail to reject the null hypothesis" rather than "we accept the null hypothesis". When we reject the null hypothesis, yet, in fact the null hypothesis is true, this is called a type I error. The probability of rejecting the null hypothesis while in fact the null hypothesis is true is α. This α is also known as the significance level of a test and can be specified by the researcher beforehand. Vice versa, the probability of failing to reject the null hypothesis while the null hypothesis is true (thus, making the right decision) is given by 1 - α. Further, when we fail to reject the null hypothesis and, in fact, the null hypothesis is false, this is called a type II error. The probability of failing to reject the null hypothesis, while in fact the null hypothesis is false, is β. Vice versa, the probability of rejecting the null hypothesis while indeed the null hypothesis is false is given by 1 - β. The possible decisions regarding the null hypothesis and the true nature are summarized in Table 9.1

 Decision on H0 H0 is true H0 is false Fail to reject H0 Correct decision (1 - α) Type II error (β) Reject H0 Type I error (α) Correct decision (1 -β)

Finally, another important concept that is used in hypothesis testing is the power of a test. The power is the probability of rejecting H0 when H1 is true. This corresponds to the bottom right cell of Table 9.1. Power is thus equal to 1 -β. Note that the power is thus computed for a particular value of μ. Typically, the power is different for every different value of μ.

### How to test the mean of a normal distribution with population variance known?

In this section, we discuss how to test hypotheses regarding the mean of a normal distribution, when the population variance is known. Assume we want to know whether university students on average have a higher IQ than the mean in the population, that is 100. In this case, we would state our null hypothesis as: H0: μ = μ0 = 100. The alternative hypothesis is: H1: μ > μ0 = 100. The next step is to specify the significance level (α). To test the population mean, we use the sample mean x̅. If the sample mean is substantially larger than μ0 = 100, then we reject the null hypothesis. To obtain an appropriate decision, we use the fact that the standardized random variable

$Z = \frac{\bar{X} - \mu_{0}}{\sigma/\sqrt{n}}$

has a standard normal distribution with mean 0 and variance 1, given that the null hypothesis is true. Now, if α is the probability of a type I error and Z is large such that P(Z > zα) = α, then we can test the null hypothesis by using the following decision rule:

$reject \hspace{1mm} H_{0} \hspace{1mm} if \hspace{1mm} \frac{\bar{x} - \mu_{0}}{\sigma / \sqrt{n}} > z_{\alpha}$

From this equation it follows that the significance level α is the probability of rejecting the null hypothesis when in fact the null hypothesis is true. As we mentioned earlier, the researcher may specify the significance level beforehand. It is important to do so before the hypothesis testing procedure is actually conducted, because it may happen that a certain null hypothesis is rejected at a significance level of, for instance, 0.05, but would not have been rejected at the lower 0.01 significance level. Generally, reducing the significance level, implies reducing the probability of rejecting a true null hypothesis.

Another procedure for hypothesis testing is related to the p-value. The p-value is the probability of obtaining a value of the test statistic as extreme as or more extreme than the actual value obtained when the null hypothesis is true. In other words, the p-value is the smallest significance level at which the null hypothesis can be rejected given the observed sample statistic. When using the p-value, the following decision rule should be applied: reject H0 is p-value < α. Generally, this decision rule results in the same decision as following the earlier described procedure. The p-value for a test is computed as follows:

$p-value = P( \frac{\bar{x} - \mu_{0}}{\sigma / \sqrt{n}}) \geq z_{p} | H_{0}: \mu = \mu_{0} )$

where zp refers to the standard normal value that is associated with the smallest significance level at which the null hypothesis can be rejected. The p-value is commonly computed by basically every statistical computer program. It is a very popular tool for many statistical applications. However, be aware of the fact that the p-value is an observed random variable that will be different for each random sample obtained for a statistical test. Hence, two different analysts could obtain their own rando samples and sample means from the same population and subsequently compute a different p-value. This may lead to different conclusions (when the p-value is close to the statistical significance level).

### How to test the mean of a normal distribution with population variance unknown?

In this section, we discuss how to test the mean of a normal distribution in the event of unknown population variance. Recall from chapter 7 that we must use the Student's t distribution when the population variance is unknown. Further, this t distribution depends on the degrees of freedom. Here, the degrees of freedom are: df = n - 1. For sample sizes greater than 100 the normal probability can be used to approximate the Student's t distribution. Here, the decision rule for a one-sided alternative hypothesis (more specifically: H1: μ > μ0) is as follows:

$reject \hspace{1mm} H_{0} \hspace{1mm} if \hspace{1mm} t = \frac{\bar{x} - \mu_{0}}{s / \sqrt(n)} > t_{n-1,\alpha}$

or, equivalenty:

$reject \hspace{1mm} H_{0} \hspace{1mm} if \hspace{1mm} \bar{x} > \bar{x}_{c} = \mu_{0} + t_{n-1,\alpha} s/\sqrt{n}$

Note that the ">" sign changes to a "<" sign if we are testing the alternative hypothesis that a particular value is lower than the value specified by the null hypothesis. For two-sided alternative hypothesis, we are testing both ">" and "<" with significance level α/2. Further, the p-values for these tests are computed in the same manner as was done for the hypothesis tests with population variance known, except that the Student's t value is used rather than the normal Z value.

### How to test the population proportion (for large samples)?

Another important and common issue in business and economics problems involves population proportions. For instance, business executives are interested in the percent market share for their products. Further, government officials are interested in the percentage of people that support a proposed new program. Therefore we devote this section to hypothesis testing for population proportions. Recall from chapter 5 and 6 that we can use the normal distribution as a quite accurate approximation of the distribution of the sample proportion. Let P be the population proportion. Then, the following hypotheses can be formulated: H0: P = P0 and H1 = P > P0. From this it follows that the decision rule for a population proportion is as follows:

$reject \hspace{1mm} H_{0} \hspace{1mm} if \hspace{1mm} \frac{ \hat{p} - p_{0} }{\sqrt{P_{0} (1 - P_{0}) /n}} < -z_{\alpha}$

This will be illustrated with an example. Suppose that a certain company desires to know if shoppers are sensitive to the prices of items sold in the store. A random sample of 802 shoppers is obtained. It appears that 378 of these shoppers were able to state the correct price on an item immediately after putting it into their cart. Now, we want to test at a 7% significance level the null hypothesis that at least half of all shoppers are able to recall the correct price.

First, formulate the null hypothesis and alternative hypothesis: H0: P > P0 = 0.50 and H1: P < 0.50. Next, for this example we obtain the following sample statistics: n = 802 and p(hat) = 378/802 = 0.471. The test statistic is computed as follows:

$\frac{ \hat{p} - p_{0} }{\sqrt{P_{0} (1 - P_{0}) /n}} = \frac{0.471 - 0.5}{\sqrt{0.50(1 - 0.50)/802}} = -1.64$

At a 7% significance level (α = 0.07), we find the following z value: zα = -1.474. Because the test statistic of -1.64 is lower than -1.474 we can reject the null hypothesis at this 7% significance level and conclude that less than one half of the shoppers can correctly recall the price immediately after putting an item into their supermarket cart.

### What are the five properties of the power function?

For all the hypothesis tests that we discussed so far, we have developed certain decision rules for rejecting the null hypothesis in favor of some alternative hypothesis. In doing so, we repeatedly emphasized that not rejecting the null hypothesis does not imply that either the null or the alternative hypothesis is true. In fact, not rejecting the null hypothesis leaves the researcher with a lot of uncertainty. Therefore, the power (1 - β) can be used as a measure of the degree of certainty that the null hypothesis will be rejected if in fact the null hypothesis is false. By computing the power of a test for all values of μ included in the alternative hypothesis, a power function can be generated. Such a power function has several useful properties. First, the farther the true mean is from the hypothesized mean, the greater the power of test is (assuming everything else being equal). Second, the smaller the significance level (α) of the test, the smaller the power (again assuming everything else being equal). Third, the larger the population variance, the lower the power of the test (again assuming everything else being equal). Fourth, the larger the sample size, the greater the power of the test (again, assuming everything else being equal). Fifth, and lastly, the power of the test at the critical value equals 0.5 because the probability that a sample mean is above x̅c is, logically, 0.50.

### How to test the variance of a normally distributed population?

Next to testing the population mean, we can also conduct hypothesis tests regarding the population variance. This is especially of importance in modern quality-control work, because such processess may produce defective items if there exists a substantially large variance. The testing procedures for σ2 are, logically, based on the sample variance, that is s2. It is important to know that the chi-square distribution is used for hypothesis testing regarding the variance. The chi-square distribution for a single population has df = (n - 1) degrees of freedom. The null and alternative hypotheses might for example be: H0: σ2 = σ20 and H1: σ2 > σ20. From this, it follows that the decision rule is formulated as follows:

$reject \hspace{1mm} H_{0} \hspace{1mm} if \hspace{1mm} \frac{(n - 1)^{2} s^{2}}{\sigma^{2}_{0}} > \chi^{2}_{n-1,\alpha}$

In this chapter, it is discussed how to developed hypothesis testing procedures to test the validity of some conjecture or claim about a population by using sample data.

## What test procedurs are there for testing the difference between two populations? - Chapter 10

In the previous chapter, it was discussed how to formulate hypotheses for tests that concern a single population. In this chapter, these concepts are extended to the scenario of testing the differences between two population means, proportions, and variances. By now, it is assumed that the reader is familiar with the hypothesis-testing procedures developed in chapter 9 and concepts related to this manner (such as the null hypothesis, alternative hypothesis, and one- and two-sided composite alternative hypotheses).

### How to test the difference in means between two normally distributed populations (with dependent samples)?

There are various applications in business and economics where we desire to draw conclusions about the difference between two population means, rather than to draw conclusions about the absolute levels of the means. For instance, one might want to compare the output of two different productions processes without knowing the population means. Or, one might want to know if one stock market strategy results in a higher profit than another without knowing the population profits. Such questions can be treated effectively by various different hypothesis-testing procedures. These different procedures are based on different assumptions that are pretty similar to what has been discussed in the previous chapter.

If it is assumed that a random sample of n matched pairs of observations obtained from two populations with respectively means μx and μy. For matched pairs that are positively correlated, the variance of the difference between the sample means

$\bar{d} = \bar{x} - \bar{y}$

will be reduced in comparison to using independent samples, because some of the characteristics of the pairs are similar and, therefore, a part of the variability is removed from the total variables of the differences between the means. To illustrate this, suppose that we are studying human behavior. Usually, the differences between twins (matched pairs) are less than the differences between two randomly selected people (independent samples). To put this into general terms, we would prefer, whenever possible, to use matched pairs of observations rather than independent samples when comparing measurements from two populations, because the variance of the difference will be smaller. Moreover, a smaller variance increases the probability of rejecting the null hypothesis when the null hypothesis in fact is false.

The hypothesis testing is fairly similar to the procedure that has been discussed in the previous chapter. That is, the null hypothesis is: H0: μx - μy = 0, or: H0: μx - μy < 0. This is tested against the alternative hypothesis, for instance: H1: μx - μy > 0. The decision rule is formulated as follows:

$reject \hspace{1mm} H_{0} \hspace{1mm} if \hspace{1mm} \frac{\bar{d}}{s_{d} / \sqrt{n} } > t_{n-1,a}$

for a one-sided alternative hypothesis. If the one-sided alternative hypothesis is left-hand sided, such that H1: μx - μy < 0, we obtain the following decision rule:

$reject \hspace{1mm} H_{0} \hspace{1mm} if \hspace{1mm} \frac{\bar{d}}{s_{d} / \sqrt{n} } < -t_{n-1,a}$

Note that the “>” sign changes to a “<” sign and the t value becomes negative. Lastly, the decision rule for a two-sided alternative hypothesis (H1: μx - μy ≠ 0) is formulated as follows:

$reject \hspace{1mm} H_{0} \hspace{1mm} if \hspace{1mm} \frac{\bar{d}}{s_{d} / \sqrt{n} } < t_{n-1,a/2} or \frac{\bar{d}}{s_{d} / \sqrt{n} } > t_{n-1,a/2}$

For all these hypothesis tests, tn-1 is a random variable that follows a Student’s t distribution with (n - 1) degrees of freedom. Moreover, for all these hypothesis tests, we can obtain p-values that can be interpreted as the probability of getting a value that is at least as extreme as the one obtained given the null hypothesis.

### How to test the difference in means between two normally distributed populations (with independent samples)?

Similar to what has been discussed in the previous chapter, there are three important scenarios for independent samples: (1) population variances known; (2) population variances unknown, but assumed to be equal; and (3) population variances unknown and not assumed to be equal.

#### Scenario 1: population variances known

When the two population variances are known, hypothesis tests of the difference between the two population means can be based on this result, using the same procedures as discussed before. Further, due to the central limit theorem, the results hold for large sample sizes, even if the populations are not normally distributed. Further, if the sample sizes are large (that is: n > 100), the approximation is quite satisfactory when the sample variances are used for population variances. Similar to other hypothesis tests, we can obtain p-values that can be interpreted as the probability of getting a value that is at least as extreme as the one obtained given the null hypothesis. Note that, because the population variances are known, we can use the standard normal distribution rather than Student’s t distribution.

#### Scenario 2: population variances unknown, but assumed to be equal

Often, the population variances are unknown. If, in addition to that, the sample sizes are under 100, we must use the Student’s t distribution. There are, however, some theoretical problems when using the Student’s t distribution to test differences between sample means. Fortunately, these issues can be solved by using the procedure that arises if we can assume that the population variances are equal. In that case, the can use a pooled estimator of the equal population variance, which can be computed as follows:

$s^{2}_{p} = \frac{ (n_{x} - 1) s^{2}_{x} + (n_{y} - 1) s^{2}_{y} }{ (n_{x} + n_{y} - 2) }$

with the degrees of freedom equal to: df = nx + ny – 2. From this, it follows that the hypothesis tests can be conducted using the Student’s t statistic for the difference between the two means:

$t = \frac{ (\bar{x} - \bar{y} ) – ( \mu_{x} - \mu_{y} ) }{ \sqrt{ \frac{s^{2}_{p}}{n_{x}} + \frac{s^{2}_{p}}{n_{x}}} }$

The form is thus more or less similar to the Z statistic, which is used when the population variances are known. The only difference here is that the pooled estimator of the variances is used rather than the (known) population variances themselves. Apart from using the Student’s t distribution and the pooled estimator of the variances, the testing procedure is equal to the scenario with known population variances.

#### Scenario 3: population variances unknown and not assumed to be equal

When the population variances are unknown and can also not be assumed to be equal, we derive at a rather complex situation. There are substantial complexities in the determination of the degrees of freedom for the critical value of the Student’s t distribution. Although this can be computed by hand, this is often computed by a statistical computer program. For the interested reader, we refer to page 401 of the book. After the degrees of freedom are obtained, the procedure is similar to the testing procedure that we discussed before. The only difference is that the sample variances are used, rather than the population variances or a pooled estimator of the variances.

### How to test the difference in proportions between two normally distributed populations (with large samples)?

Recall from chapter 5 that, for large samples (that is: nP0(1 – P0) > 5), proportions can be approximated as normally distributed random variables. As a result, the standard normal distribution with z scores can be used. Suppose there are two independent random samples of size nx and ny. Let the proportion of successes be and . Now, if the population proportions are unknown but it can be assumed that the population proportions are equal, then the unknown population proportion P0 can be estimated using a pooled estimator that is defined as follows:

$\hat{p}_{0} = \frac{n_{x} \hat{p}_{x} + n_{y} \hat{p}_{y}}{n_{x} + n_{y}}$

### How to test the equality of the variances between two normally distributed populations?

In the last section of this chapter, it is discussed how to test the equality of variances between two normally distributed populations. Equality of variances can, for instance, be used to compute a pooled estimator for the common variance between two sample variances (as was discussed earlier in this chapter). Here, we developed a procedure for testing this assumption of equal variances. To perform this test, the F probability distribution is used. Suppose that there are two independent random samples drawn from population X and Y. Then, the random variable F can be computed as follows:

$F = \frac{s^{2}_{x} / \sigma^{2}_{x}}{s^{2}_{y} / \sigma^{2}_{y}}$

This random variable follows a distribution known as the F distribution. Similar to Student’s t distribution, this F distribution actually is a family of distributions that is characterized by the degrees of freedom. Different from the Student’s t distribution, however, is that the F distribution is characterized by degrees of freedom for the numerator and degrees of freedom for the denominator. More precisely, the degrees of freedom for the numerator are equal to (nx - 1) and the degrees of freedom for the denominator are equal to (ny - 1). The critical cutoff point for a particular F value can be found in Appendix Table 9. Now, let the degrees of freedom be denoted by v1 and v2 respectively. Then, the decisions rule for a one-sided alternative hypothesis (right-hand side can be formulated as follows:

$reject \hspace{1mm} H_{0} \hspace{1mm} if \hspace{1mm} F = \frac{s^{2}_{x}}{s^{2}_{y}} > F_{n_{x}-1, n_{y} – 1, \alpha}$

Note that α/2 is used as the upper-tail probability for a two-tailed hypothesis test. Similar to all hypothesis tests discussed before, a p-value can be computed, yielding the probability of getting a value at least as extreme as the one obtained given the null hypothesis. Because the F distribution is rather complex, the critical values are commonly computed using a statistical software package.

In the previous chapter, it was discussed how to formulate hypotheses for tests that concern a single population. In this chapter, these concepts are extended to the scenario of testing the differences between two population means, proportions, and variances. By now, it is assumed that the reader is familiar with the hypothesis-testing procedures developed in chapter 9 and concepts related to this manner (such as the null hypothesis, alternative hypothesis, and one- and two-sided composite alternative hypotheses).

## How to conduct a simple regression? - Chapter 11

So far, we have focused on the statistical analysis and inference related to a single variable. In this chapter, we move on to analyzing relationships between multiple variables. In doing so, we assume that the reader is familiar with concepts such as scatter plot, covariance, and correlation (see Chapter 2). The relationship between variables is commonly used for analyzing processes business and economics. For example, one may be interested in the following: if a developing country increases its fertilizer production by one million tons, how much increase in grain product can be expected? Generally, these relationships can be expressed as Y = f(x) in which the function Y can follow both linear and nonlinear forms. For now, in this chapter, we only focus on linear relationships using least squares regression.

### What is meant by a least square regression?

Often, a desired functional relationship between two variables X and Y can be approximated using a linear equation, given by:

$Y = \beta_{0} + \beta_{1}X$

in which Y is the dependent variable (also known as endogenous variable) and X is the independent variable (also known as exogenous variable). Further, β0 is the intercept (where the Y-value is equal to zero) and β1 is the slope of the linear (that is: the change in Y for one unit change in X). This slope coefficient (β1) is very important for many business and economics applications, because it provides an indication of the change in output (of the endogenous variable) for each unit change in the input (of the exogenous variable). To obtain the best estimates of the intercept and the slope, the available data is used. The estimates are defined as b0 and b1 and are computed by using the least square regression, a technique that is widely implemented in many statistical software packages. In a least square regression, it is assumed that for each value of X, there will be a corresponding mean value of Y that results because of the underlying linear relationship between X and Y.The least square regression line based on sample data is given by:

$\hat{y} = b_{0} + b_{1}x$

where b0 is the y-intercept, which can be computed as follows:

$b_{0} = \bar{y} - b_{1}\bar{x}$

and b1 is the slope of the line, which can be computed as follows:

$b_{1} = \frac{Cov(x,y)}{s^{2}_{x}} = r \frac{s_{y}}{s_{x}}$

Further, it follows from the last equation that the correlation coefficient can be computed as follows:

$r = \frac{Cov(x,y)}{s_{x}s_{y}}$

### What does the linear regression population model look like?

In the section above, we saw that least squares regression is a procedure that provides an estimated model of the linear relationship between an independent (exogenous) variable and a dependent (endogenous) variable. The least square regression thus is an estimate of the population model. This population model can be specified as follows:

$y_{i} = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i}$

where β0 and β1 are the population model coefficients and εi is a random error term. For linear regressions, four assumptions are made. First, it is assumed that the Y's are linear functions of X, plus a random error term. Second, it is assumed that the x values are fixed number that are independent of the error terms. Third, the error terms are assumed to be random variables with a mean of zero and a covariance of σ2. This property is also known as homoscedasticity, or uniform variance. This will be explained in more detail later in this chapter. Further, we will also describe later in this chapter that the central limit theorem can be used to relax the assumption of a normal distribution. Fourth, it is assumed that the random error terms are not correlated with one another.

Linear regression provides two important outcomes. First, the predicted values (y hat) of the dependent (endogenous) variable as a function of the independent (exogenous) variable. Second, the estimated marginal change in the dependent (endogenous) variable, b1, that results from a one-unit change in the independent (exogenous) variable.

It is important to be aware of the fact that regression results summarize the information contained in the data. They do not "prove" that an increase in X "causes" and increase in Y. In order to be able to draw such conclusions, one needs to combine good statistical analysis with theory.

### How to obtain the least squares coefficient estimators?

Although the population regression line is a useful theoretical construct, we cannot use this in practice. Instead, we need to determine an estimate of this model by using the available data. As mentioned before, the least square regression procedure can be used for this purpose. The least square procedure obtains estimates of the linear equation coefficients by minimizing the sum of the squared residuals εi:

$SSE = \sum^{n}_{i = 1} e^{2}_{i} = \sum^{n}_{i = 1} (y_{i} - \hat{y}_{i})^{2}$

Further, the coefficients b0 and b1 are chosen such that the sum of the squared residuals (SSE) is minimized. Differential calculus is used to obtain the coefficient estimators that minimize the SSE. For the interested reader, we refer to the chapter appendix in the book. Basically, what follows are the following equations for the coefficient estimators:

$b_{1} = r \frac{s_{Y}}{s_{X}}$

and

$b_{0} = \bar{y} - b_{1}x$

Note that, because the computation of the regression coefficients is challenging, we often use statistical software packages to compute the regression coefficients. While the computation often is assigned to computers, it remains our task to think, analyze, and make recommendations. These estimates are used to obtain an estimation of the underlying population model. However, in order to be able to make inferences about the population, it is required that the four assumptions that were described in the previous section are met. Given these assumptions, it can be shown that the least squares coefficient estimators are unbiased and have minimum variance.

### What is the explanatory power of a linear regression equation?

In this section, we move on to the explanatory power of a linear regression equation. How to develop measures that indicate how effectively the variable X explains the change of Y? The total variability in a regression analysis (SST) can be partitioned into a component that is explained by the regression equation (SSR) and a component that is due to unexplained error (SSE). In formula, that is: SST = SSR + SSE. These coefficients can be determined as follows:

$SST = \sum^{n}_{i = 1} (y_{i} - \bar{y})^{2}$

with

$SSE = \sum^{n}_{i = 1} e^{2}_{i}$

and

$SSR = b^{2}_{i} = \sum^{n}_{i = 1} (x_{i} - \bar{x})^{2}$

One commonly used measure to indicate the explanatory power of a linear regression equation is the coefficient of determination, denoted by R2. This coefficient is a ratio of the sum of squares of the variance explained by the regression equation (SSR) divided by the total sum of squares (SST) and hence provides a descriptive measure of the proportion or percent of the total variability that is explained by the regression model. In other words, it is the percent explained variability. In formula, that is:

$R^{2} = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}$

The coefficient of determination varies from 0 to 1 in which higher values indicate a better regression (a larger part of the total variability that is explained by the regression). However, one should be cautious when making general interpretations of R2, because a high value might result from either a smaller SSE, a large SST, or both.

There is an important association between the correlation coefficient and R2. More precisely, the coefficient of determination (R2) for simple regression is equal to the simple correlation squared: R2 = r2.

Lastly, the quantity SSE is a measure of the total squared deviation about the estimation regression line and ei is the residual. The model error variance can be estimated as follows:

$\sigma^{2} = \frac{SSE}{n - 2}$

Note that the division is by (n - 2) instead of (n - 1). This is done, because the simple regression model uses two estimated parameters (b0 and b1) instead of one. In the next section, we will see that this variance estimator forms the basis for statistical inference in regression models.

### How to make inferences about the population?

Now that we have developed the coefficient estimators, it is time to make inferences about the population model. In doing so, we follow the basic approach that has been discussed in Chapters 7-10. Since yi is normally distributed and b1 is a linear function of independent normal variables, the linear function implies that b1 is also normally distributed. From this property, we can derive the population variances and sample variances as follows:

$\sigma^{2}_{b1} = \frac{\sigma^{2}}{ (n - 1) s^{2}_{x}}$

and an unbiased sample variance estimator:

$s^{2}_{b1} = \frac{s^{2}_{e}}{ (n - 1) s^{2}_{x}}$

Further, it is important to see that the variance of the slope coefficient b1 depends on two quantities: (1) the distance of the points from the regression line measured by s2e for which higher values yield greater variance for b1, and; (2) the total deviation of the X values from the mean, which is measured by (n - 1) s2x for which higher deviations in the X values and larger sample sizes imply smaller variance for the slope coefficient. From this, it follows that smaller variance estimators of the slope coefficient imply a better regression model. In other words, we would like to have the variance of the decision variable (X) to be as small as possible.

Earlier in this chapter, we discussed that for the equation that computes the estimated coefficients for b1 we assume that the variances of the error terms are uniform or equal over the entire range of the independent variable(s). This property is called homoscedasticity. Sometimes, however, the variances of the error terms are not uniform. This may happen for example in annual household consumption, which generally increases with increasing levels of household disposal income, yet with higher incomes, households have greater flexibility between consumption and saving. Hence, a plot of the annual household consumption versus disposable income would show that the data are "fanning out" around a linear trend as disposable income increases. This situation of non uniform error terms is also referred to as heteroscedasticity.

Now, we move on to the hypothesis tests. To determine if there is a linear relationship between X and Y, we can test the following null hypothesis: H0: β1 = 0 against the alternative hypothesis: H1: β1 ≠ 0. Given that b1 is normally distributed, we can test this hypothesis using the Student's t statistic:

$t = \frac{b_{1} - \beta_{1}}{s_{b_{1}}}$

with (n - 2) degrees of freedom. Further, the central limit theorem can be used to conclude that this result is approximately valid for a wide range of nonnormal distributions if the sample is large enough. Moreover, from this test statistic, the following decision rule follows:

$reject \hspace{2mm} H_{0} \hspace{2mm} if \hspace{2mm} \frac{b_{1} - \beta^{*}_{1}}{s_{b-{1}}} \geq t-{n-2,\alpha}$

If the null hypothesis is rejected, this implies that there is a relationship between X and Y.

Finally, it is convenient to know the following rule of thumb: for a two-tailed test with α= 0.05 and n > 60, a Student's t statistic with an absolute value greater than 2.0 indicates that there is a relationship between the two variables X and Y.

The confidence interval for the population regression slope (β1) is given by:

$b_{1} - t_{ (n-2,\alpha/2) s_{b_{1}}} < \beta_{1} < b_{1} + t_(n-2,\alpha/2) s_{b_{1}}$

where, again, the random variable tn-2 follows a Student's t distribution with (n - 2) degrees of freedom.

#### F test for simple regression coefficient

Next to testing the slope of the regression by using the Student's t distribution, it is also possible to use the F distribution for this hypothesis test. In fact, exactly the same result will be provided. Further, we will see in Chapter 13 that the F distribution also allows for the opportunity to test the hypothesis that several population slope coefficients are simultaneously equal to zero. For now, however, it is sufficient to know that F = t2b1 and that the F statistic can be computed as follows:

$F = \frac{MSR}{MSE} = \frac{SSR}{s^{2}_{e}}$

The decision rule is as follows:

$Reject \hspace{2mm} H_{0} \hspace{2mm} if \hspace{2mm} F \geq F_{1,n-2\alpha}$

### How can a regression model be used for prediction?

Regression models are a useful tool to compute predictions for the dependent variable, given an assumed future value for the independent variable. Broadly speaking, there are two distinct options of interest:

1. Estimating the actual value that will result for a single observation, yn+1.
2. Estimating the conditional expected value, that is, the average value of the dependent variable when the independent variable is fixed at xn+1.

For the first option, that is estimating the actual value that will result for a single observation, the prediction interval can be computed as follows:

$\hat{y}_{n+1} \pm t_{n-2,\alpha/2} \sqrt{ [ 1 + \frac{1}{n} + \frac{ (x_{n+1} - \bar{x})^{2} }{ \sum^{n}_{i = 1} (x_{i} - \bar{x})^{2} } ] } s_{e}$

And for the second option, that is estimating the conditional expected value or the mean, the confidence interval for predictions is:

$\hat{y}_{n+1} \pm t_{n-2,\alpha/2} \sqrt{ [ \frac{1}{n} + \frac{ (x_{n+1} - \bar{x})^{2} }{ \sum^{n}_{i = 1} (x_{i} - \bar{x})^{2} } ] } s_{e}$

Note that the second equation is similar to the first, with the exception of "1 +" in the square root. From these general forms of prediction, we can see that the wider the interval, the greater the uncertainty surrounding the prediction point. More specifically, we can formulate four observations. First, all other things being equal, the larger the sample size (n), the narrower are both the prediction interval and the confidence interval. Second, all other things being equal, the larger s2e, the wider both the prediction interval and the confidence interval. Third, a large dispersion implies that there is information for a wider range of values for a variable, which allows more precise estimates of the population regression line and, correspondingly, narrower confidence intervals and narrower prediction intervals. Fourth, the larger the value for the quantity (xn+1 - x̅)2, the wider the confidence intervas and prediction intervals.

### How to conduct a correlation analysis?

Correlation coefficients can also be used to study relationships between variables. In chapter 2, we already used the correlation coefficient to describe the relationship between variables. In chapters 4 and 5, we discussed the population correlation. In this chapter, we discuss inference procedures that use the correlation coefficient to study linear relationships between variables.

The sample correlation coefficient r is a useful tool as it provides a descriptive measure of the strength of a linear relationship in a sample. The correlation can also be used to test the hypothesis that there is no linear association in the population between a pair of random variables. That is: H0: ρ = 0. This can be tested against the alternative hypothesis that there is a correlation between the pair of random variables. That is: H1: ρ ≠ 0. Then, the decision rule is:

$reject \hspace{2mm} H_{0} \hspace{2mm} if \hspace{2mm} \frac{r \sqrt{(n - 2)}}{\sqrt{(1 - r^{2})}} < -t_{n-2,\alpha} \hspace{2mm} or \hspace{2mm} \frac{r \sqrt{(n - 2)}}{\sqrt{(1 - r^{2})}} > t_{n-2,\alpha}$

where tn-2 follows a Student's t distribution with (n - 2) degrees of freedom.

If we set tn-2,a/2 = 2.0, an approximate rule to remember for testing the previous hypothesis that the population correlation is zero can be shown to be:

$|r| > \frac{2}{\sqrt{n}}$

### What is the beta measure of financial risk?

In finance, a number of measures have been developed to help investors measure and control financial risk in the development of investment portfolios. Risk can be subdivided into diversifiable risk and nondiversifiable risk. The former, diversifiable risk, is that risk associated with specific firms and industries and includes labor conflicts, new competition, consumer market changes, and several other factors. Diversifiable can be controlled by larger portfolio sizes and by including stocks whose return have negative correlations. The latter, nondiversifiable risk, is that risk associated with the entire economy. Examples are: shifts in the economy resulting from business cycles, international crisis, the evolving world energy demands. Such factors affect all firms, but do not have the exact same effect on each firm. The effect this has on individual firms is measured with the beta coefficient. More specifically, the beta coefficient for a specific firm is the slope coefficient that is obtained when the return for a particular firm is regressed on the return for a broad index, such as the S&P 500. This slope coefficient then indicates how responsive the returns for a particular firm are in comparison to the overall market returns. Commonly, the beta coefficient is positive, but in some limited cases a firm's returns will move in the opposite direction compared to the overall economy, yielding a negative beta. If the firm's returns exactly follow the market, then the beta coefficient will be 1. If the firm's returns are more responsive to the market, the beta will be greater than 1. And if the firm's returns are less responsive to the market, then the beta will be less than 1.

### Which two factors can influence the estimated regression equation?

Both extreme points and outliers have a great influence on the estimated regression equation compared to other observations. In any applied analysis, either these unusual points are part of the data that represent the process being studied, or they are not. In the former case, these unusual points should be included in the data set. In the latter case, these unusual points should not be included in the data set. In any case, the researcher has to decide what the nature of these unusual points are. Typically, this decision requires a good understanding of the process and proper judgement. The individual points should we examined carefully and their source should be checked. Are they the result of measurement or recording error?

Generally, extreme points are defined as points that have X values that deviate substantially from the X values for the other points. Extreme values have a high leverage (h1), which is defined as follows:

$h_{i} = \frac{1}{n} + \frac{ (x_{i} - \bar{x})^{2} }{\sum^{n}_{i = 1} (x_{i} - \bar{x})^{2} }$

This leverage term increases the standard deviation of the expected value as data points are farther from the mean of X and, thus, lead to a wider confidence interval. There are different cut-off values for the leverage, but one common rule is that points with leverage hi > 3 p/n are identified as "high leverage" in which p is the number of predictors, including the constant. Most software packages use this rule, although Excel uses a different rule.

Outliers are defined as those observations that deviate substantially in the Y direction from the predicted value. Typically, these points are identified by computing the standard residual as follows:

$e_{is} = \frac{e-{i}}{s_{e} \sqrt{1 - h_{i}} }$

Recall that points with high leverage will have a smaller standard error of the residual. This is the case, because points with high leverage are likely to influence the location of the estimated regression line and, therefore, the observed and expected values of Y will be closer.

So far, we have focused on the statistical analysis and inference related to a single variable. In this chapter, we move on to analyzing relationships between multiple variables. In doing so, we assume that the reader is familiar with concepts such as scatter plot, covariance, and correlation (see Chapter 2). The relationship between variables is commonly used for analyzing processes business and economics. For example, one may be interested in the following: if a developing country increases its fertilizer production by one million tons, how much increase in grain product can be expected? Generally, these relationships can be expressed as Y = f(x) in which the function Y can follow both linear and nonlinear forms. For now, in this chapter, we only focus on linear relationships using least squares regression.

## How to conduct a multiple regression? - Chapter 12

In the previous chapter the simple regression was introduced. A simple regression is a procedure for obtaining a linear equation that predicts a dependent (endogenous) variable as a function of a single independent (exogeneous) variable. In practice, however, it is often the case that multiple independent variables jointly affect a dependent variable. Therefore, in this chapter, the multiple regression will be discussed, which is a procedure for obtaining a linear equation that predicts a dependent (endogenous) variable as a function of multiple independent (exogenous) variables.

### What are important considerations in developing a multiple regression model?

#### Model specification

A critical step a multiple regression is model specification: the selection of the exogenous variables and the functional form of the model. To select the appropriate independent variables, often, considerable discussion with people in the company is conducted to determine which variables possibly affect the dependent variable (most).

#### Model objectives

The strategy that is used for model specification is influenced by the model objectives. Broadly speaking, there are two main objectives for regression analysis: (1) prediction of changes in the dependent variable as a function of the independent variables, and; (2) estimation of the marginal effect of each independent variable. Often in economics and business, one is interested in how performance measures are influenced by changes in the independent variables. For instance, how do sales change as a result of price increase and advertising expenditures? How does output change when the amounts of labor and capital are changed? Does infant mortality become lower when health care expenditures and local sanitation are increased? Note that marginal change often is more difficult to estimate, because the independent variables are related not only to the dependent variables, but also to each other. If the latter is the case, it is difficult to determine the individual effect of each independent variable on the dependent variable. Sometimes, both of these aims (i.e., prediction and estimation) are equally important. Usually, however, one of the aims will be predominate.

#### Model development

Next, the regression model can be constructed in order to explain variability in the dependent variable of interest. In order to build the model, we want to include the simultaneous and individual influences of the different independent variables. The basic form of a multiple regression population model is as follows:

$y_{i} = \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i} + ... + \beta_{K}x_{1K} + \epsilon_{i}$

where the βj terms are the coefficients (i.e., marginal effects) of the independent variables Xj where j = 1, ..., K, given the effects of the other independent variables, and εi is the random error term with a mean of 0 and a variance of σ2.

Similar to the simple regression, the population model is estimated by a sample estimated model, which has the following basic form:

$y_{i} = b_{0} + b_{1}x_{1i} + b_{2}x_{2i} + ... + b_{K}x_{1K} + e_{i}$

Basically, simple regression is a special (reduced) form of multiple regression in which there is only one predictor variable. As a result of this single predictor variable, the plane is reduced to a line. In multiple regression, this plane is multidimensional. Sometimes (for example when there are two predictor variables, and -logically- one dependent variable) three-dimensional graphing may be a useful tool to aid in interpretating the relationship between the variables.

### How to estimate the regression coefficients?

To obtain the regression coefficients, the least squares procedure is used again. The least squares procedure is similar to the one presented in the previous chapter for simple regression, except that the estimators are complicated by the relationships beteen the multiple independent variables that occur simultaneously with the relationship between the independent variables and dependent variable. For now, it is enough to know that the estimates of the coefficients and their variances are always obtained by using a computer. You do not have to be able to compute these estimates by hand.

With regard to the assumptions for a standard multiple regression, we can see that there are in total five assumptions. The first four assumptions are essentially the same as those made for simple regression. Only a fifth assumption is added for multiple regression. The fifth assumption is that there is no direct linear relationship between the Xj independent variables. Often, proper model specification ensures that the fifth assumption will not be violated.

Suppose there are two independent variables X1 and X2 and the sample correlation between X1 and the dependent variable Y is known (rx1y) as well as the sample correlation between X2 and Y (rx2y) and the sample correlation between the two independent variables (rx1x2). Further, we know the sample standard deviation for X1 (sx1), the sample standard deviation for X2 (sx2) and the sample standard deviation for Y (sy). In that case, we can obtain the regression coefficients as follows:

$b_{1} = \frac{ s_{y} (r_{x1y} - r_{x1x2}r_{x2y} ) }{s_{x1} (1 - r^{2}_{x1x2}) }$

$b_{2} = \frac{s_{y} (r_{x2y} - r_{x1x2} r_{x1y} ) }{s_{x2} (1 - r^{2}_{x1x2})}$

$b_{0} = \bar{y} - b_{1}\bar{x}_{1} - b_{2}\bar{x}_{2}$

Note that the slope coefficient (b1) not only depends on the correlation between Y and X1, but also is influenced by the correlation betwee the independent variables as well as the correlation between X2 and Y. If, for some reason, the correlation between the independent variables is equal to 1, the coefficients estimators will be undefined. This will, however, rarely happpen and likely will result only from poor model specification and violation of the fifth assumption of no direct linear relationship between the independent variables.

Lastly, it is important to be aware of the following. In multiple regression, the regression coefficients are conditional coefficients. That is, the estimated coefficient b1 depends on the other independent variables that are included in the model. The only exception to this rule is when two independent variables have a sample correlation of exactly zero. This is, however, a very unlikely event.

### How to compute the explanatory power of a multiple regression equation?

To explain the changes of a particular dependent variable, several independent variables are used in a multiple regression. The linear function of these independent variables partially explaines the variability in the dependent variable. In this section, we develop a measure of the proportion of the variability in the dependent variable that can be explained by the multiple regression model. This procedure is very similar to the one used for a simple regression model.

Similar to a simple regression model, the model variability can be partitioned into two components: SST (total) = SSR (regression) + SSE (error). Here, SST refers to the sum of squares of the total sample variability, SSR refers to the sum of squares of the variability that is explained by the regression and SSE refers to the unexplained variability, that is the sum of squares of the variability of the error.

The coefficient of determination, R2, of the regression equation is, similar to before, defined as the proportion of the total sample variability that is explained by the regression. This coefficient is bounded between zero and one, in which higher numbers indicate a better regression model (more variability of the dependent variable that is explained by the regression model).

$R^{2} = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}$

Similar to before, one has to be careful in making inferences from this coefficient. The R2 can be large either because SSE is small (indicating that the points are close to the predicted points) or because SST is large, or both.

As with simple regression, we commonly do not know the population model errors. Therefore, an unbiased estimate of the error variance can be computed as follows:

$s^{2}_{e} = \frac{\sum^{n}_{i = 1} e^{2}_{i}}{n - K - 1} = \frac{SSE}{n - K - 1}$

where K is the number of independent variables in the regression model. By taking the square root of this unbiased estimate of the error variance, we obtain the standard error of the estimate.

For multiple regression, there is a potential problem with the use of this determination coefficient as an overall measure of the quality of a fitted equation. As additional independent variables are added to a multiple regression model, the explained sum of squares (SSR) will increase in essentially all applied situations, even if the additional independent variable is not an important predictor variable. Therefore, the R2 may increase spuriously after one or more predictor variables have been added to the multiple regression model. Under those circumstances, the increases value of the R2 is misleading. To overcome this problem, the adjusted coefficient of determination has been developed, which is defined as follows:

$\bar{R}^{2} = 1 - \frac{SSE/(n - K - 1)}{SST/(n - 1)}$

This adjusted coefficient of determination corrects for the fact that nonnrelevant independent variables will result in a (small) reduction in the error sum of squares (SSE). Consequently, the adjusted coefficient of determination offers a better comparison between multiple regression models with different numbers of independent variables.

Lastly, the coefficient of multiple correlation is a correlation coefficient that indicates the relationship between the predicted value and the observed value of the dependent variable. The coefficient of multiple correlation is defined as:

$R = r(\hat{y},y) = \sqrt{R^{2}}$

As you can see in the above equation, the coefficient of multiple correlation is equal to the square root of the multiple coefficient of determination. Hence, R can be used as another measure of the strength of the relationship between the dependent variable and the several independent variables. It is comparable to the correlation between Y and X in a simple regression equation.

### How to compute confidence intervals and hypothesis tests for individual regression coefficients?

In general, the confidence intervals and hypothesis tests depend on the variance of the coefficients and the probability distribution of the coefficients. The variance of a coefficient estimate is affected by: (1) the sample size; (2) the spread of the independent variables; (3) the correlations between the independent variables, and; (4) the model error term. A higher correlation between the independent variables increases the variance of the coefficient estimators. An important conclusion here is that the variance of the coefficient estimators is conditional on the entire set of the independent variables in the regression model (in addition to the coefficient estimators themselves).

If the assumptions for a standard multiple regression hold and the error terms are normally distributed, then the test statistic can be computed as follows:

$t_{bj} = \frac{b_{j} - \beta_{j}}{s_{b_{j}}}$

with j = 1, 2, ..., K (with K being the number of independent variables). This test statistic follows a Student's t distribution with (n - K - 1 ) degrees of freedom.

Next, the confidence intervals for the βj (for a two-tailed test) can be derived as follows:

$b_{j} - t_{n-K-1,\alpha/2s_{bj}} < \beta_{j} < b_{j} + t_{n-K-1,\alpha/2s_{bj}}$

The most commonly tested null hypothesis is: H0: βj = 0. This test is used to determine if a specific independent variable is conditionally important in a multiple regression model. Often, it is argued that if we cannot reject the conditional hypothesis that the coefficient is 0, then we have to conclude that the variable should not be included in the multiple regression model. Typically, the test statistic for a two-tailed hypothesis test is computed in most regression programs and printed next to the coefficient variance estimate. In addition, a p-value indicating the significance of the hypothesis test is usually included. Using these p-value allows one to conclude whether or not a particular predictor variable is conditionally significant given the other variables in the regression model. Note, however, that the preceding selection procedure ignores the type II error (that is: the population coefficient is not equal to 0, but we fail to reject the null hypothesis). This may occur for example because of a large error or large correlation between independent variables or both.

### How to compute confidence intervals and hypothesis tests for multiple regression coefficients?

In the previous section, it was shown how to formulate and test a conditional hypothesis to determine if a specific variable coefficient is conditionally significant in a regression moldel. Sometimes, however, researchers are interested in the effect of the combination of several variables. This issue will be discussed in the present section.

If the null hypothesis is that all regression coefficients are equal to 0 and this hypothesis is true, then the mean square regression

$MSR = \frac{SSR}{K}$

is also a measure of error with K degrees of freedom. From this, the following F ratio results:

$F = \frac{SSR/K}{SSE/(n - K - 1)} = \frac{MSR}{s^{2}_{e}}$

This ratio follows an F distribution with K degrees of freedom for the numerator and (n - K - 1) degrees of freedom for the denominator. If the null hypothesis is true, then both the numerator and denominator provide estimates of the population variance. Similar to before, the computed F value is compared with the critical F vaue from Appendix Table 9. If the computed F value exceeds the critical F value, then the null hypothesis can be rejected and it can be concluded that at least one coefficient is not equal to 0.

By now, we have developed hypothesis tests for individual regression parameters as well as for all regression parameters together. Next, we developed a hypothesis test for a subset of regression parameters. If in that case the null hypothesis is true, it indicates that the Zj variables should not be included in the multiple regression model, because they provide no additional explanation regarding the changes of the dependent variable beyond what is already explained by the Xj variables.

### How to obtain predictions for multiple regressions?

One important application of regression models, whether simple or multiple, is to predict or forecast values of the dependent variable, given values for the independent variable(s). For simple regression models, we saw in Chapter 11, that the prediction interval includes the expected value of Y with probability 1 - α. For multiple regression models, in contrast, the prediction interval includes individual predicted values (expected values of Y plus the random error term). To obtain these intervals, we need to compute estimates of the standard deviations for the expected value of Y and for the individual points. In form, these computations are similar as shown before. Yet, the estimator equations are much more complicated and beyond the scope of this book. Predicted values, confidence intervals, and prediction intervals can therefore be computed directly in, for example, the Minitab regression routine.

### How to modify nonlinear regression models?

So far, we have discussed how regression analysis can be used to estimate linear relationships that predict or estimate a dependent variable as a function of one or more independent variables. Sometimes, however, the relationships between variables are not strictly linear. Therefore, in this section, several procedures are discussed that can be used for modifying certain nonlinear model formats such that multiple regression procedures can be applied. With careful manipulation of nonlinear models, it is possible to use least squares regression.

First, we consider the case of the quadratic function

$Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \epsilon$

that can be transformed into a linear multiple regression model by defining the following new variables:

$z_{1} = x_{1}$

$z_{2} = x^{2}_{1}$

And then specifying the model as:

$y_{i} = \beta_{0} + \beta_{1}z_{1i} + \beta_{2}z_{2i} + \epsilon_{i}$

which is linear in the transformed variables. These transformed quadratic variables can be combined with other variables in a multiple regression model. Inference procedures for transformed variables are equal to those for linear models, which we have discussed before. The coefficients must be combined for interpretation. That is, if we have a quadratic model, then the effect of an independent variable X is indicated by the coefficients of both the linear and quadratic terms. Further, it can be tested whether the quadratic or the original linear model is a better fit for the data.

#### Logarithmic transformations

Coefficients for exponential models of the form

$Y = \beta_{0} X^{\beta_{1}}_{1} X^{\beta_{2}}_{2} \epsilon$

can be estimated by first taking the logarithm of both sides in order to obtain an equation that is linear in the logarithms of the variables. In formula, that is:

$log(Y) = log(\beta_{0}) + \beta_{1} log(X_{1}) + \beta_{2} log(X_{2}) + log(\epsilon)$

Be aware of the fact that this estimation procedure requires that the random errors are multiplicative in the original exponential model. In other words, the error term is expressed as a percentage increase or decrease rather than by the addition or subtraction of a random error, as we have seen for linear regression models.

### How can regression models be used for dummy variables?

So far in our discussion of multiple regression models, we have assumed that the independent variables are fixed values and that these values exist over a range of many different values. It is, however, possible that the independent variable is a dummy variable with only two possible values: 0 and 1. Now, suppose that we have the following multiple linear regression model:

$Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2}$

When X2 = 0 in this model, then the constant is β0 but when X2 = 1, then the constant is β0 + β2. This shows that the dummy variable shifts the linear relationship between Y and X1 by the value of the coefficient β2. Dummy variables are therefore also called indicator variables.

In addition to their use for testing the intercept, dummy variables can also be used to test for differences in the slope coefficient. This is done by adding an interaction variable. First, the regression model should be expanded as follows:

$Y = \beta_{0} + \beta_{2}X_{2} + (\beta_{1} + \beta_{3}X_{2}) X_{1}$

From this model it can be seen that the slope coefficient of X1 contains two components: β1 and β3X2. When X2 equals 0, then the slope is the usual β1. Yet, when X2 = 1, then the slope is equal to the algebraic sum of β1 + β3. To estimate this model, we need to use the sample estimation model, which is defined by:

$\hat{y} = b_{0} + b_{2}x_{2} + b_{1}x_{1} + b_{3}x_{2}x_{1}$

The resulting regression mode is now linear with three variables. The new variable (x1x2) is called the interaction variable. When the dummy variable x2 = 0, then this interaction variable is also zero. But when x2 = 1, then the interaction variable has a value of x1. The coefficient b3 is an estimate of the difference in the coefficient of x1 when x2 = 1 compared to when x2 = 0. The Student's t statisic for β3 can be used to test the following hypothesis:

$H_{0}: \beta_{3} = 0 | \beta_{1} \neq 0, \beta_{2} \neq 0$

$H_{1}: \beta_{3} \neq 0 | \beta_{1} \neq 0, \beta_{2} \neq 0$

If the null hypothesis is rejected, it can be concluded that there is a difference in the slope coefficient for the two subgroups.

In the previous chapter the simple regression was introduced. A simple regression is a procedure for obtaining a linear equation that predicts a dependent (endogenous) variable as a function of a single independent (exogeneous) variable. In practice, however, it is often the case that multiple independent variables jointly affect a dependent variable. Therefore, in this chapter, the multiple regression will be discussed, which is a procedure for obtaining a linear equation that predicts a dependent (endogenous) variable as a function of multiple independent (exogenous) variables.

## What other topics are important in regression analysis? - Chapter 13

Generally, the aim of a regression analysis is to use information about the independent variable(s) to explain the behavior of the dependent variable and to derive predictions concerning this dependent variable. Further, the model coefficients can also be used to estimate the rate of change of the dependent variable as the result of changes in an independent variable, conditional on a particular set of other independent variables included in the model remaining fixed. In this chapter, we will discuss a set of alternative specifications. Moreover, we will consider situations in which the basic regression assumptions are violated.

### What are the four phases of model building?

We live in a complex world and no one really believes that we can exactly capture the complexities of economics and business behavior in one or more equations. However, as the famous statistician George Box once said: "All models are wrong, but some are useful". The art of model building recognizes the impossibility of representing all the many individual influences on a dependent variable and tries to select the most influential ones. This process of model building is problem specific. That means that it depends on what is known about the behavior of the variables under study and what data are available.

Model building consists of four stages: (1) model specification; (2) coefficient estimation; (3) model verification, and; (4) interpretation and inference.

The first stage, model specification, comprises the selection of the dependent and independent variables as well as the selection of the algebraic form of the model. In doing so, a specification is sought that provides an adequate representation of the system and process of interest. Theory and accumulated research experiences provide the context for the mode. Literature should be studied carefully and experts should be consulted. In fact, it may be necessary to do additional research and perhaps include others that have important insights.

Once the model is specified, it typically involves several unknown coefficients or parameters. In the second stage, therefore, these coefficients or parameters are estimated using sample data. In doing so, confidence intervals are build around the estimates.

In the third stage, the model is verified. In fact, simplifications and assumptions likely occur when translating insights from the model specification into algebraic forms and when selecting data for model estimation. Since some of these simplifications and assumptions might prove untenable, it is important to check the adequacy of the model. More specifically, after estimating a regression equation, we may find that the estimates do not make sense, given what we know about the process. If this is the case, then it is necessary to examine the assumptions, model specification and data. This may lead us to consider a different model specification. Therefore, a "feedback" loop is included in this four-stage procedure, with an arrow from the third stage back to the first stage.

In the fourth and final stage, the model is interpreted and inferences about the population are drawn. Here, it should be recognized that there is always the danger of making wrong conclusions. More specifically, the more severe any specification or estimation errors, the less reliable inferences derived from the estimated model are.

### How can dummy variables be used in experimental design models?

For years, experimental design procedures have been a major area of statistical research and practice. In such experimental design models, dummy variable regression offers a useful tool. For instance if the experiment has a single outcome variable that contains all the conditions of random error. Each experimental outcome then is measured at discrete combinations of experimental (independent) variables Xj.

Experimental designs differ in a very importantway from most of the problems we have considered so far. More precisely, the aim of an experimental design is to identify causes for the changes in the dependent variable. It thus is strongly aimed at the causal relationship, rather than simply identifying any relationship between dependent and independent variables. In doing so, it is important to choose experimental points, defined by independent variables, that provide minimum variance estimators. The order in which the experiments are performed is chosen randomly in order to avoid biases from variables that are not included in the experiment.

In an experimental design, the experimental outcome (Y) is measured at specific combinations of levels for treatment and blocking variables. A treatment variable is a variable whose effect we are interested in estimating with minimum variance. For instance, we may desire to know which of the five different production machines provides the highest productivity per hour. For this example, the treatment variable is the production machine, represented by a four-level categorical variable. Second, a blocking variable is a variable that is part of the environment. Therefore, the variable level of such a variable cannot be preselected. However, we still want to include the level of the blocking variable in the model so that we can remove the variability in the outcome variable (Y) that is due to different levels of the blocking variable. A treatment or blocking variable with K levels can be represented by K - 1 dummy variables.

### What is a lagged value?

When time series are analyzed (i.e., when measurements are taken over time) lagged values of the dependent variable are an important issue. Often in time series data, the dependent variable in time period t is related to the value taken by this dependent variable in an earlier time period, that is yt-1. The lagged value then is the value of the dependent variable in this previous time period.

### What is meant by specification bias?

It is a delicate and difficult task to adequately specify a statistical model. Substantial divergence of the model from reality can lead to conclusions that are seriously in error. In formulating a regression model, we implicitly assume that the set of independent variables contains all the quantities that significantly influence the changes of the dependent variable. In reality, however, there are likely to be additional variables that also influence the dependent variable. The joint influence of these factors is captured with the error term. However, a serious problem may occur if an important variable is omitted from the list of independent variables. That is, when important predictor variables are omitted from the model, the least squares estimates of the coefficients in the model are usually biased and the usual inferential statements from the hypothesis tests or confidence intervals can be seriously misleading. Moreover, the effect of the missing variables is instead captured in the error term, which, therefore, is larger. Only in the very rare case in which the omitted variables are completely uncorrelated with the other independent variables, this bias in the estimation of coefficients does not occur.

### What is multicollinearity?

If a linear regression model is correctly specified and all assumptions are met, then the least squares estimates are the best that can be achieved. Sometimes, however, the model is not correctly specified or not all assumptions are met. Suppose data from a competitive product market are used to estimate the relationship between quantity sold and price when the competitior's price is also included. Because both competitors are operating in the same market, they tend to adjust their prices when the other competitor makes a price adjustment. In statistical terms, this example illustrates the situation in which the estimated coefficients are not statistically significant and, therefore, could be misleading even when the actual effect of the independent variable on the dependent variable is rather strong. This example refers to multicollinearity, which is a state of very high intercorrelations among the independent variables. It is a type of disturbance in the data. If multicollinearity is present in the data, statistical inferences about the population may not be reliable.

There are a number of indicators of multicollinearity. The first indicator is: regression coefficients differ substantially from values indicated by theory or experience including having incorrect sign. The second indicator is: coefficients of variables believed to be a strong influence have small Student's t statistics indicating that their values do not differ from 0. The third indicator is: All the coefficient student t statistics are small, indicting no individual effect, and yet the overall F statistic indicates a strong effect for the total regression model. And the fourth and last indicator is: high correlations between individual independent variables or one or more f the independent variables have a strong linear association to the other independent variables, or a combination of both.

There are three approaches that can be used to correct for multicollinearity. First, remove one or more of the highly correlated independent variables. However, be aware that this might lead to a bias in coefficient estimation. Second, change the model specification, including possibly a new independent variable that is a function of several correlated independent variables. And third, obtain additional data that do not have the same strong correlations between the independent variables.

### What is heteroscedasticity?

Earlier, we discussed the several assumptions for linear regression analysis and the least squares method. When these assumptions are met, least squares regression offers a powerful set of statistical tools. Yet, when one or more of these assumptions are violated, the estimated regression coefficients can be inefficient. And, more importantly, the inferences drawn from this can be wrong and misleading.

In this and the next section, we discuss the violation of two of these assumptions. First, in this section, we will discuss the violation of uniform variances. Then, in the next section, we will discuss the violation of uncorrelated error terms.

In real applications, it is not so unlikely that the assumption of uniform variances is violated. For instance, suppose we are interested in the factors affecting output from a particular industry. To examine this, data are collected from several different firms. Both output measures and likely predictors are assessed. If these firms have different sizes, then the total output will vary. In addition, it is likely that the larger firms will also have a higher variance in their output measure compared to the smaller firms. This is due to the fact that there are simply more factors affecting the error terms in a large firm than there are in a small firm. Therefore, the error terms are expected to be larger in both positive and negative terms.

Models in which the error terms do not have uniform (i.e., equal) variance are said to show heteroscedasticity. On the other hand, if the models do have uniform variance, the model is said to show homoscedasticity. If heteroscedasticity is present (thus violation of the assumption of uniform variance), the least squares regression procedure for estimating the regression coefficients it not the most efficient procedure. In addition, the standard procedures for hypothesis testing and deriving confidence intervals are no longer valid.

It is therefore important to conduct a procedure to test for possible heteroscedasticity. There are several procedures for this. Many common procedures check the assumption of constant error variances against a plausible alternative. It may be found that the size of the error variance is directly related to one of the independent (predictor) variables. Another possibility is that the variances increase with the expected value of the dependent variable. Another useful tool for checking heteroscedasticity is by examining graphs, for instance a scatter plot of the residuals versus the independent variables and the predicted values from the regression. If the dots are nicely distributed around the horizontal line, there is no (sufficient) evidence for heteroscedasticity. If, on the other hand, the magnitude of the error terms tends to increase (or decrease) with increasing values of the independent variable, this is an indication of heteroscedasticity. Another, more formal, procedure to check heteroscedasticity is by testing the null hypothesis that the error terms all have the same variance against the alternative hypothesis that their variances depend on the expected values. In this reression, the dependent variable is the square of the residuals (i.e., e2i) and the independent variable is the predicted value (yi hat).

$e^{2}_{i} = a_{0} + a_{1} \hat{y}_{i}$

Now, let R2 be the coefficient of determination for this auxiliary regression. In that test, using a significance level of α, the null hypothesis is rejected if nR2 is larger than χ21,a (which is the critical value of the Chi-square random variable with 1 degree of freedom and probability of error α and n sample size).

### What is the influence of autocorrelated errors?

In this section, we discuss the violation of the assumption of uncorrelated error terms. What is the effect on the regression model if the error terms appear to be correlated from one observation of another? Until this point, we have assumed that the random errors for our model are independent. This may, however, not be the case. Especially in time-series data, it is often so that the random errors in a model are dependent on each other. Often, the behavior of many of the factors under study are quite similar over several time periods, yielding a high correlation over time. These correlations between error terms from adjacent time periodsa are very common in models constructed using time-series data. Therefore, it is important in regression models with time-series data, to test the hypothesis that the error terms are not correlated with each other. Correlations between first-order error terms through time are called autocorrelated errors. Consider the following equation:

$Corr(\epsilon_{t}, \epsilon_{t-1}) = \rho$

where ρ is the correlation coefficient (range -1 to +1) between the error in time t and the error in the previous time point, that is t - 1. If ρ = 0, this implies that there is no autocorrelation. Values around ρ = 0.3 indicate relatively weak autocorrelations. Values around ρ = 0.90 indicate a quite strong autocorrelation. For errors that are separated by l periods, the autocorrelation can be modeled as follows:

$Corr(\epsilon_{t}, \epsilon_{t-l}) = \rho^{l}$

From this, it can be seen that the correlation decays rapidly as the number of periods of separation grows. In other words, the correlation between errors that are far apart in time is relatively weak, whilst the correlation between errors that are closer to one another in time is possibly quite strong. Now, if we assume that the errors all have the same variance, it is possible to show that the autocorrelation stucture is equal to the following model:

$\epsilon_{t} = \rho \epsilon_{t - 1} + u_{t}$

where the random variable ut has mean 0 and constant variance σ2 and is not autocorrelated. This model is also called the first-order autoregressive model of autocorrelated behavior. Taking a closer look at this equation, it can be seen that the value of the error at time t depends on its value in the previous time point (the strength of that dependence is determined by the correlation coefficient ρ) and on a second error random term μt.

The most frequently used test statistic to test for autocorrelation is the Durbin-Watson test, denoted by d. In this test, the null hypothesis is formulated as follows: H0: ρ = 0. This can be tested against the alternative hypothesis: H1: ρ > 0. The test statistic d is calculated as follows:

$d = \frac{ \sum^{n}_{t = 2} (e_{t} - e_{t-1})^{2} }{\sum^{n}_{t=1} e^{2}_{t}}$

where the et are the residuals when the regression equation is estimated by least squares. The decision rules are as follows: Reject H0 if d > dL. Accept H0 if d > du. Test inconclusive if dL < d < dU. In this, dL and dU are tabulated for values of n and K and for significance levels of 1% and 5% in Appendix Table 12.

Sometimes, we want to test the null hypothesis against the alternative hypothesis H1: ρ < 0. In that case, the decision rules are as follows: Reject H0 if d > 4 - dL. Accept H0 if d < 4 - du. Test inconclusive if 4 - dL > d > 4 - dU.

Finally, there is a simple procedure to estimate the serial correlation, that is:

$r = 1 - \frac{d}{2}$

Generally, the aim of a regression analysis is to use information about the independent variable(s) to explain the behavior of the dependent variable and to derive predictions concerning this dependent variable. Further, the model coefficients can also be used to estimate the rate of change of the dependent variable as the result of changes in an independent variable, conditional on a particular set of other independent variables included in the model remaining fixed. In this chapter, we will discuss a set of alternative specifications. Moreover, we will consider situations in which the basic regression assumptions are violated.

## How to analyze categorical data? - Chapter 14

Do customers have a preference for a particular burger of MacDonald's? Are people's preferences for a certain political candidate depend on characteristics, such as age, gender or country of origin? Do students at a particular university have a preference for one of the three statistics teachers? These questions are only a few examples of the types of questions that we will address in this chapter. More specifically, in this chapter, the topic of nonparametric tests is discussed. Nonparametric tests are often the appropriate procedure to draw statistical inferences about qualitative (i.e., nominal or ordinal) data or numerical data in which the assumption of normality cannot be made about the probability distribution of the population.

### What test should be conducted when data are generated by a fully specified probability distribution?

First, let us consider the situation in which data are generated by a fully specified probability distribution. The most straightforward test of this type is the goodness-of-fit test. In this test, the null hypothesis about the population specifies the probabilities that a sample observation will fall into each possible category. Next, the sample observations themselves are used to check this hypothesis. If the null hypothesis is true, this indicates that the observed data in each category are close in value to the expected numbers in each category. In that case, the data is said to provide a close fit to the assumed population distribution of probabilities.

To test this hypothesis, the observed probabilities (Oi) are compared to the expected probabilities (Ei) using the following decision rule:

$Reject \hspace{1mm} H_{0} \hspace{1mm} if \hspace{1mm} \sum^{K}_{i = 1} \frac{ (O_{i} - E_{i} )^{2}}{E_{i}} > \chi^{2}_{K-1, \alpha}$

where X2k-1,a is the number for which

$P(\chi^{2}_{K - 1} > \chi^{2}_{K - 1, \alpha}) = \alpha$

and the random variable X2K-1 follows a chi-square distribution with K - 1 degrees of freedom. Here, K is the number of categories of the variable. Note that for this hypothesis test to be valid, the sample size must be large enough with at least five expected observations in each cell.

Suppose that we are interested in the preference university students have for one of the three statistics teachers at the faculty. The null hypothesis is that the students do not have a particular preference, and thus, that the probability for each of the three teachers is equal (that is 1/3). We obtain the following data:

 Category Teacher A Teacher B Teacher C Total Observed number of objects 75 110 115 300 Probability (under H0) 1/3 1/3 1/3 1 Expected number of objects (under H0) 100 100 100 300

To test the null hypothesis, we first need to compute the test statistic. This is done as follows:

$\chi^{2} = \sum^{3}_{i = 1} \frac{(O_{i} - E_{i})^{2} }{E_{i}} = \frac{ (75 - 100)^{2} }{100} + \frac{ (110 - 100)^{2} }{100} + \frac{ (115 - 100)^{2} }{100} = 9.50$

Because there are three categories (teacher A, teacher B, and teacher C), there are K - 1 = 2 degrees of freedom, the associated critical value of this test if we are testing with a 1% significance level is: X22,0.01 = 9.210. Hence, according to the decision rule, the test statistic exceeds the critical value and thus the null hypothesis can be rejected at a 1% significance level. The data therefore provide strong evidence against the hypothesis that the teachers are equally likely to be preferred by the university students.

### How to apply goodness-of-fit tests when the population parameters are unknown?

In the previous section, we assumed that the data were generated by a fully specified probability distribution. In doing so, the null hypothesis in such a test specifies the probability that a sample observation will fall in any category. However, it is often needed to test the hypothesis that the data are generated by a particular distribution, such as the binomial or the Poisson distribution, without assuming the parameters of that distribution to be known. If the population parameters are unknown, the appropriate goodness-of-fit test with estimated population parameters is similar to the one developed in the previous section, except that the number of degrees of freedom for the chi-square random variable is (K - m - 1) where K is the number of categories and m is the number of unknown population parameters.

#### Test for Poisson distribution

Suppose we are testing whether the data are generated by the Poisson distribution. The following frequencies are observed:

 Number of occurrences 0 1 2 3+ Observed frequency 156 63 29 14

Now, recall that if the Poisson distrbution is appropriate, the probability of x occurrences is:

$P(x) = \frac{e^{-\lambda} \lambda^{x} }{x!}$

where λ is the mean number of occurrences. Even though the population mean is unknown, we can estimate it by considering the sample mean, which is 0.66. From this, it follows that we can estimate the probability for any number of occurrences under the null hypothesis that the population distribution is Poisson. For instance, the probability of 2 occurrences is computed as follows:

$P(2) = \frac{e^{-0.66} (0.66)^{2} }{2!} = \frac{(0.5169)(0.66)^{2}}{2} = 0.1126$

Doing this for all possible number of occurrences yields the following results:

 Number of occurrences 0 1 2 3+ Observed frequency 156 63 29 14 Expected frequency under H0 135.4 89.4 29.5 7.7

These observed and expected frequencies can be used the same way as before to compute the test statistic for testing the null hypothesis that the population distribution is Poisson.

#### Test for normal distribution

Now, suppose that we are testing whether the population distribution is normally distributed. Herefor, we can use the Jarque-Bera test for normality, which can be computed as follows:

$JB = n [ \frac{(skewness)^{2}}{6} + \frac{ (kurtosis - 3)^{2}}{24} ]$

where the population skewness is estimated by

$Skewness = \frac{ \sum^{n}_{i=1} (x_{i} - \bar{x})^{3} }{ns^{3}}$

and the population kurtosis is estimated by

$kurtosis = \frac{ \sum^{n}_{i = 1} (x_{i} - \bar{x})^{4} }{ns^{4}}$

Often, skewness and kurtosis are already included in the standard output of most statistical software packages. If the number of sample observations becomes very large, the JB statistic is known to have (under the null hypothesis that the population distribution is normal) a chi-square distribution with 2 degrees of freedom. Similar to all other hypothesis tests, the null hypothesis is rejected for large values of the test statistic.

### Which test to use for nonparametric testing with paired or matched samples?

#### The Sign Test

The most commonly used nonparametric test when analyzing data from paired or matched samples is the Sign test. This sign test is used, for example, in market research to determine if consumer prefer one of two products. Because the consumers only name their preference, the data are nominal and lend themselves to nonparametric procedures. In addition, the sign test is also useful for testing the median of a population.

The null hypothesis of the sign test is formulated as follows: H0: P = 0.5. Here, P is the proportion of nonzero observations in the population that are positive. The test staisic S for the sign test is simply the number of pairs with a positive difference, where S has a binomial distribution with P = 0.5 and n = the number of nonzero differences. This value can be tested against the cumulative binomial probability for that value, which can be found in Appendix Table 3.

#### The Wilcoxon Signed Rank Test

One disadvantage of the sign test is takes it takes only a very limited amount of information into consideration. Namely, it only considers the signs of the differences. It ignores the strength of the preferences. When the sample size is small, the sign test may therefore not be the most powerful tool. Instead, the Wilcoxon Signed Rank Test can be used. This test provides a method for incorporating the information about the magnitude of the differences between matched pairs. In doing so, it still is a distribution-free test. It is, however, based on ranks of the observations. First, the pairs for which the difference is 0 are discarded. Then, the remaining pairs are ranked in ascending order, with ties assigned the average of the ranks they occupy. Next, the sum of the ranks corresponding to positive and negative differences is computed and the smaller of these sums is the Wilcoxon signed rank statistic T. In formula, that is:

$T = min(T_{+},T_{-})$

where T+ is the sum of the positive ranks, T- is the sum of the negative ranks and n is the number of nonzero differences. The null hypothesis then is rejected if T is less than or equal to the value in Appendix Table 10.

#### Normal approximation to the Sign Test

Thanks to the central limit theorem, the normal distribution can be used to approximate the binomial distribution if the sample size is large enough. Yet, there is no consensus of the definition of large. A commonly made suggestion is to use the normal approximation if the sample size exceeds 20. By using a continuity correction factor in the test statistic, we can compensate for estimating discrete data with a continuous distribution and subsequently provide a cloer approximation to the p-value. Based on the normal approximation of a binomial distribution, we derive the mean and standard deviation as follows:

$\mu = np = 0.5n$

$\sigma = \sqrt{np(1 - p)} = \sqrt{0.25n} = 0.5 \sqrt{n}$

These can be used to obtain the test statistic as follows:

$Z = \frac{S* - \mu}{\sigma} = \frac{S* - 0.5n}{0.5 \sqrt{n}}$

where S* is the test statistic corrected for continuity, defined as follows:

• For a two-tail test: S* = S + 0.5 (if S < μ) or S* = S - 0.5 (if S > μ)
• For an upper-tail test: S* = S - 0.5
• For a lower-tail test: S* = S + 0.5

#### Normal approximation to the Wilcoxon Signed Rank Test

Similar to the above section, when the number n of nonzero differences in the sample is large (i.e., n > 20), the normal distribution provides a good approximation of the the Wilcoxon Signed Ranktest statistic T under the null hypothesis that the population differences are centered on zero. More specifically, under this null hypothesis, the Wilcoxon signed rank test has mean and variance given by:

$E(T) = \mu_{T} = \frac{n(n + 1)}{4}$

$Var(T) = \sigma^{2}_{T} = \frac{n(n + 1)(2n + 1)}{24}$

Using this information, for large n, the distribution of the random variable Z is approximately normal and can be computed as follows:

$Z = \frac{T - \mu_{T}}{\sigma_{T}}$

This test value can subsequently be compared to the critical value of the standard normal distribution that corresponds with the significance level that is used for the hypothesis test.

### What nonparametric tests can be used for independent random samples?

In the previous section, we considered nonparametric tests for matched pairs or dependent samples. In this section, we move on to nonparametric tests for independent random samples. In doing so, two tests are introduced: the Mann-Whitney U test and the Wilcoxon Rank Sum Test

#### Mann-Whitney U Test

As the number of sample observations increases, the distribution of the Mann-Whitney U statistic rapidly approaches the normal distribution. The approximation requires that each sample consists of at least ten observations in order to provide a (somewhat) adequate approximation. In other words, it is required that n1 > 10 and n2 > 10. Further, to test the null hypothesis that the central locations of the two population distributons are equal, it is assumed that, apart from any possible differences in central location, the two population distributions are identical. The Mann-Whitney U test statistic can be defined as follows:

$U = n_{1}n_{2} + \frac{n_{1}(n_{1} + 1)}{2} - R_{1}$

where R1 denotes the sum of the ranks of the observations from the first population.

Further, the Mann-Whitney U has the following mean and variance:

$E(U) = \mu_{U} = \frac{n_{1}n_{2}}{2}$

$Var(U) = \sigma^{2}_{U} = \frac{ n_{1}n_{2} (n_{1} + n_{2} + 1)}{12}$

Assuming that both sample sizes are at least comprised of ten observations, we can find the test statistic as follows:

$Z = \frac{U - \mu{U}}{\sigma_{U}}$

This test statistic is approximated by the normal distribution and can be contrasted against the critical value found in the Table for the standard normal distribution.

#### Wilcoxon Rank Sum Test

The Wilcoxon Rank Sum Test is quite similar to the Mann-Whitney U test and may possibly even lead to the same results. Sometimes this test may be preferred because of its ease. Similar to before, the test approaches the normal distribution rapidly as the number of sample observations increases. A sample size of at least ten observations in each sample is required for an adequate approximation. Assuming that the null hypothesis is true, the Wilcoxon Rank Sum Test statistic T has mean and variance:

$E(T) = \mu_{T} = \frac{n_{1} (n_{1} + n_{2} + 1 ) }{2}$

$Var(T) = \sigma^{2}_{T} = \frac{ n_{1}n_{2} ( n_{1} + n_{2} + 1 ) }{12}$

Then, for large samples, the distribution of the random variable

$Z = \frac{T - \mu_{T}}{\sigma_{T}}$

can be approximated by the normal distribution. Note, however, that for a large number of ties, the equation of the variance may not be correct.

### How can the Spearman Rank correlation be calculated?

The presence of odd extreme observations or other deviances from normality can seriously affect the sample correlation coefficient. More specifically, many tests based on correlation measures rely for their validity on the assumption of normality. Sometimes, however, this assumption of normality is violated. Then, the Spearman Rank correlation offers a solid alternative. The Spearman Rank correlation coefficient is a nonparametric correlation coefficient that is based on the ranks of the observations. The coefficient can be computed as follows:

$r_{s} = 1 - \frac{6 \sum^{n}_{i = 1} d^{2}_{i} }{n (n^{2} - 1) }$

where di refers to the differences of the ranked pairs. Suppose that we have 17 observations and these are ranked according to variable X and variable Y. The first observation has rank 14 for variable X and rank 2 for variabe Y. Then, the difference in ranks for this observation is 14 - 2 = 12. Similar computations are done for each pair of observations and the sum of all these rank differences is capted in d. As can be seen in the formula, this sum is multiplied by 6. The test value can be tested against the critical value, which can be found in Appendix Table 11.

Do customers have a preference for a particular burger of MacDonald's? Are people's preferences for a certain political candidate depend on characteristics, such as age, gender or country of origin? Do students at a particular university have a preference for one of the three statistics teachers? These questions are only a few examples of the types of questions that we will address in this chapter. More specifically, in this chapter, the topic of nonparametric tests is discussed. Nonparametric tests are often the appropriate procedure to draw statistical inferences about qualitative (i.e., nominal or ordinal) data or numerical data in which the assumption of normality cannot be made about the probability distribution of the population.

## How to conduct an analysis of variance? - Chapter 15

### How to conduct a one-way analysis of variance?

Suppose that we are interested in the comparison of K populations, each of which is assumed to have the same variance. From these populations, we draw independent random samples with n1, n2, ..., nK observations. Further, the symbol xij is used to refer to the jth observation in the ith population. Now, the procedure that we are using to test for the equality of population means in this study is also known as the one-way analysis of variance (ANOVA). Why it is called a one-way analysis will become clear when we discuss other analysis of variance models.

In a one-way analysis of variance, the null hypothesis is that the K population means are all equal, give the independent random samples. The alternative hypothesis then states that at least one population mean is different from the other population means. In formal notation, that is:

$H_{0} = \mu_{1} = \mu_{2} = ... = \mu_{K}$

$H_{1} = \mu_{1} \neq \mu_{j} For \hspace{2mm} at \hspace{2mm} least \hspace{2mm} one \hspace{2mm} pair \hspace{2mm} \mu_{i}, \mu_{j}$

To test these hypotheses, the first step is to calculate the sample means for the k groups of observations. The null hypothesis states that all populations have the same common mean. A logical next step therefore is to we develop an estimate of this common population mean. This common mean can be obtained simply as the sum of all the sample values divided by their total number. In other words, it is the mean of the sample means.

The next step is to test the equality of population means. This test is based on two types of variability: (1) the within-group variability, and; (2) between-groups variability. The within-group variability is calculated by the sum of squared deviations of all the observations from their sample mean. This is done for each sample. The sum of all these calculations then is the total within-groups variability. Similarly, the between-group variability is calculated by taking the sum of squared deviations of all individual group means from the overall (common) mean. In calculating the total between-groups variability, a weight is assigned to each squared discrepancy. This weight is based on the number of sample observations in the group. Hence, the most weight is given to the largest sample. Finally, we can also calculate the total sum of squares, which is the sum of squared discrepancies of all the sample observations about their overall mean (thus, not for each sample separately).

In formula, we obtain the following for the within-groups variability (SSW), between-groups variability (SSG) and total variability (SST):

$SSW = \sum^{K}_{i = 1} \sum^{n_{j}}_{j = 1} (x_{ij} - \bar{x}_{i} )^{2}$

$SSG = \sum^{K}_{i = 1} n_{i} (\bar{x}_{i} - \bar{x} )^{2}$

$SST = \sum^{K}_{i = 1} \sum^{n_{j}}_{j = 1} (x_{ij} - \bar{\bar{x}}_{i} )^{2}$

These concepts are related as follows:

$SST = SSW + SSG$

From the last formula, it can be observed that the total sum of squares can be decomposed into two components: (1) the sum of the within-groups variability, and; (2) the between-groups variability. This provides the basis for the analysis of variance test of equality of group means. More precisely, the ANOVA is based on the assumption that the K populations have the same common variance. If the null hypothesis that the population means are all the same is true, then each of the sums of squares (SSW and SSG) can be used as the basis for an estimate of the common population variance. Hence, to obtain these estimates, the sums of squares should be divided by the appropriate number of degrees of freedom.

First, an unbiased estimator of the population variance is obtained by dividing SSW by (n - K). This estimate is called the within-groups mean square, and is given by:

$MSW = \frac{SSW}{n - K}$

Second, another unbiased estimator of the population is obtained by dividing SSG by (K - 1). This estimate is called the between-groups mean square, and is given by:

$MSG = \frac{SSG}{K - 1}$

Importantly, if the population means are NOT equal, the between-groups mean square (MSG) does NOT provide an unbiased estimate of the common population variance. Instead, the expected value of the corresponding random variable will exceed the common population variance, because it then also yields information about the squared differences of the true population means. If, however, the null hypothesis is true, then both the MSW and MSG are unbiased estimators of the population variance and it would be reasonable to assume that these two values are quite close to one another. Based on this idea, we can test the null hypothesis of equal population variances by considering the ratio of mean squares, given by:

$F = \frac{MSG}{MSW}$

If this F ratio is quite close to 1, there is little cause to doubt that the null hypothesis of equal population variances. If, however, this ratio is substantially larger than 1, we suspect that the null hypothesis of equal population variances is not true. This random variable follows an F distribution with (K - 1) degrees of freedom in the numerator and (n - K) degrees of freedom in the denominator. Hence, formally, this ratio can be tested against the F distribution with corresponding degrees of freedom. The critical value can be looked up in Appendix Table 9 of the book. Note that this is done under the assumption that the population distributions are normal.

All of the above is summarized in the following table:

 Source of variation Sum of Squares Degrees of freedom Mean Squares F ratio Between groups SSG K - 1 MSG = SSG / (K - 1) MSG / MSW Within groups SSW n - K MSW = SSW / (n - K) Total SST n - 1

### How to conduct the multiple-comparison procedure?

If one conducts a one-way analysis of variance (ANOVA) and finds a significant result, the null hypothesis will be rejected that all population means are equal. This, however, does not tell us all that much, because it does not provide any information about which population means are different from each other. Hence, the question arises which subgroup means are different from others. Several procedures have been developed to tackle this issue of multiple-comparison question. All of these, in essence, involve developing intervals that are somewhat wide than those for the two-subgroup case. One such procedure is developed by John Tukey. He used an extended form of the Student's t distribution. The test statistic is the minimum significant difference between the K subgroups, which can be computed as follows:

$MSD(K) = Q \frac{s_{p}}{\sqrt{n}}$

where the factor Q can be found in Appendix Table 13 using the appropriate significance level. Further, sp is the square root of MSW, that is: sp = √(MSW). The resulting MSD value can be used to indicate which subgroup means are different, and, therefore, this statistic provides a very useful screening devide that can be used to extend the results of the one-way analysis of variance.

### What is the Kruskal-Wallis Test?

The Kruskall-Wallis test is a nonparametric alternative to the ANOVA. Like the majority of nonparametric tests, the Kruskal-Wallis test is based on the ranks of the sample observations. The sample values are pooled together and subsequently ranked in ascending order. Next, the sums of the ranks for the K samples are computed, yielding R1, R2, ..., RK. The test statistic W can be computed as follows:

$W = \frac{12}{n(n + 1)} \sum^{k}_{i = 1} \frac{R^{2}_{i}}{n_{i}} - 3(n + 1)$

This test statistic is a random variable that follows the chi-square distribution with (K - 1) degrees of freedom. The test statistic can be compared against the critical value, which can be found in Appendix Table 7 using the corresponding degrees of freedom and significance level.

### How to conduct a two-way analysis of variance?

In some applications, not one but two factors are of interest. Suppose, for instance, that there are three types of cars (say A, B, and C) whose fuel economies we want to compare. We develop an experiment in which six trials are to be run with each type of car. If these trials are conducted using six drivers, each of whom drives a car of all three types, it is possible to extract from the results information about driver variability as well as information about the differences among the three types of cars, because every car type has been tested by every driver. The additional variable (here: drivers) is called a blocking variable. The experiment is said to be arranged in blocks. In this example, the experiment consists of six blocks, one for each driver. If we randomly select one driver to drive type A, one driver to drive type B, and one driver to drive type C, and so on, this type of experimental design is also known as randomized blocks design.

If we in fact have two variables that we want to compare simulatenously, we can conduct a two-way analysis of variance. In doing so, the total sum of squares can be decomposed in not two, but three components: (1) between-blocks sum of squares; (2) between-groups sum of squares, and; (3) error sum of squares. Then, SST = SSG + SSB + SSE. Two hypothesis tests can be conducted, one for the null hypothesis that the population group means are all the same, and one for the null hypothesis that the block means are all the same. Everything is summarized in the table below.

 Source of variation Sum of squares Degrees of freedom Mean squares F ratio Between groups SSG K - 1 MSG = SSG / (K - 1) MSG / MSE Between blocks SSB H - 1 MSB = SSB / (H - 1) MSB / MSE Error SSE (K - 1) (H - 1) MSE = SSE / ((K - 1) (H - 1)) Total SST n - 1

Lastly, if there are more observations per cell, we extend this approach using the symbol m to denote the number of observations per cell. In addition, the two-way analysis of variance table will be extended with m observations per cell, yielding the following table:

 Source of variation Sum of squares Degrees of freedom Mean squares F ratio Between groups SSG K - 1 MSG = SSG / (K - 1) MSG / MSE Between blocks SSB H - 1 MSB = SSB / (H - 1) MSB / MSE Interaction SSI (K - 1)(H - 1) MSI = SSI / ((K - 1) (H - 1)) MSI / MSE Error SSE KH (m - 1) MSE = SSE / KH (m - 1) Total SST n - 1

Suppose that we are interested in the comparison of K populations, each of which is assumed to have the same variance. From these populations, we draw independent random samples with n1, n2, ..., nK observations. Further, the symbol xij is used to refer to the jth observation in the ith population. Now, the procedure that we are using to test for the equality of population means in this study is also known as the one-way analysis of variance (ANOVA). Why it is called a one-way analysis will become clear when we discuss other analysis of variance models.

## How to analyze data sets with measurements over time? - Chapter 16

### What is a time series?

In this chapter, we discuss how to analyze data sets that contain measurements over time for different variables. Such data with measurements over time are also called time series. More specifically, a time series is a set of measurements, ordered over time, on a particular quantity of interest. In a time series, the sequence of observations is important. This is different from cross-sectional data, for which a sequence of observations is not important.

### What are the components of a time series?

Most time series consist of four components:

1. Tt: trend component
2. St: Seasonality component
3. Ct: Cyclical component
4. It: Irregular component

The trend component refers to the tendency many time series have to grow or decrease over time, rather than to remain stable. Often, such a trend stops a certain time point, and when that occurs, it is found that this provides an important component for developing forecasts. The seasonal pattern is uniquely defined for each time series. Our treatment of seasonality depends on our objectives. If, for instance, we are interested in the quarterly profits, we might compare the different quarters and include the quarter time period as a seasonality component in our model. Sometimes, on the other hand, seasonality is a nuisance. It may, for example, be that the analyst requires an assessment of overall measurement in a time series, which is not affected by the influence of seasonal factors.

Using these four components, we can define a time series as an additive model consisting the sum of these components:

$X_{t} = T_{t} + S_{t} + C_{t} + I_{t}$

In other circumstances, the time series model may be defined by a multiplicative model, often represented as a logarithmic additive model:

$X_{t} = T_{t} * S_{t} * C_{t} * I_{t}$

$ln(X_{t}) = ln(T_{t}) + ln(S_{t}) + ln(C_{t}) + ln(I_{t})$

### What are moving averages?

It may happen that the irregular component in a time series is so large that is hinders any underlying component effect. In that case, any visual interpretation of the time plot is extremely difficult as the actual plot will appear quite jagged. Hence, it may be beneficial to smooth the plot to achieve a clearer picture. This smoothing can be done by using a moving average. The method of moving averages is built on the idea that any large irregular component at any time point will exert a smaller effect if we average the point with its immediate neighbors. The simplest procedure to obtain such a moving average is by using a simple, centered (2m + 1) point moving average. This means that we replace each observation (xt) by the average of itself and its neighbors:

$x*{t} = \frac{1}{2m + 1} \sum^{m}_{j = -m} x_{t + j}$

with (t = m + 1, m + 2, ..., n - m). Commonly, this moving average is computed by a statistical software program, such as Minitab.

These moving averages can in turn be used to compute the seasonal component. More specifically, let xt (t = 1, 2, ..., n) be a seasonal time series of period s (s = 4 for quarterly data, and s = 12 for monthly data). A centered s-point moving average series, x*t is obtained through the following steps, where it is assumed that s is even:

Step 1. Form the s-point moving averages

$\frac{ \sum^{s/2}_{j = - (s/2) + 1} x_{t + j} }{s}$

Step 2. Form the centered s-point moving averages

$x^{*}_{t} = \frac{x^{*}_{t-0.5} + x^{*}_{t + 0.5} }{2}$

The series of centered s-point moving averages can be used to obtain descriptive insight into the structure of a time series. Because it is largely free from seasonality and comprises a smoothing of the irregular component, it is well suited for the identification of a trend and/or cyclical component.

There is a seasonal-adjustment approach that is based on the implicit assumption of a stable seasonal pattern over time. This procedure is also known as the seasonal index method. In this procedure, it assumed that for any seasonal period (e.g., month, quarter, year) the effect of seasonality is to increase or decrease the series by the same percentage. This will be illustrated using an example with quarterly data. To assess the influence of seasonality, the original series is expressed as a percentage of the centered 4-point. Suppose that, for the third quarter of the first year, we find that xt = 0.345 and x*t = 0.5075. Then, the following can be obtained:

$100 (\frac{x_{3}}{x*_{3}}) = 100 ( \frac{0.345}{0.5075} ) = 67.98$

These percentages can in turn be used to compute the seasonal index. This is done as follows: divide the total seasonal index by the median and multiply that value with the one obtained in the equation above. Finally, we can obtain the adjusted vaue by dividing 100 by the seasonal index and multiplying that value by the original value. The latter will obtain a proportional value ranging from 0 to 1 in which higher values indicatie more influence of seasonality.

### What is meant by exponential smoothing?

Simple exponential smoothing is a forecasting method, which performs quite effectively in a variety of forecasting application and forms the basis for some more elaborate forecasting methods. Exponential smoothing is appropriate when the time series in nonseasonal and has no consist increasing or decreasing trend. The smoothed series can be obtained as follows:

$\hat{x}_{t} = (1 - \alpha) \hat{x}_{t - 1} + \alpha x_{t}$

where α is a smoothing constant whose vaue is fixed between 0 and 1. Considering time n, we can obtain predictions of future values xn+h of the series as follows:

$\hat{x}_{n + h} = \hat{x}$

### How to use the Holt-Winters Method for nonseasonal series?

Another forecasting method is the Holt-Winters method: nonserial series. This method proceeds as follows. First, obtain estimates of the level and trend Tt as follows:

$\hat{x}_{2} = x_{2} \hspace{2mm} T_{t} = x_{2} - x_{1}$

$\hat{x}_{t} = (1 - \alpha) (\hat{x}_{t} + T_{t - 1}) + \alpha x_{t}$

$T_{t} = (1 - \beta) T_{t - 1} + \beta (\hat{x} - \hat{x}_{t - 1} )$

where α and β are smoothing constant whose values are fixed between 0 and 1. Considering time point n, the prediction of future values can be obtained as follows:

$\hat{x}_{n + h} = \hat{x}_{n} + hT_{n}$

where h is the number of periods in the future.

The Holt-Winters Method for nonseasonal series will be illustrated using an example. Suppose, we obtain the following data for smoothing constants α = 0.7 and β - 0.6:

 t xt $\hat{x}_{t}$ Tt 1234567891011 133155165171194231274312313333343 .. ..

The initial estimates of level and trend in year 2 are:

$\hat{x} = x_{2} = 155$

$T = x_{2} - x_{1} = 155 - 133 = 22$

Because α = 0.7 and β - 0.6, we obtain the following equations:

$\hat{x}_{t} = 0.3 (\hat{x}_{2} +T_{2}) + 0.7x_{3}$

$T_{t} = 0.4T_{t - 1} + 0.6( \hat{x}_{t} - \hat{x}_{t - 1})$

Using these equations we can obtain the following estimates of level and trend for year 3:

$\hat{x}_{3} = 0.3( \hat{x}_{2} + T_{2} ) + 0.7x_{3} = (0.3)(155 + 22) + (0.7)(165) = 168.6$

$T_{3} = 0.4T_{2} + 0.6(\hat{x}_{3} - \hat{x}_{2} = (0.4)(22) + (0.6)(168.6 - 155) = 10.86 )$

The estimates of level and trend for year 4 are obtained in a similar manner:

$\hat{x}_{4} = 0.3( \hat{x}_{3} + T_{3} ) + 0.7x_{4} = (0.3)(168.6 + 16.96) + (0.7)(171) = 175.4$

$T_{4} = 0.4T_{3} + 0.6( \hat{x}_{4} - \hat{x}_{3}) = (0.4)(16.96) + (0.6)(175.4 - 168.6) = 10.86$

This can be calculated for each time point. The results of all these calculations are summarized in the table below.

 t xt $\hat{x}_{t}$ Tt 1234567891011 133155165171194231274312313333343 155169175192223266309324338347 2211114253640251813

Next, we can predict relationships using the following rules:
$\hat{x}_{n + 1} = \hat{x}_{n} + T_{n}$
and, for the next one:
$\hat{x}_{n + 2} = \hat{x} + 2T_{n}$
In more general terms, that is:
$\hat{x}_{n + h} = \hat{x}_{n} + hT_{n}$
Now, consider the estimates of level and trend for year 11, which are subsequently 347 and 13. Then, the forecasts of the next two years are given by:
$\hat{x}_{12} = 347 + 13 = 360$
$\hat{x}_{13} = 347 + (2)(13) = 373$

### How to use the Holt-Winters method for seasonal time series?

In this section, we discuss an extension of the Holt-Winters method that allows for seasonality. Commonly in applications, the seasonal factors is believed to be multiplicative, so that, for instance, in dealing with monthly sales figures, we might think of January in terms of a proportion of average monthly sales. Similar to before, the trend component is believed to be additive.

The same symbols that were used in the nonseasonal case are used for seasonal time series. Only one symbol is added, Ft to denote the seasonal factor. For instance, if the time series consists of s time periods per year, the seasonal factor for the corresponding period in the previous year will be Ft-s. Forecasting using the Holt-Winters method for seasonal time series makes use of a set of recursive estimates from the historical series. These estimates utilize a level factor (α), a trend factor (β), and a multiplicative factor (γ). All three factors are bounded between 0 and 1. The recursive estimates are based on the following equations:

$\hat{x}_{1} = (1 - \alpha) (\hat{x}_{t - 1} + T_{t - 1} + \alpha \frac{x_{t}}{F_{t-s}})$

$T_{t} = (1 - \beta) T_{t-1} + \beta (\hat{x}_{t} - \hat{x}_{t - 1})$

$F_{t} = (1 - \gamma) F_{t - s} + \gamma \frac{x_{t}}{\hat{x}_{t}}$

The computational details are very complicated and best left to a computer. After the initial procedures generates the level, trend, and seasonal factors from the previous (historical) values, we can use these results to forecast future values at h time periods ahead from the last observation (xn) in the "historical" series. This forecast equation is given by:

$\hat{x}_{n+h} = ( \hat{x}_{n} + hT_{n} ) F_{n+h-s}$

### What are autoregressive models?

Here, we discuss a different approach to forecast time-series using. This approach involves the use of the available data to estimate parameters of a model of the process that possibly have generated the time series. One such procedure that is based on this model-building approach is called autoregressive modeling. Essentially, the idea of an autoregressive model is to regard a time series as a series of random variables. As we saw in Chapter 13, we might often for practical purposes be prepared to assume that these random variables all have the same means and variances. THis is, however, not very plausible in real data. It is, for instance, very likely that sales in adjecent periods are correlated with each other. Such correlation periods between adjacent periods are sometimes referred to as autocorrelation. In principle, any number of autocorrelation patterns are possible, although some are more likely to arise than others. One very simple autocorrelation pattern that arises when the correlation between adjacent values in the time series is some number (say Φ1) that between values two time periods apart is Φ21 and between values three time periods apart is Φ31. Then, the autocorrelation structure gives rise to a time-series model of the form:

$x_{t} = \gamma + \phi_{1}x_{t - 1} + \epsilon_{t}$

where γ and Φ1 are fixed parameters, and the random variables εt have means 0 and fixed variances for all t and are not correlated with each other. The aim of the parameter γ is to allow for the possibility that the series xt has some mean other than 0. Otherwise, (thus if the mean is 0), we obtain the model presented in Chapter 13 that has been used to represent autocorrelation in the error terms of a linear regression equation. This is called the first-order autoregressive model. The parameters of the autoregressive model are estimated using the least squares algorithm. Those parameters are selected for which the sum of squares is a minimum.

In this chapter, we discuss how to analyze data sets that contain measurements over time for different variables. Such data with measurements over time are also called time series. More specifically, a time series is a set of measurements, ordered over time, on a particular quantity of interest. In a time series, the sequence of observations is important. This is different from cross-sectional data, for which a sequence of observations is not important.

## What other sampling procedures are available? - Chapter 17

In some situations, it is preferred to divide the population into subgroups, called strata, so that each individual member of the population belongs to one, and only one, subgroup. The basis for this division into strata might be based on a certain characteristic of the population, such as gender or income. In this chapter, we discuss the stratified sampling procedure. We will also briefly discuss other methods of sampling, namely cluster sampling, two-phase sampling, and nonprobalistic sampling.

### What is stratified sampling?

Stratified sampling is a way of sampling in which the population is broken down into subgroups (strata) and a simple random sample is drawn from each stratum. The only requirement here is that each participant belongs to one, and only to one, of the strata. In other words, stratified random sampling is the selection of independent random samples from each stratum of the population. In doing so, one attractive possibility that is often used in practice is called proportional allocation, which implies that the proportion of sample members from any stratum is the same as the proportion of population members in the stratum. This can be contrasted to the (less representive) approach of including the same number of participants from each stratum.

Suppose that random samples of nj individuals are taken from the strata containing Nj individuals. Then, an unbiased estimation procedure for the overll population mean μ results in the following point estimate:

$\bar{x}_{st} = \frac{1}{N} \sum^{K}_{j = 1} N_{j}\bar{x}_{j}$

Next, an unbiased estimation procedure for the variance of the estimator of the overall population mean results in the following point estimate:

$\hat{\sigma}^{\frac{2}{st}} = \frac{1}{N^{2}} \sum^{K}_{j = 1} N^{2}_{j} \hat{\sigma}^{2}_{x_{j}}$

where

$\hat{\sigma}^{\frac{2}{x_{j}}} = \frac{ s^{2}_{j} }{n_{j}} x \frac{ (N_{j} - n_{j} ) }{N_{j} - 1 }$

Assuming that the sample size is large enough, a 100 (1 - α)% confidence interval estimation of the population mean using stratified random samples is obtained from the following:

$\bar{x}_{st} \pm z_{\alpha/2} \hat{\sigma}_{\bar{x}_{st} }$

#### Estimation of the population total

Because the population total is the product of the population mean and the number of population members, these procedures can be modified simply to allow the estimation of the population total. This is done by pasting the N for each equation.

#### Estimation of the population proportion

Let Pj be the population proportion. In stratified random sampling then, the population proportion can be estimated as follows:

$\hat{p}_{st} = \frac{1}{N} = \sum^{K}_{j = 1} N_{j} \hat{p}_{j}$

$\hat{\sigma}^{2}_{p_{st}} = \frac{1}{N^{2}} \sum^{K}_{j = 1} N^{2}_{j} \hat{\sigma}^{2}_{\hat{p}_{j}}$

where

$\hat{\sigma}^{2}_{p_{st}} = \frac{ \hat{p}_{j} (1 - \hat{p}_{j}) }{n_{j} - 1} x \frac{ (N_{j} - n_{j}) }{N_{j} - 1}$

is the estimate of the variance of the sample proportion in the jth stratum. Next, provided that the sample size is large enough, a 100 (1 - α)% confidence interval estimation of the population proportion for stratified random samples can be obtained from the following:

$\hat{p}_{st} \pm z_{\alpha/2} \hat{\sigma}_{\hat{p}_{st}}$

#### Proportional allocation

Assuming that a total of n sample members is to be selected, how many of these sample observations should be allocated to each stratum? As we discussed earlier, a natural choice is proportional allocation in which the proportion of sample members in any stratum is the same as the proportion of population members in that stratum. Thus, for the jth stratum

$\frac{n_{j}}{n} = \frac{N_{j}}{N}$

Transforming this formula, we can see that the sample size for the jth stratum using proportional allocation is given by:

$n_{j} = \frac{N_{j}}{N} x n$

#### Optimal allocation

If the only aim of a survey is to estimate as precisely as possible an overall population parameter, for instance the mean, total, or proportion, and if enough is known about the population, then it is possible to derive an optimal allocation, which yields the most precise estimator. Using optimal allocation, we can obtai the sample size for the jth stratum for the overall mean or total as follows:

$n_{j} = \frac{ N_{j} \sigma_{j} }{ \sum^{K}_{i = 1} N_{i} \sigma_{i} } x n$

When we compare this formula to the one obtained for proportional allocation, we can see that optimal allocation allocates relatively more sample effort to strata in which the population variance is highest. This implies that a larger sample size is needed where the greater population variability exists.

Next, using optimal allocation for the population proportion, we can obtain the sample size for the jth stratum as follows:

$n_{j} = \frac{ N_{j} \sqrt{ P_{j} (1 - P_{j}) } }{ \sum^{K}_{i = 1} N_{i} \sqrt{ P_{i} (1 - P_{i}) } } x n$

When we compare optimal allocation and proportional allocation again, we can see that optimal allocation allocates more sample observations to strata in which the true population proportions are closest to 0.50

### What other sampling procedures can be used?

In the final section of this chapter, we briefly discuss some other sampling procedures that are available.

#### Cluster sampling

First, cluster sampling is an attractive approach when a population conveniently can be broken down into relatively small, geographically compact units called clusters. For instance, a city can be subdivided into political wards or residential blocks. Often, this can be achieved even without the availability of a complete list of residents of households in the city. In cluster sampling a simple random sample of clusters is selected from the population and every individual in each of the sampled clusters is contacted. In other words, a complete census is carried out in each of the selected clusters. Using cluster sampling implies that inferences can be made about the population using relative little prior information about the population. All that is required is a breakdown of the population into identifiable clusters. It is, for instance, not even required to know the total number of population members. It is simply sufficient to know the number in each of the sampled clusters and these can be determined during the survey itself, because a full census is taken in each cluster in the sample. Another, more practical advantage of cluster sampling is that the contact with the interviewers it is relatively inexpensive as the sample members will be geographically close to one another within clusters.

Note that cluster sampling is rather different from stratified sampling. Although in both sampling procedures the population is first subdivided into subgroups, the similarity between these two is quite illusory. In stratified random sampling, a sample is taken from every stratum of the population in an attempt to ensure that important segments of the population are given corresponding weight. In cluster sampling, a random sample of clusters is taken, such that some clusters will have no members in the sample. Since, within custers, population members are likely to be quite homogeneous, the danger of cluster sampling is that some important subgroups of the population may be either not represented at all or heavily be underrepresented in the final sample. Hence, the benefit of cluster sampling (i.e., its convenience) comes at the high costs of additional imprecision in the sample estimates.

#### Two-phase sampling

In many applications, the population cannot be surveyed in a single step. Instead, it is often convenient to first conduct a pilot study in which a relatively small proportion of the sample members are surveyed. The results that are obtained from this pilot study can be analyzed prior to conducting the bulk of the survey. Conducting a survey with two stages, beginning with a pilot study, is called two-phase sampling. An important advantage of this sampling procedure is that it enables the researcher to try out the proposed questionnaire at modest costs. An important disadvantage of this approach, however, is that it can be quite time consuming.

#### Nonprobabilistic sampling methods

So far, all sampling methods we discussed have been probabilistic in nature. Nevertheless, in many practical applications, nonprobabilistic methods are used for selecting sample members. This is primarily done as a matter of convenience. The main drawback of nonprobabilistic sampling methods is that there is no valid way to determine the reliability of the resulting estimates.

In some situations, it is preferred to divide the population into subgroups, called strata, so that each individual member of the population belongs to one, and only one, subgroup. The basis for this division into strata might be based on a certain characteristic of the population, such as gender or income. In this chapter, we discuss the stratified sampling procedure. We will also briefly discuss other methods of sampling, namely cluster sampling, two-phase sampling, and nonprobalistic sampling.

Check page access:
Public
Join WorldSupporter!

How to use this summary?
Work for WorldSupporter

JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Check more of this topic?

## Statistics and Data analysis Methods

Check where this content is also used in:
Check all content related to:
How to use more summaries?
Check other studie fields?
• Public
• WorldSupporters only
• JoHo members
• Private
Statistics
 2582 1