Samenvatting Managerial Statistics (Keller) - in 2 delen

Deze samenvatting is gebaseerd op het studiejaar 2013-2014.

CHAPTER A: BASICS OF STATISTICS
CHAPTER B: GRAPHICAL DESCRIPTIVE TECHNIQUES
CHAPTER D: METHODS OF COLLECTING DATA AND SAMPLING

CHAPTER A: BASICS OF STATISTICS

Statistics is a way to get information from data. There are two main branches of statistics:

Descriptive statistics, which are concerned with methods of organizing, summarizing, and presenting data in a convenient and informative way. Descriptive statistics make use of graphical and numerical techniques to summarize and present data in a clear way. The actual technique used depends on what specific information needs to be extracted.
Inferential statistics, which is a body of methods used to draw conclusions or inferences about characteristics of a population based on sample data (although, a sample that is only a small fraction of the size of the population can lead to correct inferences only a certain percentage of the time).

Statistical inference problems involve three key concepts:

A population is the group of all items of interest to a researcher (note: population does not necessarily refer to a group of people). It is frequently very large and may, in fact, be infinitely large. A descriptive measure of a population is called a parameter. In most applications of inferential statistics the parameter represents the information which is needed.
A sample is a set of data drawn from the population. A descriptive measure of a sample is called a statistic. Statistics are used to make inferences about parameters.
Statistical inference is the process of making an estimate, prediction, or decision about a population based on sample data. In the statistical inference there are two measures of reliability:
- the confidence level, which is the proportion of times that an estimating procedure will be correct; and
- the significance level, which measures how frequently the conclusion will be wrong in the long run.

Some basic terms related to the concept of data:

A variable is some characteristic of a population or sample. The name of the variable is usually represented using upper case letters such as X, Y, and Z.
The values of the variable are the possible observations of the variable.
Data are the most observed values of a variable.

There are three types of data:

Interval data are real numbers, (for instance, incomes and distances). This type of data is also referred to as quantitative or numerical.
The values of nominal data are categories. For instance, answers to questions about marital status produce nominal data. The values are not numbers but instead are words describing the categories. Nominal data are also called qualitative or categorical.
Ordinal data appear to be nominal, but their values are in order. Because the only constraint that is imposed on the choice of codes is that the order must be maintained, any set of codes that are in order can be used.

The critical difference between those three types of data is that the intervals or differences between values of interval data are consistent and meaningful. For instance, the difference between grades of 10 and 8 is the same two-grade difference that exists between 8 and 6.Thus, a researcher can calculate the difference and interpret the results. Because the codes representing ordinal data are arbitrarily assigned except for the order, a researcher cannot calculate and interpret differences.

All calculations are permitted on interval data. A set of interval data is often described by calculating the average. No calculations can be performed on the codes of nominal data, because these codes are completely arbitrary. Thus, calculations based on the codes used to store nominal data are meaningless. All that a researcher can do with nominal data is count the occurrences of each category. The only permissible calculations on ordinal data are ones involving a ranking process.

The data types can be placed in order of the permissible calculations. At the top of the list there is the interval data type (because virtually all computations are allowed). At the bottom of the list there is the nominal data type (because no calculations other than determining frequencies are permitted). In between interval and nominal data lies the ordinal data type. Note: higher-level data types may be treated as lower-level ones. For instance, in universities the grades in a course (interval data), can be converted to letter grades (ordinal data). Lower-level data types cannot be treated as high-level types.

The variables whose observations constitute the data are given the same name as the type of data. Thus, for instance, nominal data are the observations of a nominal variable.

CHAPTER B: GRAPHICAL DESCRIPTIVE TECHNIQUES

The only allowable calculation on nominal data is to count the frequency of each value of the variable. The data can be summarized in a table that presents the categories and their counts called a frequency distribution. A relative frequency distribution lists the categories and the proportion with which each occurs. There are two graphical methods which can be used to present a picture of the data:

A bar chart, which is often used to display frequencies; and
a pie chart, which graphically shows relative frequencies.

A bar chart is created by drawing a rectangle representing each category. The height of the rectangle represents the frequency and its base is arbitrary. A pie chart is simply a circle subdivided into slices that represent the categories. It is drawn so that the size of each slice is proportional to the percentage corresponding to that category.

There are several graphical methods which are used when the data are interval. The most important of these graphical methods is the histogram – it can be used to summarize interval data or to help explain an important aspect of probability.

A frequency distribution for interval data is created by counting the number of observations that fall into each of a series of intervals (classes) that cover the complete range of observations.

Although the frequency distribution provides information about how the numbers are distributed, the information is more easily understood and imparted by drawing a picture or graph. The graph is called a histogram. A histogram is created by drawing rectangles whose bases are the intervals and whose heights are the frequencies.

The number of class intervals selected depends entirely on the number of observations in the data set. The more observations available, the larger the number of class intervals needed to draw a useful histogram. For instance, for less than 50 observations, a researcher would normally create between 5 and 7 classes; for more than 50000 observations, a researcher would normally use between 17-20 classes. (More detailed guidelines are presented in Table 2.6 on page 35.)

Alternatively, a researcher can use Sturge’s formula, which recommends that the number of class intervals be determined by the following:

Number of class intervals = 1 + 3.3 log(n)

For instance, if n = 100, number of class intervals = 1 + 3.3 log(100) = 1 + 3.3(2) = 7.6 (which is rounded to 8).

The approximate width of the classes is determined by subtracting the smallest observation from the largest and dividing the difference by the number of classes. Thus,

	Largest observation – Smallest observation
Class width =	----------------------------------------------------------
	Number of classes

The result is often rounded to some convenient value. Consequently, the class limits are defined by selecting a lower limit for the first class from which all other limits are determined. The only condition to apply is that the first class interval must contain the smallest observation.

The shape of histograms is described on the basis of the following characteristics:

Symmetry - A histogram is said to be symmetric if, when a vertical line down the center of the histogram is drawn, the two sides are identical in shape and size.
Skewness - A skewed histogram is one with a long tail extending to either the right or the left. The one which extends to the right is called positively skewed, and the one which extends to the left is called negatively skewed.
Number of modal classes - A mode is the observation that occurs with the greatest frequency. A modal class is the class with the largest number of observations. A unimodal histogram is one with a single peak. A bimodal histogram is one with two peaks, not necessarily equal in height. Bimodal histograms often indicate that two different distributions are present.
Bell shape - A special type of symmetric unimodal histogram is one that is bell shaped (such as the one presented in Figure 2.10 on page 37).

One of the drawbacks of the histogram is that potentially useful information can be lost by classifying the observations. A stem-and-leaf display is a method which partially overcomes this loss. The first step in developing a stem-and-leaf display is to split each observation into two parts, a stem and a leaf. There are several different ways of doing this. For instance, the number 15.6 can be split so that the stem is 15 and the leaf is 6. In this definition the stem consists of the digits to the left of the decimal and the leaf is the digit to the right of the decimal. Another method can define the stem as 1 and the leaf as 5. In this definition the stem is the number of tens and the leaf is the number of ones. The stem-and-leaf display is similar to a histogram turned on its side. The length of each line represents the frequency in the class interval defined by the sets. The advantage of the stem-and-leaf display over the histogram is that the actual observations can be seen.

The frequency distribution lists the number of observations that fall into each class interval. A relative frequency distribution can also be created by dividing the frequencies by the number of observations.

The relative frequency distribution highlights the proportion of the observations that fall into each class. In some situations a researcher may want to determine the proportion of observations that fall below each of the class limits. In such cases he needs to create cumulative relative frequency distribution. Another way of presenting this information is the ogive, which is a graphical representation of the cumulative relative frequencies.

Data can be also classified in the following way:

cross-sectional data, which are the observations are measured at the same time;
time-series data, which represent measurements at successive points in time.

Time-series data are often graphically depicted on a line chart, which is a plot of the variable over time. It is created by plotting the value of the variable on the vertical axis and the time periods on the horizontal axis.

Techniques applied to single sets of data are called univariate. When a researcher wants to depict the relationship between variables, bivariate methods are required.

A cross-classification table (also called a cross-tabulation table) is used to describe the relationship between two nominal variables. There are several ways to store the data to be used to produce a table and/or a bar or pie chart.

The data are in two columns where the first column represents the categories of the first nominal variable and the second column stores the categories for the second variable. Each row represents one observation of the two variables. The number of observations in each column must be the same.
The data are stored in two or more columns where each column represents the same variable in a different sample or population. To produce a cross-classification table, the number of observations of each category in each column has to be counted.

Researchers often need to know how two interval variables are related. The technique used to describe the relationship between such variables is called a scatter diagram. To draw a scatter diagram a researcher needs data for two variables. In applications where one variable depends to some degree on the other variable the dependent variable is labeled Y and the other, called the independent variable is labeled X. In other cases where there is no dependency evident, the variables are labeled arbitrarily.

To determine the strength of the linear relationship a researcher needs to draw a straight line through the points in such a way that the line represents the relationship. If most of the points fall close to the line it can be said that there is a linear relationship.

If most of the points appear to be scattered randomly with only an impression of a straight line, there is no, or at best, a weak linear relationship (however, there may be some other type of relationship, e.g. a quadratic or exponential one).

Usually, when one variable increases and the other variables also increases, it can be said that there is a positive linear relationship. When the two variables tend to move in opposite directions, the nature of their association is described as a negative linear relationship.

However, if two variables are linearly related, it does not mean that one is causing the other. In fact, it can never be concluded that one variable causes another variable. Thus, correlation is not causation.

Graphical excellence is a term which applies to techniques that are informative and concise and that communicate information clearly to their viewers.

Graphical excellence is achieved when the following characteristics apply:

The graph presents large data sets concisely and coherently.
The ideas and concepts the researcher wants to deliver are clearly understood by the viewer.
The graph encourages the viewer to compare two or more variables.
The display induces the viewer to address the substance of the data and not the form of the graph.
There is no distortion of what the data reveal.

Researchers should be aware of possible methods of graphical deception. Firstly, a researcher should be careful about graphs without a scale on one axis. Secondly, he or she should also avoid being influenced by a graph’s caption. Additionally, perspective is often distorted if only absolute changes in value, rather than percentage changes, are reported. For instance, 15% growth in revenues can be made to appear more dramatic by stretching the vertical axis – a technique that involves changing the scale on the vertical axis so that a given euro amount is represented by a greater height than before. As a result, the rise in revenues appears to be greater, because the slope of the graph is visually (but not numerically) steeper. The expanded scale is usually accommodated by employing a break in the vertical axis or by shortening the vertical axis so that the vertical scale start at a point greater than zero. The effect of making slopes appear steeper can also be created by shrinking the horizontal axis, in which case points on the horizontal axis are moved closer together. Just the opposite effect is obtained by stretching the horizontal axis - spreading out the points on the horizontal axis to increase the distance between them so that slopes and trends will appear to be less steep. Similar illusions can be created with bar charts by stretching or shrinking the vertical or horizontal axis. Another method which is used to create distorted impressions with bar charts is to construct the bars so that their widths are proportional to their heights.

Lastly, a researcher should also be careful about size distortions, particularly in pictograms, which replace the bars with pictures of objects to enhance the visual appeal.

In general, preparation of a statistical report should contain the following:

Statement of objectives;
Description of the experiment;
Description of the results; and
Discussion of limitations of the statistical techniques.

CHAPTER D: METHODS OF COLLECTING DATA AND SAMPLING

Data are observed values of a variable. The methods of collecting data include:

direct observation, which produce observational data,
experiments, which produce experimental data, and
surveys.

An important aspect of surveys is the response rate. The response rate is the proportion of all people who were selected who completed the survey. A low response rate can destroy the validity of any conclusion resulting from the statistical analysis.

Personal interview involves an interviewer soliciting information from a respondent by asking prepared questions. A personal interview has the advantage of having a higher expected response rate than other methods of data collection. In addition, there will probably be fewer incorrect responses resulting from respondents misunderstanding some questions (the interviewer can clarify misunderstandings). But the interviewer must also be careful not to say too much, for fear of biasing the response.

A telephone interview is usually less expensive than personal interview, but it is also less personal and has a lower expected response rate. Unless the issue is of interest, many people will refuse to respond to telephone surveys.

Self-administered surveys are usually mailed to a sample of people. This is an inexpensive method of conducting a survey and is therefore attractive when the number of people to be surveyed is large. But self-administered questionnaires usually have a low response rate and may have a relatively high number of incorrect responses due to respondents misunderstanding some questions.

When designing a questionnaire the following should be taken into account:

the questionnaire should be kept as short as possible to encourage respondents to complete it;
the questions themselves should also be short, as well as simply and clearly worded, to enable respondents to answer quickly, correctly, and without ambiguity;
dichotomous questions (questions with only two possible responses, such as “yes” and “no,” and multiple choice questions) are useful and popular because of their simplicity, however, for instance, in the case of a multiple-choice question, a respondent may feel that none of the choices offered is suitable;
open-ended questions provide an opportunity for respondents to express opinions more fully, but they are time consuming and more difficult to tabulate and analyze;
leading questions should never be used;
it is useful to pretest a questionnaire on a small number of people in order to uncover potential problems (e.g. ambiguous wording);
when preparing the questions, a researcher should think about how he or she intends to tabulate and analyze the responses. It is important to determine whether values (i.e., responses) are solicited for an interval variable or a nominal variable.

Then a researcher needs to determine type of statistical techniques – descriptive or inferential – which he or she intends to apply to the data to be collected, and note the requirements of the specific techniques to be used.

Statistical inference allows researchers to draw conclusions about a population parameter based on a sample that is quite small in comparison to the size of the population. The sample statistic can come quite close to the parameter it is designed to estimate if the target population (the population about which we want to draw inferences) and the sampled population (the actual population from which the sample has been taken) are the same.

Self-selected samples are almost always biased, because the individuals who participate in them are more keenly interested in the issue than are the other members of the population.

A simple random sample is a sample selected in such a way that every possible sample with the same number of observations is equally likely to be chosen. One way to conduct a simple random sample is to assign a number to each element in the population, write these numbers on individual slips of paper, toss them into a bowl, and draw the required number of slips (the sample size, n) from it. Sometimes the elements of the population are already numbered. In such cases, choosing which sampling procedure to use is simply a matter of deciding how to select from among these numbers.

A stratified random sample is obtained by separating the population into mutually exclusive sets, or strata, and then drawing simple random samples from each stratum. Examples of criteria for separating a population into strata (and of the strata themselves) may be:

Gender – male / female
Occupation – professional / clerical / other

Any stratification must be done in such a way that the strata are mutually exclusive. Thus, each member of the population must be assigned to exactly one stratum. After the population has been stratified in this way, simple random sampling can be used to generate the complete sample.

A cluster sample is a simple random sample of groups or clusters of elements. Cluster sampling is particularly useful when it is difficult or costly to develop a complete list of the population members (making it difficult and costly to generate a simple random sample). It is also useful whenever the population elements are widely dispersed geographically. However, cluster sampling often increases sampling error (i.e. households belonging to the same cluster are likely to be similar in many aspects).

The following types of error can arise when a sample of observations is taken from a population:

sampling error, which refers to differences between the sample and the population that exists only because of the observations that happened to be selected for the sample. Sampling error may occur when a researcher makes a statement about a population that is based only on the observations contained in a sample taken from the population.

The difference between the true (unknown) value of the population mean and its estimate, the sample mean, is the sampling error. The only way the expected size of the sampling error can be reduced is to take a larger sample. Given a fixed sample size, the best a researcher can do is to state the probability that the sampling error is less than a certain amount.

nonsampling errors are due to mistakes made in the acquisition of data or due to the sample observations being selected improperly:
- errors in data acquisitions, which arise from the recording of incorrect responses. Incorrect responses may be the result of incorrect measurements being taken because of faulty equipment, mistakes made during transcription from primary sources, inaccurate recording of data due to misinterpretation of terms, or inaccurate responses to questions concerning sensitive issues.
- nonresponse error, which refers to error (or bias) introduced when responses are not obtained from some members of the sample. When this happens, the sample observations that are collected may not be representative of the target population, resulting in biased results.
- selection bias, which occurs when the sampling plan is such that some members of the target population cannot possibly be selected for inclusion in the sample.

chapter_c.pdf
chapter_e.pdf
chapter_f.pdf
chapter_g.pdf
chapter_h.pdf
chapter_i.pdf
chapter_j.pdf
chapter_k.pdf
chapter_l.pdf
chapter_m.pdf
chapter_n.pdf

Access:

Public

Join: WorldSupporter!

Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

Check: concept of JoHo WorldSupporter

Concept of JoHo WorldSupporter

JoHo WorldSupporter mission and vision:

JoHo wants to enable people and organizations to develop and work better together, and thereby contribute to a tolerant and sustainable world. Through physical and online platforms, it supports personal development and promote international cooperation is encouraged.

JoHo concept:

As a JoHo donor, member or insured, you provide support to the JoHo objectives. JoHo then supports you with tools, coaching and benefits in the areas of personal development and international activities.
JoHo's core services include: study support, competence development, coaching and insurance mediation when departure abroad.