880 likes | 1.01k Views
Chapter 3. Descriptive Statistics Numerical Methods. What is a typical value? How variable are the data? How extreme is a particular value? Given data on two variables, how closely do they move together?. Our goal? Numbers to help us answer simple questions.
E N D
Chapter 3 Descriptive Statistics Numerical Methods
What is a typical value? How variable are the data? How extreme is a particular value? Given data on two variables, how closely do they move together? Our goal? Numbers to help us answer simple questions.
Here are three ways to identify a “typical” observation Mean – the arithmetic average Median – the middlemost value Mode – the most common value Measures of Central Tendency
There are formulas, but . . . One confusing thing about the formulas is all the notation they use. To explain why we need the notation, and why you need to know it, let me remind you of an important distinction . . .
A population is the set of all data that characterize some phenomenon, and a number computed from population data is called a parameter. A sample is a subset of a population, and a number computed from sample data is called a statistic. Population vs. Sample
Population - All registered voters. Parameter – The fraction of all registered voters who prefer John McCain to Hillary Clinton. Sample – 2500 voters surveyed by Gallup. Statistic – The fraction of voters in the poll who prefer McCain to Clinton. An Example
Population - All Duracell AA batteries. Parameter – The average lifetime of all Duracell AA batteries in a particular toy. Sample – A hundred batteries being tested by the manufacturer. Statistic – The average lifetime of the 100 tested batteries in the particular toy. Another Example
Sample statistics are very different from population parameters. Parameters are fixed numbers. Before the sample is drawn, statistics depend on the elements that may be selected, and are random. Once a sample is drawn, the numbers themselves are likely to be different; that is 48% of the population but 51% of the sample may prefer Clinton. Therefore, our notation must clearly distinguish sample statistics from population parameters. Why is the distinction important?
Now let us return . . . To those measures of central tendency.
The Sample Mean • The sample mean is the arithmetic mean of some sample data. • The notation for a sample mean is X-bar. • The notation for sample size is a lower case n.
The Population Mean • The population mean is the arithmetic mean of some population data. • The notation for a population mean is the Greek letter mu. • The notation for population size is an upper case N.
And THIS is the summation operator Here it is just telling us to add the observations.
It is just shorthand; it saves space. The summation operator is just the Greek letter Sigma. Sigma is Greek for S, and S stands for Sum. S for Sum, meaning “add them all up.” Don’t be intimidated by the summation operator
Consult Anderson, Sweeny, and Williams Appendix C, and memorize the rules listed there, or Do what I do, which is figure it out as you go along . For example . . . Formulas with summation operators confuse you?
Suppose you are given the following data Let’s go back to our formulas
It depends on the context. Suppose this data came from asking 13 people the number of computer games they own. If you are investigating the number of games owned by these particular 13 people, then this is population data If you are investigating the number of games owned by a larger group, and these 13 people are members of that group, then this is sample data. In homework problems, the default is sample data. Sample or population data?
Since the median is the middlemost observation, one way to find it is to order the observations and throw away observations one at a time from each end, until one is left in the middle – in this case, 8. The median in this example
In that case, you will be left with two numbers in the middle when you have finished eliminating numbers from both the top and bottom. The median is found by adding those two numbers and dividing by two. Suppose you have an even number of observations!
Here is our data. Note that the only observation that appears twice is “3” making it the mode, or most common observation. But 3 is not a very typical observation, which is why the mode is hardly ever used. The mode in this example
In 1983, the average starting salary of Rhetoric and Communications majors at the University of Virginia was approximately $35,000 a year, far more than that of other majors in the college of Arts and Science. Can you guess why? Mean vs. Median
Here is your answer! • Ralph Sampson, a Rhetoric and Communications major, was the first pick in the NBA draft. The Houston Rockets paid him $2,000,000 a year.
A statistic is said to be robust if it is not dramatically affected by a small number of extreme observations. The median is robust, the mean is not. Therefore the median is usually a better indication of a “typical value.” Robust Statistics
If a distribution is not symmetric . . . And there are a handful of extremely large or small values . . . The mean will be pulled in the direction of the extreme values. The Ralph Sampson story illustrates the problem. How the mean and median differ
The median income is marked on the graph, at about $22,000 a year. The mean is not reported, but it appears to be about $30,000 a year. Asymmetric, skewed to the right
Many people use statistics the way a drunk uses a lamp post • For support . . . • Not for Illumination.
An incumbent politician will boast of how well the economy is doing, and use mean income numbers as evidence. The challenger will complain of how badly the economy is doing, and use median income numbers as evidence. Confusing voters. And they play games with the mean and median
Measures of Variability • Range • Interquartile Range • Variance • Standard Deviation • Coefficient of Variation
The Range • The Range of a data set is just the difference between the biggest and smallest observation. • The Range is easy to compute, but it is not robust, and therefore may be misleading. • As in: “Starting salaries for Rhetoric and Communications majors range from $18,000 to $2,000,000 a year.”
An Example • Lets use the earlier data to illustrate.
Interquartile Range (IQR) • This is the spread of the middle 50% of the observations. • It is defined as Q3 – Q1 • Q3 is the third quartile, or 75th percentile. 75% of all observations are smaller than Q3. • Q1 is the first quartile, or 25th percentile. 25% of all observations are smaller than Q1. • (Q2 is the second quartile, or median.)
How do you find quartiles? • Basically, to find Q3, the 75th percentile, order the data and throw away 3 observations from the bottom for every one from the top. Here, Q3 is 13.
Q1 works the same way • To find Q1, the 25th percentile, order the data and throw away 1 observation from the bottom for every 3 from the top. Here Q1 is 4, so IQR = Q3 - Q1= 13 – 4 = 9.
I cheated a bit to make it simple • With 13 observations, eliminating observations in this way leaves you with just one observation remaining. • If the number of observations you have is not equal to 4n+1 for some n, there will be two, three, or four observations remaining. • Then you must round or interpolate.
A, S, & W propose this solution • Arrange data in ascending order. • Compute i, where p is the percentile you seek and n is the sample size. • If i is an integer, average the ith and i+1st observations. • If i is not integer, round up.
An Example finding Q3 • Here p = 75, n = 6. • Which gives i = 4.5. • Which is not integer. • So round up to 5. • The 5th observation is 9, so Q3 = 9.
An Example finding Q1 • Here p = 25, n=8 • Which gives i = 2. • Which IS an integer. • So average the second and third observations. • To get (5+6)/2 = 5.5 • So Q1 = 5.5
Minitab uses a different rule. In our first example, where we got Q3 = 9, Minitab gets Q3 = 9.25. In our second example, where we got Q1 = 5.5, Minitab gets Q1 = 5.25. But this is an arbitrary convention
Variance of a Population • The variance is the average size of a squared deviation about the mean. • Lower-case sigma squared is population variance. • Note the use of mu and N in the formula: all these are population parameters.
Variance of a Sample • Lower-case s-squared denotes sample variance. • Note the use of X-bar and n in the formula: these are sample statistics. • Also note the funky denominator, n-1, where you would expect to see n.
A sophisticated explanation is coming in Chapter 7, but think of it as a fudge factor. Having to compute squared deviations around the sample mean instead of the true population mean makes the numerator too small. Dividing by n-1 corrects for this. Why use n-1 with sample data?
The heart of the calculation is evaluating the numerator. Here is our example. Remember, the mean is 9. Example of the Calculation
Finishing the Variance calculation • Given the sum of squared deviations from the mean, the calculation is as follows: • Divide by n-1 for sample data. • Divide by N for population data.
The Standard Deviation • The variance measures variability in nonsense units; in this case, number of computer games squared. • To correct this, we introduce the standard deviation, which is just the square root of the variance. • The standard deviation can be thought of as the size of a typical deviation from the mean.
Coefficient of Variation • Seldom used in this course. • Answers: “The standard deviation is what percent of the average?” • Why is this useful? • An inch more or less in the height of a skyscraper is meaningless. • An inch more or less in the length of your nose is a big deal.