Introduction to Statistics

Topics 7 - 10 Nellie Hedrick Introduction to Statistics

Topic 7 – Displaying and Describing Distribution • Center – the center of data distribution is the most important part of the data analyzing • Spread, variability, consistency – how data are distributed is a second important part of data analysis. • Shape of distribution third important component of analyzing data.

Symmetric and Skew Distribution Skewed to the Left Skewed to the Right Symmetric – Single Pick Symmetric – Two Picks

Graphical Representations of DataQuantitative Variables Stem plot (21, 20, 40, 22, 31, 19, 25, 23, 22, 18, 10) Stem Leaf Stem Leaf 1 2 3 4 980 102532 1 0 1 2 3 4 089 012235 1 0

Activity 7-5 • Exercise 7-10 • Exercise 7-21

Definition • Side by side Stem plot- common set of stems is placed in the middle of the display with leaves branching out in either direction to the left and right. The convention is to order the leaves from the middle out from least to greatest. • Histogram is graphical display similar to dot plot or stem plot. Histogram is more feasible with the larger dataset. • Construct the range data into subintervals (bins) of equal length. • Counting the number(frequency) of observational units in each subinterval. • The bar height represent proportions (relative frequencies) of observational units in the subinterval.

Wrap up, Watch out and in Brief • Direction of skewed is the indicated by the longer tail • Pay attention to the units of the stem plot • Pay attention to outliers – identify them, investigate possible explanations for their occurrences. Make sure if it is not typo error • Remember context! Your description of the data should be clear for everyone to be able to read. • Remember to label • Examine different type of graph to see which gives you better representation • Anticipate features of the data by considering the nature of the variable involved.

Topic 8 – Measures of Center • Mean – is the ordinary average. It is calculated by adding all the numbers and dividing it by the number of observational units. • Median – the value of the middle observational units when observational units are sorted low to high. • Median of the odd number of observational units is in (n+1)/2 location • Median of even number of observational units in average of the middle two values. • Resistant, a measure whose value is relatively unaffected by the presence of outliers in a distribution. Median is resistant, mean is not. • Mode - numerical value that appears more often in a distribution.

Describing Distributions with Numbers Example: 20, 40, 22, 22, 21, 31, 19, 25, 23 • Mean - Average • Median – Measuring Center • Mode – Most repeated • Minimum – smallest value • Maximum – largest value in the data set

Describing Distributions with Numbers Example: 20, 40, 22, 22, 21, 31, 19, 25, 23 • Mean – Average • Median – Measuring Center • Minimum • Maximum • Mode Sort the data: 19 20 21 22 22 23 25 31 40 Median: 9 different data + 1 is 10, the divide by 2 is 5 so the median is the 5th location. (22) Minimum = 19, Maximum = 40, Mode = 22

Describing Distributions with Numbers Example: 20, 40, 22, 22, 21, 31, 19, 25, 23 • Mean - Average • Median – Measuring Center • Minimum • Maximum • Mode TI83: [1.edit] Enter all the data in the example 1 for L1. Press  after each entry. After completing data entry, press [Quit] [calc] [1:1-var stats]  [L1] . Use (or ) to view all the information.

Median and Mean of a Density Curve symmetric Mean Median Mode Mean Mean Median Mode Mode Median Skewed right Skewed left

Wrap up and Warning - • Center is a property. Mean and median are two ways to measure center. Neither one is synonymous with center. Either one is have their own properties and straight. • Center is only one aspect of a distribution of data. Measures of center do not tell the whole story. Other important features are spread, shape, cluster and outliers. • Mode does not apply to categorical as well as quantitative variables. • Notion of center does not make sense in categorical values.

Exercise 8-7 page 161 • Exercise 8-9 page 161 • Exercise 8-17 page 163

Topic 9 – Measures of Spread • Range – difference between maximum and minimum • Lower quartile – data located ¼th = 25% location • Upper quartile – data located 3/4th = 75% location • Inter quartile range (IQR) difference between upper and lower quartile • Start here

Measuring the Spread • The Standard Deviation (s) – Square root of the Variance Standard deviation: Measure of the spread about the mean of a distribution. It is an average of the squares of the deviations of the observations from their mean, also equal to the square root of the variance.

Describing Distributions with Numbers • Be aware that various software packages and calculators might use slightly different rules for calculating quartiles • It can be tempting to regard range and IQR as an interval of values, but they should each be reported as a single number that measures the spread of the distribution • Measure of spread apply only to quantitative variables, not categorical ones.

Activity 9-5 page 182Exercise 9-12 page 190Exercise 9-22 page 193

Watch out • Variability can be tricky concept to grasp! But it is the absolute fundamental to working with data • When looking at the variable distribution, make sure to focus on variability in the horizontal values (the variable) and not the heights (frequency) • The number of distinct values represented in a histogram does not necessary indicates greater variability. Consider how far the values fall from the center more than the variety of their exact numerical values.

Mound-Shaped Distribution – Empirical rule 68% of data fall within one standard deviation from Mean 95% of data fall within two standard deviation from Mean 99.7% of data fall within three standard deviation from Mean 68% 95% 99.7% The 68-95-99.7 rule

Attendance at a university's basketball games follows a normal distribution with mean µ = 8,000 and standard deviation σ = 1,000. Use the 68–95–99.7 rule and give your answer as a percent. • Estimate the percentage of games that have between 6,000 to 8,000 people in attendance. • Estimate the percentage of games that have more than 7000 people in attendance • Estimate the percentage of games that have less than 6,000 people in attendance • Estimate the percentage of games that have less than 8,000 people in attendance • Estimate the percentage of games that have less than 5,000 people in attendance • Estimate the percentage of games that have more than 10,000 people in attendance

Mound-Shaped Distribution – Empirical rule 68% of data fall within one standard deviation from Mean 95% of data fall within two standard deviation from Mean 99.7% of data fall within three standard deviation from Mean 34% 34% 13.5% 13.5% 2.35% 2.35% 0.15% 0.15% The 68-95-99.7 rule

The Standard Normal Distribution As 68-95-99.7 rule suggest all the normal distribution share a common property. Z-score The z-score is process of standardization. If x is an observation from a distribution that has a mean  and standard deviation , the standardized value of x is

Calculating Standard Normal Z Example: Calculate standard normal for x = 120, where Mean  =170 and standard deviation  = 30. µ = 170  = 30 120 µ = 0  = 1 -1.67

Normal distribution Same Mean, but different standard deviation (S2 < S1) larger spread with larger standard deviation. S2 S1

The length of human pregnancies from conception to birth is known to be normally distributed with a mean of 266 days and standard deviation of 16 days. 1. What proportion of pregnancies last between 250 and 282 days? 2. What proportion of pregnancies last between 232 and 282 days?

Wrap up • In study of variability, you see that even if two databases have similar center, the spread of the values might differ substantially. • Z-score is a useful tool when you are comparing two or more dataset. Z-score serves as a ruler for measuring distances. • Variability is a property of a distribution; standard deviation and IQR are two ways to measure variability. • Standard deviation, mean absolute deviation, loosely interpreted as the typical deviation of an observation from the mean.

Topic 10 – More Summary Measures and Graph • Five-number summery (FNS) – the FNS provides a quick and convenient description of where the four quarters of the data in a distribution fall. • Median • Quartiles (Q1, Q3) • Extremes (min, max) • Box Plot – the FNS forms the basis for a graph called a box-plot. Box plot are especially useful for comparing distributions of a quantitative variable across two or three groups.

Measuring the Center and Spread • Five-number summary • Mean and standard deviation Choosing a Summary Five-number summary Mean and standard deviation Symmetric distribution Skewed distribution Outlier

The Five-Number Summary • Maximum • Q3 • Median • Q1 • Minimum Box Plot

Modified box plot • Modified box plot – convey additional information by treating Outliner differently. On these graphs the outlier is marked differently using special symbol and extended the whisker to the next non-outliers. • We call any observation falling more than 1.5 times the IQR away from the nearer quartile to be an outlier.

Activity 10-1 pageExercise 10-22 page 217

Watch out and Wrap up • Box plot can be tricky to read and interpret. It only provides that data is divided into 4 pieces and each containing 25% of the data. • Box plot and modified box plot is nice tool to compare between groups. Make sure to use a same scaling.

Introduction to Statistics