Chapter 3

Chapter 3 Summarizing Data

Graphical Methods - 1 Variable • After data collected, sorted into categories/ranges of values so that each individual observation falls in exactly one category/range • Numeric Responses: Break “range” of values into non-overlapping bins and count number of units in each bin • Categorical Responses: List all possible categories (with “Other” if needed), and count numbers of units in each • Pie Chart: Displays percent in each category/range • Bar Chart: Displays frequency/percent per category • Histogram: Displays frequency/percent per “range”

Constructing Pie Charts • Select a small number of categories (say 5 or 6 at most) to avoid many narrow “slivers” • If possible, arrange categories in ascending or descending order for categorical variables

Monthly Philly Rainfall 1825-1869 (1/100 in)

Constructing Bar Charts • Put frequencies on one axis (typically vertical, unless many categories) and categories on other • Draw rectangles over categories with height=frequency • Leave spaces between categories

Constructing Histograms • Used for numeric variables, so need Class Intervals • Let Range = Largest - Smallest Measurement • Break range into (say) 5-20 intervals depending on sample size • Make the width of the subintervals a convenient unit, and make “break points” so that no observations fall on them • Obtain Class Frequencies, the number in each subinterval • Obtain Relative Frequencies, proportion in each subinterval • Construct Histogram • Draw bars over each subinterval with height representing class frequency or relative frequency (shape will be the same) • Leave no space between bars to imply adjacency of class intervals

Interpreting Histograms • Probability: Heights of bars over the class intervals are proportional to the “chances” an individual chosen at random would fall in the interval • Unimodal: A histogram with a single major peak • Bimodal: Histogram with two distinct peaks (often evidence of two distinct groups of units) • Uniform: Interval heights are approximately equal • Symmetric: Right and Left portions are same shape • Right-Skewed: Right-hand side extends further • Left-Skewed: Left-hand side extends further

Stem-and-Leaf Plots • Simple, crude approach to obtaining shape of distribution without losing individual measurements to class intervals. Procedure: • Split each measurement into 2 sets of digits (stem and leaf) • List stems from smallest to largest • Line corresponding leaves aside stems from smallest to largest • If too cramped/narrow, break stems into two groups: low with leaves 0-4 and high with leaves 5-9 • When numbers have many digits, trim off right-most (less significant) digits. Leaves should always be a single digit.

Time Series Plots • Many datasets represent a single variable measured on a single unit at different time points • When measurements are made at equally spaced time points, goal is often to describe temporal variation • Annual measurements can reveal long-term trends • Sub-annual (weekly, monthly, quarterly) measurements can reveal long-term trends as well as seasonal fluctuations • Plots generally have measurement on vertical axis and time period on horizontal. • Some plots include bars around points to represent fluctuations within that time period

Numerical Descriptive Measures • Numeric summaries of a set of measurements • Measures of Central Tendency describe the “location” or center of a set of measurements • Measures of Variability describe the “spread” or dispersion of a set of measurements • Parameters: Numeric descriptive measures based on Populations of measurements • Statistics: Numeric descriptive measures based on Samples of measurements

Measures of Central Tendency - I • Mode: Most often occuring outcome (typically only of interest for variables taking on only “discrete” values) • Median: Middle value when measurements ordered from smallest to largest • Mean: Sum of all measurements, divided by total numberof measurements (equal distribution of total) In practice, we only observe sample, and use to estimatem

Example - Philadelphia Rainfall Note: The mean is higher than median as a few very large amounts were observed.

Measures of Central Tendency - II • Outlier: Individual measurement(s) falling far away from others. Can have large effect on mean, not median • Trimmed Mean (TM): Mean that is based on center measurements (deleting extreme measurements). • Mode: For continuous (smooth) distributions, mode is value corresponding to the peak of the frequency curve • Skewness: Shape of the distribution: • Mound-Shaped Distributions: Mode  Median  Mean  TM • Right-Skewed Distributions: Mode < Median < TM < Mean • Left-Skewed Distributions: Mean < TM < Median < Mode

Measures of Variability - I • Variability: Magnitude of dispersion in data. • Range: Difference between largest and smallest measurements in a set. • pth-Percentile: Value that has at most p% of measurements below, and (100-p)% above it (0<p<100) • Lower Quartile = 25th Percentile (Q1) • Median = 50th Percentile (Q2) • Upper Quartile = 75th Percentile (Q3) • Interquartile Range: Difference between the upper and lower quartiles (measures the amount of spread in he middle 50% of ordered measurements). IQR = Q3-Q1

Measures of Variability - II • Deviation: Distance between an individual measurement and the group mean: • Variance: “Average” squared deviation • Standard Deviation: Square root variance (data’s units) • Empirical rule (measurements with mound-shaped histogram) • Approximately 68% of measurements lie within 1 SD of mean • Approximately 95% of measurements lie within 2 SD of mean • Virtually all of measurements lie within 3 SD of mean

Example - Philadelphia Rainfall (Population) Note: 383 (71%) Months lie within 1s of m and 518 (96%) within 2s

Boxplots • Graph highlighting spread of set of measurements, highlighting quartiles and outliers. • Constructing a boxplot: • Draw box with top at Q3, bottom at Q1, and line crossing at median (Q2). Height of box is IQR = Q3 - Q1 • Compute “lower inner fence” = Q1-1.5(IQR) = LIF • Compute “upper inner fence” = Q3+1.5(IQR) = UIF • Compute “lower outer fence” = Q1-3.0(IQR) = LOF • Compute “upper outer fence” = Q3+3.0(IQR) = UOF • Draw line from Q3 to max(UIF, largest y value). Place ‘*’ for any y values between UIF and UOF, ‘o’ for any above UOF • Draw line from Q1 to min(LIF, smallest y value). Place ‘*’ for any y values between LIF and LOF, ‘o’ for any below LOF

UIF = 468+1.5(232.25) = 816.375 UOF = 468+3(232.25) = 1164.75

Summarizing Data of More than One Variable • Contingency Table: Cross-tabulation of units based on measurements of two qualitative variables simultaneously • Stacked Bar Graph: Bar chart with one variable represented on the horizontal axis, second variable as subcategories within bars • Cluster Bar Graph: Bar chart with one variable forming “major groupings” on horizontal axis, second variable used to make side-by-side comparisons within major groupings (displays all combinations in factorial expt) • Scatterplot: Plot with quantitaive variables y and x plotted against each other for each unit • Side-by-Side Boxplot: Compares distributions by groups

Example - Ginkgo and Acetazolamide for Acute Mountain Syndrome Among Himalayan Trekkers Contingency Table (Counts) Percent Outcome by Treatment

Scatterplots • Identify the explanatory and response variables of interest, and label them as x and y • Obtain a set of individuals and observe the pairs (xi , yi) for each pair. There will be n pairs. • Statistical convention has the response variable (y) placed on the vertical (up/down) axis and the explanatory variable (x) placed on the horizontal (left/right) axis. (Note: economists reverse axes in price/quantity demand plots) • Plot the n pairs of points (x,y) on the graph

France August,2003 Heat Wave Deaths • Individuals: 13 cities in France • Response: Excess Deaths(%) Aug1/19,2003 vs 1999-2002 • Explanatory Variable: Change in Mean Temp in period (C) • Data:

France August,2003 Heat Wave Deaths PossibleOutlier

Example - Pharmacodynamics of LSD • Response (y) - Math score (mean among 5 volunteers) • Explanatory (x) - LSD tissue concentration (mean of 5 volunteers) • Raw Data and scatterplot of Score vs LSD concentration: Source: Wagner, et al (1968)

Manufacturer Production/Cost Relation X= Amount Produced Y= Total Cost n=48 months (not in order)

Manufacturer Production/Cost Relation

Chapter 3

Chapter 3

Presentation Transcript

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

chapter 3

CHAPTER 3-3

Chapter 3-3

Chapter 3 Chapter 3

CHAPTER 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

CHAPTER 3

Chapter 3