1 / 33

STAT 101: Day 5 Descriptive Statistics II 1/30/12

STAT 101: Day 5 Descriptive Statistics II 1/30/12. One Quantitative Variable (continued) Quantitative with a Categorical Variable Two Quantitative Variables. Section 2.3, 2.4, 2.5. Professor Kari Lock Morgan Duke University. Clicker Registration.

maille
Download Presentation

STAT 101: Day 5 Descriptive Statistics II 1/30/12

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STAT 101: Day 5 Descriptive Statistics II 1/30/12 • One Quantitative Variable (continued) • Quantitative with a Categorical Variable • Two Quantitative Variables • Section 2.3, 2.4, 2.5 • Professor Kari Lock Morgan • Duke University

  2. Clicker Registration • To register your clicker, just press the letter that appears next to your name, then press the second letter that appears next to your name

  3. What are The Odds That Stats Would Be This Popular? - New York Times, 1/26/12 There are billions of bytes generated daily, not just from the Internet but also from sciences like genetics and astronomy. Companies like Google and Facebook, as well as product marketers, risk analysts, spies, natural philosophers and gamblers are all scouring the info, desperate to find a new angle on what makes us and the world tick. … What no one has are enough people to figure out the valuable patterns that lie inside the data. …

  4. Measures of Center m=$1,250,000 x=$2,210,000 Mean is “pulled” in the direction of skewness

  5. Standard Deviation • The sample standard deviation, s, measures the spread of a distribution. The larger s is, the more spread out the distribution is Standard deviation is always ≥ 0. R: sd()

  6. Standard Deviation Both of these distributions are bell-shaped

  7. The 95% Rule • If a distribution is symmetric and bell-shaped, then approximately 95% of the data values will lie within 2 standard deviations of the mean

  8. The 95% Rule The standard deviation for hours of sleep per night is closest to • ½ • 1 • 2 • 4 • I have no idea

  9. z-score • A z-score is unit-free measure of extremity of a data point. It tells us how many standard deviations away from the mean a value is • Values farther from 0 are more extreme • 95% of all z-scores fall between -2 and 2

  10. z-score Which is better, an ACT score of 28 or a combined SAT score of 2100? • ACT: mean = 21, sd = 5 • SAT: mean = 1500, sd = 325 • Assume ACT scores and SAT scores have approximately symmetric and bell-shaped distributions (a) ACT score of 28 (b) SAT score of 2100 (c) I don’t know

  11. Other Measures of Location • Maximum = largest data value • Minimum = smallest data value • Quartiles: • Q1 = median of the values below m. • Q3 = median of the values above m.

  12. Min Q1 m Q3 Max 25% 25% 25% 25% Five Number Summary • Five Number Summary: R: summary()

  13. Percentile • The Pthpercentileis the value of a quantitative variable which is greater than P percent of the data • We already used z-scores to determine whether an SAT score of 2100 or an ACT score of 28 is better • We could also have used percentiles: • ACT score of 28: 91st percentile • SAT score of 2100: 97th percentile

  14. Min Q1 m Q3 Max 25% 25% 25% 25% Five Number Summary • Five Number Summary: 50th percentile 75th percentile 100th percentile 0th percentile 25th percentile

  15. Five Number Summary > summary(study_hours) Min. 1st Qu. Median 3rd Qu. Max. 2.00 10.00 15.00 20.00 69.00 The distribution of number of hours you spend studying each week is (a) Symmetric (b) Right-skewed (c) Left-skewed (d) Impossible to tell

  16. Measures of Spread • Range = Max – Min • Interquartile Range (IQR) = Q3 – Q1 • Is the range resistant to outliers? • Yes • No • Is the IQR resistant to outliers? • Yes • No

  17. Outliers • Outliers can be informally identified by looking at a plot, but one rule of thumb for identifying outliers is data values more than 1.5 IQRs beyond the quartiles • A data value is an outlier if it is Smaller than Q1 – 1.5(IQR) or Larger than Q3 + 1.5(IQR)

  18. Boxplot Outliers • Lines (“whiskers”) extend from each quartile to the most extreme value that is not an outlier Q3 Median Q1 R: boxplot(study_hours, ylab=“Hours spent studying”)

  19. Boxplot Which boxplot goes with the histogram of waiting times for the bus? (a) (b) (c)

  20. Summary: One Quantitative Variable • Summary Statistics • Center: mean, median • Spread: standard deviation, range, IQR • Percentiles • 5 number summary • Visualization • Dotplot • Histogram • Boxplot • Other concepts • Shape: symmetric, skewed, bell-shaped • Outliers, resistance • z-scores

  21. Quantitative and Categorical Relationships • Boxplots are particularly useful for comparing distributions of a quantitative variable across different levels of a categorical variable

  22. Side-by-Side Boxplots • Do students whose parents had more of an education have higher GPAs? boxplot(gpa~parent_degree, ylab="GPA", xlab="Parents' Highest Degree")

  23. Side-by-Side Boxplots • Does GPA differ by major?

  24. Side-by-Side Boxplots • Do students who’ve had AP statistics do better in STAT 101? • NO!

  25. Side-by-Side Boxplots

  26. Quantitative Statistics by a Categorical Variable • Any of the statistics we use for a quantitative variable can be looked at separately for each level of a categorical variable • Mean hours per week spent studying by major:

  27. Summary: One Quantitative and One Categorical • Summary Statistics • Any summary statistics for quantitative variables, broken down by each level of the categorical variable • Visualization • Side-by-side boxplots

  28. Scatterplot • Ascatterplotis a graph of the relationship between two quantitative variables. Each dot represents one case. R: plot(study_hours, gpa)

  29. Direction of Association • A positive associationmeans that values of one variable tend to be higher when values of the other variable are higher • A negative associationmeans that values of one variable tend to be lower when values of the other variable are higher • Two variables are not associated if knowing the value of one variable does not give you any information about the value of the other variable

  30. Cars Data - Handout • Quantitative Variables: • Weight (pounds) • City MPG • Fuel capacity (gallons) • Page number (in Consumer Reports) • Time to go ¼ mile (in seconds) • Acceleration time from 0 to 60 mph • Relationships • Weight vs. CityMPG • Weight vs. FuelCapacity • PageNum vs. Fuel Capacity • Weight vs. QtrMile • Acc060 vs. QtrMile • CityMPG vs. QtrMile

  31. Correlation • The sample correlation, r, measures the strength and direction of linear association between two quantitative variables sX: sample standard deviation of X sY: sample standard deviation of Y R: cor(X,Y)

  32. Car Correlations (-.91) (.89) (-.45) (.51) (.99) (-.08) What are the properties of correlation?

  33. Correlation • -1 ≤ r ≤ 1 • positive association: r > 0 • negative association: r < 0 • no linear association: r 0 • The closer r is to ±1, the stronger the linear association • r does not depend on the units of measurement • The correlation between X and Y is the same as the correlation between Y and X

More Related