1 / 54

STAT 101 Exploratory Data Analysis I 1/25/12

STAT 101 Exploratory Data Analysis I 1/25/12. One Categorical Variable Two Categorical Variables One Quantitative Variable – Center. Section 2.1, 2.2. Professor Kari Lock Morgan Duke University. Announcements. Textbooks are here! My office hours: (Old Chemistry 216)

hayden
Download Presentation

STAT 101 Exploratory Data Analysis I 1/25/12

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STAT 101 Exploratory Data Analysis I 1/25/12 • One Categorical Variable • Two Categorical Variables • One Quantitative Variable – Center • Section 2.1, 2.2 • Professor Kari Lock Morgan • Duke University

  2. Announcements • Textbooks are here! • My office hours: (Old Chemistry 216) • Wednesday 3-5 pm • Friday 1-3pm • Lecture slides, assignments, labs, etc. will be posted at http://stat.duke.edu/courses/Spring12/sta101.2/ • Complete lecture slides to be posted after each class

  3. The Big Picture Sample Population Sampling Statistical Inference Exploratory Data Analysis

  4. Class Survey Data Data from both STAT 101 classes and STAT 10

  5. Data • In order to make sense of this data, we need ways to summarize and visualize it • Summarizing and visualizing variables and relationships between two variables is often known as exploratory data analysis (also known as descriptive statistics) • Type of summary statistics and visualization methods depend on the type of variable(s) being analyzed (categorical or quantitative)

  6. One Categorical Variable • Display the number or proportion of cases that fall in each category “What is your favorite day of the week?”

  7. Frequency Table • A frequency tableshows the number of cases that fall in each category: R: table(fav_day)

  8. Proportion • The sample proportion of students in each category is

  9. Proportion • The sample proportion of students in this class who prefer Friday is • Proportion and percent can be used interchangeably: 0.51 or 51%

  10. Relative Frequency Table • A relative frequency tableshows the proportion of cases that fall in each category • All the numbers in a relative frequency table sum to 1 R: round(table(fav_day)/209,3)

  11. Bar Chart/Plot/Graph • In a barplot, the height of the bar corresponds to the number of cases falling in each category R: barplot(table(fav_day))

  12. Pie Chart • In a pie chart, the relative area of each slice of the pie corresponds to the proportion in each category R: pie(table(fav_day))

  13. Summary: One Categorical Variable • Summary Statistics • Proportion • Frequency table • Relative frequency table • Visualization • Barplot • Pie chart

  14. Two Categorical Variables • Look at the relationship between two categorical variables • Relationship status • Gender

  15. Two-Way Table • It doesn’t matter which variable is displayed in the rows and which in the columns R: table(gender, relationship)

  16. Two-Way Table What proportion of females in intro stat are in a relationship? • 42/60 • 42/151 • 42/215 • 151/215 • 60/215

  17. Two-Way Table What proportion of intro stat students in a relationship are female? • 42/60 • 42/151 • 42/215 • 151/215 • 60/215

  18. Two-Way Table CAUTION: The proportion of females in a relationship is NOT THE SAME AS the proportion of people in a relationship who are female!

  19. Two-Way Table What proportion of intro stat students are in a relationship and female? • 42/60 • 42/151 • 42/215 • 151/215 • 60/215

  20. Side-by-Side Bar Chart The height of each bar is the number of the corresponding cell in the two-way table colors = c("pink", "blue") barplot(table(gender, relationship), beside=TRUE, col=colors, legend=TRUE)

  21. Side-by-Side Bar Chart colors = c("red", "green","blue") barplot(table(relationship, gender), beside=TRUE, col=colors, legend=TRUE)

  22. Segmented Bar Chart • A segmented bar chart is like a side-by-side bar chart, but the bars are stacked instead of side-by-side R: barplot(table(relationship, gender), legend=TRUE, col=c(“red”, “green”, “blue”))

  23. Mosaic Plot • Columns are the width of the proportion of the column category, and each column’s bar is colored according to the corresponding proportions of the row variable within each column category R: mosaicplot(table(Music, Gender), col=c("pink", "blue"))

  24. Mosaic Plot colors = c("red", "green","blue") mosaicplot(table(gender, relationship), col=colors, legend=TRUE,cex.axis=.7,main="")

  25. Mosaic Plot This tells us… • Most people who are in favor of the new housing model are in (or plan to be in) a selected living group • Most people who are in (or plan to be in) a selected living group are in favor of the new housing model • Both (a) and (b) • Neither (a) nor (b)

  26. Difference in Proportions • A difference in proportions is a difference in proportions for one categorical variable (e.g. proportion for whom “it’s complicated”) calculated for different levels of the other categorical variable (e.g. gender)

  27. Two-Way Table What is the difference in proportions • 0.833 • 0.066 • –0.003 • 0.057 • 0.047 11/151 – 1/64

  28. Summary: Two Categorical Variables • Summary Statistics • Two-way table • Difference in proportions • Visualization • Side-by-side bar chart • Segmented bar chart • Mosaic plot

  29. Kidney Stones R. Charig, D. R. Webb, S. R. Payne, O. E. Wickham (1986). "Comparison of treatment of renal calculi by open surgery, percutaneousnephrolithotomy, and extracorporeal shockwave lithotripsy". Br Med J (Clin Res Ed)292 (6524): 879–882 • Which treatment is better at removing kidney stones? • (a) Treatment A • (b) Treatment B

  30. Kidney Stones • Which treatment is better at removing small kidney stones? • (a) Treatment A • (b) Treatment B

  31. Kidney Stones • Which treatment is better at removing large kidney stones? • (a) Treatment A • (b) Treatment B

  32. Kidney Stones • Treatment A is more effective for all kidney stones, but the data shows Treatment B to be effective overall! • How is this possible!?!?

  33. Kidney Stones

  34. Kidney Stones • Treatment A is used more often on large stones, which are harder to treat. • This is an example of Simpson’s Paradox: an observed relationship between two variables can change (or even reverses!) when a third variable is considered

  35. Slope = # successful / # unsuccessful = odds

  36. Slope = # successful / # unsuccessful = odds

  37. One Quantitative Variable • We’ll look at how to analyze a quantitative variable such as • Times checking Facebook per day • Average hours of sleep per night • Average hours of exercise per week • GPA • Average hours of spent on extracurricular activities per week • Number of piercings

  38. Dotplot • In a dotplot, each case is represented by a dot and dots are stacked. • Average number of times checking Facebook per day • Easy way to see each case

  39. Histogram • The height of the each bar corresponds to the number of cases within that range of the variable R: hist(exercise)

  40. Histogram • Although they look similar, a histogram is not the same as a bar plot • A bar plot is for categorical data, and the x-axis has no numeric scale • A histogram is for quantitative data, and the x-axis is numeric • For a categorical variable, the number of bars equals the number of categories, and the number in each category is fixed • For a quantitative variable, the number of bars in a histogram is up to you (or the software you use), and the appearance can differ with different number of bars

  41. Shape Long right tail Symmetric Right-Skewed Left-Skewed

  42. Notation • The sample size, the number of cases in the sample, is denoted by n • We often let x or y stand for any variable, and x1 , x2 , …, xnrepresent the n values of the variable x • Example: x = Average hours of sleep x1 = 5, x2 = 9, x3= 7, x4 = 7, …

  43. Mean • The sample mean is the average, and is computed by adding up all the numbers and dividing by the number of cases R: mean()

  44. Median • The sample medianis the middle value when the data is ordered • If there are an even number of values, the median is the average of the two middle values • The sample median is denoted as m R: median()

  45. Outliers • An outlier is a value that is notably different from the other values • Hours spent on extracurricular activities per week

More Related