STAT 101 Exploratory Data Analysis I 1/25/12

STAT 101 Exploratory Data Analysis I 1/25/12 • One Categorical Variable • Two Categorical Variables • One Quantitative Variable – Center • Section 2.1, 2.2 • Professor Kari Lock Morgan • Duke University

Announcements • Textbooks are here! • My office hours: (Old Chemistry 216) • Wednesday 3-5 pm • Friday 1-3pm • Lecture slides, assignments, labs, etc. will be posted at http://stat.duke.edu/courses/Spring12/sta101.2/ • Complete lecture slides to be posted after each class

The Big Picture Sample Population Sampling Statistical Inference Exploratory Data Analysis

Class Survey Data Data from both STAT 101 classes and STAT 10

Data • In order to make sense of this data, we need ways to summarize and visualize it • Summarizing and visualizing variables and relationships between two variables is often known as exploratory data analysis (also known as descriptive statistics) • Type of summary statistics and visualization methods depend on the type of variable(s) being analyzed (categorical or quantitative)

One Categorical Variable • Display the number or proportion of cases that fall in each category “What is your favorite day of the week?”

Frequency Table • A frequency tableshows the number of cases that fall in each category: R: table(fav_day)

Proportion • The sample proportion of students in each category is

Proportion • The sample proportion of students in this class who prefer Friday is • Proportion and percent can be used interchangeably: 0.51 or 51%

Relative Frequency Table • A relative frequency tableshows the proportion of cases that fall in each category • All the numbers in a relative frequency table sum to 1 R: round(table(fav_day)/209,3)

Bar Chart/Plot/Graph • In a barplot, the height of the bar corresponds to the number of cases falling in each category R: barplot(table(fav_day))

Pie Chart • In a pie chart, the relative area of each slice of the pie corresponds to the proportion in each category R: pie(table(fav_day))

Summary: One Categorical Variable • Summary Statistics • Proportion • Frequency table • Relative frequency table • Visualization • Barplot • Pie chart

Two Categorical Variables • Look at the relationship between two categorical variables • Relationship status • Gender

Two-Way Table • It doesn’t matter which variable is displayed in the rows and which in the columns R: table(gender, relationship)

Two-Way Table What proportion of females in intro stat are in a relationship? • 42/60 • 42/151 • 42/215 • 151/215 • 60/215

Two-Way Table What proportion of intro stat students in a relationship are female? • 42/60 • 42/151 • 42/215 • 151/215 • 60/215

Two-Way Table CAUTION: The proportion of females in a relationship is NOT THE SAME AS the proportion of people in a relationship who are female!

Two-Way Table What proportion of intro stat students are in a relationship and female? • 42/60 • 42/151 • 42/215 • 151/215 • 60/215

Side-by-Side Bar Chart The height of each bar is the number of the corresponding cell in the two-way table colors = c("pink", "blue") barplot(table(gender, relationship), beside=TRUE, col=colors, legend=TRUE)

Side-by-Side Bar Chart colors = c("red", "green","blue") barplot(table(relationship, gender), beside=TRUE, col=colors, legend=TRUE)

Segmented Bar Chart • A segmented bar chart is like a side-by-side bar chart, but the bars are stacked instead of side-by-side R: barplot(table(relationship, gender), legend=TRUE, col=c(“red”, “green”, “blue”))

Mosaic Plot • Columns are the width of the proportion of the column category, and each column’s bar is colored according to the corresponding proportions of the row variable within each column category R: mosaicplot(table(Music, Gender), col=c("pink", "blue"))

Mosaic Plot colors = c("red", "green","blue") mosaicplot(table(gender, relationship), col=colors, legend=TRUE,cex.axis=.7,main="")

Mosaic Plot This tells us… • Most people who are in favor of the new housing model are in (or plan to be in) a selected living group • Most people who are in (or plan to be in) a selected living group are in favor of the new housing model • Both (a) and (b) • Neither (a) nor (b)

Difference in Proportions • A difference in proportions is a difference in proportions for one categorical variable (e.g. proportion for whom “it’s complicated”) calculated for different levels of the other categorical variable (e.g. gender)

Two-Way Table What is the difference in proportions • 0.833 • 0.066 • –0.003 • 0.057 • 0.047 11/151 – 1/64

Summary: Two Categorical Variables • Summary Statistics • Two-way table • Difference in proportions • Visualization • Side-by-side bar chart • Segmented bar chart • Mosaic plot

Kidney Stones R. Charig, D. R. Webb, S. R. Payne, O. E. Wickham (1986). "Comparison of treatment of renal calculi by open surgery, percutaneousnephrolithotomy, and extracorporeal shockwave lithotripsy". Br Med J (Clin Res Ed)292 (6524): 879–882 • Which treatment is better at removing kidney stones? • (a) Treatment A • (b) Treatment B

Kidney Stones • Which treatment is better at removing small kidney stones? • (a) Treatment A • (b) Treatment B

Kidney Stones • Which treatment is better at removing large kidney stones? • (a) Treatment A • (b) Treatment B

Kidney Stones • Treatment A is more effective for all kidney stones, but the data shows Treatment B to be effective overall! • How is this possible!?!?

Kidney Stones

Kidney Stones • Treatment A is used more often on large stones, which are harder to treat. • This is an example of Simpson’s Paradox: an observed relationship between two variables can change (or even reverses!) when a third variable is considered

Slope = # successful / # unsuccessful = odds

One Quantitative Variable • We’ll look at how to analyze a quantitative variable such as • Times checking Facebook per day • Average hours of sleep per night • Average hours of exercise per week • GPA • Average hours of spent on extracurricular activities per week • Number of piercings

Dotplot • In a dotplot, each case is represented by a dot and dots are stacked. • Average number of times checking Facebook per day • Easy way to see each case

Histogram • The height of the each bar corresponds to the number of cases within that range of the variable R: hist(exercise)

Histogram • Although they look similar, a histogram is not the same as a bar plot • A bar plot is for categorical data, and the x-axis has no numeric scale • A histogram is for quantitative data, and the x-axis is numeric • For a categorical variable, the number of bars equals the number of categories, and the number in each category is fixed • For a quantitative variable, the number of bars in a histogram is up to you (or the software you use), and the appearance can differ with different number of bars

Shape Long right tail Symmetric Right-Skewed Left-Skewed

Notation • The sample size, the number of cases in the sample, is denoted by n • We often let x or y stand for any variable, and x1 , x2 , …, xnrepresent the n values of the variable x • Example: x = Average hours of sleep x1 = 5, x2 = 9, x3= 7, x4 = 7, …

Mean • The sample mean is the average, and is computed by adding up all the numbers and dividing by the number of cases R: mean()

Median • The sample medianis the middle value when the data is ordered • If there are an even number of values, the median is the average of the two middle values • The sample median is denoted as m R: median()

Outliers • An outlier is a value that is notably different from the other values • Hours spent on extracurricular activities per week

STAT 101 Exploratory Data Analysis I 1/25/12