540 likes | 554 Views
Dive into quantitative data analysis with dot plots, means, and standard deviations. Learn to interpret distributions, calculate standard deviations, and create histograms for data visualization.
E N D
Data matrix for emails Rows 1, 2, 3, and 3921 of a data matrix are displayed below. It contains data collected on 3,921 emails that were received.
Data matrix for emails Rows 1, 2, 3, and 3921 of a data matrix are displayed below. It contains data collected on 3,921 emails that were received. Quantitative variables
Sample of email data Let’s consider a random sample of 50 emails from the data. Here, rows 1, 2, 3, and 50 of a data matrix are displayed below. It contains data from the randomly selected 50 emails.
Dot plot A dot plot provides a case-by-case view of data for one quantitative variable.
Dot plot A dot plot provides a case-by-case view of data for one quantitative variable.
Dot plot and the mean • The “placement” of data, as seen in a dot plot or some other representation, is called the distribution of the data. • The mean (also called the average) is a common way to measure the center of the distribution. Mean for data below is 10.704
The mean The sample mean, denoted by , can be calculated as where represent the observed values.
Population mean and estimation • The population mean is also computed the same way, but denoted by μ (the Greek letter mu). It is often not possible to compute μ because data on the entire population is not available. • The sample mean is a sample statistic, and serves as a point estimate of the population mean. This estimate is probably not perfect, but if the sample is representative of the population, it is usually a good estimate.
Distributions with the same mean Each dot plot displays 124 observations and the distributions all have a mean of 6. What makes them different?
Distributions with the same mean Order these distributions from the least spread out to the most spread out. A. B. C.
Standard Deviation The standard deviation is the typical distance of an observation from the mean. The mean of the distribution is = 6 and sample size is n = 124. The standard deviation is computed as follows:
Standard deviation measures spread A. Std. dev. = 1.361 The standards deviations of the three distributions are given. B. Std. dev. = 2.550 C. Std. dev. = 1.482
The standard deviation • The standard deviation of a sample is denoted by s and can be calculated using the formula given on the previous slide. • The standard deviation of the population is computed in a similar way, except we divide by n instead of n-1. The standard deviation of the population is denoted by σ (the Greek letter sigma).
Histogram A histogram plots binned counts as bars.
Histograms • A histogram is another way to display the distribution of a quantitative variable. • Better than a stem-and-leaf plot for larger data sets, but doesn’t retain the actual numerical values. Basic Steps for Creating a Histogram • Divide the range of the data (smallest to largest) into classes of equal width. The classes should not overlap. • Count the number of observations that fall into each class. Recall that the counts are also called frequencies. • Draw a horizontal axis and mark off the classes along this axis. • The vertical axis can be the count, the proportion, or the percentage. • Draw a rectangle (a vertical bar) above each class with the height equal to the count, the proportion, or the percentage.
Bin width: height of MAT 117 students Bin width can alter the story we get from the histogram. ½ in. bins 1 in. bins 6 in. bins 33 in. bins
Shape of a Distribution: Modality Does the histogram have a single prominent peak (unimodal), several prominent peaks (bimodal/multimodal), or no apparent peaks (uniform)? Note: To determine modality, step back and imagine a smooth curve over the histogram – imagine the bars are wooden blocks and you drop a limp spaghetti noodle over them, the shape the spaghetti would take could be viewed as a smooth curve.
Modality: height of MAT 117 students Which bin width most accurately presents the modality? ½ in. bins 1 in. bins 6 in. bins 33 in. bins
Shape of a Distribution: Skewness Is the histogram right skewed, left skewed, or symmetric? Note: Histograms are said to be skewed to the side of the long tail.
Shape of a Distribution: Unusual Observations Are there any unusual observations or potential outliers
Sample of email data How would you describe the shape of the distribution of the number of characters contained in the emails?
Sample of email data How would you describe the shape of the distribution of the number of characters contained in the emails? Unimodal and right skewed, with a potentially unusual observation at 40,000 characters
Percentiles, quartiles, and the median • The p-th percentile is a value such that p percent of observations fall at or below that value. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 • 0-th percentile • Minimum • 50-th percentile • Median • 100-th percentile • Maximum
Percentiles, quartiles, and the median • The p-th percentile is a value such that p percent of observations fall at or below that value. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 • 0-th percentile • Minimum • 50-th percentile • Median • 100-th percentile • Maximum • 25-th percentile • First quartile • Q1 • 75-th percentile • Third quartile • Q3
Percentiles, quartiles, and the median • The p-th percentile is a value such that p percent of observations fall at or below that value. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 • 0-th percentile • Minimum • 50-th percentile • Median • Second quartile • Q2 • 100-th percentile • Maximum • 25-th percentile • First quartile • Q1 • 75-th percentile • Third quartile • Q3
Percentiles, quartiles, and the median • The p-th percentile is a value such that p percent of observations fall at or below that value. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 • 0-th percentile • Minimum • 50-th percentile • Median • Second quartile • Q2 • 100-th percentile • Maximum • 25-th percentile • First quartile • Q1 • 75-th percentile • Third quartile • Q3
Height of female MAT 117 students Median Q1 Q3 Max. Min. We want to graphically represent these five numbers, called the five-number summary. This graph is called a box plot. As you can see, there is a bit more to it than just these five numbers.
Anatomy of the box plot Median Lower whisker Upper whisker Potential outliers Q1 Q3 Potential outlier
IQR, whisker, and outliers • Between Q1 and Q3 is the middle 50% of the data. The range these data span is called the interquartile range (IQR). IQR = Q3 – Q1 • Whiskers of a box plot can extend up to 1.5 x IQR away from the the quartiles: Max upper whisker reach = Q3 + 1.5 x IQR Max lower whisker reach = Q1 – 1.5 x IQR • A potential outlier is an observation beyond the maximum reach of the whiskers. It is an observation that appears to be extreme relative to the rest of the data.
Outliers Why is it important to look for outliers? • Identify extreme skew in the distribution. • Identify data collection and entry errors. • Provide insight into interesting features of the data.
Scatterplot A scatterplot provides a case-by-case view of data for two quantitative variables.
Scatterplot A scatterplot provides a case-by-case view of data for two quantitative variables.
Scatterplots: trends Linear trend Nonlinear trend
Scatterplots: trends (continued) Cluster trend No apparent trend