440 likes | 685 Views
Descriptive Statistics. Summarizing, Simplifying Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to large data sets. Describing Useful for recognizing important characteristics of data Used in inferential statistics.
E N D
Descriptive Statistics • Summarizing,Simplifying • Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to large data sets. • Describing • Useful for recognizing important characteristics of data • Used in inferential statistics
Important Characteristics of Data • Center – typical data value • Variation – spread in data • Distribution – shape of data distribution • Outliers – problems in data • Time – changes over time?
Graphical Summary Methods • Pie Chart • Useful for qualitative or quantitative data • Bar Chart • Useful for qualitative data • Called a Pareto chart if bars ordered by height
Graphical Summary Methods • Frequency Histogram • Useful for quantitative data • A “connected bar plot” with bar height proportional to the frequency of the associated value or class (interval of values) • Graphical summary of a frequency distribution (sometimes called a frequency table)
Frequency Distribution (Table) • For Discrete Data: • Lists data values and corresponding counts • Resulting histogram has a bar on each value with height proportional to its count • For Continuous Data: • Data is divided into classes (intervals of values) and the classes are listed along with the corresponding counts
Definitions for Classes • Lower Class Limit – smallest value in a class • Upper Class Limit - largest value in a class • Class Width – distance between consecutive lower (or upper) class limits • Class Mark – midpoint of class (calculated as the mean of the lower and upper class limits) • Class Boundaries – eliminates space between consecutive classes for plotting purposes
Constructing a Frequency Table • Select the class width (w) • Approximated by range divided by the desired number of classes (usually between 5 and 15 in medium-sized data sets) • Select lower class limit for first class • Construct class limits using w as the distance between consecutive lower or upper class limits • Count number of observations in each class.
Types of Histograms • Frequency–height of bar is count • Relative Frequency - height of bar is relative frequency (proportion/percentage/probability) • Cumulative Frequency – height of bar is cumulative count • Cumulative Relative Frequency – height of bar is cumulative relative frequency (percentile)
Other Types of Graphs • Dotplot • Each value is plotted as a dot along an x-axis. Dots representing equal values are stacked. • Stem-and-Leaf Plot • Each value is separated into a stem (such as the leftmost value or values) and a leaf (such as the rightmost value or values) • Stems are listed in order and leaves are plotted alongside the appropriate stem • Ordered Stem-and-Leaf Plot
Other Types of Graphs • Scatter Diagram or Scatter Plot • Plot of paired (x,y) data with x on the horizontal axis and y on the vertical axis. • Useful for seeing relationship between x and y • Time-Series Plot • A special scatter diagram which as time plotted on the horizontal axis.
Importance of Knowing the Distribution of Data • Distribution can affect the choice of an appropriate statistic to use. • Distribution can aid in determining the validity of many inferential statistics. • Common data distributions • Bell (normal), bi-modal, right-skewed (chi-squared, exponential), left-skewed
Numerical Summary Methods • Measures of Center(Location) • The middle value or typical observation from a population. • Measures of Variability • The dispersion or spread in the population. • Measures of Relative Standing • The comparative value relative to the population.
Measures of Center Population Mean • Mean (Arithmetic Mean) • The size of the population is denoted by N. The sample size is denoted by n. Sample Mean
Measures of Center • Median • Middle value in the ordered data for odd n. • Mean of the 2 middle values for even n. • Commonly called the 50th percentile. • The location of the median in the ordered data set is: (n+1)÷2
Measures of Center • Mode • Most common value (occurs most frequently) • Midrange • Midway between the lowest and highest value • Trimmed Mean • Mean of values remaining after an equal number of values are removed from each tail.
Skewness Mode = Mean = Median SYMMETRIC Mean Mean Mode Mode Median Median SKEWED LEFT (negatively) SKEWED RIGHT (positively)
Measures of Variation • Range • Distance between minimum and maximum • Range = Max – Min • The range does not measure the overall variability in the data. A measure is needed which incorporates the variability of every value in the data. One was is to look at deviations from the mean (xi-m) for each xi.
Measures of Variation • Variance • The average squared difference of the observations from the mean. Population Variance Sample Variance
Measures of Variation • Standard Deviation • The square root of the average squared difference of the observations from the mean. Population Standard Deviation Sample Standard Deviation
Empirical Rule For data that is approximately bell-shaped in distribution, • 68% of data values fall within 1 standard deviation of the mean, • 95.4% of data values fall within 2 standard deviation of the mean, • 99.7% of data values fall within 3 standard deviation of the mean,
The Empirical Rule (applies to bell-shaped distributions) FIGURE 2-13 x
The Empirical Rule (applies to bell-shaped distributions) FIGURE 2-13 68% within 1 standard deviation 34% 34% x - s x x+s
The Empirical Rule (applies to bell-shaped distributions) FIGURE 2-13 95% within 2 standard deviations 68% within 1 standard deviation 34% 34% 13.5% 13.5% x - 2s x - s x x+s x+2s
0.1% The Empirical Rule (applies to bell-shaped distributions) FIGURE 2-13 99.7% of data are within 3 standard deviations of the mean 95% within 2 standard deviations 68% within 1 standard deviation 34% 34% 2.4% 2.4% 0.1% 13.5% 13.5% x - 3s x - 2s x - s x x+s x+2s x+3s
Chebyshev’s Theorem • For data from any distribution, the proportion (or fraction) of values lying within K standard deviations of the mean is always at least 1 - 1/K2 , where K is any positive number greater than 1. • at least 3/4 (75%) of all values lie within 2 standard deviations of the mean. • at least 8/9 (89%) of all values lie within 3 standard deviations of the mean.
Coefficient of Variation • Relates the standard deviation of a data set to its mean • The CV is useful for comparing relative variation between two or more sets of data
Measures of Relative Position • Standard Score or Z-Score
Measures of Relative Position Order Statistics • The order statistics, denoted by, x(1), x(2), … x(n) are the observed data values ordered from smallest to greatest.
Measures of Relative Position • Percentile • The kth percentile (Pk) separates the bottom k% of data from the top (100-k)% of data. • The location of Pk in the order statistics is:
) k ( 100 Start Finding the Value of the kth Percentile Sort the data. (Arrange the data in order of lowest to highest.) Compute L = nwhere n = number of values k = percentile in question The value of the kth percentile is midway between the Lth value and the next value in the sorted set of data. Find Pk by adding the L th value and the next value and dividing the total by 2. Is L a whole number ? Yes No Change L by rounding it up to the next larger whole number. Figure 2-15 The value of Pk is the Lth value, counting from the lowest
Measures of Relative Position • Quartiles • The quartiles (Q1=P25, Q2=P50 and Q3=P75) separate the data into fourths. • Interquartile Range (IQR) • The distance between the first and third quartiles: IQR=Q3-Q1. • The IQR is a measure of variability which is less affected by outliers than the range, variance and standard deviation.
Box-and-Whisker Diagram(Boxplot) • Graphical display of the “5 Number Summary” • X(1) =Min • Q1 =P25, Q2 =P50, Q3 =P75 • X(n) =Max • Inner & Outer Fences • Useful for identifying potential outliers in data.
Figure 2-17 Boxplots Bell-Shaped
Figure 2-17 Boxplots Bell-Shaped Uniform
Figure 2-17 Boxplots Bell-Shaped Uniform Skewed
Percentile Rank • If x=Pk and k is the percentile rank of x, then k is approximately equal to:
Exploring • Measures of center:mean, median, and mode • Measures of variation: Standard deviation and range • Measures of relative location: order statistics, minimum, maximum, percentile • Unusual values: outliers • Distribution: histograms, stem-leaf plots, and boxplots