210 likes | 386 Views
Chap 10 : Summarizing Data. 10.1: INTRO : Univariate/multivariate data (random samples or batches) can be described using procedures to reveal their structures via graphical displays (Empirical CDFs, Histograms,…) that are to Data what PMFs and PDFs are to Random Variables.
E N D
Chap 10: Summarizing Data 10.1: INTRO: Univariate/multivariate data (random samples or batches) can be described using procedures to reveal their structures via graphical displays (Empirical CDFs, Histograms,…) that are to Data what PMFs and PDFs are to Random Variables. Numerical summaries (location and spread measures) and the effects of outliers on these measures & graphical summaries (Boxplots) will be investigated.
10.2: CDF based methods 10.2.1: The Empirical CDF (ecdf) ECDF is the data analogue of the CDF of a random variable. The ECDF is a graphical display that conveniently summarizes data sets.
The Empirical CDF (cont’d) The random variables are independent Bernoulli random variables:
10.2.2: The Survival Function In medical or reliability studies, sometime data consist of times of failure or death; thus, it becomes more convenient to use the survival function rather than the CDF. The sample survival function (ESF) gives the proportion of the data greater than t and is given by: Survival plots (plots of ESF) may be used to provide information about the hazard function that may be thought as the instantaneous rate of mortality for an individual alive at time t and is defined to be:
The Survival Function (cont’d) From page 149, the method for the first order: which expresses how extremely unreliable (huge variance for large values of t) the empirical log-survival function is.
10.2.3:QQP(quantile-quantile plots) Useful for comparing CDFs by plotting quantiles of one dist’n versus quantiles of another dist’n.
10.2.3: Q-Q plots Q-Q plot is useful in comparing CDFs as it plots the quantiles of one dist’n versus the quantiles of the other dist’n. Additive model: Multiplicative model:
10.3: Histograms, Density curves & Stem-and-Leaf Plots Kernel PDF estimate:
10.4: Location Measures 10.4.1: The Arithmetic Mean is sensitive to outliers (not robust). 10.4.2: The Median is a robust measure of location. 10.4.3: The Trimmed Mean is another robust location measure
Location Measures (cont’d) The trimmed mean (discard only a certain number of the observations) is introduced as a natural compromise between the mean (discard no observations) and the median (discard all but 1 or 2 observations) Another compromise between is was proposed by Huber (1981) who suggested to minimize: or to solve (its solution will be called an M-estimate)
10.4.4: M-Estimates (cont’d) M-estimates coincide with MLEs because: The computation of an M-estimate is a nonlinear minimization problem that must be solved using an iterative method (such as Newton-Raphson,…) Such a minimizer is unique for convex functions. Here, we assume that is known; but in practice, a robust estimate of (to be seen in Section 10.5) should be used instead.
10.4.5: Comparison of Location Estimates Among the location estimate introduced in this section, which one is the best? Not easy ! For symmetric underlying dist’n, all 4 statistics (sample mean, sample median, alpha-trimmed mean, and M-estimate) estimate the center of symmetry. For non symmetric underlying dist’n, these 4 statistics estimate 4 different pop’n parameters namely (pop’n mean, pop’n median, pop’n trimmed mean, and a functional of the CDF by ways of the weight function ). Idea: Run some simulations; compute more than one estimate of location and pick the winner.
10.4.6: Estimating Variability of Location Estimates by the Bootstrap Using a computer, we can generate (simulate) many samples B (large) of size n from a common known dist’n F. From each sample, we compute the value of the location estimate . The empirical dist’n of the resulting values is a good approximation (for large B) to the dist’n function of . Unfortunately, F is NOT known in general. Just plug-in the empirical cdf for F and bootstrap ( = resample from ).
10.4.6: Bootstrap (cont’d) A sample of size n from is a sample of size n drawn with replacement from the observed data that produce . Thus, Read example A on page 368. Bootstrap dist’n can be used to form an approximate CI and to test for hypotheses.
10.5:Measures of Dispersion A measure of dispersion (scale) gives a numerical indication of the “scatteredness” of a batch of numbers. The most common measure of dispersion is the sample standard deviation Like the sample mean, the sample standard deviation is NOT robust (sensitive to outliers). Two simple robust measures of dispersion are the IQR (interquartile range) and the MAD (median absolute deviation from the median).
10.6: Box Plots Tukey invented a graphical display (boxplot) that indicates the center of a data set (median), the spread of the data (IQR) and the presence of outliers (possible). Boxplot gives also an indication of the symmetry / asymmetry (skewness) of the dist’n of data values. Later, we will see how boxplots can be effectively used to compare batches of numbers.
10.7: Conclusion Several graphical tools were introduced in this chapter as methods of presenting and summarizing data. Some aspects of the sampling dist’ns (assume a stochastic model for the data) of these summaries were discussed. Bootstrap methods (approximating a sampling dist’n and functionals) were also revisited.
Parametric Bootstrap: Example: Estimating a population mean It is known that explosives used in mining leave a crater that is circular in shape with a diameter that follows an exponential dist’n . Suppose a new form of explosive is tested. The sample crater diameters (cm) are as follows: 121 847 591 510 440 205 3110 142 65 1062 211 269 115 586 983 115 162 70 565 114 It would be inappropriate to use as a 90% CI for the pop’n mean via the t-curve (df=19)
Parametric Bootstrap: (cont’d) because such a CI is based on the normality assumption for the parent pop’n. The parametric bootstrap replaces the exponential pop’n dist’n F with unknown mean by the known exponential dist’n F* with mean Then resamples of size n=20 are drawn from this surrogate pop’n. Using Minitab, we can generate B=1000 such samples of size n=20 and compute the sample mean of each of these B samples. A bootstrap CI can be obtained by trimming off 5% from each tail. Thus, a parametric bootstrap 90% CI is given by: (50th smallest = 332.51,951st largest = 726.45)
Non-Parametric Bootstrap: If we do not assume that we are sampling from a normal pop’n or some other specified shape pop’n, then we must extract all the information about the pop’n from the sample itself. Nonparametric bootstrapping is to bootstrap a sampling dist’n for our estimate by drawing samples with replacement from our original (raw) data. Thus, a nonparametric bootstrap 90% CI of is obtained by taking the 5th and 95th percentiles of among these resamples.