440 likes | 681 Views
Mathematical Statistics. Instructor: Dr. Deshi Ye. Course homepage: http://www.cs.zju.edu.cn/people/yedeshi/. Course information. What is for? This course provides an elementary introduction to mathematical statistics with applications.
E N D
Mathematical Statistics Instructor: Dr. Deshi Ye Course homepage: http://www.cs.zju.edu.cn/people/yedeshi/
Course information • What is for? • This course provides an elementary introduction to mathematical statistics with applications. • Topics include: statistical estimation, hypothesis testing; confidence intervals; calculation of a P-value; nonparametric testing; curve fitting; analysis of variance and factorial experimental design.
Grading • Grades for the course will be based on the following weighting1) Class attendance: 10% 2) Homework assignment: 26% 3) Unit quiz: 24% (12%, 12%)4) Final exam: 40%
Introduction • Probability theory is devoted to the study of uncertainty and variability • Statistics can be described as the study of how to make inference and decisions in the face of uncertainty and variability
Brief History • Blaise Pascal and Pierre de Fermat: the origins of probability are found. • concerning a popular dice game • fundamental principles of probability theory • Pierre de Laplace: • Before him, concern on the analysis of games of chance • Laplace applied probabilistic ideas to many scientific and practical problems
A case study • Visually inspecting data to improve product quality
Population and Sample • Investigating: a physical phenomenon, production process, or manufactured unit, share some common characteristics. • Relevant data must be collected. • Unit: the source of each measurement. • A single entity, usually an object or person • Population: entire collection of units.
Sample • Statistical population: the set of all measurement corresponding to each unit in the entire population of units about which information is sought. • Sample: A sample from a statistical population is the subset of measurements that are actually collected in the course of investigation.
Ch2: Treatment of data • Outline • Pareto diagrams, dot diagrams • Histograms (Frequency distributions) • Stem-and-leaf display • Box-plot (Quartiles and Percentiles) • The calculation of and standard deviation s
Pareto Diagram • For a computer-controlled lathe whose performance was below par, workers recorded the following causes and their frequencies: power fluctuations 6 controller not stable 22 operator error 13 worn tool not replaced 2 other 5
Minitab14 • 1. Stat->Quality tools->Pareto chart • 2. Choose chart defects table as follows
Pareto diagram • Pareto diagram: depicts Pareto’s empirical law that any assortment of events consists of a few major and many minor elements. • Typically, two or three elements will account for more than half of the total frequency.
Dot diagram • Observation on the deviations of cutting speed from the target value set by the controller. • EX. Cutting speed – target speed • 3 6 –2 4 7 4 • In minitab: stat->dotplots->simple
Dot diagram • This diagram visually summarize the information that the lathe is generally running fast.
Data001. 80 data of emission (in ton)of sulfur oxides from an industry plant • 15.8 26.4 17.3 11.2 23.9 24.8 18.7 13.9 9.0 13.2 22.7 9.8 6.2 14.7 17.5 26.1 12.8 28.6 17.6 23.7 26.8 • 22.7 18.0 20.5 11.0 20.9 15.5 19.4 16.7 10.7 19.1 15.2 22.9 26.6 20.4 21.4 19.2 21.6 16.9 19.0 18.5 23.0 • 24.6 20.1 16.2 18.0 7.7 13.5 23.5 14.5 14.4 29.6 19.4 17.0 20.8 24.3 22.5 24.6 18.4 18.1 8.3 21.9 12.3 • 22.3 13.3 11.8 19.3 20.0 25.7 31.8 25.9 10.5 15.9 27.5 18.1 17.9 9.4 24.1 20.1 28.5
Frequency distributions • Afrequency distributionis a tabular arrangement of data whereby the data is grouped into different intervals, and then the number of observations that belong to each interval is determined. • Data that is presented in this manner are known as grouped data.
Class limit and width • lower class limit: The smallest value that can belong to a given interval • upper class limit: The largest value that can belong to the interval. • Class width: The difference between the upper class limit and the lower class limit is defined to be the. • When designing the intervals to be used in a frequency distribution, it is preferable that the class widths of all intervals be the same.
Variants of frequency distribution • The cumulative frequency distribution is obtained by computing the cumulative frequency, defined as the total frequency of all values less than the upper class limit of a particular interval, for all intervals. • Relative frequency: the ratio of the number of observations in the interval to the total number of observations • The percentage frequency distribution is arrived at by multiplying the relative frequencies of each interval by 100%.
Histogram • The most common form of graphical presentation of a frequency distribution is the histogram. • Histogram: is constructed of adjacent rectangles; the height of the rectangles is the class frequencies and the bases of the rectangles extend between successive class boundaries.
Graph->histogram->simple • Graph variables: c4 • Edit bars: Click the bars in the output figures, in Binning, Interval type select midpoint and interval definition select midpoint/cutpoint, and then input 7 11 15 19 23 27 31 as illustrated in the following
Density histogram • When a histogram is constructed from a frequency table having classes of unequal lengths, the height of each rectangle must be changed to • Height = relative frequency / width. • The area of the rectangle then represents the relative frequency for the class and the total area of the histogram is 1.
Cumulative histogram • 1) Graph->histogram->simple • 2) Dataview-> Datadisplay: check “symbos” only Smoother: check “lowess” and “0” in degree of smoothing and “1” in number of steps.
Stem-and-leaf Display • Class limits and frequency, contain data in each class, but the original data points have been lost. • Stem-and-leaf: function the same as histogram but save the original data points. • Example: 10 numbers: • 12, 13, 21, 27, 33, 34, 35, 37, 40, 40
Frequency table Class limits Frequency 10 – 19 2 20 – 29 2 30 – 39 4 40 – 49 3
Stem-and-leaf Stem-and-leaf: each row has a stem and each digit on a stem to the right of the vertical line is a life. The "stem" is the left-hand column which contains the tens digits. The "leaves" are the lists in the right-hand column, showing all the ones digits for each of the tens, twenties, thirties, and forties. Key: “4|0” means 40
Stem-and-leaf in Minitab • The display has three columns: • The leaves (right) - Each value in the leaf column represents a digit from one observation. • The stem (middle) - The stem value represents the digit immediately to the left of the leaf digit. • Counts (left) - If the median value for the sample is included in a row, the count for that row is enclosed in parentheses. The values for rows above and below the median are cumulative.
Stem-and-leaf for DATA001 • Stem-and-leaf of frequencies N = 80 • Leaf Unit = 1.0 • 2 0 67 • 6 0 8999 • 11 1 00111 • 17 1 223333 • 24 1 4445555 • 32 1 66677777 • (13) 1 8888888999999 • 35 2 0000000111 • 25 2 222223333 • 16 2 4444455 • 9 2 66667 • 4 2 889 • 1 3 1
Ch2.5: Descriptive measures • Mean: the sum of the observation divided by the sample size. • Median: the center, or location, of a set of data. If the observations are arranged in an ascending or descending order: • If the number of observations is odd, the median is the middle value. • If the number of observations is even, the median is the average of the two middle values.
Example • 15 14 2 27 13 • Mean: • Ordering the data from smallest to largest • 2 13 14 15 27 • The median is the third largest value 14
Sample variance • Deviations from the mean: • Standard deviation s:
Quartiles and Percentiles • Quartiles: are values in a given set of observations that divide the data in 4 equal parts. • The first quartile, , is a value that has one fourth, or 25%, of the observation below its value. • The sample 100 p-th percentile is a value such that at least 100p% of the observation are at or below this value, and at least 100(1-p)% are at or above this value.
Example • Example in P34:
Boxplots • A boxplot is a way of summarizing information contained in the quartiles (or on a interval) • Box length= interquartile range=
Modified boxplot • Outlier: too far from third quartile. • 1.5(interquartile range) of third quartile. • Modified boxplot: identify outliers and reduce the effect on the shape of the boxplot.