1 / 57

Introduction to Statistics

Introduction to Statistics. Data description and summary. Statistics. Derived from the word state , which means the collection of facts of interest to the state The art of learning from data Statistics are no substitute for judgment.

yeriel
Download Presentation

Introduction to Statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Statistics Data description and summary

  2. Statistics • Derived from the word state, which means the collection of facts of interest to the state • The art of learning from data • Statistics are no substitute for judgment. • A scientific discipline can be used to collect, describe, summarize, and analyze the data • Descriptive vs. inferential • It is a usual expectation to draw a meaningful conclusion beyond a merely descriptive figure or table from the collected data • An extrapolative inference, a method of deduction

  3. Probability • Some assumptions about the chances of obtaining the different data values for drawing certain logical conclusions • A totality of these assumptions is referred to as a probability model • An inductive approach

  4. Statistics vs. probability Source: http:///ocw.mit.edu/OcwWeb/Sloan-School-of-Management/

  5. Data: A Set of measurements • Character • Nominal, e.g., color: red, green, blue • Binary e.g., (M,F), (H,T), (0,1) • Ordinal, e.g., attitude to war: agree, neutral, disagree • Numeric • Discrete, e.g., number of children • Continuous. e.g., distance, time, temperature • Interval, e.g., Fahrenheit/Celsius temperature • Ratio (real zero), e.g., distance, number of children

  6. Concepts about data • Population: The set of all units of interest (finite or infinite). • E.g., all students at NCNU • Sample: A subset/subgroup of the population actually observed. • E.g., students in this room. • Variable: A property or attribute of each unit, • e.g., age, height (a column field within a table) • Observation: Values of all variables for an individual unit (a row record in the table)

  7. Matrix form of raw data variable … observation Sample …

  8. Properties of measurements • Parameter: • Numerical characteristic of population, defined for each variable, e.g. proportion opposed to war • Statistic: • Numerical function of sample used to estimate population parameter. • Precision: • Spread of estimator of a parameter • Accuracy: • How close estimator is to true value • Bias: • Systematic deviation of estimate from true value

  9. Accuracy vs. Precision Source: http:///ocw.mit.edu/OcwWeb/Sloan-School-of-Management/

  10. Is it a good sample? • Is it a representative sample from the interested population? • Preexisted Bias? • unavoidable errors?

  11. Describing data sets • Frequency tables and graphs • Scatter plot, bar/pie chart (for attraction) • Relative frequency tables and graphs • Grouped data with • histograms, • Ogive (cumulative frequency), e.g., the Lawrence curve for national wealth distribution • Stem-and-leaf plot • Always plot your data appropriately - try several ways!

  12. Scatter plot Variable Y or observation Variable x or observation number

  13. Line graph (chart)

  14. Bar chart

  15. Relative frequency (42/200)= =200=n

  16. Pie chart

  17. Histogram (柱狀圖/直方圖) • Class intervals: a trade-off between too-few and too-many classes • Class boundaries: left-end inclusion convention • E.g., the interval 20-30 contains all values that both greater than or equal to 20 and less than 30 • c.f. right-end inclusion, (MS Excel) • Pareto histogram: a bar chart with categories arranged from the highest to lowest

  18. The life hours of lamps

  19. Interpretation of histogram • Area under the histogram represents sample proportion • If too many intervals, too jagged; (polygon graph) • If too few, too smooth • Detecting the data distribution (chart) • Symmetric or skewed • Uni-modal or bi-modal • Only used for categorizing the numerical data

  20. Ogive (cumulative relative frequency graph)

  21. Stem-and-leaf plot The case of city minimum temperatures The length of leaf means the frequency of this stem (interval) The tens digit • You had better sort the data from the smallest to the largest before the stem-and-leaf assignment The ones digit

  22. Run chart • For time series data, it is often useful to plot the data in time sequence.

  23. Summarizing data sets • Measures of location & central tendency • Sample mean, sample median, sample mode • Measures of dispersion • Sample variance, sample standard deviation • Sample percentile (quartiles, quantiles) • Box (and whiskers) plots, QQ plots

  24. Mean • Simple average • Weighted average

  25. Median The middle value is located when the data are arranged in a increasing/decreasing order.

  26. Mode • The value occurs most frequently • If no single value occurs most frequently, all the values that occur at the highest frequency are called mode values.

  27. Skew-ness Adjusted by the log transformation Adjusted by the exponential or squared transformation Exercise and justify it yourselves

  28. A case of bimodal histogram

  29. Mean or median? • Appropriate summary of the center of the data? • Mean—if the data has a symmetric distribution with light tails (i.e. a relatively small proportion of the observations lie away from the center of the data). • Median—if the distribution has heavy tails or is asymmetric. • Extreme values that are far removed from the main body of the data are called outliers. • Large influence on the mean but not on the median.

  30. Sample variance (Check it!)

  31. Linear computation of sample variance if

  32. Sample standard deviation

  33. Percentiles , Quartiles • The sample 100p percentile (p quantile) is that data value such that 100p percent of the data are less than or equal to it and 100(1-p) percent are greater than or equal to it. • The sample 25 percentile is called the first quartile, Q1; the sample 50 percentile is called the sample median or the second quartile, Q2; the sample 75 percentile is called the third quartile, Q3.

  34. Finding the sample percentiles • To determine the sample 100p percentile of a data set of size n, Xp, we need to determine the data values such that (1)At least np of the values are less than or equal to it. (2)At least n(1-p) of the values are greater than or equal to it. • If np is NOT an integer, round up to the next integer and set the corresponding observation Xp • If np is an integer K, average the Kth and (K+1)st ordered values. This average is then Xp.

  35. Five number summary • The minimum, • The maximum, • and three quartiles, Q1, Q2, Q3

  36. Box (and Whiskers) plots • A “box” starts at the Q1 and continues to the Q3, so the length of box is called the interquartile range. (50% of distribution) • the value of the Q2 indicated by a vertical line • A straight line segment (i.e., whiskers) stretching from the smallest to the largest data value (i.e., the range) is drawn on a horizontal axis. Case 1. Q1 Q2 Q3 Min. Max.

  37. Lower fence and upper fence Max. Possible outliers * Case 2. * Whisker extends to this adjacent value, the highest value within the upper fence= Q3 + 1.5 (Q3 - Q1) Q3 Median Q1 Whisker extends to this adjacent value, the lowest value within the lower fence= Q1 - 1.5 (Q3 - Q1) Min.

  38. Normal sample distribution • For normal data and large samples • 50% of the data values fall between mean ± 0.67s • 68% of the data values fall between mean ± 1s • 95% of the data values fall between mean ± 2s • 99.7% of the data values fall between mean ± 3s

  39. QQ (normal) plots • Sequentially compare the sample data to the quantiles of theoretical (normal) distribution • The ith ordered data value is the pth quanntile, p=(i-0.5)/n Raw data Quantiles of standard normal

  40. Paired data sets (X, Y) andthe sample correlation coefficient, r r

  41. Illustrations of correlation

  42. r vs. Linear relation • If the these two paired data sets x and y possess a linear relation, y=a+bx, with b>0, then r=1. • If the these two paired data sets x and y possess a linear relation, y=a+bx, with b<0, then r=-1. • r is just an indicator telling how perfect a linear relation exists between X, and y

  43. Properties of r • |r| ≤ 1, (why? See the 2.6.1) • If r is positive, x and y may change in the same direction. • If r is negative, x and y may not change in the same direction. • Correlation measures association, not causation • Causation still needs the other necessary conditions: time sequence, exclusion • E.g., Wealth and health problems go up with age. Does wealth cause health problems?

  44. Chebyshev’s inequality Let Set (The lower bound)

  45. Proof Dividing both sides by The next step? And the upper bound of N(k)/n

  46. Categorizing the bi-variate data

  47. Simpon’s paradox • Lurking variables excluded from considerations can change or reverse a relation between two categorical variables

  48. Gender bias of graduate admissions Male Female Ad. Engineering school Male Female Rej. Ad. 30/60 10/20 Male Female Rej. Art school Ad. 20/60 35/80 Rej. 5/20 10/40

  49. Homework #1 • Chapter 1: Problem 2, 6 • Chapter 2: Problem 15 (You had better use Excel or the book-included software to compute the data.)

  50. Graphical Excellence • “Complex ideas communicated with clarity, precision, and efficiency” • Shows the data • Makes you think about substance rather than method, graphic design, or something else • Many numbers in a small space • Makes large data sets coherent • Encourages the eye to compare different pieces of the data

More Related