1 / 28

Descriptive statistics

Descriptive statistics. Petter Mostad 2005.09.08. Goal: Reduce data amount, keep ”information”. Two uses: Data exploration: What you do for yourself when you first get the data. Data presentation: Illustrating for others some conclusion with numbers or graphs based on the data.

whittlem
Download Presentation

Descriptive statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Descriptive statistics Petter Mostad 2005.09.08

  2. Goal: Reduce data amount, keep ”information” Two uses: • Data exploration: What you do for yourself when you first get the data. • Data presentation: Illustrating for others some conclusion with numbers or graphs based on the data.

  3. Data exploration • Understand description of variables • Find ranges, typical values, distributions of variables • Is the data OK? Meaningful? Outliers? Errors? • How do variables relate to each other? • Is it meaningful? As expected? • Can you form new hypotheses?

  4. Data presentation • Remove superfluous information • Present essential information fairly • Present information efficiently • Make it possible to understand information quickly and simply

  5. Types of variables • Numerical variables • Discrete • Continuous • Categorical variables • Nominal values • Ordinal values

  6. Histograms • Subdivide continuous data into intervals, and display counts in intervals • Desicion about width of intervals can influence result a lot • ”Ogives”

  7. Bar charts • Can show variation between categories • Grouped bars can compare variations in different groups • Stacked bars can show proportions, or cumulative effects

  8. Example • Shows changing proportions of 8 types across 24 groups • Groups: coexpressed genes • Types: Types of organisms

  9. Cumulative distributions • Cumulates the proportions up to each level • Can never decrease; goes from 0 to 1 (or 100%)

  10. Stem-and-leaf diagrams • A way to show both the distribution of numbers graphically, and the digits involved • Age in years Stem-and-Leaf Plot • Frequency Stem & Leaf • 2,00 1 . & • 18,00 2 . 01223444 • 28,00 2 . 5667888889999 • 39,00 3 . 0000111222233344444 • 48,00 3 . 55555666777778888899999 • 38,00 4 . 00001111223334444 • 39,00 4 . 555677777888889999 • 37,00 5 . 0000011223333444 • 22,00 5 . 55667789999 • 13,00 6 . 011133 • 5,00 6 . 6& • 7,00 7 . 03& • 1,00 7 . & • Stem width: 10 • Each leaf: 2 case(s) • & denotes fractional leaves.

  11. Pie charts • Illustrates percentages or parts well for comparison between the parts. • 3D pies, or ”exploded” pies, distort more than they clarify the information

  12. Pareto diagrams • Focuses on the most important (frequent) categories. • Shows cumulative frequences when including each category

  13. Numerical summary statistics • (Arithmetic) mean • Median • Mode • Skewness • Outliers • Max, min, range

  14. Arithmetic versus geometric mean Given observations x1, x2, …, xn • Arithmetic mean: • Geometric mean: They correspond to each other when the scale is changed by taking logarithms!

  15. Measures of variability • (Sample) variance • (Sample) standard deviation • Coefficient of variation

  16. Percentiles and quartiles • The x percentile is the number p such that x percent of the data is smaller than p. • The first and third quartiles are the 25th and 75th percentiles, respectively • The inter-quartile range is the difference between the third and first quartiles.

  17. Boxplots • ”Box and whisker plots” • Sometimes shows min, 1st quartile, median, 3rd quartile, max • May instead show some outliers separately

  18. Scatterplots • Probably the most useful graphical plot • Can show any kind of connection between variables, not only linear • Can be done for many pairs at a time (matrix plot), or for triplets (3D plot)

  19. Covariance Given paired observations (x1,y1), (x2,y2), …, (xn, yn) • (sample) covariance: • Positive when variables tend to change in the same direction, negative if opposite direction

  20. Correlation coefficient • Correlation coefficient: • Always between -1 and 1 • If exactly equal to 1, then points are on an increasing line • Can be a more illustrative measure than covariance

  21. Least squares line fitting We can illustrate a trend in the data by fitting a line

  22. Fitting the line • The line is often fitted by minimizing the sum of the squares of the ”errors” (the vertical distances to the line) • We will hear much about regression methods later

  23. Cross tables • When items can be classified using two different categorical variables, we can illustrate counts in a cross table. • If percentages are computed, they must be either relative to the columns or the rows. • In multiway tables, more than two classifying variables are used.

  24. Early example: Napoleons Russian campain 1812-1813

  25. DNA sequence logos • Used to show what is conserved, and what varies, at DNA binding sites for some protein • Relative height of letters show which bases are conserved • Total height shows degree of conservation

  26. Chernoff faces • A way to visualize about 20 parameters in one figure • Background: We are good at remembering and comparing faces • Features in the face correspond to parameters you want to visualize

  27. Chernoff faces

  28. Use your own creativity! • When exploring data, try to make the kinds of plots that will answer your questions! • When presenting data, think about • simplicity • fairness • efficiency • inventiveness

More Related