280 likes | 306 Views
Descriptive statistics. Petter Mostad 2005.09.08. Goal: Reduce data amount, keep ”information”. Two uses: Data exploration: What you do for yourself when you first get the data. Data presentation: Illustrating for others some conclusion with numbers or graphs based on the data.
E N D
Descriptive statistics Petter Mostad 2005.09.08
Goal: Reduce data amount, keep ”information” Two uses: • Data exploration: What you do for yourself when you first get the data. • Data presentation: Illustrating for others some conclusion with numbers or graphs based on the data.
Data exploration • Understand description of variables • Find ranges, typical values, distributions of variables • Is the data OK? Meaningful? Outliers? Errors? • How do variables relate to each other? • Is it meaningful? As expected? • Can you form new hypotheses?
Data presentation • Remove superfluous information • Present essential information fairly • Present information efficiently • Make it possible to understand information quickly and simply
Types of variables • Numerical variables • Discrete • Continuous • Categorical variables • Nominal values • Ordinal values
Histograms • Subdivide continuous data into intervals, and display counts in intervals • Desicion about width of intervals can influence result a lot • ”Ogives”
Bar charts • Can show variation between categories • Grouped bars can compare variations in different groups • Stacked bars can show proportions, or cumulative effects
Example • Shows changing proportions of 8 types across 24 groups • Groups: coexpressed genes • Types: Types of organisms
Cumulative distributions • Cumulates the proportions up to each level • Can never decrease; goes from 0 to 1 (or 100%)
Stem-and-leaf diagrams • A way to show both the distribution of numbers graphically, and the digits involved • Age in years Stem-and-Leaf Plot • Frequency Stem & Leaf • 2,00 1 . & • 18,00 2 . 01223444 • 28,00 2 . 5667888889999 • 39,00 3 . 0000111222233344444 • 48,00 3 . 55555666777778888899999 • 38,00 4 . 00001111223334444 • 39,00 4 . 555677777888889999 • 37,00 5 . 0000011223333444 • 22,00 5 . 55667789999 • 13,00 6 . 011133 • 5,00 6 . 6& • 7,00 7 . 03& • 1,00 7 . & • Stem width: 10 • Each leaf: 2 case(s) • & denotes fractional leaves.
Pie charts • Illustrates percentages or parts well for comparison between the parts. • 3D pies, or ”exploded” pies, distort more than they clarify the information
Pareto diagrams • Focuses on the most important (frequent) categories. • Shows cumulative frequences when including each category
Numerical summary statistics • (Arithmetic) mean • Median • Mode • Skewness • Outliers • Max, min, range
Arithmetic versus geometric mean Given observations x1, x2, …, xn • Arithmetic mean: • Geometric mean: They correspond to each other when the scale is changed by taking logarithms!
Measures of variability • (Sample) variance • (Sample) standard deviation • Coefficient of variation
Percentiles and quartiles • The x percentile is the number p such that x percent of the data is smaller than p. • The first and third quartiles are the 25th and 75th percentiles, respectively • The inter-quartile range is the difference between the third and first quartiles.
Boxplots • ”Box and whisker plots” • Sometimes shows min, 1st quartile, median, 3rd quartile, max • May instead show some outliers separately
Scatterplots • Probably the most useful graphical plot • Can show any kind of connection between variables, not only linear • Can be done for many pairs at a time (matrix plot), or for triplets (3D plot)
Covariance Given paired observations (x1,y1), (x2,y2), …, (xn, yn) • (sample) covariance: • Positive when variables tend to change in the same direction, negative if opposite direction
Correlation coefficient • Correlation coefficient: • Always between -1 and 1 • If exactly equal to 1, then points are on an increasing line • Can be a more illustrative measure than covariance
Least squares line fitting We can illustrate a trend in the data by fitting a line
Fitting the line • The line is often fitted by minimizing the sum of the squares of the ”errors” (the vertical distances to the line) • We will hear much about regression methods later
Cross tables • When items can be classified using two different categorical variables, we can illustrate counts in a cross table. • If percentages are computed, they must be either relative to the columns or the rows. • In multiway tables, more than two classifying variables are used.
DNA sequence logos • Used to show what is conserved, and what varies, at DNA binding sites for some protein • Relative height of letters show which bases are conserved • Total height shows degree of conservation
Chernoff faces • A way to visualize about 20 parameters in one figure • Background: We are good at remembering and comparing faces • Features in the face correspond to parameters you want to visualize
Use your own creativity! • When exploring data, try to make the kinds of plots that will answer your questions! • When presenting data, think about • simplicity • fairness • efficiency • inventiveness