660 likes | 839 Views
Statistics. Achim Tresch Gene Center LMU Munich. Two ways of dealing with uncertainty. Andrej Kolmogoroff. Pope Benedikt XVI. Descriptive Statistics Test theory III. Common tests IV. Bivariate Analysis V. Regression. Topics. „If you don‘t know, you have to believe“
E N D
Statistics Achim Tresch Gene Center LMU Munich
Two ways of dealing with uncertainty Andrej Kolmogoroff Pope Benedikt XVI
Descriptive Statistics Test theory III. Common tests IV. Bivariate Analysis V. Regression Topics
„If you don‘t know, you have to believe“ Pan Tau • Tables • Figures and graphical presentation • Interpretation I. Description „I strongly believe the Irak owns weapons of mass destruction“ George W. Bush
Endpoints (Variables) Cases (Samples, Observations) Realizations (instances,values) The sample/ the sample population … What is „data“? ⊆ population A collection of observationsof a similar structure
Different Scales of a Variable Categorial Variables Have only a finite number of instances: Male/female; Mon/Tue/…/Sun Nominal data: Categorial variables without a given order E.g. eye color [brown, blue, green, grey]Special Case: Binary (=dichotomic) variables (yes/no, 0/1…) Ordinal data: Instances are ordered in a natural wayE.g. tumor grade [I, II, III, IV], rank in a contest (1,2,3,…) Continuous Variables Can take values in an interval of the real numbers E.g. blood pressure [mmHg], costs [€]
I. Description Problem: It is often difficult to map a variable to an appropriate scale: E.g. happiness, pain, satisfaction, social status, anger -> Check whether your choice of scale is meaningful! 85% shinier hair!
I. Description Description of a categorial variable: Tables Example: Blood antigens (ABO), n = 188 samples • Alwayslist absolute frequencies! • Do not list relative frequencies in percentifthesample sizeissmall (n < 20) • Do not usedecimaldigits in percentnumbersfor n<300 • „Side effectswereobserved in 14,2857% of all cases“Nonsense, weconcludethat n=7!
I. Description Description of a categorial variable: Barplot %
I. Description Description of continuous data: Histogram
I. Description The size of the bins (= width of the bars) is a matter of choice and has to be determined sensibly! 50 bins 4 Balken 12 bins
I. Description Description of continuous data: Density plot Caution: Data will be smoothed automatically. This is very suggestive and blurs discontinuities in a distribution.
I. Description The most important one: The Gaussian (normal) distribution C.F Gauss (1777-1855):Roughly speaking, continuous variables that are the (additive) result of a lot of other random variables follow a Gaussian distribution.-> It is often sensible to assume a gaussian distribution for continuous variables. Expectation value Standard-deviation
I. Description Mean: sum of all observations / number of samples Ex.: observations: 2, 3, 7, 9, 14 sum: 2+3+7+9+14 = 35 # observations: 5 Mean: 35/5 = 7 Measures of Location, Scale and Scatter Median: A number M such that 50% of all observations are less than or equal to M, and 50% are greater than or equal to M. (Q: What if #observations is even?) 50% 50%
I. Description Description of Location, Scale and Scatter Mode: A value for which the density of the variable reaches a local maximum. If there is only one such value, the distribution is called unimodal, otherwise multimodal. Special case: bimodal) Median Mean
I. Description Distribution Shapes Symmetric Mean Median Skewed to the right Median <<Mean Skewed to the left Mean << Median
I. Description • The median should be preferred to the meanif • the ditribution is very asymmetric • there are extreme outliers The mean is more „precise“ than the median if the distribution is approximately normal Rule of thumb: The skewness g of the distribution ranges between–1 und +1, i.e. the distribution is approx. symmetric. Right skew: skewness g > 0 skewness g < 0 Left skew:
I. Description How would you describe this distribution?
I. Description Unexpected distributionshave unexpected causes! „…it showed a giant boa swallowing an elephant. I painted the inside of the boa to make it visible to the adults. They always need explanations.“ Antoine de Saint-Exupéry, Le petit prince
I. Description Median = 50%-quantile 50% 50% 25% 25% 25% 25% 1-quantile = maximum 0-quantile = minimum 1.quartile = 25%-quantile 3.quartile = 75%-quantile More Location measures Quantile: A q-quantile Q (0≤q≤1) splits the data into a fraction of q points below or equal to Q and a fraction of 1-q points above or equal to Q.
I. Description The five-point Summary and the Boxplot
I. Description Mesures of Variation Interquartile range (IQR): 3. quartile - 1. quartile Span: Maximum - Minimum
I. Description Measures of Variation How far do the observations scatter around their „center“(=measure of location)? large variation Location measure small variation e.g.: location = Median variation = 3.Quartil – 1.Quartil = Interquartilabstand (IQR)
I. Description e.g.: location = median variation = mean deviation (MD) from = e.g.: location = median variation = (median absolute deviation,MAD)from Measures of Variation
I. Description z.B.: location = mean variation = mean squared deviation from = = variance (v) Or:variation = square root of the variance = standard deviation (s, std.dev) Measures of Variation Numbers for Gaussian variables: Mean ± s contains ~68% of the data Mean ± 2s ´´ ~95% ´´ Mean ± 3s ´´ ~99.7% ´´ x-s x x+s
I. Description Histogram/Density Plot vs. Boxplot Boxplot contains less information, but it is easier to interpret. 1 2 4 3
I. Description Multiple Boxplots Sample: 2769 schoolchildren
I. Description Summary • Always report the sample size! • numericalMedian, Q1, Q3,Min., Max. (5-summary) for symmetric distr. alternatively: mean, standard deviation • graphical • Boxplots, histograms and/or density plots • c) verbale.g. „Blood pressure was reduced by 12 mmHg (Interquartile range: 8 to 18 mmHg = 10mmHg), whereas the reduction in the placebo group was only3 mmHg (IQR: –2 to 4 mmHg = 6mmHg).“
I. Description Two categorial variables: Cross Tables Data
I. Description Two categorial variables: Cross tables Data values of variable 2 (potential effects) values of variable 1 (potential causes)
I. Description Two categorial variables: Cross tables Data values of variable 2 (potential effects) Each case is one count in the table values of variable 1 (potential causes)
I. Description Two categorial variables: Cross tables Data values of variable 2 (potential effects) Each case is one count in the table values of variable 1 (potential causes)
I. Description Two categorial variables: Cross tables The most common question is:Are there differences between █and █? values of variable 2 (potential effects) values of variable 1 (potential causes)
I. Description Two categorial variables: Cross tables Cross Table: n = 80 cases
I. Description Two categorial variables: Cross tables What‘s bad about this table?
I. Description Cross tables: Independent vs. paired data independent data paired data Paired data: One object (or two closely related objects) serves for the measurement of two variables of the same kind.Exercise: The influence of diet on body height is assessed in 1) a study with 100 randomly picked subjects. 2) a study with 50 identical twins that grew up separately. Write down the cross tables. Which study is probably more informative?
I. Description Cross tables: Paired data paired data values of variable 2 values of variable 1
I. Description Cross tables: Paired data A typical question is: Are the observations concordant or discordant?Is there a particularly large number in █or █? values of variable 2 concordant observations discordant observations values of variable 1
I. Description Induction from the sample to the population Significance Testing: Difference in the population?Probability of a false call? Difference in the sample Measure in the population?Variance? Confidence intervals? Estimation, Regression: Measure in the sample
I. Description What allows us to conclude from the sample to the population? The sample has to be representative (figures about drug abuse of students cannot be generalized to the whole population of Germany) How is representativity achieved?Large sample numbers Random recruitment of samples from the populationE.g.: Dial a random phone number. Choose a random name from the register of birth (Advantages/Disadv.?) Randomization: Random allocation of the samples to the different experimental groups
I. Description Interval estimate ( ) ____________________________________ X 24,3 20.5 29,5 Confidence intervals 95%-Confidence interval: An estimated interval which contains the „true value“ of a quantity with a probability of 95%. Point estimate (e.g. % votes for the SPD in the EU elections) ( 1 – α ) – Conficence interval: An estimated interval which contains the „true value“ of a quantity with a probability of (1 – α). 1 – α = confidence level , α = error probability Use confidence intervals with caution!
II. Testing A non-sheep detector Training: Measure the length of all sheep that cross your way
II Testing A non-sheep detector Training: Measure the length of all sheep that cross your way. Determine the distribution of the quantity of interest.
II Testing A non-sheep detector Test phase: For any unknown animal, test the hypothesis that it is a sheep. Measure ist length and compare it to the learned length distribution of the sheep. If its length is „out of bounds“, the animal will be called a non-sheep (rejection of the hypothesis). Otherwise, we cannot say much (non-rejection). Not a sheep
II Testing True Positives False Positives False Negatives True Negatives A non-sheep detector Advantage ofthemethod: Onedoes not needtoknowmuchaboutsheep. Disadvantage: Itproduceserrors… Positive calls Negatives calls Decision boundary
II Testing Statistical Hypothesis Testing • State a null hypothesis H0(„nothing happens, there is no difference…“) • Choose an appropriate test statistic (the data-derived quantity that finally leads to the decision) This implicitly determines the null distribution (the distribution of the test statistic under the null hypothesis).
II Testing Statistical Hypothesis Testing • Stats an alternative hypothesis (e.g. „the test statistic is higher than expected under the null hypothesis“) • Determine a decision boundary.This is equivalent to the chioce of a significance level α, i.e. the fraction of false positive calls you are willing to accept. d Rejection region Acceptance region α
II Testing Statistical Hypothesis Testing • Calculate the actual value of the test statistic in the sample, and make your decision according to the prespecified(!)decision boundary. Reject H0 (assume the alternative hypothesis) Keep H0 (no rejection) d α
II Testing Good test statistics, bad test statistics d Good statistic Distribution of the test statistic under the null hypothesis Distribution of the test statistic under the alternative hypothesis 0