560 likes | 688 Views
Biostatistics I. Descriptive statistics and some things related to the normal distribution. Outline. Descriptive statistics Frequency distribution, relative, cumulative, histograms Measure of central tendency (mean, median, mode) Deviations and measure of variation
E N D
Biostatistics I Descriptive statistics and some things related to the normal distribution
Outline • Descriptive statistics • Frequency distribution, relative, cumulative, histograms • Measure of central tendency (mean, median, mode) • Deviations and measure of variation • Some thoughts on the normal distribution • Z-score, CV, Confidence interval • Standard error of the mean • How many samples?
Population: A finite number of separate objects defined in space and time All boats operating in a country’s EEZ in year 2009 Boat of a particular type operating in a country’s water in January 2009 The number of queen conch in a country’s EEZ Sample: A subset of a population Usually a sample is an order of magnitude smaller than the size of a population. Population and sample 1
Population and sample 2 • Use information from a sample to make inference about the population Population is unknown Sample is known Inference Can only make inference about the population from the sample if the sample is representative of the population
Frequency distributions 1 • Objectives of frequency tabulation is to condense the raw data into some more useful form that allows some visual interpretation of the data. • How can we make a quick summary of the data on the right? • Lets say that the data contain length measurements of 30 fishes (n=30) • We can quickly see that the smallest fish is 3.4 cm and that the largest is 15.3 cm
Frequency distributions 2 • How its done: • Decide on the number of classes to include in the frequency distribution. • Here 7 length classes • Find the class width: determine the range of the data, divide the range by the number of classes and round up to the next convenient number. • Range is: 15.3 cm – 3.4 cm = 11.9 • 11.9 cm / 7 = 1.7 cm 2 cm • Find the class limits: Start with the lowest value (rounded down) as the lower limit of the first class, add the class width to this to obtain the lower limit for the second class, etc. • Lowest class limit = 2 cm • Next one: 2 + 2 = 4 cm, etc. • Count the number of fish in each length class, either by using a pencil or a paper or a computer program.
Relative and cumulative frequency Relative frequency is the proportion of the observation within a class. Cumulative frequency is the sum of the relative frequency of all classes below and including the class indicated.
Various ways for displaying frequency Histogram Relative frequency Relative cumulative frequency Cumulative frequency How would one verbally describe: 1) the general characteristics of the data? 2) the different forms of presentations of the same data?
GENERAL n: number of measurements Lowest value: Xmin Highest value: Xmax Range: Xmax – Xmin j: Class numbers Class boundaries: L1, L2, .. Lj Class range: dl = Lj+1 – Lj Class midpoint: (Lj+1 – Lj)/2 nj: number of fish in class j Relative frequency: nj / n Cumulative frequency: Hmm …, lets wait for that one OUR EXMPLE n = 30 fish Xmin = 3.4 cm Xmax = 15.3 cm Range = 15.3 – 3.4 = 11.9 cm j = 1, 2, … 7 Class boundaries: 2,4,… 16 cm Class range: 4 -2 = 2 cm Class midpoint: (4+2)/2= 3 cm nj = 2, 3, 5, 6, 7, 6, 1 0.067, 0.100, 0.167, …, 0.033 0.067, 0.167, 0.267, …, 1.000 Some mathematical bookeeping
Number of classes? • Generally no fewer than 5 and no greater than 15 • Depends in part: • On the number of observations, the more observations the greater the number of classes. • The nature of the data • If the sample is a composite of a lot of different elements we need to have high number of classes. But that also means we need a lot of measurements. • Some general guidelines • Square root of n • Sturge´s rule: (Xmax-Xmin) / (1+1.44 ln(n))
Measure of central tendency A value that is supposed to describe the most typical or central point of the measurements
A value that is supposed to describe the most typical or central point of the measurements Arithmetic mean Median Mode Measure of central tendency Mode Median Mean
In mathematical notation: n: the total number of measurements i: The ith measurement xi: the value of the ith measurement Note the effect of “outliers” on the mean value How well does the mean describe the most typical value? The arithmetic mean Example
Median position: Sort the measurements in an ordered fashion from lowest to the highest (ranked) Find the median position: (n+1)/2 of the ordered data The median value: The value of the observation in the median position Note if n is an even number the median is the average of the two central values: E.g. 10 ,20 ,30 ,40 ,50 ,60 Note that the median is not affected by the “outlier” The median Example Median = 35
The mode • Mode = value that occur most often • Not sensitive to outliers • Problem: there may be no or many modes • E.g. 10 ,20 ,30 30, 30 ,40 ,50 ,60
Shapes of distributions Left skewed Symmetrical Right skewed ModeMedianMean Mode Mode Median Median Mean Mean tail tail Left skewed: Mode < Median < Mean Symmetrical: Mode = Median = Mean Right skewed: Mode > Median > Mean
Measure of variability A value that is supposed to describe the distribution of the measurements around the central value
Fractiles: General definitions • Range: Difference between the maximum and minimum value • Range = xmax – xmin • Sensitive to outliers • Quantiles: Q1, Q2 and Q3 divide a data set into four equal parts • Q1: 25th percentile • Q2: 50th percentile = Median • Q3: 75th percentile • Interquantile range = Q3-Q1 • Less sensitive to outliers • Percentiles: P1, P2, … P100 divide a data set into 100 equal parts • Note relationship: Q1 = P25, Q2 = P50= Median, Q3 = P75
Fractiles: Box and whisker plots • Note • 25% of observation are ≤ Q1, 50% ≤ Q2, 75% ≤ Q3 • 50% of the observations lies between Q1 and Q3 Q2 = 50th percentileMedian Q1 = 25th percentile Q3 = 75th percentile Minimum Maximum Measurements value 4 100 200 400 600 Range = 600 – 4 = 596 Interquartile range = 400 – 100 = 300
Box and whiskers plots and distributions Left skewed Symmetrical Right skewed ModeMedianMean Mode Mode Median Median Mean Mean tail tail Box and whisker plots give an indication of the central value (here mode), the distribution of the data and the shape of the distribution
Example of a quartile plot • Plot show the median catch rate (CPUE) as a function of time. • Plot shows the median and the interquartile catch rate as a function of time • What additional information does the lower graph provide?
Example of a percentile plot P90 P90: 90% of observations with values less than 19
In mathematical notation: i: The ith measurement n: the total number of measurements xi: the value of the ith measurement Deviations from the mean 1 Example
Deviation from the mean 2 • Deviation from the mean • How can we characterize the average deviation?? • Plain average gives always zero.
Variance & standard deviation 1 • The variance • Standard deviation: Square root of variance • Xi ith measurement of the variable X • m: population mean sample mean • s: population std. deviation s: sample std. deviation • N: population size n: sample size Whole population Sample from population
Variance and standard deviation 2 • Do you think that the value of 15.8 is a reasonable measure of the average deviation in the data?
400 + 1000 0 + 400 = + 100 + 100 250 15.8 Variance and standard deviation 3 20 0 20 10 10
Coefficient of variation • Measures of relative variation • CV = “Relative standard deviation” • Always a percentage (%) or a proportion of 1 • Can be higher than 100% • Can be used to compare two or more sets of data
The normal distribution The normal distributions are a very important class of statistical distributions. All normal distributions are symmetric and have bell-shaped density curves with a single peak.
Common distribution of measurements 1 • Example: 7073 Icelandic cod fish larvae lengths measurements taken in august 2002. • Since we have many fish we can use a length bin of 1 mm to generate a frequency distribution. • Most fish fall within a certain narrow size range • The number of fish of a certain length decrease the further away one goes from the central distribution. • Distribution is close to symmetrical n=7073
Common distribution of measurements 2 Lets make an rough eyeball drawing through the points Can we describe this red line mathematically?
Normal distribution nLi: number in length class Li dLi: width of length interval
The normal distribution pdf - probability density function i - measured variable (here length of fish) Xbar – the mean s – the standard deviation • The model that describes the normal distribution is complex at first sight …
What matters? • What parameters are in the equation? • Xbar is the sample mean • s is the standard deviation • The rests (2, p, e) are constants • The normal distribution is only “controlled” by the Xbar and s, often written as: • In words we say that the normal distribution is a function of Xbar and s.
pdf = f(Xbar,s), keep Mean(Xbar) =50, change “s” The central position (Xbar) remains the same. The higher the value of s the greater the spread of the curve. Q: Is mean on its own a useful measure?
pdf = f(Xbar, s), keep s=10, change Xbar The shape of the curve remains the same. The mean (Xbar) describes the central location on the x-axis.
What line describes the data distribution best? Assume the distribution is normal: Find value of Xbar and s which best describe the data.
s s Answer: Xbar = 50, s = 10 In 2002 7073 larvae were measured.The mean was 50 mm and the standard deviation 10 mm
Can we say anything about probabilities? • Probabilities = likelihood relative frequency • In presentation of data analysis we often have statements like: • We expect that 95% of the population are within a certain specified range of the data distribution • E.g. given the sample that I have, I expect that 95% of the distribution of the fish population is between 30 and 70 mm. • This is sometimes written as: 50 ± 20 mm • How can we say this? • Why do we say this?
Although there are many normal curves, they all share an important property that allows us to treat them in a uniform fashion. • The 68-95-99.7% Rule • All normal density curves satisfy the following property which is often referred to as the Empirical Rule. • 68% of the observations fall within 1 standard deviation of the mean, that is, between and . • 95% of the observations fall within 2 standard deviations of the mean, that is, between and . • 99.7% of the observations fall within 3 standard deviations of the mean, that is, between and . • Thus, for a normal distribution, almost all values lie within 3 standard deviations of the mean.
Note that these values are approximations : • For example according to the normal curve probability density function, • 95% of the data will fall within 1.96 standard Deviation of the mean. • Using 2 standard deviations is a convenient approximation.
±2s What does 1.96 standard deviation mean? In 2002 7073 larvae were measured.The mean was 50 mm and the standard deviation 10 mm 95% of all the measurements (6719 larvae) fall within 1.96 standard deviation (30-70 mm) from the mean, given that the data follow a normal distribution.
±1s But what does 1 standard deviation mean? In 2002 7073 larvae were measured.The mean was 50 mm and the standard deviation 10 mm 68% of all the measurements (4810 larvae) fall within 1 standard deviation from the mean (40-60 mm), given that the data follow a normal distribution
The Z score • In statistics the Z score is defined as: Hmm ... , have we seen this formula before??
The meaning of the Z-score • The Z-score standardizes the deviation from the mean of a measurement relative to the standard deviation. • The Z-score value is a multiplier, indicating how many standard deviation a particular measurement is from the mean.
±1s ±2s The Z scores of our data n Length n Z score
Cumulative relative distribution of Z scores 84th% 68% 16th% The graph shows that -1s is the 16th percentile, +1 the 84th percentile. Thus84-16 = 68% of the data lie within ± 1 s of the mean
Cumulative relative distribution of Z scores The shape of this graph and the values of Z and pdf are the same for any normally distributed data irrespective of the number of measurements (n) and the value of the mean and standard deviation If we have a mean and a standard deviation from a sample and we assume that the data are normally distributed we can say what the probability is that the next sample we sample we take is less than a certain Z value. E.g. Xbar = 100 mm, s = 20 mm. How likely is it that the next measurement that is sampled is: 60 : Z score = (60-100)/20 = -3, probably very unlikely 120: Z score = (120-100)/10 = -2, 2.5% probability
Standard error (standard deviations of the means) • Standard error (or standard deviation of the mean) • estimates of the standard deviations of the means. • We are effectively using the present sample to estimate what the likely distribution of the means would be if we were to have repeated measurements from the population. • The standard error is thus a value that can be used to estimate the confidence interval of parametric mean from the sample mean, given the distribution of the data • We assume that the means are normally distributed • Note: Standard deviation: • estimate of the dispersion of the individual observations from the mean of a sample