Hss2381a – stats and stuff

Hss2381a – stats and stuff The Normal Curve, part 1

No class on Thursday!

Interdisciplinary Journal of Health Sciences • WANTED: Seeking applicants for the 2011-2012 editorial team • Students in both the English and French HSS streams are encouraged to apply. • Send an email expressing your interest in the position to IJHS@hssuottawa.ca, with your resume attached. • Successful candidates will be invited to a panel interview. • Deadline to apply: Wednesday, September 28th, 2011

Last time…. • We covered measures of central tendency: • Mode • Median • Mean • And two measures of variability: • Range • Interquartile Range

Two More Measures of Variability • Standard deviation • Variance

The Standard Deviation • Standard deviation (SD or σ): An index that conveys how much, on average, scores in a distribution vary • SDs are based on deviation scores (x), calculated by subtracting the mean from each person’s original score x = X - M

Standard Deviation Interpretation • In a normal distribution, a fixed percentage of cases lie within certain distances from the mean:

Example • We weigh 10 students and collect their weight in pounds: • 110 120 130 140 150 150 160 170 180 190 • What is the mean? (M) 150 For the lightest person, their weight is the mean – 40 For the heaviest person, their weight is the mean +40

What’s a deviation? • A “deviation” is how much each data point deviates from the mean • So for X1 the deviation is -40 • And for x10 the deviation is +40 • So what’s a “standard deviation”? • It’s some sort of measure of how much the “typical” data point deviates from the mean

Let’s go back to our data… • Mean = 150 -40 -30 -20 -10 0 0 10 20 30 40 0

Defining Standard Deviation • The sum of all deviation scores in a distribution always = 0 • to compute SDs, deviation scores must be squared (x2) before being summed • SD equation: SD = Square root of: Σx2 ÷ (N -1)

Standard Deviation (cont’d) • Weights (pounds): • 110 120 130 140 150 150 160 170 180 190 • Deviation scores(x) for M = 150: -40 -30 -20 -10 0 0 10 20 30 40 • Squared deviation scores(x2): 1600 900 400 100 0 0 100 400 900 1600 • Sum of squared deviation scores: 1600+900+400+100+0+0+100+400+900+1600 = 6000 • SD = √(6000/(N -1) = • SD = √(6000/(9) = 25.82

A little bit about notation σ “sigma” = standard deviation in the reference population s Lower case “s” = standard deviation in the sample The textbook uses “SD” for both

Standard Deviation Interpretation • Provides a “standard”—the SD indicates the average amount of deviation of scores from the mean • Tells you how wrong, on average, the mean is as a summary of the overall distribution • An SD provides valuable information when the distribution is normal: • There are approximately three SDs above and below the mean in a normal distribution

Standard Deviation Interpretation (cont’d) • In a normal distribution, a fixed percentage of cases lie within certain distances from the mean:

SDs and Individual Scores • A person who scores one SD below the mean has a higher score than 16% of the cases (2.3% + 13.6%) • A person who scores one SD above the mean has a higher score than 84% of the cases (50.0% + 34.1%)

Standard Deviation: Advantages • Takes all data into account in describing variability • Is more stable as a measure of variability than the range or IQR • Lends itself to computation of other measures often used in inferential statistics • Is helpful in interpreting individual scores when data are distributed approximately normally

Standard Deviation: Disadvantages • Can be influenced by extreme scores • Not as “intuitive” or as easy to interpret as the range

Variance • An important variability concept in inferential statistics, but not used descriptively • The variance = SD2 • In earlier example, SD2 = 25.822 = 666.67 • Not easily interpreted because it is not in units of original data—it is in units squared (here, pounds squared)

More about notation σ “sigma” = standard deviation in the reference population s Lower case “s” = standard deviation in the sample σ2 “sigma squared” = variance in the reference population s2 Variance in the sample

Formulae for Variance Population variance Sample variance

Measurement Scales and Descriptive Statistics

Relative Standing • Central tendency and variability indexes describe a distribution • There are also descriptive statistics to describe individual scores—i.e., their relative standing or position in a distribution: • Percentile ranks • Standard scores

Percentiles • A percentile is one one-hundredth of a distribution • Quartiles divide a distribution into quarters • Deciles divide a distribution into tenths • Each percentile, quartile, etc. can be determined in relation to a score in a distribution

Percentile Rank • A percentile rank is the location of a given score in the distribution—it communicates what percentage of cases fall at or below that value • Score  What percentile rank? • Percentile  What score?

Percentiles and Outliers • Outliers are often defined in relation to percentiles • There are: • Mild outliers • Extreme outliers

NOT what we’re talking about

An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs. -Grubbs (Wikipedia) In this course (as per the textbook), an outlier is a value that is >1.5 times the IQR

Outliers: Formal Definition • A mild outlier is a score that is between 1.5 and 3.0 times the value of the IQR, below Q1 or above Q3 • An extreme outlier is a score that is greater than 3.0 times the value of the IQR, below Q1 or above Q3

Box Plots • A box plot (or box-and-whiskers plot) is a graphic depiction of a distribution that shows the median, the IQR, and the outer limits of values not considered outliers • Outlying cases can be shown on the box plot, with identifying information (e.g., an ID number)

Traditionally…

But for the purposes of this course (due to the textbook’s insistence)… The extent of the boxplot is NOT the range, but rather those data points that are NOT outliers

Box Plots (cont’d) • Bottom of “box” shows Q1 • Top of “box” shows Q3 • Horizontal line in box shows median • “Whiskers” show outer limits of what is NOT an outlier • In SPSS, a circle O indicates value and ID of a mild outlier • An asterisk * is for an extreme outlier

Box Plot Illustration – p52 • Textbook Heart Rate Data: • Q1 = 62 • Q2 = 66 = Median • Q3 = 68 • “Whiskers” limits: 53, 77 • Mild outliers: • 50 (#106), 45 (#105) • Extreme outliers: • 40 (#104), 90 (#103), • 95 (#102), 100 (#101)

Box Plots Versus Histograms • Outliers can be seen in histograms, but box plots give more useful information about degree of extremity and ID numbers

(Stolen from wikipedia)

Standard Scores • Also called z-score or z-statistic or z-value or normal score • Is a measure of how far an observation is from the mean of its distribution • The z-score only has meaning if you know the parameters of the reference population • i.e.: μ and σ

Standard Scores • Standard scores—another index of “relative standing” helpful in interpreting raw scores • A standard score (also called a z score) is a score expressed in standard deviation units, in relative distance from the mean

Standard Scores (cont’d) • Standard score equation: • z = (X – M) ÷ SD • That is, the mean is subtracted from an individual score, then divided by the SD • For example: • M = 100, SD = 25, X = 125, z = 1.0 • M = 100, SD = 25, X = 50, z = -2.0

How is this useful? • Very useful in standardized testing (like MCAT, GRE, SAT, etc) • Allows us to: • Calculate the probability of a score occurring within a normal distribution • Compare two scores that are from different normal distributions

Calculating a Probability Using a z-score For a variable distributed normally (such as MCAT scores in Canada, a z-score of 1.96 will have 95% of observations falling within its range.

Example • We know that the LSAT score in Canada is normally distributed. The mean mark is 60% and the SD is 15. So…. • What is the lowest mark among those who were in the top 10% of performers? • (Why? Because law schools will only take the top 10% and need to know what mark to make their cut-off)

Example • We know that the LSAT score in Canada is normally distributed. The mean mark is 60% and the SD is 15. So…. We get the “1.282” by looking it up in a table, or using a z-score calculator http://www.fourmilab.ch/rpkp/experiments/analysis/zCalc.html

Using z-scores to compare tests • A student is in two classes, English and Math. • She got 70% in English and 70% in Math and wants to know which class she’s doing better in • Why isn’t the answer automatically “English”?

Using z-scores to compare tests • A student is in two classes, English and Math.

Using z-scores to compare tests Since these scores are from two different distributions, we need to standardise them into z-scores so that they can be directly compared. This gives us:

Using z-scores to compare tests How do we interpret this? Z=0.67 suggests that the student performed 0.67 SDs above the mean in both classes. This makes her above average in both classes. But she’s doing equally well in both. (If we use a z-score calculator, we’d find out that z=0.67 means that she’s in the top 25.1% of the class.)

Standard Scores (cont’d) • Standard scores have a mean of 0.0 and an SD of 1.0: • But z scores can be transformed mathematically to have any mean and SD • Most typical: • Mean = 500, SD = 100 (e.g., GRE, SAT) • Mean = 100, SD = 15 (e.g., IQ tests) • Mean = 50, SD = 50 (called T scores)

The Normal Distribution • Central Limit Theorem: • Under “mild” conditions, a large number of any random variable will be distributed “normally” • For fun, go to: • http://www.math.csusb.edu/faculty/stanton/probstat/clt.html • This is an “applet” that you keep clicking on. It produces a graph of a random variable. You will see that it always ends up being a Normal curve

Hss2381a – stats and stuff