Intro to Statistics for the Behavioral Sciences PSYC 1900

Intro to Statistics for the Behavioral SciencesPSYC 1900 Lecture 3: Central Tendency And Dispersion

Measures of Central Tendency • Numerical values that refer to the center of a distribution • Used to provide a “best descriptor” of the score for a sample • Usefulness or quality of the measure depends on shape of distribution • Mode, Median, and Mean

The Mode • Defined as the most common or frequent score • The value with the highest point on a frequency distribution of a variable • 3,4,1,5,7,1,2,3,1,1,6,1,7,2 • The mode = 1

The Mode • If two adjacent points occur with equal and greatest frequency, the mode can be considered the average of these two. • Mode = 3.5

The Mode • If the two points are not adjacent and equal, the distribution is bimodal. • Of course, binning might result in a single mode by eliminating error/noise. • Bimodal usually means substantially separated

The Median • Score that corresponds to the point at or below which 50% of scores fall • The “middle” number in a ranking of the data • Median Location • Mdn location = (N+1)/2 • If we have 11 numbers, the mdn location is: • (11+1)/2 = 6 • 1,1,2,3,3,3,4,4,5,5,6 • Mdn = 3

The Median • What about: 1,1,2,3,3,3,4,4,5,5,6,6 • Mdn location = (12+1) / 2 = 6.5 • Mdn = 3.5 • When the median location falls between points, the median is defined as the average of those two points.

Median: Histogram vs. Stem and Leaf Stem-and-Leaf Plot Frequency Stem & Leaf 2.00 1 . 00 1.00 2 . 0 3.00 3 . 000 2.00 4 . 00 2.00 5 . 00 2.00 6 . 00 Stem width: 1.00 Each leaf: 1 case(s)

The Mean • The average value • The sum of the scores divided by the number of scores • 2,4,5,9,11 • (2+4+5+9+11)=31; 31/5=6.2

Relations Among Measures of Central Tendency • When the distributions are symmetric, the three measures will generally correspond. • When the distributions are asymmetric, they will often diverge.

The Mode:Advantages & Disadvantages • Mode is the most commonly occurring score. • Always appears in the data; mean and median may not. • Most likely score to occur. • Useful for nominal data; mean and median are not. • When might the mode be useful?

Loaded Dice 11.00 1 . 00000000000 1.00 2 . 0 2.00 3 . 00 3.00 4 . 000 4.00 5 . 0000 5.00 6 . 00000 6.00 7 . 000000 5.00 8 . 00000 4.00 9 . 0000 3.00 10 . 000 2.00 11 . 00 1.00 12 . 0 The mode is your best bet. Median is not the highest probability. Mean does not even occur in sample.

Disadvantages of The Mode • Mode can vary depending on how data are grouped/binned • May not be representative of entire distribution • Loaded Dice Example • Rare events (e.g., most frequent is zero) • Tells us nothing about cause of nonzero events

Advantages & Disadvantagesof the Mean and Median Let me tell you a story . . . . Better known as ALWAYS look at your data distributions

Men, Women, Evolution, & Sex • Is there a gender difference in the number of desired partners? • Evolutionary psychologists say “yes” due to an asymmetry in minimum parental investment needs. • Data appeared to support this

Men, Women, Evolution, & Sex • Mean # partners in next 30 years: • Men = 7.69; Women = 2.78 • You can’t blame men; it’s in there nature! • Yes? No? Any ideas?

Means versus Medians • These folks never considered the form of their data (or did they?) • Without winsorization, men’s mean = 64

Means: Men = 7.69; Women = 2.78 Medians and Modes = 1

Advantages & Disadvantagesof the Mean and Median • Mean is subject to bias by extreme values • May provide a value for central tendency that does not exist in data set • Major benefit is historical use and ability to be manipulated algrebraically • Most mathematical equations depend on it • When assumptions are met, it is quite valid • Median • Not influenced by extreme values (e.g., salaries, home values). • Not as amenable to algebraic manipulation and use.

Measures of Variability/Dispersion • The degree to which individual data points are distributed around the mean • Provide a measure of how representative the mean is of the scores More Representative

Several Measures • Range • Distance from lowest to highest values • 1,2,3,4,4,5,6,7; Range = 7-1 = 6 • Suffers from sensitivity to extremes • 1,2,3,4,4,5,6,7,80; Range = 80-1 = 79 • Interquartile Range • Range of the middle 50% of scores • Less dependent on extreme values • Trimmed samples and statistics

Average Deviation • Conceptually Clear • How far individual scores deviate from the mean on average • Problem is that average deviation from the mean is, be definition, zero • 1,2,3,3,4,5 • Deviations: -2,-1,0,0,1,2 • Average Deviation = 0

The Variance • Solves the problem that deviations sum to zero • Variance is defined as the average of the sum squared deviations about the mean • Squares of negative numbers are positive • Divide by N-1, not N • Sample Variance is used to estimate Population Variance

The Variance Data: 1,2,3,3,4,4,4,5,6 Volunteer?

Standard Deviation • Square root of the variance • Average deviation from the mean • Gets rid of the squared metric

Computational Formulae • Algebraic manipulations are less clear conceptually but easy to use

Mean and Variance as Estimators • These descriptive statistics are used to estimate parameters

Bias in Sample Variance • If we calculated the average squared deviation of the sample (as opposed to dividing by N-1), the variance would be a biased estimate of the population variance. • Bias: A property of a statistic whose long-range average is not equal to the parameter it estimates.

Bias in Sample Variance • Why does using N produce bias? • Expected value is the long range avg. of a statistic over repeated samples.

Applet Example

Multiply by constant: N/N-1

Box-and-Whisker Plots • Graphical representations of dispersion • Quite useful to quickly visualize nature of variability and extreme scores

Box-and-Whisker Plots • First find the median location and mdn • Find the quartile locations • Medians of the upper and lower half of distribution • Quartile location = (mdn location + 1) / 2 • These are termed the “hinges” • Note: drop fractional values of mdn location • Hinges bracket interquartile range (IQR) • Hinges serve as top and bottom of box

Box-and-Whisker Plots • Find the H-spread • Range between two quartiles • Simply the IQR • Area inside box in plot • Draw the whiskers • Lines from hinges to farthest points not more than 1.5 X H-spread • Outliers • Points beyond whiskers • Denoted with asterisks

Box-and-Whisker Plots Stem-and-Leaf Plot Frequency Stem & Leaf 2.00 0 . 11 3.00 0 . 223 3.00 0 . 445 6.00 0 . 667777 3.00 0 . 889 1.00 Extremes (>=15) Stem width: 10.00 Each leaf: 1 case(s)

Example

Intro to Statistics for the Behavioral Sciences PSYC 1900