830 likes | 916 Views
Distributions & Descriptive statistics Dr William Simpson Psychology, University of Plymouth. Defining and measuring variables. Independent & dependent variables. Independent variable : something we manipulate in an experiment Dependent variable : something we measure
E N D
Distributions & Descriptive statistics • Dr William Simpson • Psychology, University of Plymouth
Independent & dependent variables • Independent variable: something we manipulate in an experiment • Dependent variable: something we measure • By manipulating the IV, we expect to produce a change in the DV
Scales of measurement • variables classified according to type of scale • type of analysis depends on type of scale • Worst to best: Nominal, ordinal, interval, ratio
Nominal • Nominal data: assign categorical labels to observations • Not really measurement • E.g. male/female; married/single/widowed/divorced • Numbers on football jerseys
Ordinal • Ordinal data: values can be ranked (ordered). Categorical but rankable • E.g. small, medium, large; movie rating 1-5; Likert scale • Can only be ranked. Rating scale is not like cm. The diff between & is not nec the same as between &
Adding a response of "strongly agree" (5) to two responses of "disagree" (2) would give us a mean of 4, but what is the meaning of that number?
Interval • Interval data: ordinary measurement, e.g. temperature • Unlike ordinal data, we can say the diff between 1 & 2 deg C is same as diff between 4 & 5 deg
Ratio • Ordinary measurements, but with an absolute, non-arbitrary zero point • E.g. weight, length: any scale must start at zero • deg C: not ratio, because 0 arbitrarily set at freezing pt of water
Discrete & continuous variables • variables measured on interval & ratio scales are further identified as either: • discrete – Integers, no intermediate values. E.g. #Smarties in a box • continuous - measurable to any level of accuracy. E.g. Weight of Smarties contents
We have a pile of scores • Not all scores are equally likely • How were scores distributed?
Subjects were timed (in sec) while completing a problem-solving task: • 7.6, 8.1, 9.2, 6.8, 5.9, 6.2, 6.1, 5.8, 7.3, 8.1, 8.8, 7.4, 7.7, 8.2
Stem & leaf • Two components: the stem and the leaf • In problem-solving example, stem = ones, leaf = tenths • Stems range between 5 and 9
7.6, 8.1, 9.2, 6.8, 5.9, 6.2, 6.1, 5.8, 7.3, 8.1, 8.8, 7.4, 7.7, 8.2 • 5|98 • 6|821 • 7|6347 • 8|1182 • 9|2 • Key: 9|2 means 9.2
Heights in cm:154, 143, 148,139, 143, 147, 153, 162, 136, 147, 144, 143, 139, 142, 143, 156, 151, 164, 157, 149, 146 • - Put 2 digits in stem; split stems 0-4, 5-9 • 13|969 • 14|334323 • 14|87796 • 15|431 • 15|67 • 16|24 • Key: 13|6 means 136
GSR values: 23.25, 24.13, 24.76, 24.81, 24.98, 25.31, 25.57, 25.89, 26.28, 26.34, 27.09 • - Round the last 2 digits • 23|3 • 24|188 • 25|0369 • 26|33 • 27|1 • Key: 23|3 means 23.3
Histogram • Alternative way to look at distribution • It is like a version of stem-and-leaf turned 90 deg
Example • Time to complete task (min): • 8 2 6 12 9 14 1 7 7 9 11 8 12 10 5 7 10 9 10 11 4 8 2 11 10 11 13 13 14 11 13 10 12 13 5 16 11 17 10 6 13 11 5 9 12 14 8 2 12 4
Sort scores into about 10 or so bins (similar to stem in stem-and-leaf)
Decide on sensible bins • Count the number of observations in each bin (length of each leaf in stem-and-leaf) • This number in each bin is called the frequency
time frequency 0-1 1 2-3 3 4-5 5 6-7 5 8-9 8 10-11 13 12-13 10 14-15 3 16-17 2
This table is then used to make the histogram • Histogram is bar chart with frequency on y axis and score on x axis • Sometimes done other ways, e.g. connect the dots (frequency distrib polygon)
15 10 Frequency 5 0 0 2 4 6 8 10 12 14 16 18 20 Time (min)
in R • x<-c(8, 2, 6, 12, 9, 14, 1, 7, 7, 9, 11, 8, 12, 10, 5, 7, 10, 9, 10, 11, 4, 8, 2, 11, 10, 11, 13,13, 14, 11, 13, 10, 12, 13, 5, 16, 11, 17, 10, 6, 13, 11, 5, 9, 12, 14, 8, 2, 12, 4) • hist(x) • stem(x) • boxplot(x)
Probability distributions • Histogram is estimate of true probability distribution • Many theoretical probability distributions exist • Basis of statistical models used to make inferences about population
Binomial distribution • Binomial distribution is a discrete distribution • the binomial distribution applies when: • there is a series of n trials (e.g., 10 coin tosses) • only 2 possible outcomes per trial • outcomes are mutually exclusive (head or tail) • outcome of each trial independent of others
The binomial distribution gives the chance of getting each total number of ‘successes’ after doing all the (binary) trials of the expt • E.g. it gives the chance of getting 1, 2, or 3 girls after giving birth to 6 children • p = p(success) = p(girl) = 0.5 each trial • q = p(failure) = p(boy) = 1-p = 0.5 • n = number of trials = 6
prob distribution where n = 6 and the prob of each outcome is 0.5 on each trial looks like: probability number of girls
For any probability distribution, the y-axis is given by a formula • For the binomial, it looks like this: • k successes in n trials; () is binomial coefficient • you don’t need to know it
Normal distribution • Continuous probability distribution • Every probability distribution’s y-axis is given by a formula • For normal distribution, the y-axis (probability density) is:
We have a pile of scores • Have made stem-and-leaf, histogram • Want to summarise further: descriptive statistics
1. Centre (location) • What is the ‘typical’ score? If you were to make a prediction for a new score, what would it be?
a) Mean (average) • Mean = sum(x)/n
Mean as balance point • Imagine that each observation is a toy block • Place the blocks on a ruler; the position (1, 2, etc inches) represents the value • The balance point is the mean
1 2 2 3 1 2 2 5 1 2 2 9 Mean is pulled towards extreme observation (outlier)
b) Median • Median is middle score; 50th percentile • useful when extreme scores (outliers) lie in one tail of distribution (skewed)
Calculate the median • Sort scores • If odd n, median is middle value • If even n, median is mean of 2 middle values • 25 13 9 18 1 -> 1 9 13 18 25; med=13 • 25 13 9 18 -> 9 13 18 25 • Median= (13+18)/2 = 15.5
Median and outliers • 1 2 2 3 • 1 2 2 5 • 1 2 2 9 • Median = 2 in all cases
c) Mode • Mode is most frequently occurring score • Mean should really be used only for interval/ratio data. Mode good otherwise • E.g. mean movie rating – not really sensible. Mode sensible • Sometimes no unique mode exists (e.g. bimodal)
Bimodality can be due to mixture of two different populations (e.g. male and female)
15 10 Frequency 5 0 0 2 4 6 8 10 12 14 16 18 20 Time (min) Time to complete task (min) • Mean = 9.36 Median = 10 Mode =11
mean(x) • median(x) • Mode <- function(x) { ux <- unique(x) ux[which.max(tabulate(match(x, ux)))]} • Mode(x)
Likert scale • e.g. Brief Psychiatric Rating scale (BPRS) • Interview + observations of patient's behaviour over preceding 2–3 days • Each item scored 0-7
Suppose we have a new treatment • Does it reduce anxiety? • Define “anxiety” as score on Q2
We use BPRS on lots of patients • Compare treatment and placebo • How? Find mean(treatment) vs mean(placebo)?
The numbers 0-7 are not really numbers! • They have only rank (order) info • Ordinal