Review for Final Exam

Review for Final Exam GEOG090 - Quantiative Methods in Geography Spring 2006

GEOG 090 – Quantitative Methods in Geography • The Scientific Method • Exploratory methods (descriptive statistics) • Confirmatory methods (inferential statistics) • Mathematical Notation • Summation notation • Pi notation • Factorial notation • Combinations

Summation Notation: Components refers to where the sum of terms ends indicates what we are summing up indicates we are taking a sum refers to where the sum of terms begins

Summation Notation: Compound Sums • We frequently use tabular data (or data drawn from matrices), with which we can construct sums of both the rows and the columns (compound sums), using subscript i to denote the row index and the subscript j to denote the column index: Columns Rows

Pi Notation • Whereas the summation notation refers to the addition of terms, the product notation applies to the multiplication of terms • It is denoted by the following capital Green letter (pi), and is used in the same way as the summation notation

Factorial • The factorial of a positive integer, n, is equal to the product of the first n integers • Factorials can be denoted by an exclamation point • There is also a convention that 0! = 1 • Factorials are not defined for negative integers or nonintegers

Combinations • Combinations refer to the number of possible outcomes that particular probability experiments may have • Specifically, the number of ways that r items may be chosen from a group of nitems is denoted by: or

C. Scales of Measurement • The data used in statistical analyses can divided into four types: 1. The Nominal Scale 2. The Ordinal Scale 3. The interval Scale 4. The Ratio Scale As we progress through these scales, the types of data they describe have increasing information content

Which one is better: mean, median, or mode? • The mean is valid only for interval data or ratio data. • The median can be determined for ordinal data as well as interval and ratio data. • The mode can be used with nominal, ordinal, interval, and ratio data • Mode is the only measure of central tendency that can be used with nominal data

Which one is better: mean, median, or mode? • It also depends on the nature of the distribution Multi-modal distribution Unimodal symmetric Unimodal skewed Unimodal skewed

Which one is better: mean, median, or mode? • It also depends on your goals • Consider a company that has nine employees with salaries of 35,000 a year, and their supervisor makes 150,000 a year • What if you are a recruiting officer for the company that wants to make a good impression on a prospective employee? • The mean is (35,000*9 + 150,000)/10 = 46,500 I would probably say: "The average salary in our company is 46,500" using mean Source: http://www.shodor.org/interactivate/discussions/sd1.html

Measures of dispersion • Measures of Dispersion • Range • Variance • Standard deviation • Interquartile range • z-score • Coefficient of variation

Further Moments of the Distribution • There are further statistics that describe the shape of the distribution, using formulae that are similar to those of the mean and variance • 1st moment - Mean (describes central value) • 2nd moment - Variance (describes dispersion) • 3rd moment - Skewness (describes asymmetry) • 4th moment -Kurtosis(describes peakedness)

How to Graphically Summarize Data? • Histograms • Box plots

Functions of a Histogram • The function of a histogram is to graphically summarize the distribution of a data set • The histogram graphically shows the following: 1. Center (i.e., the location) of the data 2. Spread (i.e., the scale) of the data 3. Skewness of the data 4. Kurtosis of the data 4. Presence of outliers 5. Presence of multiple modes in the data.

We can also use a box plot to graphically summarize a data set A box plot represents a graphical summary of what is sometimes called a “five-number summary” of the distribution Minimum Maximum 25th percentile 75th percentile Median Interquartile Range (IQR) 75th %-ile max. median 25th %-ile min. Rogerson, p. 8. Box Plots

How To Assign Probabilities to Experimental Outcomes? • There are numerous ways to assign probabilities to the elements of sample spaces • Classical methodassigns probabilities based on the assumption of equally likely outcomes • Relative frequency methodassigns probabilities based on experimentation or historical data • Subjective methodassigns probabilities based on the assignor’s judgment or belief

Probability-Related Concepts • An event – Any phenomenon you can observe that can have more than one outcome (e.g., flipping a coin) • An outcome – Any unique condition that can be the result of an event (e.g., flipping a coin: heads or tails), a.k.a simple event or sample points • Sample space – The set of all possible outcomes associated with an event • e.g., flip a coin – heads (H) and tails (T) • e.g., flip a coin twice – HH, HT, TH, TT

Probability-Related Concepts • Associated with each possible outcome in a sample space is a probability • Probability is a measure of the likelihood of each possible outcome • Probability measures the degree of uncertainty • Each of the probabilities is greater than or equal to zero, and less than or equal to one • The sum of probabilities over the sample space is equal to one

How To Assign Probabilities to Experimental Outcomes? • There are numerous ways to assign probabilities to the elements of sample spaces • Classical methodassigns probabilities based on the assumption of equally likely outcomes • Relative frequency methodassigns probabilities based on experimentation or historical data • Subjective methodassigns probabilities based on the assignor’s judgment or belief

A B Probability Rules • Rules for combining multiple probabilities • A useful aid is the Venn diagram - depicts multiple probabilities and their relations using a graphical depiction of sets • The rectangle that forms the area of the Venn Diagram represents the sample (or probability) space, which we have defined above • Figures that appear within the sample space are sets that represent events in the probability context, & their area is proportional to their probability (full sample space = 1)

0.50 p(xi) 0.25 0 1 2 3 4 xi Probability Mass Function • Example: # of malls in cities • xi p(X=xi) • 1/6 = 0.167 • 1/6 = 0.167 • 1/6 = 0.167 • 3/6 = 0.5 • This plot uses thin lines to denote that the probabilities are massed at discrete values of this random variable

a b f(x) x • The probability of a continuous random variable X within an arbitrary interval is given by: • Simply calculate the shaded shadedarea if we know the density function, we could use calculus

Discrete Probability Distributions • Discrete probability distributions • The Uniform Distribution • The Binomial Distribution • The Poisson Distribution • Each is appropriately applied in certain situations and to particular phenomena

Source: http://en.wikipedia.org/wiki/Uniform_distribution_(discrete) a<=x<=b otherwise x < a a<=x<=b x>b

0.25 P(xi) 0.125 0 E N S W The Uniform Distribution • Example – Predict the direction of the prevailing wind with no prior knowledge of the weather system’s tendencies in the area • We would have to begin with the idea that P(xNorth) = 1/4 P(xEast) = 1/4 P(xSouth) = 1/4 P(xWest) = 1/4 • Until we had an opportunity to sample and find out some tendency in the wind pattern based on those observations

xi P(xi) • 0 0.4096 • 1 0.4096 • 0.1536 • 0.0256 • 0.0016 0.50 P(xi) 0.25 0 1 2 3 4 0 xi The Binomial Distribution – Example • Naturally, we can plot the probability mass function produced by this binomial distribution:

e-l * lx P(x) = x! The Poisson Distribution • Poisson distribution • The shape of the distribution depends strongly upon the value of λ, because as λ increases, the distribution becomes less skewed, eventually approaching a normal-shaped distribution as it gets quite large  • We can evaluate P(x) for any value of x, but large values of x will have very small values of P(x)

Source: http://en.wikipedia.org/wiki/Normal_distribution

3. • P(0 £ Z £a) = [0.5 – (table value)] • Total Area under the curve = 1, thus the area above x is equal to 0.5, and we subtract the area of the tail a 2. a Finding the P(x) for Various Intervals 1. • P(Z ³a) = (table value) • Table gives the value of P(x) in the tail above a a • P(Z £a) = [1 – (table value)] • Total Area under the curve = 1, and we subtract the area of the tail

6. a 5. a Finding the P(x) for Various Intervals 4. • P(Z £a) = (table value) • Table gives the value of P(x) in the tail below a, equivalent to P(Z ³a) when a is positive a • P(Z ³a) = [1 – (table value)] • This is equivalent to P(Z £a) when a is positive • P(a£ Z £ 0) = [0.5 – (table value)] • This is equivalent to P(0 £ Z £ a) when a is positive

7. b a Finding the P(x) for Various Intervals P(a £ Z £ b) if a < 0 and b > 0 = (0.5 – P(Z<a)) + (0.5 – P(Z>b)) = 1 – P(Z<a) – P(Z>b) or = [0.5 – (table value for a)] + [0.5 – (table value for b)] = [1 – {(table value for a) + (table value for b)}] • With this set of building blocks, you should be able to calculate the probability for any interval using a standard normal table

The Central Limit Theorem • Suppose we draw a random sample of size n (x1, x2, x3, … xn – 1, xn) from a population random variable that is distributed with meanµ and standard deviationσ • Do this repeatedly, drawing many samples from the population, and then calculate the of each sample • We will treat the values as another distribution, which we will call the sampling distribution of the mean ( )

The Central Limit Theorem • Given a distribution with a mean μ and variance σ2, the sampling distribution of the mean approaches a normal distribution with a mean (μ) and a variance σ2/n as n, the sample size, increases • The amazing and counter- intuitive thing about the central limit theorem is that no matter what the shape of the original (parent) distribution, the sampling distribution of the mean approaches a normal distribution

Confidence Intervals for the Mean • More generally, a (1- α)*100% confidence interval around the sample mean is: • Where zα is the value taken from the z-table that is associated with a fraction αof the weight in the tails (and therefore α/2 is the area in each tail) margin of error Standard error

Hypothesis Testing • One-sample tests • One-sample tests for the mean • One-sample tests for proportions • Two-sample tests • Two-sample tests for the mean

Hypothesis Testing 1. State the null hypothesis, H0 2. State the alternative hypothesis, HA 3. Choose a, our significance level 4. Select a statistical test, and find the observed teststatistic 5. Find the critical value of the test statistic 6. Compare the observed test statistic with the critical value, and decide to accept or reject H0

Hypothesis Testing - Errors H0 is trueH0 is false Accept H0 Correct decision Type II Error (β) (1-α) Reject H0 Type I Error (α) Correct decision (1-β)

p-value • p-value is the probability of getting a value of the test statistic as extreme as or more extreme than that observed by chance alone, if the null hypothesis H0, is true. • It is the probability of wrongly rejecting the null hypothesis if it is in fact true • It is equal to the significance level of the test for which we would only just reject the null hypothesis

p-value • p-value vs. significance level • Small p-values  the null hypothesis is unlikely to be true • The smaller it is, the more convincing is the rejection of the null hypothesis

One-Sample t-Tests Data: Acidity data has been collected for a population of ~6000 lakes in Ontario, with a mean pH of μ = 6.69, and σ= 0.83. A group of 27 lakes in a particular region of Ontario with acidic conditions is sampled and is found to have a mean pH of x = 6.16, and a s = 0.60. Research question: Are the lakes in that particular region more acidic than the lakes throughout Ontario?

One-Sample Tests for Proportions • Data: A citywide survey finds that the proportion of households that own cars is p0 = 0.2. We survey 50 households and find that 16 of them own a car (p = 16/50 = 0.32) • Research question: Is the proportion of households in our survey that has a car different from the proportion found in the citywide survey?

| x1 - x2 | ttest = (n1 - 1)s12 + (n2 - 1)s22 (1 / n1) + (1 / n2) Sp = n1 + n2 - 2 Two-Sample t-tests • Variances are equal (homoscedasticity) Pooled estimate of the standard deviation: sp  df =n1 + n2 - 2

Review for Final Exam