510 likes | 688 Views
STAT 231 Winter 2011. Introduction. Niall MacGillivray 2B Actuarial Science. Agenda. 7:05 – 7:15 Data Types and Transformations 7:15 – 7:35 PPDAC 7:35 – 8:10 Data Summaries 8:10 – 8:15 Bivariate Risk Measures 8:15 – 8:40 Probability Models 8:40 – 9:00 Likelihood Functions and MLEs.
E N D
STAT 231 Winter 2011
Introduction • Niall MacGillivray • 2B Actuarial Science
Agenda • 7:05 – 7:15 Data Types and Transformations • 7:15 – 7:35 PPDAC • 7:35 – 8:10 Data Summaries • 8:10 – 8:15 Bivariate Risk Measures • 8:15 – 8:40 Probability Models • 8:40 – 9:00 Likelihood Functions and MLEs
What is Statistics? What is Statistics? Statistics is the science of design and collection of data used to draw conclusions about a larger population.
Data Types • Discrete: countable (whole numbers), finite • i.e. Number of students in Stat 231 born in 1991 • Continuous: measured data using real number line • i.e. Age of Stat 231 students • Categorical: non-numerical, pre-determined categories • i.e. Months of birth of Stat 231 students • Binary: categorical data with two categories • i.e. Born in 1991?
Data Types continued • Ordinal: data that has an underlying order • i.e. Final Stat 230 grades of students in Stat 231 • Grouped/Frequency: numerical, # of occurrences in a category • i.e. Number of Pure Math/Act Sci/Stats students in Stat 231 • A Dataset is a collection of data • Can include several different data types
Transformations • Transforming data from one form to another using a transformation function can simplify data and/or solve comparison issues • Transformation types: • Monotone increasing: preserves ranking, i.e. ranks of {x1,x2,...,xn} = ranks of {F(x1),F(x2),...,F(xn)} • Monotone decreasing reverses rankings • Affine: linear transformation (y = Ax + B) • Coding: categorical data to numerical data • Ranking: ordering data from smallest to largest
Example 1 If the temperature at which a certain compound melts is a random variable with mean value 120°C and standard deviation 2°C what are the mean temperature and standard deviation measured in °F? (Hint: °F = 1.8°C + 32).
Problem • “A clear statement of what we are trying to achieve” • Key Terms: • Unit: individual in the population • Variate: characteristic of a unit • Attribute: characteristic of the population • The problem is defined in terms of attributes of the population
Aspect • Aspects (type of problem) • Descriptive (exploring a target population attribute) • What is the average age of death for smokers in Canada? • What are the average marks for STAT 230 and STAT 231? • Causative (linking explanatory and response variates) • Does smoking lead to lung cancer? • Does a high mark in STAT 230 indicate the individual will get a high mark in STAT 231? • Predictive (predicting value of response variate) • Given that a male, age 30, smokes, what is the predicted age of mortality? • If I know an individual’s mark in STAT 230, can I predict his mark in STAT 231?
Population • Target Pop. (units we want to investigate) • University Students • Study Pop. (units which could have been selected) • Laurier Students • Sample (units actually selected) • Laurier Students selected for the study • Subsets • Sample is a subset of study population • Study population not necessarily a subset of target population
Error and Plan • Study Error (Study vs. Target) • Possible consequence: making the wrong conclusion about our target population • Sample Error (Sample vs. Study) • Is present because we use a subset to make a conclusion on a larger population • Can only be reduced, but never eliminated • Plan: how we execute the study • Experimental vs. Observational plans
Example 2 PROBLEM: An auto manufacturer wants to know the average distance cars registered in Ontario go between oil changes. PLAN: Canadian Tire is asked to collect data on the distance driven since the last oil change for all cars registered in Ontario whose oil they change during the last week in February. If the odometer reading at the last oil change is not available, a car will not be included in the sample.
Data • After we’ve collected data, it’s important to summarize it in a form that is clear and concise • Potential Issues: • Outliers: extreme observations • Bias: systematic error from improper data collection • Missing observations: suspicious -> omitted
Our Collected Data Observed Data: Ages of 12 individuals randomly selected from a room. { 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 40 } Sample Size: n = 12
Averages { 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 40 } Measures of Averages • Mean ArithmeticGeometric • Median • Q2, 50% of the data lies above, 50% lies below • Mode • The most frequently occurring data point(s)
Pie Chart Pie Charts • Frequency: # of occurrences • Relative Frequency: proportion of occurrences
Histogram Histograms • Frequency Histogram • Height (area) of each bar is the # of occurrences within each interval • Relative Frequency Histogram • Height (area) of each bar is the proportion of occurrences within each interval • Determining an interval size • (Max – Min)/desired # of intervals
Histogram Frequency Histogram { 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 40 }
Example 3 Relative Frequency Histogram Estimate the number of electronic components in the sample which took at least 8 hours to fail, if there was a total of 300 items in the sample.
CDF Cumulative Frequency Plot X-axis: data points Y-axis: sum of all relative frequencies for data points up to x
Lorenz Curves Lorenz Curves • CDF plot used to illustrate income inequality • Shows percentage (y%) of total income held by poorest x% of households • 45-degree line: line of perfect equality (LPE) • Gini Co-efficient: Area between Lorenz curve and LPE Area between Lorenz curve and LPI
Tipping Points Model of Tipping Points • How many people will do something, given how many other people are expected to do it • Can be illustrated using a modified Lorenz curve • Equilibria: points intersecting the 45⁰ line • Stable Equilibria: points at which small deviations from equilibria will result in a return to equilibria, regardless of the direction of deviation • Unstable Equilibria: tipping points at which small deviations from equilbria will not result in a return to equilibria
Example 4 (from Asst. 1) 100 students are in a class. Let N = the actual number of students clapping and NE be the number of students expected to clap. The relationship between N and NE is given as follows: N = 0.5NE if NE <= 20 N = 2NE – 30 if 20 < NE <= 50 N = 0.5NE + 45 if 50 < NE <= 90 N = 90 if NE >= 90 Illustrate this graphically. Equilibria? Stable Equilbria? Tipping Points?
Variability and Spread { 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 40 } • Sample Variance Population Variance • Percentile • The p-th percentile is the data point located at position number (p/100)*(n + 1) • Use linear interpolation if necessary • Interquartile Range (IQR) = Q3 (75th percentile) – Q1 (25th percentile) n
Box and Whisker Plot Box and Whisker Plot Steps: • Calculate Q1, Q2 (median), Q3, and IQR • Draw a horizontal line representing scale of measurement, and a box surrounding Q1 and Q3, with a line drawn for Q2 • Calculate outlier boundaries (dotted lines): • lower fence = Q1 – 1.5*IQR, upper fence = Q3 + 1.5*IQR • Mark any outliers with a * or o on the graph • Draw whiskers connecting the largest and smallest measurements (upper/lower adjacent values) that are not outliers to the box
QQ Plot QQ Plot • Theoretical Quantiles • Quartiles, percentiles, etc. of known distribution • 95th Theoretical quantile: α • Sample Quantiles • 2 uses of QQ plots • Sample vs. Theoretical Quantile (45oline = good fit) • Sample vs. Sample Quantile (straight line = similar distribution)
Measures of Association • Relative Risk (of event A provided event B occurs or does not occur) • > 1 : positive association between A and B • Association does not imply causation!
Example 6 Given the following frequency table for individuals grouped according to whether they smoke or not and their education level: Calculate the relative risk of smoking if a person has a PHD education.
Measures of Association Correlation Coefficient ρ = Cov(X, Y) or or σx*σy • Measures linear relationship between two random variables • ρ > 0 : positive correlation; vice-versa • |ρ| = 1: X and Y are linearly related
Example 7 • (47, 41) is called an influential outlier
Time Series Time Series Graphs • The explanatory variate is time • The response variate is the measured variable of interest at time t • Neighbouring points are joined by straight lines rather than a simple scatter plot • Time series graphs can be used to look at trends, seasonal patterns, etc.
Statistical Science • Statistics is the science of design and collection of data used to draw conclusions about a larger population. • When we collect this data, we’re always going to have uncertainty • We fit our data to known probability models to quantify these uncertainties
Terminology • Descriptive Statistics (Chapter 1) • Tools and techniques used to describe certain attributes of a population • Graphs, charts, numerical summaries • Statistical Inference (Rest of Course) • A problem solving method using data to draw general conclusions on a population
Statistical Inference • Estimation Problems • After collection of data, we fit the data to probability models • Using the collected data, form estimates for the parameters of the models • Hypothesis Testing • Accepting or rejecting a statement about the target population
Probability Models • Random Variables • Represent what we’re going to measure in our experiment • Realizations • Represent the actual data we’ve collected from our experiment
Probability Functions • CDF = (discrete) or (cts.) • E[g(X)]= (disc.) or (cts.) • Var(X) = E(X^2) – [E(X)]^2 • E(aX + b) = aE(X) + b • Var(aX + b) = a2 Var(X) • P(a<=X<=b) = (discrete) or (cts.)
Example 8 A random variable X has a continuous probability model with a cumulative distribution function (cdf) Give an expression for the expected value of Do not evaluate any sums or integrals.
Probability Models • Uniform (discrete/cts data over specified range) • Discrete: P(X=x) = 1/(b-a+1), a<=x<=b • Continuous: f(x) = 1/(b-a), a<=x<=b • No parameters! • Binomial (binary data) • Fixed number of trials (n) and fixed probability (π) of success on each (Bernoulli) trial • P(X=x; π) = ; x = 0,1,…,n • Bernoulli ~ Bin(1,p)
Probability Models • Multinomial (discrete grouped/frequency data) • P(X1=x1,…Xk=xk; π1,…,πk) = 0<=xi<=n, = n, = 1 • Poisson (discrete data) • Events occur at a constant rate (λ) • P(X=x; λ) = ; x = 0,1,2,… • Exponential (continuous data) • Waiting time between events occuring at rate λ • f(x; λ) = λe-λx ; x > 0
Gaussian Distribution and CLT Gaussian Distribution • f(x; μ, σ) = • If Y ~ G(μ,σ), then Z = ~ G(0,1) • If Y1,…Yn are independent G(μ1,σ1),…,G(μn,σn): • ~ G( , ) • Central Limit Theorem (CLT) • For any iid RVs W1,W2,…Wn with mean μ and s.d. σ: If = , then E( ) = μ and SD( ) = ~ G(0,1)
Example 9 We are given that non-diabetics have glucose levels represented by a random variable which follows a G(5.31, 0.58) distribution. Diabetics have glucose levels represented by a random variable which follows a G(11.74, 3.5) distribution. When taking a test, if the person’s glucose level measures higher than 6.5, they will be diagnosed as diabetic. • If a person is diabetic, what is the probability that he/she is diagnosed correctly? • What is the probability that a non-diabetic is diagnosed as diabetic?
Response Model • Problem: what is μ, the average of the attribute of interest in the target population • We will use our collected data to estimate μ • Let Y be a random variable that represents the measured response variate • Y = μ + R R~G(0, σ ) • Y ~ G(μ, σ) • μ is systematic (no risk), while R is random (variable)
Maximum Likelihood Estimation • Binomial π = ; x = # of successes • Response μ = ; yi is the ith realization • Maximum Likelihood Estimation • A procedure used to determine a parameter estimate given any model
Maximum Likelihood Estimation • First, we assume our data collected will follow a distribution • Before we collect the sample random variables • {Y1, Y2, …, Yn} • After we collect the sample realizations • {y1, y2, …, yn} • We know the distribution of Yi (with unknown parameters), hence we know the PDF/PMF
Likelihood Function • The Likelihood Function: • Likelihood: the probability of observing the dataset you have • We want to choose an estimate of the parameter θ that gives the largest such probability • Ω is the parameter space, the set of possible values for θ • Relative Likelihood: R(μ) = Discrete Continuous
MLE Process • Step One: Define the likelihood function • Step Two: Define the log likelihood function l(θ ) = ln[L(θ)] • Step Three: Take the derivative with respect to θ • Step Four: Solve for zero to arrive at the maximum likelihood estimate • Step Five: Plug in data values (if given) to arrive at a numerical maximum likelihood estimate
Examples 10/11 Discrete: What is the MLE of a geometric distribution with pmf ? Assume you draw k realizations from your sample. Continuous: Given Y ~ Exp(θ), with realizations y1,y2,…yn , find the maximum likelihood estimate of θ. What is the MLE for the realizations {3, 2, 1, 4}?