1 / 50

STAT 231 MIDTERM 1 Fall 2010

STAT 231 MIDTERM 1 Fall 2010. Introduction. Jeffrey Baer 3B Actuarial Science Work terms at Manulife and Towers Watson Waterloo SOS President, May 2009 – Aug 2010. Agenda. 8:05 – 8:15 Data Types and Transformations 8:15 – 8:35 PPDAC 8:35 – 9:10 Data Summaries

niles
Download Presentation

STAT 231 MIDTERM 1 Fall 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STAT 231 MIDTERM 1 Fall 2010

  2. Introduction • Jeffrey Baer • 3B Actuarial Science • Work terms at Manulife and Towers Watson • Waterloo SOS President, May 2009 – Aug 2010

  3. Agenda • 8:05 – 8:15 Data Types and Transformations • 8:15 – 8:35 PPDAC • 8:35 – 9:10 Data Summaries • 9:10 – 9:15 Bivariate Risk Measures • 9:15 – 9:40 Probability Models • 9:40 – 10:00 Likelihood Functions and MLEs

  4. What is Statistics? What is Statistics? Statistics is the science of design and collection of data used to draw conclusions about a larger population.

  5. Data Types • Discrete: countable (whole numbers), finite • i.e. Number of students in Stat 231 born in 1991 • Continuous: measured data using real number line • i.e. Age of Stat 231 students • Categorical: non-numerical, pre-determined categories • i.e. Months of birth of Stat 231 students • Binary: categorical data with two categories • i.e. Born in 1991?

  6. Data Types continued • Ordinal: data that has an underlying order • i.e. Final Stat 230 grades of students in Stat 231 • Grouped/Frequency: numerical, # of occurrences in a category • i.e. Number of Pure Math/Act Sci/Stats students in Stat 231 • A Dataset is a collection of data • Can include several different data types

  7. Transformations • Transforming data from one form to another using a transformation function can simplify data and/or solve comparison issues • Transformation types: • Monotone increasing: preserves ranking, i.e. ranks of {x1,x2,...,xn} = ranks of {F(x1),F(x2),...,F(xn)} • Monotone decreasing reverses rankings • Affine: linear transformation (y = Ax + B) • Coding: categorical data to numerical data • Ranking: ordering data from smallest to largest

  8. Example 1 If the temperature at which a certain compound melts is a random variable with mean value 120°C and standard deviation 2°C what are the mean temperature and standard deviation measured in °F? (Hint: °F = 1.8°C + 32).

  9. PPDAC

  10. Problem • “A clear statement of what we are trying to achieve” • Key Terms: • Unit: individual in the population • Variate: characteristic of a unit • Attribute: characteristic of the population • The problem is defined in terms of attributes of the population

  11. Aspect • Aspects (type of problem) • Descriptive (exploring a target population attribute) • What is the average age of death for smokers in Canada? • What are the average marks for STAT 230 and STAT 231? • Causative (linking explanatory and response variates) • Does smoking lead to lung cancer? • Does a high mark in STAT 230 indicate the individual will get a high mark in STAT 231? • Predictive (predicting value of response variate) • Given that a male, age 30, smokes, what is the predicted age of mortality? • If I know an individual’s mark in STAT 230, can I predict his mark in STAT 231?

  12. Population • Target Pop. (units we want to investigate) • University Students • Study Pop. (units which could have been selected) • Laurier Students • Sample (units actually selected) • Laurier Students selected for the study • Subsets • Sample is a subset of study population • Study population not necessarily a subset of target population

  13. Error and Plan • Study Error (Study vs. Target) • Possible consequence: making the wrong conclusion about our target population • Sample Error (Sample vs. Study) • Is present because we use a subset to make a conclusion on a larger population • Can only be reduced, but never eliminated • Plan: how we execute the study • Experimental vs. Observational plans

  14. Example 2 PROBLEM: An auto manufacturer wants to know the average distance cars registered in Ontario go between oil changes. PLAN: Canadian Tire is asked to collect data on the distance driven since the last oil change for all cars registered in Ontario whose oil they change during the last week in February. If the odometer reading at the last oil change is not available, a car will not be included in the sample.

  15. Data • After we’ve collected data, it’s important to summarize it in a form that is clear and concise • Potential Issues: • Outliers: extreme observations • Bias: systematic error from improper data collection • Missing observations: suspicious -> omitted

  16. Our Collected Data Observed Data: Ages of 12 individuals randomly selected from a room. { 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 38 } Sample Size: n = 12

  17. Averages { 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 40 } Measures of Averages • Mean ArithmeticGeometric • Median • Q2, 50% of the data lies above, 50% lies below • Mode • The most frequently occurring data point(s)

  18. Pie Chart Pie Charts • Frequency: # of occurrences • Relative Frequency: proportion of occurrences

  19. Histogram Histograms • Frequency Histogram • Height (area) of each bar is the # of occurrences within each interval • Relative Frequency Histogram • Height (area) of each bar is the proportion of occurrences within each interval • Determining an interval size • (Max – Min)/desired # of intervals

  20. Histogram Frequency Histogram { 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 40 }

  21. Example 3 Relative Frequency Histogram Estimate the number of electronic components in the sample which took at least 8 hours to fail, if there was a total of 300 items in the sample.

  22. CDF Cumulative Frequency Plot X-axis: data points Y-axis: sum of all relative frequencies for data points up to x

  23. Lorenz Curves Lorenz Curves • CDF plot used to illustrate income inequality • Shows percentage (y%) of total income held by poorest x% of households • 45-degree line: line of perfect equality (LPE) • Gini Co-efficient: Area between Lorenz curve and LPE Area between Lorenz curve and LPI

  24. Tipping Points Model of Tipping Points • How many people will do something, given how many other people are expected to do it • Can be illustrated using a modified Lorenz curve • Equilibria: points intersecting the 45⁰ line • Stable Equilibria: points at which small deviations from equilibria will result in a return to equilibria, regardless of the direction of deviation • Unstable Equilibria: tipping points at which small deviations from equilbria will not result in a return to equilibria

  25. Example 4 (from Asst. 1) 100 students are in a class. Let N = the actual number of students clapping and NE be the number of students expected to clap. The relationship between N and NE is given as follows: N = 0.5NE if NE <= 20 N = 2NE – 30 if 20 < NE <= 50 N = 0.5NE + 45 if 50 < NE <= 90 N = 90 if NE >= 90 Illustrate this graphically. Equilibria? Stable Equilbria? Tipping Points?

  26. Variability and Spread { 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 40 } • Sample Variance Population Variance • Percentile • The p-th percentile is the data point located at position number (p/100)*(n + 1) • Use linear interpolation if necessary • Interquartile Range (IQR) = Q3 (75th percentile) – Q1 (25th percentile) n - 1 n

  27. How to find Percentiles

  28. Box and Whisker Plot Box and Whisker Plot Steps: • Calculate Q1, Q2 (median), Q3, and IQR • Draw a horizontal line representing scale of measurement, and a box surrounding Q1 and Q3, with a line drawn for Q2 • Calculate outlier boundaries (dotted lines): • lower fence = Q1 – 1.5*IQR, upper fence = Q3 + 1.5*IQR • Mark any outliers with a * or o on the graph • Draw whiskers connecting the largest and smallest measurements (upper/lower adjacent values) that are not outliers to the box

  29. Example 5 Draw a Box and Whisker Plot for the dataset { 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 40 }

  30. QQ Plot QQ Plot • Theoretical Quantiles • Quartiles, percentiles, etc. of known distribution • 95th Theoretical quantile: α • Sample Quantiles • 2 uses of QQ plots • Sample vs. Theoretical Quantile (45oline = good fit) • Sample vs. Sample Quantile (straight line = similar distribution)

  31. Measures of Association • Relative Risk (of event A provided event B occurs or does not occur) • > 1 : positive association between A and B • Association does not imply causation!

  32. Example 6 Given the following frequency table for individuals grouped according to whether they smoke or not and their education level: Calculate the relative risk of smoking if a person has a PHD education.

  33. Measures of Association Correlation Coefficient • ρ = Cov(X, Y) or or σx*σy • Measures linear relationship between two random variables • ρ > 0 : positive correlation; vice-versa • |ρ| = 1: X and Y are linearly related

  34. Example 7 • (47, 41) is called an influential outlier

  35. Time Series Time Series Graphs • The explanatory variate is time • The response variate is the measured variable of interest at time t • Neighbouring points are joined by straight lines rather than a simple scatter plot • Time series graphs can be used to look at trends, seasonal patterns, etc.

  36. Statistical Science • Statistics is the science of design and collection of data used to draw conclusions about a larger population. • When we collect this data, we’re always going to have uncertainty • We fit our data to known probability models to quantify these uncertainties

  37. Terminology • Descriptive Statistics (Chapter 1) • Tools and techniques used to describe certain attributes of a population • Graphs, charts, numerical summaries • Statistical Inference (Rest of Course) • A problem solving method using data to draw general conclusions on a population

  38. Statistical Inference • Estimation Problems • After collection of data, we fit the data to probability models • Using the collected data, form estimates for the parameters of the models • Hypothesis Testing • Accepting or rejecting a statement about the target population

  39. Probability Models • Random Variables • Represent what we’re going to measure in our experiment • Realizations • Represent the actual data we’ve collected from our experiment

  40. Probability Functions • CDF = (discrete) or (cts.) • E[g(X)] = (discrete) or (cts.) • Var(X) = E(X^2) – [E(X)]^2 • E(aX + b) = aE(X) + b • Var(aX + b) = a2 Var(X) • P(a<=Y<=b) = (discrete) or (cts.)

  41. Example 8 A random variable X has a continuous probability model with a cumulative distribution function (cdf) Give an expression for the expected value of Do not evaluate any sums or integrals.

  42. Probability Models • Binomial (binary data) • Fixed number of trials (n) and fixed probability (π) of success on each (Bernoulli) trial • P(X=x; n, π) = ; x = 0,1,…,n • Poisson (discrete data) • Events occur at a constant rate (λ) • P(X=x; λ) = ; x = 0,1,2,… • Exponential (continuous data) • Waiting time between events occuring at rate λ • f(x; λ) = λe-λx ; x > 0

  43. Gaussian Distribution and CLT Gaussian Distribution • f(x; μ, σ) = • If Y ~ G(μ,σ), then Z = ~ G(0,1) • If Y1,Y2,…Yn are G(μ1,σ1), G(μ2,σ2), … , G(μn,σn): • ~ G( , ) Central Limit Theorem (CLT) • For any iid RVs W1,W2,…Wn with mean μ and s.d. σ: • If = , then E( ) = μ and SD( ) = • ~ G(0,1)

  44. Example 9 We are given that non-diabetics have glucose levels represented by a random variable which follows a G(5.31, 0.58) distribution. Diabetics have glucose levels represented by a random variable which follows a G(11.74, 3.5) distribution. When taking a test, if the person’s glucose level measures higher than 6.5, they will be diagnosed as diabetic. • If a person is diabetic, what is the probability that he/she is diagnosed correctly? • What is the probability that a non-diabetic is diagnosed as diabetic?

  45. Response Model • Problem: what is μ, the average of the attribute of interest in the target population • We will use our collected data to estimate μ • Let Y be a random variable that represents the measured response variate • Y = μ + R R~G(0, σ ) • Y ~ G(μ, σ) • μ is systematic (no risk), while R is random (variable)

  46. Maximum Likelihood Estimation • Binomial π = ; x = # of successes • Response μ = ; yi is the ith realization • Maximum Likelihood Estimation • A procedure used to determine a parameter estimate given any model

  47. Maximum Likelihood Estimation • First, we assume our data collected will follow a distribution • Before we collect the sample  random variables • {Y1, Y2, …, Yn} • After we collect the sample  realizations • {y1, y2, …, yn} • We know the distribution of Yi (with unknown parameters), hence we know the PDF/PMF

  48. Likelihood Function • The Likelihood Function: • Likelihood: the probability of observing the dataset you have • We want to choose an estimate of the parameter θ that gives the largest such probability • Ω is the parameter space, the set of possible values for θ • Relative Likelihood: R(μ) = Discrete Continuous

  49. MLE Process • Step One: Define the likelihood function • Step Two: Define the log likelihood function ln[L(θ)] • Step Three: Take the derivative with respect to θ • Step Four: Solve for zero to arrive at the maximum likelihood estimate • Step Five: Plug in data values (if given) to arrive at a numerical maximum likelihood estimate

  50. Examples 10/11 Discrete: What is the MLE of a geometric distribution with pmf ? Continuous: Given Y ~ Exp(θ), with realizations y1,y2,…yn , find the maximum likelihood estimate of θ. What is the MLE for the realizations {3, 2, 1, 4}?

More Related