220 likes | 335 Views
70-208: Regression. Lecture 2: Prob & Stats review (continued) + Intro to Linear Patterns. Question of the Day. I believe my final grade in this class will be: A B C D. Administrative. Piazza Have you signed up? Excel/Stats review session Here (1185) at 12:15 Don’t come late
E N D
70-208: Regression Lecture 2: Prob & Stats review (continued) + Intro to Linear Patterns
Question of the Day I believe my final grade in this class will be: • A • B • C • D
Administrative • Piazza • Have you signed up? • Excel/Stats review session • Here (1185) at 12:15 • Don’t come late • Clickers: • In the review session today at 12:15 • Sunday morning 10:30 – 11am in Adriana’s office • Slides: • Will be posted online after lecture. If not, email me. • Problem Set 1 due Monday at 9am. • First Quiz next Wednesday: 20 min.
Central Limit Theorem • Why assume Normality? The CLT tells us that we’re often OK: • The probability distribution of a mean (or sum) of IID random variables of tends to a Normal distribution (asymptotically) • Several versions of the CLT but we won’t go through the proofs here (they can be a little nasty) • So why are we OK? • Observed data are often the accumulation of many small factors (e.g., the value of the stock market depends on many investors, or scores on an exam)
Standardizing a Variable:z-scores • What’s a z-score? • Transforms a variable to standard deviation units away from the mean. Centered at 0. • Why would we use it?
Probabilities and Percentiles • What is P(X = 600)? • What is P(X >= 600)?
Percentiles • The lifetimes (in km) of a certain brand of automobile tires is a normally distributed random variable, • X ~ N(μ=40,000 km, σ=2000 km) • In a shipment of 3000 tires how many tires are expected to have a lifetime that is less than or equal to 35,000 miles? • E(# of tires) = P(X <= 35000) * 3000 • So how do we calculate P( X <= 35000)? • Z-scores. Or very easy in Excel: NORM.DIST() • norm.dist(x, μ, σ, Cumulative?) • norm.dist(35000, 40000, 2000, TRUE) = .0062 • E (# tires) = .0062 * 3000 = 18.6 = 19 • What if it were strictly less than 35,000 miles?
Quantile Plots • A visual check on Normality • Why wouldn’t just looking at the density or histogram work? • Sometimes skew, etc, is easy to see but often it isn’t unless you look at a quantile plot • We’ll see lots of quantile plots this term. Review Chp 12.4 If data track the diagonal line, you can safely assume it’s a Normal distribution.
Fun with data http://www.qatar.cmu.edu/~gasper/regression/ /Data: diamonds.xls • Topics: • Naming ranges / editing names • Check min, max, average • Data Analysis add-in • Frequency() – ctrl+shift+enter; it’s an “array function” • Also Histogram from the Data Analysis add-in • StatTools
Scatterplots Data: diamonds.xls • Produce a scatterplot with weight on the x-axis and price on the y-axis. Topics: • Be careful about proceeding too quickly with Excel • It tries to do a lot for you but can get things wrong (often). Double check! • Excel will default to the left-most column being the x-axis. Not always what you want. • Can change the layout of the excel sheet (easier but not recommend) • Goto select data and check/change the X values and Y values. (best practice – always check what the x- and y- axes are) • Double check the units of the axis and the units of the data! • Scatterplot in StatTools? Very easy
On to the fun stuff… Other topics to review: • Confidence Intervals • Difference between the “empirical rule” and the precise interval using the t-distribution. • Random Sampling • Correlation • t-tests • “Regression” has a bad reputation. People think it’s hard • It’s not. It’s easy. In fact, it’s really really easy. • The problem with regression is that it’s too easy. • But there are many aspects where we need to be careful.
Linear Patterns • This course is about linear regression models. • Equation of a line: y = mx + b • We’ll typically write it: • Where y is the response (or dependent) variable and x and the explanatory (or independent) variable. And y “hat” is the fitted (or predicted) value. • Our objective is to find a the optimal β0and β0 from the data.
Linear Patterns • What is the correlation between Price and Weight? • .66
Linear Patterns Now how we need to find a line (slope and intercept) that optimally fits the data
Residuals Residual: vertical deviations from the data points to the line ( ). I.e., residual = actual – fitted
Best Fitting Line: OLS Ordinary Least Squares (OLS) • The best fitting line collectively makes the squares of residuals as small as possible (the choice of ¯0 and ¯1minimizes the sum of the squared residuals).
Quickly summarizing data: Pivot tables Data: bmw_lease • You’re probably familiar with contingency tables, aka a cross tabulation (cross tab). Pivot tables can be used for this but also much more. Data questions • How many observations over all years are there for a BMW ci? (value field settings) • What is the average price of the BMW 325 over all years and types? What about just the ci type? (value field settings) • What is the average price of the 325 ci in 2005, in the data? (report filters)
Probability and Decision Problems: Solver • Has anyone used Excel’s Solver? • You might in Optimization. • It’s an add-in; you should also have the data analysis add-in Homework problems: • Some questions on the homework are similar to the problem done in class. • Some are a little different (that’ll generally be the case) • You can use z-scores, but can also use Solver: What x (number of miles) solves: P( miles <= x) = 0.01? • Solve for x in norm.dist(x, mu, sigma, TRUE) = 0.01 • Also check out norm.inv()
Regression (by hand) Data: diamonds.xls Regress price on weight: i.e., price is what we’re predicting (or the dependent variable) and weight is the explanatory variable (independent var). I.e, for all data points, i, find a line through them: PredictedPricei= β0+ β1* weighti. So how do we find the slope and intercept?
Regression using Solver We’ll use Solver to find β0 + β1: • Guess some starting values for β0and β1 • Use those and the line eq to give you a fitted (predicted) Price • Calculate the Residual and Residual^2 • Calculate the sum of the Residual^2 • Use solver to minimize the sum of the residuals^2 by changing β0 and β1 • (solver isn’t perfect and might be off – hence we won’t use it for regression. And it’s also a pain.)
Regression using Excel Doing regression by hand is good to do. Once or twice. • In the Data Analysis add-in Excel has a built in function that makes it much easier • If you’re a Mac person (I am) there is software you can download to give you Solver and the Data Analysis add-in. • Also very easy (probably easier) to do it with StatTools.