1 / 22

70-208: Regression

70-208: Regression. Lecture 2: Prob & Stats review (continued) + Intro to Linear Patterns. Question of the Day. I believe my final grade in this class will be: A B C D. Administrative. Piazza Have you signed up? Excel/Stats review session Here (1185) at 12:15 Don’t come late

oistin
Download Presentation

70-208: Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 70-208: Regression Lecture 2: Prob & Stats review (continued) + Intro to Linear Patterns

  2. Question of the Day I believe my final grade in this class will be: • A • B • C • D

  3. Administrative • Piazza • Have you signed up? • Excel/Stats review session • Here (1185) at 12:15 • Don’t come late • Clickers: • In the review session today at 12:15 • Sunday morning 10:30 – 11am in Adriana’s office • Slides: • Will be posted online after lecture. If not, email me. • Problem Set 1 due Monday at 9am. • First Quiz next Wednesday: 20 min.

  4. Central Limit Theorem • Why assume Normality? The CLT tells us that we’re often OK: • The probability distribution of a mean (or sum) of IID random variables of tends to a Normal distribution (asymptotically) • Several versions of the CLT but we won’t go through the proofs here (they can be a little nasty) • So why are we OK? • Observed data are often the accumulation of many small factors (e.g., the value of the stock market depends on many investors, or scores on an exam)

  5. Standardizing a Variable:z-scores • What’s a z-score? • Transforms a variable to standard deviation units away from the mean. Centered at 0. • Why would we use it?

  6. Probabilities and Percentiles • What is P(X = 600)? • What is P(X >= 600)?

  7. Percentiles • The lifetimes (in km) of a certain brand of automobile tires is a normally distributed random variable, • X ~ N(μ=40,000 km, σ=2000 km) • In a shipment of 3000 tires how many tires are expected to have a lifetime that is less than or equal to 35,000 miles? • E(# of tires) = P(X <= 35000) * 3000 • So how do we calculate P( X <= 35000)? • Z-scores. Or very easy in Excel: NORM.DIST() • norm.dist(x, μ, σ, Cumulative?) • norm.dist(35000, 40000, 2000, TRUE) = .0062 • E (# tires) = .0062 * 3000 = 18.6 = 19 • What if it were strictly less than 35,000 miles?

  8. Quantile Plots • A visual check on Normality • Why wouldn’t just looking at the density or histogram work? • Sometimes skew, etc, is easy to see but often it isn’t unless you look at a quantile plot • We’ll see lots of quantile plots this term. Review Chp 12.4 If data track the diagonal line, you can safely assume it’s a Normal distribution.

  9. Fun with data http://www.qatar.cmu.edu/~gasper/regression/ /Data: diamonds.xls • Topics: • Naming ranges / editing names • Check min, max, average • Data Analysis add-in • Frequency() – ctrl+shift+enter; it’s an “array function” • Also Histogram from the Data Analysis add-in • StatTools

  10. Scatterplots Data: diamonds.xls • Produce a scatterplot with weight on the x-axis and price on the y-axis. Topics: • Be careful about proceeding too quickly with Excel • It tries to do a lot for you but can get things wrong (often). Double check! • Excel will default to the left-most column being the x-axis. Not always what you want. • Can change the layout of the excel sheet (easier but not recommend) • Goto select data and check/change the X values and Y values. (best practice – always check what the x- and y- axes are) • Double check the units of the axis and the units of the data! • Scatterplot in StatTools? Very easy

  11. On to the fun stuff… Other topics to review: • Confidence Intervals • Difference between the “empirical rule” and the precise interval using the t-distribution. • Random Sampling • Correlation • t-tests • “Regression” has a bad reputation. People think it’s hard • It’s not. It’s easy. In fact, it’s really really easy. • The problem with regression is that it’s too easy. • But there are many aspects where we need to be careful.

  12. Linear Patterns • This course is about linear regression models. • Equation of a line: y = mx + b • We’ll typically write it: • Where y is the response (or dependent) variable and x and the explanatory (or independent) variable. And y “hat” is the fitted (or predicted) value. • Our objective is to find a the optimal β0and β0 from the data.

  13. Linear Patterns • What is the correlation between Price and Weight? • .66

  14. Linear Patterns Now how we need to find a line (slope and intercept) that optimally fits the data

  15. Linear Patterns

  16. Residuals Residual: vertical deviations from the data points to the line ( ). I.e., residual = actual – fitted

  17. Best Fitting Line: OLS Ordinary Least Squares (OLS) • The best fitting line collectively makes the squares of residuals as small as possible (the choice of ¯0 and ¯1minimizes the sum of the squared residuals).

  18. Quickly summarizing data: Pivot tables Data: bmw_lease • You’re probably familiar with contingency tables, aka a cross tabulation (cross tab). Pivot tables can be used for this but also much more. Data questions • How many observations over all years are there for a BMW ci? (value field settings) • What is the average price of the BMW 325 over all years and types? What about just the ci type? (value field settings) • What is the average price of the 325 ci in 2005, in the data? (report filters)

  19. Probability and Decision Problems: Solver • Has anyone used Excel’s Solver? • You might in Optimization. • It’s an add-in; you should also have the data analysis add-in Homework problems: • Some questions on the homework are similar to the problem done in class. • Some are a little different (that’ll generally be the case) • You can use z-scores, but can also use Solver: What x (number of miles) solves: P( miles <= x) = 0.01? • Solve for x in norm.dist(x, mu, sigma, TRUE) = 0.01 • Also check out norm.inv()

  20. Regression (by hand) Data: diamonds.xls Regress price on weight: i.e., price is what we’re predicting (or the dependent variable) and weight is the explanatory variable (independent var). I.e, for all data points, i, find a line through them: PredictedPricei= β0+ β1* weighti. So how do we find the slope and intercept?

  21. Regression using Solver We’ll use Solver to find β0 + β1: • Guess some starting values for β0and β1 • Use those and the line eq to give you a fitted (predicted) Price • Calculate the Residual and Residual^2 • Calculate the sum of the Residual^2 • Use solver to minimize the sum of the residuals^2 by changing β0 and β1 • (solver isn’t perfect and might be off – hence we won’t use it for regression. And it’s also a pain.)

  22. Regression using Excel Doing regression by hand is good to do. Once or twice. • In the Data Analysis add-in Excel has a built in function that makes it much easier • If you’re a Mac person (I am) there is software you can download to give you Solver and the Data Analysis add-in. • Also very easy (probably easier) to do it with StatTools.

More Related