850 likes | 863 Views
Learn linear regression modeling with R programming in a lab setting. Includes basics, data cleaning, regression models, and more. Prerequisites include programming experience and basic stats knowledge.
E N D
Sample slides for use with “Linear Regression Using R: An Introduction to Data Modeling,” David J. Lilja, University of Minnesota Libraries Publishing, 2016 (z.umn.edu/lrur). These slides are provided to help others in teaching the material presented in the above textbook. You are welcome to use this material in your own course lectures. This material is licensed under the Creative Commons Attribution-Noncommercial 4.0 License. You are free to: Share -- copy and redistribute the material in any medium or format; Adapt -- remix, transform, and build upon the material. Under the following terms: Attribution -- You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. NonCommercial -- You may not use the material for commercial purposes.
Overview • Learn to develop linear regression models using a statistically valid approach • Learn basics of the R statistical programming language • Meet once per week in a computer lab setting • Textbook: Linear Regression Using R: An Introduction to Data Modeling • z.umn.edu/lrur
Class Topics • Intro to R programming environment • Sanity checking and cleaning data • One-factor regression models • Model quality and residual analysis • Multi-factor regression models • Backward elimination process • Prediction • Training and testing
Prerequisites and Grading • Prerequisites • Some programming experience • Some familiarity with basic statistics • Grades based on 8 lab assignments • 6 labs from text • 1 kaggle “contest” • Final project of your choice • Class attendance • I encourage you to work in pairs • But turn in your own work for grading
Lab Assignments • Lab 1 – Basics of R • Lab 2 – Data cleaning • Lab 3 – One-factor models • Lab 4 – Multi-factor models • Lab 5 – Cross data set predictions • Lab 6 – Training and testing • Lab 7 – Kaggle • Lab 8 – Final project
1. Introduction to R • R is a programming language developed for statistical computing • Open source: www.r-project.org • Complete run-time, interactive environment for manipulating and modeling data • Interpreted language • Many built-in functions • Object-oriented • Matrices and vectors are basic operand types • Excellent graphical tools
To do • Download book from z.umn.edu/lrur • Start R • Refer to Chapter 1 • Do the operations in the box on p. 5 • Download and complete Lab 1
2. Manipulating Data • First understand your data • How is it structured? • How is it organized? • What column headings are available? • Features of the data? • How many unique data points (rows)? • Missing values? • “NA” in R. • Functions usually ignore NA values • Sometimes must be explicit • mean(x, na.rm=TRUE)
Example Data – CPU DB • Design characteristics and measured performance for 1500+ processors • Clock frequency • Technology parameters • Number of transistors • Die size • Feature size • Channel length • Parallelism parameters • Number of cores, threads • Memory parameters • I-cache and D-cache sizes • Performance = SPEC CPU integer and floating-point
Data Frames – Fundamental R Object • read-data.R • int92.dat, fp92.dat, … • head(x) • Header plus first few rows
Accessing a data frame • int92[15, 12] • row 15, column 12 • int92[“71”,”perf”] • row labeled “71”, column labeled “perf” • int92[ , “clock”] • every row in column labeled “clock” • int92[36, ] • every column in row 36 • nrow(int92.dat), ncol(in92.data) • number of rows, columns
Example functions • min(int92.dat[, “perf”]) • max(int92.dat[, “perf”]) • mean(int92.dat[, “perf”]) • sd(int92.dat[, “perf”])
Alternative notation - $ • min(int92.dat$perf) • max(int92.dat$perf) • mean(int92.dat$perf) • sd(int92.dat$perf)
Attach dataframe to local workspace • attach(int92.dat) • min(perf) • max(perf) • mean(perf) • sd(perf) • detach(int92.dat) • attach(fp00.dat) • … … …
To do • Download CPU DB and read-data.R from z.umn.edu/lrur • Load CPU DB data into dataframes in R • See Section 2.4 • Try accessing dataframes using examples in Section 2.5
3. Sanity Checking and Cleaning Data • Data is never perfect • Missing values (NA) • Invalid values – e.g. recorded incorrectly • Outliers – real or errors? • Sanity checking • Does data look reasonable? • Is it in a useful format? • Data cleaning • Putting it into proper format for analysis • Often the biggest job of the data analysis task!
Things to try • Compute relevant values per column • Mean, max, min, standard deviation • Sort columns looking for outliers, unusual patterns • Find fraction of NA values • Are there enough values for analysis? • Use table() to find distribution of values • Other approaches unique to your data
Analyzing Data is Messy • No right or wrong answers • Must apply your judgment based on experience • Many options, tools available • Need to try variations to see what works for your individual needs and data set • R is interactive environment • Use it to experiment • What works • What doesn’t
To do • Read Chapter 2 • Download and complete Lab 2
4. One-factor Regression • y = a0 + a1x1 • y = output • a0 and a1 = regression coefficients • x1 = input data • Given many x1and y pairs, our goal is to compute a0 and a1
Does it look linear? • plot() • Try example in Section 3.1
Linear model function • lm() • int00.lm <- lm (perf ~ clock) • y = a0 + a1x1 • y = perf • x1 = clock • Try example in Section 3.2
Linear model function • lm() • int00.lm <- lm (perf ~ clock) • y = a0 + a1x1 • y = perf • x1 = clock • Try example in Section 3.2 • a0 = 51.8 • a1 = 0.586 • perf = 51.8 + 0.586 * clock
Plotting the regression line • plot(clock,perf) • abline(int00.lm)
Behind the Code ei • For i=1 to n (xi, yi) pairs • yi = a0 + a1x1 + ei • ei= residual = (predicted by line) - (actual) • Minimize sum-of-squares of residuals
Behind the Code • Set partial derivatives to zero to find minima • Two equations, two unknowns • Solve for a0, a1
5. Model quality • Section 3.3 • summary(int00.lm) • Residuals = actual value – fitted value • above/below line = positive/negative residual • Expect median of residuals ≈ 0 • Expect residuals normally distributed around 0 • Expect min, max approx same distance from 0 Residuals: Min 1Q Median 3Q Max -634.61 -276.17 -30.83 75.38 1299.52
Model quality – coefficients • Standard error • Expect std err = 5-10x smaller than coefficient value • Small value small variability • Pr(> |t|) = probability the coefficient is not relevant to model • Called the “significance value” • Also called the p-value • Small value coefficient is significant to model • Typically want < 0.05 or 0.1 • Coefficients: • EstimateStd. Error t value Pr(>|t|) • (Intercept) 51.7870953.31513 0.971 0.332 • clock 0.586350.02697 21.741 <2e-16***
Model quality – visual indicators • *** • 0 < p ≤ 0.001 • ** • 0.001 < p ≤ 0.01 • * • 0.01 < p ≤ 0.05 • . • 0.05 < p ≤ 0.1 • [blank] • 0.1 < p ≤ 1 • Coefficients: • EstimateStd. Error t value Pr(>|t|) • (Intercept)51.7870953.31513 0.971 0.332 • clock 0.586350.02697 21.741 <2e-16*** • Quick visual indicator of the significance of that coefficient.
Model quality – residuals • Residual standard error • Measure of total variation in residual values • Expect 1st and 3rdquantiles ≈ 1.5 standard error Residuals: Min 1Q Median 3Q Max -634.61 -276.17 -30.83 75.38 1299.52 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept)51.78709 53.31513 0.971 0.332 clock 0.58635 0.02697 21.741 <2e-16 ***
Model quality – R2 • Degrees of freedom = (number of observations used to generate model) – (number of coefficients) • E.g. 256 observations used to compute 2 coefficients 254 degrees of freedom • R2 = percentage of total variation explained by the model 0 < R2 < 1 • Adjusted R2 – discuss later for multi-factor models • F-statistic – discuss later for multi-factor models
R2 – A Little Deeper • R2 = coefficient of determination = Fraction of total variation explained by model
Residual analysis • y = a0 + a1x1 = f(xi) • fitted(int00.lm) computes yf = f(xi) for all xi • resid(int00.lm) computes yi – yf for all outputs yi Residuals = (actual value) – (fitted value) • plot(fitted(int00.lm), resid(int00.lm)) • Expect randomly distributed around 0 • Note vertical lines • Different perf (yi) with same clock (xi)
Residual analysis – QQ plot • Quantilevsquantile (QQ) plot • qqnorm(resid(int00.lm)) • qqline(resid(int00.lm)) • Expect the points to follow a straight line Both the residual plot and QQ plot suggest that a one-factor model is insufficient for this data.
To do • Read Chapter 3 • Download and complete Lab 3
6. Multi-factor regression y = a0 + a1 x1 + a2 x2 + … + anxn Steps • Visualize the data • Identify potential predictors • Apply backward elimination process • Perform residual analysis to check quality of the model
Visualize the data • pairs(int00.dat, gap=0.5) • Note nperf is normalized version of perf • 0 ≤ nperf < 100
Identify potential predictors • Use smallest number of predictors necessary to make good predictions • Too many or redundant predictors over-fits the model • Builds random noise into model • Perfect fit for that data set, but does not generalize • More predictors always improves R2 • But not necessarily a better model • May simply be better at modeling the random noise
Adjusted R2 • n = # observations • m = # predictors in the model • Adjusted R2 increases if • Adding a new predictor increases previous model’s R2 by more than we would expect from random fluctuations • If adjusted R2 increases, new predictor improved the model • Adding new predictors is not always helpful • Adjusted R2 will still the same or decrease
Identify potential predictors • Start with all available predictors (columns) • Use knowledge of the system to: • Eliminate predictors that are not meaningful: • E.g. Thermal design power (TDP), index number, … • Predictors with only a few entries – e.g. L3 cache • Add functions of existing predictors that could be useful • E.g. Previous research suggests CPU performance increases as the sqrt(cache size) Add amxm½ terms • Must still include first-degree terms
Backward elimination • Generate model with all potential predictors int00.lm <- lm(nperf ~ clock + threads + cores + transistors + …)
Backward elimination • Generate model with all potential predictors int00.lm <- lm(nperf ~ clock + threads + cores + transistors + …) • Use summary(int00.lm) to find each predictor’s significance level
Backward elimination • Generate model with all potential predictors int00.lm <- lm(nperf ~ clock + threads + cores + transistors + …) • Use summary(int00.lm) to find each predictor’s significance level • Drop predictor_i with least significance (largest p value) that is > threshold (typically want p≤0.05 or 0.1) and recompute the model int00.lm <- update(int00.lm, .~. - predictor_i)
Backward elimination • Generate model with all potential predictors int00.lm <- lm(nperf ~ clock + threads + cores + transistors + …) • Use summary(int00.lm) to find each predictor’s significance level • Drop predictor_i with least significance (largest p value) that is > threshold (typically want p≤0.05 or 0.1) and recompute the model int00.lm <- update(int00.lm, .~. - predictor_i) • Repeat #2-3 until all p-values ≤ threshold
Residual analysis • The same as for a one-factor model • plot(fitted(int00.lm), resid(int00.lm)) • Expect to see values uniformly distributed around 0
Residual analysis – QQ plot • qqnorm(resid(int00.lm)) • qqline(resid(int00.lm)) • Expect to see values follow the line.
A Little Deeper on p-values • H0 = Null hypothesis: that the given coefficient is equal to 0 • i.e., it is not significant to the model • Type I Error: rejecting the null hypothesis when it is, in fact, true. • i.e., reject a coefficient when it is actually significant • Want this probability to be small • If p ≤ a reject null hypothesis • Small p value predictor is likely to be meaningful • a is called the significance level
When Things Go Wrong All input factors that were measured. • As before, minimize SSE • Set partial derivatives to zero Must be invertible; else see Sec. 4.6. Regression coefficients.
Example: int92 (Sec. 4.6) • All predictors mostly NA, adjusted R2 = 0.9813
Example: int92 (Sec. 4.6) • All predictors mostly NA, adjusted R2 = 0.9813 • Drop threads, cores mostly NA, R2 = 0.9813