130 likes | 263 Views
Lecture 22: Thurs., April 1. Outliers and influential points for simple linear regression Multiple linear regression Basic model Interpreting the coefficients. Outliers and Influential Observations.
E N D
Lecture 22: Thurs., April 1 • Outliers and influential points for simple linear regression • Multiple linear regression • Basic model • Interpreting the coefficients
Outliers and Influential Observations • An outlier is an observation that lies outside the overall pattern of the other observations. A point can be an outlier in the x direction, the y direction or in the direction of the scatterplot. For regression, the outliers of concern are those in the x direction and the direction of the scatterplot. A point that is an outlier in the direction of the scatterplot will have a large residual. • An observation is influential if removing it markedly changes the least squares regression line. A point that is an outlier in the x direction will often be influential. • The least squares method is not resistant to outliers. Follow outlier examination strategy in Display 3.6 for dealing with outliers in x direction and outliers in direction of scatterplot.
Outliers Example • Does the age at which a child begins to talk predict a later score on a test of mental ability? • gesell.JMP contains data on the age at first word (x) and their Gesell Adaptive score (y), an ability test taken much later. • Child 18 is an outlier in the x direction and potentially influential. Child 19 is an outlier in the direction of the scatterplot. • To assess whether a point is influential, fit the least squares line with and without the point (excluding the row to fit it without the point) and see how much of a difference it makes. • Child 18 is influential.
Will You Take Mercury With Your Fish? • Too much mercury in one’s body results in memory loss, depression, irritability and anxiety – the “mad hatter” syndrome. • Rivers and oceans contain small amounts of mercury which can accumulate in fish over their lifetimes. • Concentration of mercury in fish tissue can be obtained at considerable expense by catching fish and sending samples to a lab for analysis. • It is important to understand the relationship between mercury concentration and measurable characteristics of a fish such as length and weight in order to develop safety guidelines about how much fish to eat.
Data Set • mercury.JMP contains data from a study of large mouth bass in the Wacamaw and Lumber rivers in North Carolina. At several stations along each river, a group of fish were caught, weighted, and measured. In addition a filet from each fish caught was sent to the lab so that the tissue concentration of mercury could be determined for each fish. • We want to predict Y=mercury concentration per weight measured in parts per million based on X1=length (centimeters) and X2=weight (measured in grams)
Multiple Regression Model • Multiple Regression: Seeks to estimate the mean of Y given multiple explanatory variables X1,…,Xp, denoted by • Assumptions of ideal multiple linear regression model • (linearity) • (constant variance) • Distribution of Y for each subpopulation X1,…,X p is normally distributed. • The selection of an observation from any of the subpopulations is independent of the selection of any other observation.
Multiple Regression Model: Another Representation • Data: We observe • Ideal Multiple Regression Model • has normal distribution with mean=0, SD= • are independent • = “error” = error from predicting by its subpopulation mean
Estimation of Multiple Linear Regression Model • The coefficients are estimated by choosing to make the sum of squared prediction errors as small as possible, i.e., choose to minimize • Predicted value of y given x1,…,xp: • = SD(Y|X1,…,Xp), estimated by = root mean square error
Multiple Linear Regression in JMP • Analyze, Fit Model • Put response variable in Y • Click on explanatory variables and then click Add under Construct Model Effects • Click Run Model.
Residuals and Root Mean Square Error from Multiple Regression • Residual for observation i = • Root Mean Square Error = • As with simple linear regression, under the ideal multiple linear regression model • Approximately 68% of predictions of a future Y based on will be off by at most • Approximately 95% of predictions of a future Y based on will be off by at most
Interpreting the Coefficients • = increase in mean of Y that is associated with a one unit (1cm) increase in length, holding fixed weight • = increase in mean of Y that is associated with a one unit (1 gram) increase in weight, holding fixed length • Interpretation of multiple regression coefficients depends on what other explanatory variables are in the model. • See handout.