250 likes | 363 Views
Topics 26 - 28. Relationship in Data. Topic 26. Graphical Displays of Association. Activity 26-1: House Prices – Page 570. Scatter Plot - Graphical display the data (Page 570) Horizontal axis – Explanatory variable Vertical Axis – Response Variable
E N D
Relationship in Data Topic 26 Graphical Displays of Association
Activity 26-1: House Prices – Page 570 • Scatter Plot - Graphical display the data (Page 570) • Horizontal axis – Explanatory variable • Vertical Axis – Response Variable • Association- Two variable displays association if knowing the value of one variable is useful in predicting the value of the other variable (Page 571) • Three aspects of the association between quantitative variables: (Page 571) • Direction (Positive or negative) • Strength (Strong, Moderate or Week) • Form (Linear or Curved)
More • A categorical variable can be incorporated into a scatter plot by constructing a labeled scatter plot, which assigns different labels to the dots based on the category of the observational unit. • For example, you might indicate observations coming from males with the label M and from females with the label F. • Remember an observed association does not imply a cause-and-effect relationship exist between two variables.
Relationship in Data Topic 27 Correlation Coefficient
Correlation • Correlation measures the degree of linear association between two quantitative variables. • But even when two variables display a nonlinear relationship, the correlation between them still might be quite high when there is a strong increasing or decreasing trend. • With these data, the relationship is clearly curved and not linear, and yet the correlation is still fairly high. Do not assume from a high correlation coefficient that the relationship between the variables must be only linear. • Always look at a scatter plot, in conjunction with the correlation coefficient, to assess the form ( linear or not) of the association.
Correlation Coefficient (r) • No matter how close a correlation coefficient (r) is to 1, and no matter how strong the association between two variables, a cause- and- effect conclusion cannot necessarily be drawn from observational data. • There are far more plausible explanations for why countries with lots of televisions per thousand people tend to have long life expectancies. For example, the technological sophistication of the country is related to both number of televisions and life expectancy.
Correlation Coefficient (r) • The correlation coefficient (r) is a number that measures the direction and strength of linear association between two quantitative variables. • A correlation coefficient is a number! In fact, it is a number between -1 and 1, inclusive. • Always examine a scatter plot in addition to calculating a correlation coefficient. A clear nonlinear relationship can have a small ( close to zero) correlation, • and a correlation can be close to -1 or 1, even if the relationship follows a curve or other nonlinear pattern.
Correlation Coefficient (r) • The slope, or steepness, of the points in a scatter plot is unrelated to the value of the correlation coefficient. • If the points fall on a perfectly straight line with a positive slope, then the correlation coefficient equals 1.0 whether that slope is very steep or not steep at all. • What matters for the magnitude of the correlation is how closely the points concentrate around a line, not the steepness of a line.
Correlation Coefficient (r) • Before calculating correlation you need to enable the option • Press 2nd, 0 and scroll down to find DiagnosticOn, then press ENTER twice. • Enter data for L1 and L2 in the calculator • Go to STAT, EDIT • Run Least Square Regression • Go to STAT, CALC, 8: LinReg(a+bx) • Enter L1, L2 where you entered the data • Press Enter to calculate Correlation Coefficient (r) and/or Correlation of Determinant (r2)
Relationship in Data Topic 28 Least Squares Regression
Linear Equation- • The equation of a generic line can be written as y ˆ = a + b x, in algebra class ( y = mx + b ) • where y denotes the response variable. Terms “ Least squares line” and “ regression line” are used interchangeable. • x denotes the explanatory variable ( also called the predictor variable). • For Example, x represents foot length and y represents height, and it is good form to use variable names in the equation. • a = represent y-intercept • b = Slope of the line • The caret on the y ( read as “ y- hat”) indicates that its values are predicted, not actual, heights.
Residuals • One way to measure the “ fit” of a line is to calculate the residuals for all of the observational units. • A residual is the difference between the observed y value and the y value predicted by your line for the corresponding x value. • In other words, the residual is the vertical distance from an observation to the regression line.
Regression Line • One of the primary uses of regression is prediction. • You can use the regression line to predict the value of the y- variable for a given value of the x- variable simply by plugging that value of x into the equation of the regression line. • This process is equivalent to finding the y- value of the point on the regression line corresponding to the x- value of interest. • A more common criterion for determining the “ best” line is to look at the sum of squared residuals ( SSE). • The line that achieves the exact minimum value of the sum of the squared residuals is called the least squares line, or the regression line. • Remember to provide measurement units when reporting predictions. In other words, be clear that the predicted height is in inches, not centimeters or any other units.
Interpolation • Interpolationmeans trying to predict the response variable for values of the explanatory variable within those contained in the data.
Extrapolation • Extrapolationmeans trying to predict the response variable for values of the explanatory variable beyond those contained in the data. When you have no information about the behavior of the data outside the values contained in your dataset • ( e. g., you have no reason to believe the relationship between height and foot length remains roughly linear beyond these values), extrapolation is not advisable.
An observation is considered influential if removing it from the dataset substantially changes the least squares regression equation. • Typically, observations that have extreme explanatory ( x) variable values ( far below or far above the sample mean x-bar ) have more potential to be influential.
The coefficient of Determination • The coefficient of determination is equal to the square of the correlation coefficient, so it is denoted by r2. (Where r represents Correlation Coefficient) • r2does not represent the proportion of points that fall on the line, or the proportion of the y- variable that is explained by the x- variable. • Rather, r 2 is the proportion of the variability in the y- variable that is explained by the least squares line with the x- variable. • Of course, when writing your interpretation in a given context, use the variable names rather than generic x and y labels.
You have not yet considered how to calculate the slope and intercept coefficients of the least squares line. Let the equation of a generic least squares line be yˆ = a + b x. • The most convenient expressions for calculating the intercept and slope coefficients for the least squares line involve the means and standard deviations of the two variables, along with the correlation coefficient between them. • It turns out the slope can be calculated from b = r * sy / sx . • The intercept coefficient can then be calculated from a = y-bar - b * x-bar .
transformation • When a straight line is not the best mathematical model for a relationship, you can often transform one or both variables to make the association more linear. • A transformation is a mathematical function applied to a variable, re- expressing that variable on a different scale. • Common transformations include logarithm, square root, and other powers. • Often trial and error is needed to select the transformation that establishes a linear relationship.
Exercise 28-6: Airfares – Page 634 • Enter data for L1 and L2 in the calculator • Go to STAT, EDIT • Run Least Square Regression • Go to STAT, CALC, 8: LinReg(a+bx) • Enter L1, L2 where you entered the data • Press Enter to calculate Regression line
Review • Notice that the word “ coefficient” appears often here. Be especially careful not to confuse a slope coefficient with a correlation coefficient. • A common theme in statistical modeling is to think of each data point as being composed of two parts: • the part that is explained by the model ( often called the fit) and • the “ leftover” part ( often called the residual) that is the result either of chance variation or of variables you have not yet considered or measured. • In the context of least squares regression, the fitted value for an observation is simply the y- value that the regression line would predict for the x- value of that observation ( i. e., the fitted value is yˆ). • The residual is the difference between the actual y- value and the fitted value y ˆ ( residual actual fitted ), so the residual measures the vertical distance from the observed y- value to the regression line.
Review • Be sure to subtract in the correct order ( observed minus predicted) when calculating a residual. ( Remember that points above the line have positive residuals.) • Never take a prediction very seriously if it results from extrapolating well beyond the actual data. • Remember not to generalize from the sample data to a larger population unless the sample was drawn randomly or you have some other reason to believe the sample is representative of the population.