240 likes | 422 Views
Understanding regression. A regression is an average. Experiment: Imagine that you are looking at people coming through a door. Imagine also that you had “metric eyes” (rather like Superman’s x-ray vision) and could accurately estimate the height of each person as they passed through.
E N D
A regression is an average • Experiment: Imagine that you are looking at people coming through a door. Imagine also that you had “metric eyes” (rather like Superman’s x-ray vision) and could accurately estimate the height of each person as they passed through. After 10 people had gone through the door, what would be the best prediction for the height of the eleventh person? • Answer – the average • This is why the “average” is also called the “expected value.”
The expected value of the height of the 11th is the average of the previous 10.
Imagine that as you are estimating the height of the persons coming through the door, you also note their gender. Information on gender improves our ability to predict height.
Regression • Two basic purposes: • Explanation • Prediction • Regression is an efficient way to analyze the structure of the data. • A regression model is a sentence that connects the average or expected value of something (a person’s height) in multi-dimensions (multivariate analysis).
The regression sentence • The regression equation may be read as a sentence that summarizes the simultaneous influence of independent variables (causes or drivers) on a single dependent variable (effects or outcomes). • Here is a simple, single variable model. Height = 165 + 5D (D = 1 for a man and 0 for a woman) • The regression sentence: The predicted (expected) height for people coming through the door is 165 cm plus 5 cm if that person is a man. • In other words: Women have an expected height of 165 cm and men have an expected height of 170 cm. Regression coefficient
Adding variables Adding more variables conditions our prediction (expectation) for the height of people. Typical variables could include: • number of litres of milk consumed per week • income of parents ($’000s) • kilometres above sea level at birth
Height = 100 + 15L X HEIGHT (cm) X X For every litre X consumed, height increases 15 cm. X X 100 X X X Number of litres of 0 5 milk consumed each No milk consumption week implies an expected Someone who drinks 20 height of 100 cm. litres of milk each week has an expected height of 400 cm.
Regression sentences • An earnings regression simply relates the expected earnings based on several variables. Y = 6,000 + 200.5 AGE + 1000.5 YEARS_ED (Y = annual income) • “Expected annual income for the sample is $6,000 plus 200.5 times AGE plus 1000.5 times years of education.” • A 30-year-old with 12 years of education can expect to earn: $6,000 + 200.5(30) + 1000.5 (12) = $24,021 • For every year of education, annual salary increases by $1000.50. Regression coefficient
Example - LMAPD impact analysis • Wanted to associate labour market programming with outcome • Wanted to assess the presence and intensity of programming • Built a regression sentence that expressed this relationship Hours = a1 + a2 Female + a3 Aboriginal + … + ak-1 EmpIoy + ak # Employ Worked Inter. Inter. • Output appears more complicated, but follows the same principles.
Ex. LMAPD: Estimating VR counselling hours (LMAPD VRhours) • Admin data includes total cost of services spent by the VR program on a particular client, but it does not include the cost of VR counselling. • To estimate VR counselling costs per client, 281 VR clients with currently active VR counsellors were selected. • VR counsellors were provided a short questionnaire including the following question to be answered for each VR client: On average, over the entire time that you have been this client’s counsellor, how many hours per month did you spend on this client’s case?
Ex. LMAPD VRhours • Surveys for 270 clients were returned. • Information from the surveys was merged with the administrative data. • The next step was to run a regression using the sample of 270 VR clients to calculate the coefficients for the independent variables (from the admin data) to estimate VR counselling costs for the entire sample of VR clients (n=1,062).
Ex. LMAPD VRhours • Dependent variable: Average monthly time in hours spent by VR counsellors on the clients’ files (survey question) • Independent variables: • Demographic: gender, Aboriginal status, minority status, age, disability type • Service data: urban/rural service delivery region, organization that delivered services
Ex. LMAPD VRhours:Independent variables • Variables in parentheses (X) are the excluded dummy variables from the regression. • Types of variables: • Continuous • Mutually exclusive dummy variable • Not mutually exclusive dummy variable
Ex. LMAPD VRhours:Coefficients • Aboriginal status is associated with fewer hours per month (-1.14). • Minority status required 3.98 hours more of VR counselling. • Rural clients logged slightly more hours in counselling than urban clients (0.17, not statistically significant). • Those with physical and hearing disabilities require substantial support.
Ex. LMAPD VRhours:Regression sentence • VRhours = 2.05 + 0.1fg + (-1.14ab) + 3.98m + (-0.01)ag + 0.2cd + 5.43phd + 1.08psd + 6.34hd + 0.58vd + (-0.61)ld + 0.17r + (-6.06)smd + (-5.16)cpa + 0.61cnib • Can now use the estimated coefficients and the independent variable values for all 1,062 VR participants to calculate the estimated number of VR hours required for each client.
Assessing the quality of a regression • Goodness of fit (R2) measures the percentage of variation in Y explained by the model. The R2 varies between 0 (low) and 1 (high).
Assessing the quality of a regression • Statistical significance • The higher the coefficient, the more confident we are that it is not zero. • The lower the SD, the more confident we are that we have measured the effect reliably. • Coefficient divided by standard deviation is the t value. • The rule of 2 is applied again as a “t” test. Y = 6,000 + 20.5 AGE + 100.5 YEARS_ED (2.5) (3.8) (1.2) Computer output reports t values (as above) and standard errors, p values and a host of other diagnostics.
Traffic accidents and photo radar m for Canada’s largest cities o r s f t s n h e t d a i X e c d c X a f o c X X i r X f e f a b X r X X m t u N X X X X X X X Number of photo radar installations Photo radar and traffic safety Model 1 Deaths = A + B (Number of installations) (The test is whether B is positive.) Model 2 Deaths = A + B (Year) + C (D) D = 0 (year < 2000) D = 1 (year > 2001) (The test is whether C is negative.)
Regression variables • Dependent (Outcome) • Independent (Causal) • Context (age, gender, ethnicity) • Driver (policy) • Policy can be measured directly ($, person years) or as a change in state (dummy variable).
Building a regression model • Identify the dependent (effect or outcome) variable(s). • What are the independent (causal) variables? • Are there policy impacts? • How are these to be measured?