300 likes | 514 Views
Regression analysis. Regression analysis. Objective : Investigate interplay of quantititative variables Identify relations between a dependent variable and one or several independent variables Make predictions based on observed data
E N D
Regression analysis • Objective: • Investigateinterplayofquantititative variables • Identifyrelationsbetween a dependent variable andoneorseveralindependent variables • Makepredictionsbased on observeddata • Dependent variable: variable whosevaluesshallbeexplained • Independent variables: variables thathave an impact on thedependent variable
Regression analysis Linear regression
Linear regression • Hier wird angenommen, dass der Einfluss der unabhängigen Variablen auf die abhängige Variable linear ist • Dabei unterscheidet man zwischen: • Einfacher linearer Regression: Erklärung einer abhängigen Variable durch eine unabhängige Variable • Multipler linearer Regression: Erklärung einer abhängigen Variable durch mehrere unabhängigen Variablen
Regression analysis Simple linear regression
Linear regression • The variables educationandincomeareconsidered, whereofone variable (eduaction) canbeassumedtohave an impact on theother (income) • Dependent variable Y=(): income • Independent variable X=(): education
Linear regression • Basic ideaof linear regression: find a straightline, whichoptimallydescribesthecorrelationbetweenthetwo variables
Linear regression • Lineares Regressionsmodell ( R: lm(y~x) ): , Her, iscalled,(axis) interceptistheslope, X isthepredictor variable andisthe residual. The residual isthedifferencebetweentheregressionlineandthemeasurementvalues Y. Here, iscalledestimateof Y andwehave:
Linear regression • Objective: • Estimate the coefficients such thatthemodelfitsoptimallytothedata • Predictionof Y values • The straightlineshallbechosen in such a waythatthesquareddistancesbetweenthevaluespredictedbythemodelandtheempiricallyobservedvaluesareminimized • Wewant:
Linear regression • We will obtainestimatesofthecoefficients wich are also calledregressioncoefficients, : • , • , arethe least square (LSQ) estimates of
Linear regression • In casethearenormallydistributedweobtainfor, and confidenceintervals: • where,,aretherespectivestandarderrorsoftheestimates
Linear regression • We obtain a t-Test for the null hypothesis against the alternative • Reject, if ist
Regression analysis Multiple linear regression
Multiple linear regression • Now multiple independent variables . A sample ofsizennow consists of the values , i=1,…,n • Hence: • Here, the , j=1,…,m, are the unknown regression coefficients and the aretheresiduals • Matrix notation:
Linear regression • Estimation oftheregressioncoefficientsisagainperformedwiththe least squaremethod. After extensive calculusoneobtains • Estimation of isobtainedaccordingto: , where • The estimationprocessiscomputationallydemanding (matrixinversionisneeded!) andhastobedonebycomputers
Linear regression • In casethe are normally distributedweobtain a F-test forthenull hypothesisagainstthe alternative • Wewanttotestwhethertheoverallmodelissignificant • Overall F-test statistic: • Reject, if
Regression analysis Logisticregression
Logisticregression • Unitlnow, thedependet variable y was continous. Now, weconsiderthesituationwherethereare just twopossibleoutcomevalues. An exampleisthedichotmoustrait y = „affectionstatus“ withthevalues „1=affectedbythedisease“ and „0= not affected“ (healthy) • Wewanttopredicttheprobalitythat an individual hast thevalue 1 (= isaffected). • The rangeofpossiblevaluesis [0,1]. • => Linear regressioncan not beusedsincethedependent variable is nominal. • => Instead, logisticregressionisused.
Logisticregression • Examplebinarylogisticregression: • Sample withinformation on survivalofthesinkingof Titanic • Question: was thechancetosurvivedependent on sex?
Logisticregression • The oddsratioisusedtomodethechancetosurvive: • Weconsidertheratioofthesurvivalprobabilityofwomenandthesurvivalprobabilityofwomen • The OR of 10.14 indicatesthattheprobabilitytosurvive was 10 timesas high forwomenasformen • Fromhereuntilslide 22: detailsforspecialists • => A regression on thelogarithmicodds (so calledlogit) thatthe 0/1-coded dependent variable takes on thevalue 1
Logisticregression • Logarithmic odds are well suited for regressions analysis since there valuesare in andsincetheyaresymetric • Regression equation • Now, theprobabilitythatthedependent variable takes on thevalue1 giventhevalues x canbecomputedas • Estimationofregressioncoefficientsisnowdonewohtmaximum-likelihoodesitmation. Logarithmicodds, that 0/1-variable takes on thevalue 1 Knownfrom linear regression
Logisticregression • Probabilities depend on • Interpretation oftheparameters: • : definestheprobability P(Y=1|X=0) ofthevalue X=0 oftheindependent variable X: • The larger the larger theprobability P(Y=1|X=0) was setto 1
Logisticregression • : determinestheslopeoftheprobabilityfunction, andtherebyhow strong differencesattheindependent variable X influencedifferencesattheconditionalprobabilites • , conditionalprobabilityofcategory 1 („affected“) ismonotonicallyincreasingfunctionof X • , conditionalprobabilityofcategory 1 („affected“) ismonotonicallydecreasingfunctionof X • , X and Y areidependnet was settozero 0
Logisticregression • Titianci: • Coding: male=1, female=0, survival=1, non-surival=0 • Compute logarithmic odds according to • (In R: glm(y~x,family=binomial(„logit“)) ): • Female: • Male: • The logitcoefficientsaredifficulttointerpret, thereforetheyaretransformed back by () • Female: , Male:
Logisticregression • „Wald-test“ forthe null hypothesisβ = 0 • Rejectthe null hypothesisif, where p isthenumberofdegreesoffreedom (= thenumberofindependent variables) • Titanic example: • Wald-test für : Note: istypically not ofinterest. Here, ititsignficantbecauseitdetectsthatthereweremuchmorementhenwomen on the Titanic. • Wald-test für : • => The null hypothesiscanberejected
Note • So far, the logistic regression example could have been computed with the chi-square test for 2x2 tables. The advantage of the logistic regression is that it can be extended to multiple independent variables and that the independent variables can be continuous.
Logisticregression • Logisticregressionwith multiple predictor variables (In R: glm(y~x1+x2,family=binomial(„logit“)) ): • Multiple predictor variables can also beanalyzedas „cross-classification“ • Example: Doeslungfunctiondepend on airpollutionandsmoking • Dependent variable: lufu= lungfunctiontest, „normal“=1, „not normal“=0 • Dependent variables: • LV=degreeofairpolution, „no“=0, „yes“=1 • Smoking „no“=0, „yes“=1
Logisticregression • Data:
Logisticregression Logistic regressionwith R yields