270 likes | 449 Views
Logistic regression. Who survived Titanic?. The sinking of Titanic. Titanic sank April 14th 1912 with 2228 souls 705 survived. A dataset of 1309 passengers survived. Who survived?. The data. Sibsp is the number of siblings and/or spouses accompanying
E N D
Logistic regression Who survived Titanic?
The sinking of Titanic • Titanic sank April 14th 1912 with 2228 souls 705 survived. • A dataset of 1309 passengers survived. • Who survived?
The data • Sibsp is the number of siblings and/or spouses accompanying • Parsc is the number of parents and/or children accompanying • Some values are missing • Can we predict who will survive titanic II?
Analyzing the data in a (too) simple manner • Associations between factors without considering interactions
Analyzing the data in a (too) simple manner • Associations between factors without considering interactions
Analyzing the data in a (too) simple manner • Associations between factors without considering interactions
Analyzing the data in a (too) simple manner • Associations between factors without considering interactions
Analyzing the data in a (too) simple manner • Associations between factors without considering interactions
Could we use multiple linear regression to predict survival?
Logit transformation is modeled linearly • The logistic function
The sigmodal curve • The intercept basically just ‘scale’ the input variable
The sigmodal curve • The intercept basically just ‘scale’ the input variable • Large regression coefficient → risk factor strongly influences the probability
The sigmodal curve • The intercept basically just ‘scale’ the input variable • Large regression coefficient → risk factor strongly influences the probability • Positive regression coefficient →risk factor increases the probability
Logistic regression of the Titanic data • Summary of data • Coding of the dependent variable • Coding of the categorical explanatory variable: • First class: 1 • Second class: 2 • Third class: reference
Logistic regression of the Titanic data • A fit of the null-model, basically just the intercept. Usually not interesting • The total probability of survival is 500/1309 = 0.382. Cutoff is 0.5 so all are classified as non-survivers. • Basically tests if the null-model is sufficient. It almost certainly is not. • Shows that survival is related to pclass (which is not in the null-model)
Logistic regression of the Titanic data • Omnibus test: Uses LR to describe if the adding the pclass variable to the model makes it better. It did! But better than the null-model, so no surprise. • Model Summary. Other measures of the goodness of fit. • Classification table: By including pclass 67.7 passengers were correctly categorized. • Variables in the equation: first line repeats that pclass has a significant effect on survival. B is the logistic fittet parameter. Exp(B) is the odds rations, so the odds of survival is 4.7 (3.6-6.3) times higher than passengers on third class (reference class)
Logistic regression of the Titanic data now adding family relations • ‘3 or more’ is set as reference groups by SPSS
Logistic regression of the Titanic data now adding family relations • The model correctly classify 79.1% of the passengers
Logistic regression of the Titanic data now adding family relations • Basically all factors seems to affect the probability of survival.
How was it with age? • Linear associations are easy to model, because the factor enters the predictive value directly. • But it is not really look linear, maybe a third order polynomial? • Three new factors for age is calculated: first, second, and third order of the age divided by the standard diviation.
How was it with age? • The third-order age factor did not add significantly to the model. • By adding third order polynomial the model can correctly categorize 79.4 vs 79.1 before. • ParChild is no longer a significant factor and can be omitted from the model
Using the model to predict survival • Omitting the second and third order age and ParChild factors • What is the probability that a 25 year old woman accompanied only by her husband holding a second class ticket would survive Titanic? • z = -3.929 • -0.589*(-5)/14.41 • +1.718 • +2.552 • +0.926 = 1.4714
Analysing interaction of selected factors • pclass * sex, age * sex, pclass * Siblings/Parents • But the model does not converge…
Analysing interaction of selected factors • Collapsing the sibling/spouse number eradicated their mutual interaction