1 / 13

Titanic

Titanic. Analytic model to predict survival in Titanic Disaster. By, Varun Kadekar vjkadekar@gmail.com. Contents. Problem description Data exploration Dependent Variable dependency Solution Approach Final Logit Equation Validation. Problem Description.

chibale
Download Presentation

Titanic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Titanic Analytic model to predict survival in Titanic Disaster. By, VarunKadekar vjkadekar@gmail.com

  2. Contents • Problem description • Data exploration • Dependent Variable dependency • Solution Approach • Final Logit Equation • Validation

  3. Problem Description • Need to predict the probability of survival based on the data available. • Dataset train.csv has test data and test.csv has validation data. • Data analysis made on train.csv dataset must be applied on validation data to check for the correctness of the solution.

  4. Data Exploration/Preparation • Check for outliers in the data. (Data looked good but for a few missing values in Age column) • Treat the age column for missing data by substituting it with mean of the Age. • If the Sex of the row with missing Age is ‘female’ then substitute the age by mean of Age of female passengers in the ship. The value is approximately 28. Likewise, for male passengers, its 31.

  5. Dependent Variable Dependency • The correlation between Pclass and Survival shows that more people from Higher Class have survived and more people from the lower class have died. • Stats below and charts in next slide. • We can observe that 372 out of 491 from Pclass 3 have not survived the accident, and more than 50% of people from higher class have survived.

  6. Dependent Variable Contd… • The below chart gives a clear idea on number of survivors against those dead per every PClass.

  7. Dependent Variable Contd… • Correlation between Age and survival shows more people below the age of 10 survived and the percentage of survival reduces with increase in age.

  8. Dependent Variable Contd… • Correlation between Sex and survival shows more men have died.

  9. Solution Approach • The correlation showed only following variables have significant impact. • Pclass, SibSp, Age and Sex. • Age is a continuous variable and hence we need to change it to categorical variable. Here is the approach I took: Age Bucket is ‘0’, if Age is between 0 and 10. Age Bucket is ‘1’ if Age is between 10 and 30. Age Bucket is ‘2’ if Age is greater than 30. • Sex changed from character variable to numeric. Sex_Num is ‘0’ if Sex is ‘female’, else Sex_Num is ‘1’.

  10. Final Logit Equation • The logit model run on the dependent variable with independent variables explained in previous slide, gives the below logit equation for probability of survival. • PClass --> Value of PClass in the input file • age_buck --> Age_buck value is '0' if 0<age<=10. • Age_buck is '1' if 10 <age<=30. Age_buck is '2' if 30 <age<100. • SibSp --> Value from the input • Sex_numeric --> This is a derived variable. • Sex_numeric is '1' if sex in the input is 'Male'. Else Sex_numeric is '0'. Prob of Survival = eXP^M/(1+exp^M) where M = 4.7905 + (PClass)*(-1.1010)+(age_buck)*(-0.7365)+ (SibSp)*(-0.3584)+(Sex_numeric)*(-2.6210)

  11. Validation • Applied the logit equation against validation dataset, test.csv. • Below chart shows the probability of survival. The model seems to have rightly predicted the probability of survival. • We could observe that if the model has predicted the probability of survival to be more than 90%, then in real, they have indeed survived. • As the prob of survival reduces, we can observe that more people have actually died.

  12. Validation • Additional validation proof attached below. In the excel below, column P shows the predicted probability of Survival by the model. • The column O shows the actual Survival variable from the myfirstforest.csv dataset.

  13. Thank you… VarunKadekar vjkadekar@gmail.com

More Related