180 likes | 443 Views
Shawn Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke Seidenberg School of Computer Science and Information Systems Pace University White Plains, NY, US shawn@cicoria.com, {js20454w,mm42526w,lc18948w}@pace.edu.
E N D
Shawn Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke Seidenberg School of Computer Science and Information Systems Pace University White Plains, NY, US shawn@cicoria.com, {js20454w,mm42526w,lc18948w}@pace.edu Classification of Titanic Passenger Data and Chances of Surviving the DisasterData Mining with Weka and KaggleCompetition Data
Background • Titanic Disaster – April 15, 1912 • 1,502 passengers and crew perished out of 2,224[2] • Researchers still try to identify chance of survival[2,3] [2] “Titanic: Machine Learning from Disaster,” Kaggle.com. [Online]. Available: https://www.kaggle.com/c/titanic-gettingStarted. [Accessed: 13-Dec-2013]. [3] Wiki, “Titanic.” [Online]. Available: http://en.wikipedia.org/wiki/Titanic. [Accessed: 13-Dec-2013].
Kaggle.com • Crowd sourcing and competition for Analytics and Data mining • Online Presence • Example Competition • General Electric (GE) offering $200,000
WekaWaikato Environment for Knowledge Analysis • Open Source tool • http://www.cs.waikato.ac.nz/ml/weka/ • Collection of machine learning algorithms and analytical tools • Cross Platform – Java based • Primary authors – Researchers at University of Waikato NZ
Basic Premise • What classes of passengers impacted the survivability for the Titanic Disaster? • Sex • Cabin Class • Point of Departure • Age
Source Data • Kaggle (Kaggle.com) • Titanic Disaster Competition • https://www.kaggle.com/c/titanic-gettingStarted • Used Test Data set
Data Set – Coaxing for Weka • Original Data • Data Modifications
Final Data Format • Final CSV • ARFF Format
J48 Classifier • C.45 Based • 81% correct classification • 42nd in Kaggle if submitted !! Information gain Amount of information gained by knowing the value of the attribute (Entropy of distribution before the split) –(entropy of distribution after it) Claude Shannon, American mathematician and scientist 1916–2001
J48 Tree Diagram • Sex largest impact • Cabin Class • Departure point
Simple K Means Clustering • Sex had clear clustering impact
Simple K Means Clustering • Cabin Class showed significant clustering • 3rd class not so great
Simple K Means Clustering • Age Group • Hard to distinguish if any • Lowest influencer in J48
Simple K Means • Point of Departure • Southampton seems significant • We didn’t identify if departure was associated with Cabin Class – another study needed.
Simple K Means • Point of Departure vs. Survived • Instance Colored by Class (1st, 2nd, 3rd) • Show’s strong association between embark and class
Conclusions and Summary • Sex clearly had the most significant impact on the survival rate • J48 classifier ~ 81% correctly classified instances • Kaggle competition 43rd place.
Finally • Weka is powerful, however; • Requires significant coaxing of the data into a more amiable format • At first, we had chosen baseball statistics • Became overwhelmed • Baseball statistics were tossed out – very late in our project. • Kaggle to the rescue • Stumbled upon this dataset • Simple manipulation had compatible ARFF format • Demonstrated which classes of passengers had the greatest impact on survivability.
References [1] GE, “Flight Quest Challenge,” Kaggle.com. [Online]. Available: https://www.gequest.com/c/flight2-main. [Accessed: 13-Dec-2013]. [2] “Titanic: Machine Learning from Disaster,” Kaggle.com. [Online]. Available: https://www.kaggle.com/c/titanic-gettingStarted. [Accessed: 13-Dec-2013]. [3] Wiki, “Titanic.” [Online]. Available: http://en.wikipedia.org/wiki/Titanic. [Accessed: 13-Dec-2013]. [4] Kaggle, Data Science Community, [Online]. Available: http://www.kaggle.com/ [Accessed: 13-Dec-2013] [5] Weka 3: Data Mining Software in Java, [Online]. Available: http://www.cs.waikato.ac.nz/ml/weka/ [Accessed: 13-Dec-2013] [6] C4.5 Algorithm, Wikipedia, Wikimedia Foundation, [Online]. Available: http://en.wikipedia.org/wiki/C4.5_algorithm, [Accessed: 13-Dec-2013]