1 / 28

Demographics and Weblog Hackathon – Case Study

Demographics and Weblog Hackathon – Case Study. 5.3% of Motley Fool visitors are subscribers. Design a classificaiton model for insight into which variables are important for strategies to increase the subscription rate Learn by Doing. http:// www.meetup.com / HandsOnProgrammingEvents /.

manchu
Download Presentation

Demographics and Weblog Hackathon – Case Study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Demographics and Weblog Hackathon – Case Study 5.3% of Motley Fool visitors are subscribers. Design a classificaiton model for insight into which variables are important for strategies to increase the subscription rate Learn by Doing

  2. http://www.meetup.com/HandsOnProgrammingEvents/

  3. Data Mining Hackathon

  4. Funded by Rapleaf • With Motley Fool’s data • App note for Rapleaf/Motley Fool • Template for other hackathons • Did not use AWS. R on individual PCs • Logisics: Rapleaf funded prizes and food for 2 weekends for ~20-50. Venue was free

  5. Getting more subscribers

  6. Headline Data, Weblog

  7. Demographics

  8. Cleaning Data • training.csv(201,000), headlines.tsv(811MB), entry.tsv(100k), demographics.tsv • Feature Engineering • Github:

  9. Ensemble Methods • Bagging, Boosting, randomForests • Overfitting • Stability (small changes make large prediction changes) • Previously none of these work at scale • Small scale results using R, large scale exist in proprietary implementations(google, amazon, etc..)

  10. ROC Curves Binary Classifier Only!

  11. Paid Subscriber ROC curve, ~61%

  12. Boosted Regression Trees Performance • training data ROC score = 0.745 • cv ROC score = 0.737 ; se = 0.002 • 5.5% less performance than the winning score without doing any data processing • Random is 50% or .50. We are .737-.50 better than random by 23.7%

  13. Contribution of predictor variables

  14. Predictive Importance • Friedman, number of times a variable is selected for splitting weighted by squared error or improvement to model. Measure of sparsity in data • Fit plots remove averages of model variables • 1 pageV 74.0567852 • 2 loc 11.0801383 • 3 income 4.1565597 • 4 age 3.1426519 • 5 residlen 3.0813927 • 6 home 2.3308287 • 7 marital 0.6560258 • 8 sex 0.6476549 • 9 prop 0.3817017 • 10 child 0.2632598 • 11 own 0.2030012

  15. Behavioral vs. Demographics • Demographics are sparse • Behavioral weblogs are the best source. Most sites aren’t using this information correctly. There is no single correct answer. Trial and Error on features. The features are more important than the algorithm • Linear vs. Nonlinear

  16. Fitted Values (Crappy)

  17. Fitted Values Better

  18. Predictor Variable Interaction • Adjusting variable interactions

  19. Variable Interactions

  20. Plot Interactions age, loc

  21. Trees vs. other methods • Can see multiple levels good for trees. Do other variables match this? Simplify model or add more features. Iterate to a better model • No Math. Analyst

  22. Number of Trees

  23. Data Set Number of Trees

  24. Hackathon Results

  25. Weblogs only 68.15%, 18% better than random

  26. Demographics add 1%

  27. AWS Advantages • Running multiple instances with different algorithms and parameters using R • Add tutorial, install Screen, R GUI bugs • http://amazonlabs.pbworks.com/w/page/28036646/FrontPage

  28. Conclusion • Data Mining at scale requires more development in visualization, MR algorithms, MR data preprocessing. • Tuning using visualization. Tune 3 parameters, tc, lr, #trees. Didn’t cover 2/3. • This isn’t reproducable in Hadoop/Mahout or any open source code I know of • Other use cases, i.e. predicting which item will sell(eBay), search engine ranking. • Careful with MR paradigms, Hadoop MR != Couchbase MR

More Related