1 / 51

Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista

Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista. Agenda. Using Data Science on Internet Search Behavior as a Proxy for Human Behavior. Context Problem definition Examples Summary. Context. 17,293,822,600,000,000,000 Bytes [1].

Download Presentation

Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista

  2. Agenda Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Context Problem definition Examples Summary

  3. Context

  4. 17,293,822,600,000,000,000 Bytes[1] 15 Exabytes = 1.5 million times the size of all books in the Library of Congress [2] [1] The Human Face of Big Data , 2012 | ISBN-10: 1454908270 Rick Smolan, Jennifer Erwitt [2] Peter Lyman, Hal R. Varian (2000-10-18). "How Much Information?"

  5. 1984 Cost of storage of every single book ever written ~130 million books[4] 2014 US$1 Billion [3] US$3,000 [3] A history of storage cost, Matthew Komorowski, 2009 [4] There are 130 Million Books in the World, How Many Have You Read?, 2009 BY WALLACE YOVETICH

  6. 1996 Cost of Processing power[6] 2014 XBOX ONE $399 ASCI Red Super computer (6000 Pentium Pro) $67,000,000 [6] The history of supercomputers, Sebastian Anthony, 2012

  7. Information is only useful if its accessible… 1989 – Tim Burners Lee writes his initial proposal  for the web August 1991, First website from CERN online – Including First index Concepts Circa 1992 – Index discontinued. Research

  8. All 29 websites! Web – circa 1992

  9. “If you notice something incorrect or have any comment which you don't think is a FAQ, feel free to mail me” Phone +1 (617)253 5702, fax +1 (617)258 8682, email: timbl@w3.org

  10. http://www. History behind

  11. Web started growing and there was a need to search on it

  12. ARCHIE Circa 1990 by Alan Emtage Peter J. Deutsch  Simply contacted a list of FTP archives on a regular basis and stored locally Search functionality was using Unix GREP

  13. 24 Years Later…

  14. 2 trillion queries per year 2.8 billion Users

  15. Indexable web is ~ 40 trillion pages A couple of weeks to read.. 5700 web pages per person

  16. This is just 1 search (we make 2 trillion searches per year) A lot more time to complete a search…

  17. Agenda • Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Context Problem definition Examples Summary

  18. Problem definition Using Data Science on Internet Search Behavior as a Proxy for Human Behavior What can we learn from what people are searching? Search Focus: Relevance And Performance

  19. Agenda • Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Context Problem definition Examples Summary

  20. Examples Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Breaking News Wake up time Drug Interactions Seasonal Flu

  21. Breaking News Detection

  22. Breaking New Detection • Daily traffic follows a very stable pattern • We Build a model to predict query volume on a per-minute basis • If there are no rare-events, • predicting query volume during the day is very accurate • Model works with some variation at the Country, State, or city level

  23. We compare the daily traffic against prediction, and measure how much they deviate. Anomaly detection Problem Spike Location: Boston Z-Score +7 u

  24. Wake up time

  25. Wake up Time Methodology We calculated the time at which we receive 50% of daily peak traffic from each metro area in their local time zones. The 25 cities follow the same general curve across all seven days of the week. While the patterns are the same, we did see a 43 minute shift between the earliest risers and the late risers. 7:10 7:28 6:43 6:55 7:15 7:32

  26. San Francisco

  27. New York City

  28. Wake up time during the week At what time do we wake up during the week? 7:06 7:05 6:48 7:10 7:01 Monday Thursday Friday Wednesday Tuesday

  29. Detecting Seasonal Influenza Using Search Logs

  30. Early detection of disease activity, when followed by a rapid response, can reduce the impact of both seasonal and pandemic influenza Epidemics of seasonal influenza are a major public health concern, causing tens of millions of respiratory illnesses and 250,000 to 500,000 deaths worldwide each year Using internet searches for influenza surveillance. Clinical Polgreen, P. M., Chen, Y., Pennock, D. M. & Forrest, N. D. Infectious Diseases 47, 1443–1448 (2008) Detecting influenza epidemics using search engine query data Jeremy Ginsberg,Matthew H. Mohebbi,Rajan S. Patel,LynnetteBrammer,Mark S. Smolinski & Larry Brilliant

  31. How does it works? Detecting influenza epidemics using search engine query data CDC publishes national and regional data from these surveillance systems on a weekly basis, typically with a 1-2 week reporting lag Detecting influenza epidemics using search engine query data Jeremy Ginsberg,Matthew H. Mohebbi,Rajan S. Patel,LynnetteBrammer,Mark S. Smolinski & Larry Brilliant

  32. Controversy 04 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce suscipit neque non libero aliquam, ut facilisis lacus pretium. Sed imperdiet tincidunt velit. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce suscipit neque non libero aliquam, ut facilisis lacus pretium. Sed imperdiet tincidunt velit. 03

  33. Signal is definitely relevant Model can be improved “all models are wrong but some are useful” George Box We need to be careful of [all data] [no-science] approaches This is NOT a failure for Big Data

  34. All data no-science ? Discussion Article by Chris Anderson , Wired Magazine, 2008 [13] “… faced with massive data, this approach to science —hypothesis, model, test — is becoming obsolete” “The new availability of huge amounts of data [...] offers a whole new way of understanding the world. Correlation supersedes causation” “There is now a better way. Petabytes allow us to say: Correlation is enough.” “With enough data, the numbers speak for themselves.” [13] http://edge.org/3rd_culture/anderson08/anderson08_index.html

  35. All Data no-Science Approach This is a example for a subtitle 0.81 0.82 Correlation between Flu trends and GUNS related queries. Correlation between CDC Flu and Les Miserable related queries

  36. “Torture the data enough and it will confess..” Ronald Coase

  37. Fooled by randomness

  38. Signal is definitely relevant Model can be improved “all models are wrong but some are useful” George Box We need to be careful of [all data] [no-science] approaches This is NOT a failure for Big Data

  39. Detecting Adverse drug Interactions

  40. Context: Adverse drug events cause substantial morbidity and mortality and are often discovered after a drug comes to market. In the US alone, adverse drug events cause thousands of deaths annually and their associated medical treatment costs billions of dollar

  41. Detecting Adverse drug Interactions Testing impact of a drug by FDA For each drug, FDA does a randomize control experiment before releasing them in order to Understand impact of the drug

  42. Interactions What are interactions?

  43. Web-scale pharmacovigilance: listening to signals from the crowd Ryen W White,Nicholas P Tatonetti, Nigam H Shah, Russ B Altman, Eric Horvitz Hypothesized: Internet users may provide early clues about adverse drug events via their online information-seeking Web-scale pharmacovigilance: listening to signals from the crowd Ryen W White,Nicholas P Tatonetti, Nigam H Shah, Russ B Altman, Eric Horvitz

  44. Test case scenario Paroxetine (an antidepressant) Web-scale pharmacovigilance: listening to signals from the crowd Pravastatin (a cholesterol lowering drug) Interaction between the 2 was reported to create hyperglycemia Hyperglycemia, or high blood sugar ) is a condition in which an excessive amount of glucose circulates in the blood plasma. Web-scale pharmacovigilance: listening to signals from the crowd Ryen W White,Nicholas P Tatonetti, Nigam H Shah, Russ B Altman, Eric Horvitz

  45. Methodology Web-scale pharmacovigilance: listening to signals from the crowd Method: By examining words used in user queries, they sought evidence that searches from people exploring pravastatin and paroxetine over time (using logs from 2010) would have a higher rate of including hyperglycemia-associated words than people searching for only one of the drugs Web-scale pharmacovigilance: listening to signals from the crowd Ryen W White,Nicholas P Tatonetti, Nigam H Shah, Russ B Altman, Eric Horvitz

  46. Results Web-scale pharmacovigilance: listening to signals from the crowd The figure shows that people who search for both paroxetine and pravastatin over the 12-month period are more likely to perform searches on the terms associated with hyperglycemia The study shows that signals concerning drug interactions can be mined directly from search logs and confirms the findings of laboratory studies as well as prior known associations. Web-scale pharmacovigilance: listening to signals from the crowd Ryen W White,Nicholas P Tatonetti, Nigam H Shah, Russ B Altman, Eric Horvitz

  47. Agenda • Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Context Problem definition Examples Summary

  48. Summary

  49. Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Search logs are a very powerful data set that can be used not only to improve the relevancy of search results, but also as a unique data source to solve other problems.. This is only a small subset of problems, we believe this is the tip of the iceberg of the potential of this data source We live in an amazing era, and is too soon to realize how big is the impact of the web in human kind..

  50. We are living in this era. To soon to realize how big is the impact of the internet for human kind.. We are in an inflexion point in the history of the world..

More Related