1 / 44

My cautionary personal note on Data

Manichean Progress: Positive and Negative States of the Art in Web-Scale Data Lewis Shepherd Microsoft Institute for Advanced Technology in Government. My cautionary personal note on Data.

evania
Download Presentation

My cautionary personal note on Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Manichean Progress: Positive and Negative States of the Art in Web-Scale DataLewis ShepherdMicrosoft Institute for Advanced Technology in Government

  2. My cautionary personal note on Data “If all others accepted the lie which the Party imposed - if all records told the same tale - then the lie passed into history and became truth. 'Who controls the past' ran the Party slogan, 'controls the future: who controls the present controls the past.’” George Orwell, Nineteen Eighty-Four

  3. Murray Feshbach, Demographer & Revolutionary Spark • Following many years of continuous decline, infant mortality in the Soviet Union started inexplicably to rise in the early 1970s from 22.9 deaths per 1,000 live births in 1971 to 27.9 in 1974. The TsSU continued to print the infant mortality series for a few years after the alarming reversal of the long-term trend, but it stopped open publication of the data in 1975. • Christopher Davis and Murray Feshbach[Census Bureau] published a research report in 1980 depicting the deteriorating state of public health in the USSR and--with what later proved to be an accurate set of estimates for the missing years--suggesting that infant mortality in the Soviet Union was continuing to rise. • The Davis-Feshbachstudy was made available to high Soviet authorities who directed beneficial changes in public health policies. • [Full publication of ] Infant mortality rates were not resumed until twelve years later in NarodnoyeKhozyaystvo, 1987 • The TsSU and the Ministry of Health of the USSR probably continued to collect statistics on infant mortality... The Soviet statistical system, however, was known for its reluctance to be the bearer of bad news. In the case of infant mortality, as in many similar cases, the data on adverse developments were simply deleted from the open literature. • It took an alarming and well-publicized American report to alert higher authorities to the critical situation and to introduce remedies. Vladimir G. Treml, Center for the Study of Intelligence, “Western Analysis and the Soviet Policymaking Process”, 2007

  4. Tim O’ReillyGovernment as a Platform Evangeliston “The World’s 7 Most Powerful Data Scientists” • Elizabeth Warren: The banking system excesses that led to the economic crash of 2008 are an example of big data gone wrong. As the provisional head of the Consumer Finance Protection Bureau, Elizabeth Warren began the job of building the algorithmic checks and balances needed to counter the sorcerer’s apprentices of Wall Street. In her campaign for the US Senate, she promises to continue that fight. • …when she was working on the Consumer Finance Protection Board, she was thinking hard about what role technology could play in building a truly 21st century regulatory agency, and in my books, that will have to mean what I've been calling "algorithmic regulation.“ Forbes.com / G+ / Nov. 3, 2011 (emphasis added) https://plus.google.com/u/0/107033731246200681024/posts/2NU9pZEZ5t1

  5. Tim O’ReillyGovernment as a Platform Evangeliston “The World’s 7 Most Powerful Data Scientists” • My feeling is that someone who is likely to have a major influence on regulating the data scientists on Wall Street is a good person to put on a list like this. Yes, I do want them regulated, and this was a way of giving Elizabeth Warren a push. I do think that if anyone will help stand up for the rest of us, she will. And I wanted a chance to plant a few ideas about how that regulation ought to happen (algorithmically, in the same way that Google manages search quality.) Blog Comment / Nov. 4, 2011 (emphasis added) http://ctovision.com/2011/11/the-worlds-7-most-powerful-data-scientists/#IDComment217149604

  6. Breaking down Data Barriers Semantic Knowledge for Commodity Computing Evelyne Viegas, Microsoft Research, USA Li Ding, Rensselaer Polytechnic Institute Natasa Milic-Frayling, Microsoft Research, UK Haixun Wang, Microsoft Research, Asia Kuansan Wang, Microsoft Research, USA

  7. Vision – Enable Next Generation Experiencesby working with academia, stakeholders from industry, government, and consumers/innovators to make sense of data DATA > INFORMATION > KNOWLEDGE > INTELLIGENCE

  8. Data/Information • To help explorethe data value chain, Microsoft’s collaborations provide access to data that enables: • Innovation – By having access to real world data, researchers can unveil new analysis or research directions based on shared assets and explore new questions • Science – By allowing wider use of data, repeatability of experiments can be performed and data misrepresentations or faulty results avoided • Training – real-world large-scale data is a powerful tool for training the next generation of data analysts and researchers • Cloud-based services: Web Language and Query Language Models • Used to research topics such as human speech, spelling, information extraction, learning, and machine translation.

  9. It’s a data-driven world • Spell Checking • Machine Translation • Search queries + click through • Online games skill matching • … Data logs behaviours in more reliable ways than demographic studies or surveys to study/predict trends (Banko and Brill, 2001) – effectiveness of statistical NLP techniques is highly susceptible to the data size used to develop them (Norvig, 2008) – it is the size of data, not the sophistication of the algorithms that ultimately play the central role in modern NLP

  10. Data has become a first class citizen It’s a Data-Driven World

  11. With web users becoming producers of information, leaving the footprint of their lives in digital trails, it is becoming easier for “data snoopers” to reconstruct the identity of an individual or an organization by cross linking information from different sources Data for Open Innovation - Challenges

  12. A Face Is Exposed for Searcher No. 4417749 “Search query data can contain the sum total of our work, interests, associations, desires, dreams, fantasies, and even darkest fears” said, Lauren Weinstein, a privacy advocate. The New York Times, Aug 2006 Thelma Arnold's identity was betrayed by the records of her Web searches

  13. Web N-gram Services Access to up to petabytes of real world data Leading technology in Search, Machine Translation, Speech, Learning, … http://research.microsoft.com/web-ngram

  14. Web N-Gram in Public Beta Search engines rely on unigram body … Web data has structure… …and that counts (e.g. Body, Title, Anchor) Rich context/meta-data ignored Users form ‘query’ Exploring Web Scale Language models for Search Query Processing, in WWW’2010

  15. Applications Examples using Web Ngram Services

  16. Word Breaking

  17. Multi-word Tag Cloud from Government Dataset Titles Single Tag Cloud Multi Tag Cloud Ref: Dr. Li Ding, Rensselaer Polytechnic Institute

  18. Query Segmentation Body: Title: Anchor:

  19. Big Data and Machine Learningat the rescue ofMachine TranslationAudio/SpeechMotion/Gestures

  20. Text: Paraphrasing in English http://labs.microsofttranslator.com/thesaurus/

  21. Sentence: “many are dismayed by his behaviour”

  22. Audio:Search Over Audio http://www.msravs.com/audiosearch_demo/ http://labs.microsofttranslator.com/thesaurus/

  23. Meaning of Utterances: Search Over Audio http://www.msravs.com/audiosearch_demo/

  24. Gestures: Kinect SDK http://research.microsoft.com/en-us/um/redmond/projects/kinectsdk

  25. It’s now a Knowledge WorldFrom Patterns to Meanings

  26. Semantics as the study of Meaning • Data semantics – extract and map from structured and semi-structured sources into ontologies • Lexical semantics – identify/learn concepts, roles from sentences (e.g. Powerset; MindNet) • Statistical semantics – discover meaning from patterns of use (e.g. concept similarity) • Computational semantics – automate the process of constructing and reasoning with meaning representations • Semantic web – linked data via URI, common graph structure with RDF, inferences via ontologies and OWL • Formal semantics– in linguistics? in logic?

  27. Probase : A Knowledge Base for Text Understanding http://research.microsoft.com/en-us/projects/probase/

  28. Probase: Freebase:Cyc: Probase has a big concept space

  29. Probasevs. Freebase Uncertainty

  30. What’s in your mind when you see the word ‘apple’

  31. When the machine sees ‘apple’ and ‘pear’ together

  32. Probase Internals artist painter Born Died … Movement Picasso 1881 1973 … Cubism art created by painting Year Type … Guernica … 1937 Oil on Canvas

  33. Probase search

  34. Interim Product: Academic Search http://academic.research.microsoft.com/

  35. Zentity 2.0– Research Output Platform New Features: Pivot Viewer (defacto browser) Open Data Protocol Default web UI with CSS support and custom ASP.Net controls Flexible data model enables many scenarios and can be easily extended over time A semantic computing platform to store and expose relationships between digital assets http://research.microsoft.com/zentity/

  36. Pattern Discovery and Semantic Interpretation:Graph of Co-occurring Flickr Tags

  37. Pattern Discovery and Semantic Interpretation:Graph of Co-occurring Flickr Tags

  38. Pattern Discovery and Sociological Interpretation:‘Commenting’ Activity on Flickr Flickr users who commented on Marc_Smith’s photos (more than 4 times)

  39. Pattern Discovery and Sociological Interpretation:‘Commenting’ Activity on Flickr Flickr users who commented on Marc_Smith’s photos (more than 4 times)

  40. Semantics of Network Patterns:NodeXL http://nodexl.codeplex.com INTRODUCTION TECHNIQUES AND METRICS USER RESEARCH PRODUCT GROUP ENGAGEMENT FURTHER WORK TWITTER NodeXL Graph “Bing” at 2:30 AM Monday, July 12, 2010

  41. From Pattern to Meaning: Email • Validation of pattern analysis requires human input. • Meaning can be considered globally accepted or strictly contextual, generally understood or individually constructed.

  42. Summary • The challenge is not so much in the standards for representations (isn’t this just still syntax?) and pattern discovery but really in the interpretation and validation of that interpretation. • ‘Meaning’ has different connotations in different context • The challenge is in determining and addressing the right level of granularity.

  43. Thank you • Evelyne Viegas, Microsoft Research, USA • Li Ding, Rensselaer Polytechnic Institute • Natasa Milic-Frayling, Microsoft Research, UK • Haixun Wang, Microsoft Research, Asia • Kuansan Wang, Microsoft Research, USA Lewis Shepherd lewiss@microsoft.com @lewisshepherd

More Related