1 / 97

Pharos Summer School Fundamentals of Social Applications

Pharos Summer School Fundamentals of Social Applications. June 2009 Avaré Stewart stewart@l3s.de http://www.l3s.uni-hannover.de/~stewart/pharos/. Roadmap. Part I: Overview Social Applications current shortcomings, solutions Part II : Information Extraction (IE)

benjy
Download Presentation

Pharos Summer School Fundamentals of Social Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pharos Summer School Fundamentals of Social Applications June 2009 Avaré Stewart stewart@l3s.de http://www.l3s.uni-hannover.de/~stewart/pharos/

  2. Roadmap • Part I: Overview Social Applications • current shortcomings, solutions • Part II : Information Extraction (IE) • tasks, techniques, tools • Part III: Evaluation • Part IV: IE & IR Applications in Context

  3. Overview of Social Applications

  4. Avaré Stewart The Social Applications Phenomena The Social Application Phenomena today is driven by Social Media Social Media: • information content of the “citizen journalist”, user generated content • popular way, people connect in online world, personal & business relationships

  5. What ‘s the Social Media Hype? Capitalize on Social Processes Diffusion / Cascade • Coverage: • Reach small or large audiences • Breaks publication barriers • Business / Advertisement • Repeated Visiting: best links readers will come back • Information Gathering / Sharing: • Cut time you spend looking • Link economy is real…Give some, get some • Dynamic Content: not endpoint of conversation, but the beginning… • Social Intervention / Detection • Rumors , fads, infectious disease The core concepts of social media Espoo, April 2007

  6. The Many Faces of Social Applications Domain: • Music, politics, cycling, medicine Media Type: • Video: YouTube, Daily MotionFacebook Services: • meeting people • expressing point view • serendipitous discovery

  7. What Are Some Limitations with Social Applictions?

  8. Avaré Stewart Social Networking Divide Where's the “Social” Web ? • Social Sites intentionally seek distinction • Problem: • sheer number: redundancy, overlap: • type of media, resources • topics • Overlaps exists: untapped to the benefit of those who actually constitute the social networking ecosystem The ,so called, Social Web is ironically divided

  9. Open Social Networking (OSN) Aspects of an Open Social Network • Unified Data Spaces • Personal Identity Unification • Unified Applications

  10. Unified Data Spaces Linking Open Data Cloud Music-Social Network Bibliographic Encyclopedic BioMedical http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData

  11. Personal Indentity Unification • OpenID : a single digital • Retaggr : social media profile card • Geek Chart : graphical profile - pie chart • DandyID : collect online profiles in one place • FriendFeed : real-time aggregator, consolidates the updates from sites

  12. Unified Applications Multi-Site APIs: common API for social applications across multiple websites • OpenSocial • Data Portability Project Single Site –APIs: partner / interact programmatically • YouTube Data API: videos • Spinn3r: indexing blogosphere • etc....

  13. Pharos Scenario Bloggers Who Don’t Tag Taggers Who Don’t Blog ??? Social Network Divide

  14. Avaré BonaparteStewart Missing Link: Cross-Tagging Exploit the tags assertions made by users of one social site to personalize the experience for users in another, comparable site

  15. Overview: Cross Tagging Better Browsing Better Search Better Recommendations Cross-Tagging for Personalized Open Social Networking, Stewart, Diaz, Balby Marinho2008

  16. What More Can We Do with Social Applications?

  17. Social Medial Communities & Content • Social media: examined, primarily for popularity in connecting people Espoo, April 2007 • In Pharos: examine blogs improved, personalized information access

  18. Complex Information Needs & Social Media Search • Polarity, opinion • Meme and themes • Related, multi-lingual resources • Entities: people, organizations, etc. • Relationships between entities • Event: who, what, where, when, how

  19. Events ? ... Momentum is Shifting • Industry: • Complex Event Processing (CEP) • Event correlation: • Event Filtering , Event Aggregation • Event Masking, Root Cause Analysis • Research: • Event detection • Associations • De-duplicate Humans think in terms of events and entities Events - natural abstraction of real world

  20. Information Retrieval, Meet Information Extraction ... from Blogs IR • Information Extraction IE : • a subarea of Natural Language Processing (NLP) • Needed to solve complex (event-driven) information needs • hard, because natural language is complex, vague and ambiguous, i.e.: unstructured • potentially harder, for blogs & informal sources IE Social Media

  21. Anatomy of a Blog Rich Source for Personalized Information Author Archive Tag Content Trackback Permalink Comment Timestamp Blogroll Title Feed

  22. Part II: Information Extraction Tasks, Techniques and Tools

  23. What is Information Extraction ?

  24. Unstructured Data • Encoded in a way that makes is difficult for computers to immediately interpret • Multiple languages, across multiple documents

  25. Why Information Extraction? • Large amount of unstructured or semistructured information • Web pages, email, news articles, call-center text records, business reports, annotations, spreadsheets, research papers, blogs, tags, instant messages (IM), … • High impact applications • Business intelligence, personal information management, Web communities, Web search and advertising, scientific data management, e-government, medical records management, … • Open ended and growing rapidly • Information Extraction: • Superimpose formal meaning on unstructured information • Elicit facts and relationships • Feed database/knowledgebase

  26. Why? ... Information is Locked Away... Events, Facts, Relationships Inaccesible data .... growing and sophisticated needs ... growing

  27. What is Information Extraction (IE) ? • ...isolates relevant text fragments, extracts relevant information from the fragments, and pieces together the targeted information in a coherent framework • ... build systems that finds and link relevant information while ignoring extraneous and irrelevant information • Cowie and Lehnert, 1996 p.81 IE is used to get some information out of unstructured data

  28. Information Extraction : i.e. Disaster Unstructured Text Information Extraction (IE) System Structured Text

  29. Information Extraction: Major Tasks • Segmentation • Tokenization, Sentence Splitting • Classification • POS Tagging, Lemmatization, Disambiguation, … • Entity Detection • Association • Noun Phrase Chunking • Parsing • Relationship Detection • Normalization & Deduplication • Anaphora Resolution • Normalization of Formats, Schema • Record Linkage, Record Deduplication • Mention Tracking

  30. What are the Components and Tasks of an Information Extraction System?

  31. General View of IE System Training Phase Deployment Phase INPUT: Source Text INPUT: Training corpus External Knowledge Preprocessing Preprocessing Thesaurus Aquisition Learning Extraction Grammar Extraction Ontology Knowledge Base OUTPUT: Structured Information Feedback Inforamtion Extraction , Moens Moen 06

  32. Common IE Tasks: Preprocessing & Recognition

  33. Ex: Text Normalization AVIAN INFLUENZA, HUMAN (101): EGYPT, 79TH, 80TH CASES ***************************************************** A ProMED-mail post <http://www.promedmail.org> ProMED-mail is a program of the International Society for Infectious Diseases http://www.isid.org Date: Mon 8 Jun 2009 Source: Egyptian Chronicles [edited] <http://egyptianchronicles.blogspot.com/2009/06/h5n1-follow-up-no80.html> Clean junk formatting • Transformed to make it consistent • Performed before text is processed

  34. Sentence Splitting • Segments text into sentences • Required for the tagger • Domain- and application-independent He called Mr. White at 4p.m. in Washington, D.C. Mr. Green responded. The computer must tell which of the dots denote an actual sentence

  35. Tokenization • Words are not always surrounded by whitespace: • Tokenization / Word Segmentation: • Numbers, punctuation, symbols • string of contiguous alphanumeric characters with space on either side? Abbreviation are etc. and Calif. A text-based medium. • White space not indicating a word break: Phone: 0171 378 0647 San Franciso Ditto: in spite of

  36. Parts of Speech (POS) • POS: category / class • Words in same class have similar syntactic behavior • Ex: Noun: person, place, thing, animal • Ex: verbs express action

  37. Ex: Penn Treebank POS Tagset

  38. Chunking • Words are organized into groups • Phrases: word groupings, clumped as a unit

  39. Parsing • Labeled syntactic tree corresponding to the interpretation of the sentence • Resolution of syntactic ambiguities

  40. Sense Disambiguation Time flies like an arrow Fruit flies like a banana

  41. What are Some Basic RecognitionTasks?

  42. IE Recognition Tasks ACE + Text Analysis Conference (TAC) ACE Pilot Event MUC-1 MUC-2 MUC-3 MUC-4 MUC-5 MUC-6 MUC-7 ACE . . . Year 1987 1989 1991 1992 1993 1995 1998 1999 2002 2009

  43. Named Entity Recognition (NE) • recognition of entity names: • people, organizations • place names • temporal expressions & numerical expressions

  44. Co-reference Resolution (CO) • Identify chains of noun phrases that refer to the same object • Scope: • Within document • Across document John saw Mary. The girl was very beautiful; she wore a new red dress. • Types: • Pronominal : ’they’, ’it’, ’he’, ’hers’, ’themselves’, etc. resolve to : proper nouns, common nouns , other pronouns

  45. Proper Noun Coreference • Names of people, places, products and companies referred to in many different variations. 3M Minnesota Mining and Manufacturing 3M Corp. NYC New York N.Y.C New York City Ref: Coreference as a Foundation for Link Analysis over Free Text

  46. Other Coreference Types • Apposition: • noun phrases, side by side • one define or modified the other John Smith, chairman of General Electric, resigned yesterday. • Predicate Nominal: • noun phrase is main predicate of a sentence • subject and predicate nominal connected by a linking verb (copula) John is the finest juggler in the world.

  47. Template Element Construction (TE) • Specified classes and attributes of entities: • person : name (name variants), • title, nationality, • description in the text • subtype

  48. Template Relation Construction (TR) • Two-slot template representing a binary relation: • e.g., employee_of, product_of, location_of • pointers to template elements Fei-Yu Xu 08

  49. Scenario Template Production (ST) • information involving several relations or events: • Joint venture • Partners • Products • Profits Fei-Yu Xu 08

  50. Can We Extract Temporal Expressions?

More Related