980 likes | 1.14k Views
Pharos Summer School Fundamentals of Social Applications. June 2009 Avaré Stewart stewart@l3s.de http://www.l3s.uni-hannover.de/~stewart/pharos/. Roadmap. Part I: Overview Social Applications current shortcomings, solutions Part II : Information Extraction (IE)
E N D
Pharos Summer School Fundamentals of Social Applications June 2009 Avaré Stewart stewart@l3s.de http://www.l3s.uni-hannover.de/~stewart/pharos/
Roadmap • Part I: Overview Social Applications • current shortcomings, solutions • Part II : Information Extraction (IE) • tasks, techniques, tools • Part III: Evaluation • Part IV: IE & IR Applications in Context
Avaré Stewart The Social Applications Phenomena The Social Application Phenomena today is driven by Social Media Social Media: • information content of the “citizen journalist”, user generated content • popular way, people connect in online world, personal & business relationships
What ‘s the Social Media Hype? Capitalize on Social Processes Diffusion / Cascade • Coverage: • Reach small or large audiences • Breaks publication barriers • Business / Advertisement • Repeated Visiting: best links readers will come back • Information Gathering / Sharing: • Cut time you spend looking • Link economy is real…Give some, get some • Dynamic Content: not endpoint of conversation, but the beginning… • Social Intervention / Detection • Rumors , fads, infectious disease The core concepts of social media Espoo, April 2007
The Many Faces of Social Applications Domain: • Music, politics, cycling, medicine Media Type: • Video: YouTube, Daily MotionFacebook Services: • meeting people • expressing point view • serendipitous discovery
Avaré Stewart Social Networking Divide Where's the “Social” Web ? • Social Sites intentionally seek distinction • Problem: • sheer number: redundancy, overlap: • type of media, resources • topics • Overlaps exists: untapped to the benefit of those who actually constitute the social networking ecosystem The ,so called, Social Web is ironically divided
Open Social Networking (OSN) Aspects of an Open Social Network • Unified Data Spaces • Personal Identity Unification • Unified Applications
Unified Data Spaces Linking Open Data Cloud Music-Social Network Bibliographic Encyclopedic BioMedical http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
Personal Indentity Unification • OpenID : a single digital • Retaggr : social media profile card • Geek Chart : graphical profile - pie chart • DandyID : collect online profiles in one place • FriendFeed : real-time aggregator, consolidates the updates from sites
Unified Applications Multi-Site APIs: common API for social applications across multiple websites • OpenSocial • Data Portability Project Single Site –APIs: partner / interact programmatically • YouTube Data API: videos • Spinn3r: indexing blogosphere • etc....
Pharos Scenario Bloggers Who Don’t Tag Taggers Who Don’t Blog ??? Social Network Divide
Avaré BonaparteStewart Missing Link: Cross-Tagging Exploit the tags assertions made by users of one social site to personalize the experience for users in another, comparable site
Overview: Cross Tagging Better Browsing Better Search Better Recommendations Cross-Tagging for Personalized Open Social Networking, Stewart, Diaz, Balby Marinho2008
Social Medial Communities & Content • Social media: examined, primarily for popularity in connecting people Espoo, April 2007 • In Pharos: examine blogs improved, personalized information access
Complex Information Needs & Social Media Search • Polarity, opinion • Meme and themes • Related, multi-lingual resources • Entities: people, organizations, etc. • Relationships between entities • Event: who, what, where, when, how
Events ? ... Momentum is Shifting • Industry: • Complex Event Processing (CEP) • Event correlation: • Event Filtering , Event Aggregation • Event Masking, Root Cause Analysis • Research: • Event detection • Associations • De-duplicate Humans think in terms of events and entities Events - natural abstraction of real world
Information Retrieval, Meet Information Extraction ... from Blogs IR • Information Extraction IE : • a subarea of Natural Language Processing (NLP) • Needed to solve complex (event-driven) information needs • hard, because natural language is complex, vague and ambiguous, i.e.: unstructured • potentially harder, for blogs & informal sources IE Social Media
Anatomy of a Blog Rich Source for Personalized Information Author Archive Tag Content Trackback Permalink Comment Timestamp Blogroll Title Feed
Part II: Information Extraction Tasks, Techniques and Tools
Unstructured Data • Encoded in a way that makes is difficult for computers to immediately interpret • Multiple languages, across multiple documents
Why Information Extraction? • Large amount of unstructured or semistructured information • Web pages, email, news articles, call-center text records, business reports, annotations, spreadsheets, research papers, blogs, tags, instant messages (IM), … • High impact applications • Business intelligence, personal information management, Web communities, Web search and advertising, scientific data management, e-government, medical records management, … • Open ended and growing rapidly • Information Extraction: • Superimpose formal meaning on unstructured information • Elicit facts and relationships • Feed database/knowledgebase
Why? ... Information is Locked Away... Events, Facts, Relationships Inaccesible data .... growing and sophisticated needs ... growing
What is Information Extraction (IE) ? • ...isolates relevant text fragments, extracts relevant information from the fragments, and pieces together the targeted information in a coherent framework • ... build systems that finds and link relevant information while ignoring extraneous and irrelevant information • Cowie and Lehnert, 1996 p.81 IE is used to get some information out of unstructured data
Information Extraction : i.e. Disaster Unstructured Text Information Extraction (IE) System Structured Text
Information Extraction: Major Tasks • Segmentation • Tokenization, Sentence Splitting • Classification • POS Tagging, Lemmatization, Disambiguation, … • Entity Detection • Association • Noun Phrase Chunking • Parsing • Relationship Detection • Normalization & Deduplication • Anaphora Resolution • Normalization of Formats, Schema • Record Linkage, Record Deduplication • Mention Tracking
What are the Components and Tasks of an Information Extraction System?
General View of IE System Training Phase Deployment Phase INPUT: Source Text INPUT: Training corpus External Knowledge Preprocessing Preprocessing Thesaurus Aquisition Learning Extraction Grammar Extraction Ontology Knowledge Base OUTPUT: Structured Information Feedback Inforamtion Extraction , Moens Moen 06
Ex: Text Normalization AVIAN INFLUENZA, HUMAN (101): EGYPT, 79TH, 80TH CASES ***************************************************** A ProMED-mail post <http://www.promedmail.org> ProMED-mail is a program of the International Society for Infectious Diseases http://www.isid.org Date: Mon 8 Jun 2009 Source: Egyptian Chronicles [edited] <http://egyptianchronicles.blogspot.com/2009/06/h5n1-follow-up-no80.html> Clean junk formatting • Transformed to make it consistent • Performed before text is processed
Sentence Splitting • Segments text into sentences • Required for the tagger • Domain- and application-independent He called Mr. White at 4p.m. in Washington, D.C. Mr. Green responded. The computer must tell which of the dots denote an actual sentence
Tokenization • Words are not always surrounded by whitespace: • Tokenization / Word Segmentation: • Numbers, punctuation, symbols • string of contiguous alphanumeric characters with space on either side? Abbreviation are etc. and Calif. A text-based medium. • White space not indicating a word break: Phone: 0171 378 0647 San Franciso Ditto: in spite of
Parts of Speech (POS) • POS: category / class • Words in same class have similar syntactic behavior • Ex: Noun: person, place, thing, animal • Ex: verbs express action
Chunking • Words are organized into groups • Phrases: word groupings, clumped as a unit
Parsing • Labeled syntactic tree corresponding to the interpretation of the sentence • Resolution of syntactic ambiguities
Sense Disambiguation Time flies like an arrow Fruit flies like a banana
IE Recognition Tasks ACE + Text Analysis Conference (TAC) ACE Pilot Event MUC-1 MUC-2 MUC-3 MUC-4 MUC-5 MUC-6 MUC-7 ACE . . . Year 1987 1989 1991 1992 1993 1995 1998 1999 2002 2009
Named Entity Recognition (NE) • recognition of entity names: • people, organizations • place names • temporal expressions & numerical expressions
Co-reference Resolution (CO) • Identify chains of noun phrases that refer to the same object • Scope: • Within document • Across document John saw Mary. The girl was very beautiful; she wore a new red dress. • Types: • Pronominal : ’they’, ’it’, ’he’, ’hers’, ’themselves’, etc. resolve to : proper nouns, common nouns , other pronouns
Proper Noun Coreference • Names of people, places, products and companies referred to in many different variations. 3M Minnesota Mining and Manufacturing 3M Corp. NYC New York N.Y.C New York City Ref: Coreference as a Foundation for Link Analysis over Free Text
Other Coreference Types • Apposition: • noun phrases, side by side • one define or modified the other John Smith, chairman of General Electric, resigned yesterday. • Predicate Nominal: • noun phrase is main predicate of a sentence • subject and predicate nominal connected by a linking verb (copula) John is the finest juggler in the world.
Template Element Construction (TE) • Specified classes and attributes of entities: • person : name (name variants), • title, nationality, • description in the text • subtype
Template Relation Construction (TR) • Two-slot template representing a binary relation: • e.g., employee_of, product_of, location_of • pointers to template elements Fei-Yu Xu 08
Scenario Template Production (ST) • information involving several relations or events: • Joint venture • Partners • Products • Profits Fei-Yu Xu 08
Can We Extract Temporal Expressions?