230 likes | 383 Views
Natural Language Processing for Underground Communications. Dan Klein MURI Kickoff, 11/20/2009. Underground Communications. Example Data. Underground Communications. Example Data, Manual Extraction. Processing: Information Extraction. Observation Graphs. http://www.rossmail.ru/offline.htm.
E N D
Natural Language Processing for Underground Communications Dan Klein MURI Kickoff, 11/20/2009
Underground Communications Example Data
Underground Communications Example Data, Manual Extraction
Observation Graphs http://www.rossmail.ru/offline.htm http://www.f-mail.ru/kontact/ http://www.spam-reklama.ru/contact.html http://www.fax-reklama.ru/contact.html
Underlying Entities and Relations Employee Person: Person 9876 Product: 5621 Role: Developer Referral From: Person 2133 To: Person 1211 Product: 3319 Person 1211 Alias: Steakcap ICQ: 598199837 Location: France Person 2133 Alias: Thunderelvi ICQ: 787659871 Location: USA Person 9876 Alias: Zakar ICQ: 234150301 Email: zakar@e-... Product 3319 Type: FB Harvester Contact: 709-324-0989 Product 5621 Type: Spam Sender Contact: 495-210-4423 Extraction Goal
Discourse Structure sign deliver vote
An Entity Reference Model Our Existing Approach
America Onlinecompany Adding Semantic Knowledge Our Current Work
Does it Work? Evaluation: Reference Unsupervised MUC F1 -Cluster Similarity Supervised Unsupervised Baseline Bengston & Roth 08 Preliminary Current Work
What’s Coming Up Cross-Document Identity
Subsequent Goals Underlying Entities and Relations Employee Person: Person 9876 Product: 5621 Role: Developer Referral From: Person 2133 To: Person 1211 Product: 3319 Person 1211 Alias: Steakcap ICQ: 598199837 Location: France Person 2133 Alias: Thunderelvi ICQ: 787659871 Location: USA Person 9876 Alias: Zakar ICQ: 234150301 Email: zakar@e-... Product 3319 Type: FB Harvester Contact: 709-324-0989 Product 5621 Type: Spam Sender Contact: 495-210-4423
Summary • Goal: systems which simultaneously extract and dedupe • Train in an unsupervised / discovery manner • Requires: both new statistical machinery and good models of underlying domain structure (transactions, etc) • Requires: processing domain-specific language (domain adaptation, grammar induction) • Evaluation: are the entities and relations correct? • First steps: measure general approach on newswire, etc. where we know the right answers • Also: evaluate on underground network data • Near term: increased accuracy in identity resolution, begin to extract simple relations, better basic analysis