200 likes | 407 Views
Research Directions in Statistical Natural Language Processing 26 June 2003 Claire Cardie Lillian Lee. The Trend . 1992 ACL. 1994 ACL. 1996 ACL. 24% (8/34). 35% (14/40). 39% (16/41). 76%. 65%. 61%. 1999 ACL. 2001 NAACL. some ML. 60% (41/69). 87% (27/31). no ML. 40%. 13%.
E N D
Research Directions in Statistical Natural Language Processing 26 June 2003Claire CardieLillian Lee
The Trend 1992 ACL 1994 ACL 1996 ACL 24% (8/34) 35% (14/40) 39% (16/41) 76% 65% 61% 1999 ACL 2001 NAACL some ML 60% (41/69) 87% (27/31) no ML 40% 13%
Role of Statistical and Machine Learning methods • tool • automate the construction of NLP systems • avoid the need for large linguistic knowledge bases • portability • move to new domain quickly • reduce the need for expertise in computational linguistics • robustness • handle ungrammatical or unexpected text • missing domain knowledge
New Directions • Natural language components • Real-world applications • Methods
Progression of NL learning tasks # of papers
New Directions • Natural language components • Semantic interpretation • Discourse level understanding • NL Generation • Applications • Methods
John Simon, Chief Financial Officer of Prime Corp. since 1986, saw hispay jump 20%, to $1.3 million, as the 37-year-old also became the financial- servicescompany’s president... Example: Noun Phrase Coreference Identify all noun phrases that refer to the same entity Best results: F-measure of 70.4 (MUC-6) and 63.4 (MUC-7) [Ng & Cardie, 2002]
New Directions • Natural language components • Semantic interpretation • Discourse level understanding • NL Generation • Real-world applications • Deeper semantic analysis • Non-factual, non-event-based domains • Methods
Document Text Categorization Is the document about plants? sports? health and fitness? corporate acquisitions? … stock market? No No No YES No
Document Sentiment Classification Is the overall sentiment in the document positive? negative? In general, sentiment classification appears to be harder than categorizing by topic. [E.g., Pang, Lee, Vaithyanathan, 2002; Turney, 2002]
Who: _____ What: _____ Where:_____ When: _____ How: _____ Who: _____ What: _____ Where:_____ When: _____ How: _____ Information Extraction System Who: _____ What: _____ Where:_____ When: _____ How: _____ text collection Information Extraction
PAKISTAN MAY BE PREPARING FOR ANOTHER TEST Thousands of people are feared dead following... (voice-over) ...a powerful earthquake that hit Afghanistan today. The quake registered 6.9 on the Richter scale, centered in a remote part of the country. (on camera) Details now hard to come by, but reports say entire villages were buried by the quake. • Disaster Type: earthquake • location: Afghanistan • date: 05/30/1998 • magnitude: 6.9 • epicenter: a remote part of the country • damage: • human-effect: • victim: Thousands of people • number: Thousands • outcome: dead • physical-effect: • object: entire villages • outcome: damaged IE System for Natural Disasters Document no.: ABC19980530.1830.0342 Date/time: 05/30/1998 18:35:42.49
Opinion-oriented Extraction, Summarization, and QA The Annual Human Rights Report of the US State Department has been strongly criticized and condemned by many countries. Though the report has been made public for 10 days, its contents, which are inaccurate and lacking good will, continue to be commented on by the world media. Many countries in Asia, Europe, Africa, and Latin America have rejected the content of the US Human Rights Report, calling it a brazen distortion of the situation, a wrongful and illegitimate move, and an interference in the internal affairs of other countries. Recently, the Information Office of the Chinese People's Congress released a report on human rights in the United States in 2001, criticizing violations of human rights there. The report quoting data from the Christian Science Monitor, points out that the murder rate in the United States is 5.5 per 100,000 people. In the United States, torture and pressure to confess crime is common. Many people have been sentenced to death for crime they did not commit as a result of an unjust legal system. … [Wiebe et al., 2003]
<Chinese HR report>: <many countries>: <writer>: Opinion-oriented Extraction ATTITUDE polarity:neg strength:medium <HR report> ATTITUDE polarity:neg strength:high ATTITUDE polarity:neg strength:medium <USA> [Cardie et al., 2003]
New Directions • Natural language components • Semantic interpretation • Discourse level understanding • NL Generation • Real-world applications • Deeper semantic interpretation • Non-factual, non-event-based domains • Methods • Weakly supervised and unsupervised learning algorithms
Why Weakly Supervised Learning? • Statistical methods have transformed the field of NLP • Very good performance on increasing numbers/types of problems in NLP • Thus far, the most successful statistical and ML algorithms are supervised learning algorithms • Require large amounts of training data that has been annotated with the “correct” answers • Corpus annotation bottleneck
Weakly Supervised Methods • Approaches • Co-training [E.g. Blum & Mitchell, 1998; Pierce & Cardie, 2001; Steedman et al. 2003] • Self-training [E.g. Banko & Brill, 2001] • Multi-level bootstrapping [E.g. Riloff & Jones, 1999] • Transductive induction [E.g. Joachims, 1998] • Active learning [E.g. Cohn et al. 1994; Lewis & Catlett 1994; Schohn & Cohn, 2000] • Task-specific mostly-unsupervised methods [Ando & Lee 2000, 2003] Learn from a small amount of labeled data (expensive) and a large amount of unlabeled data (cheap)
activelearning 211,000 manually labeled instances randomly selected instances Active Learning • Minimize number of examples a human annotator must label [Cohn et al. 1994] • Process examples in order of usefulness • Usefulness = uncertainty [Lewis & Catlett 1994] • [Schohn & Cohn, 2000; fig. From Pierce & Cardie, 2003]
Mostly-Unsupervised Learning Japanese, Chinese, Thai, ...: no spaces between words Combining simple statistics from unsegmented Japanese newswire yields results rivaling grammar-based approaches. [Ando & Lee 2000, 2003]
New Directions • Natural language components • Semantic interpretation • Discourse level understanding • NL Generation • Real-world applications • Deeper semantic interpretation • Non-factual, non-event-based domains • Methods • Weakly supervised and unsupervised learning algorithms