250 likes | 293 Views
Overview of Statistical NLP. IR Group Meeting March 7, 2006. Outline. Some basic/important NLP problems Topics that recently attracted many interests NLP research groups Discussion on the relation between NLP and IR. Levels of Analysis in NLP (from Dan Roth’s CS598). Morphology
E N D
Overview of Statistical NLP IR Group Meeting March 7, 2006
Outline • Some basic/important NLP problems • Topics that recently attracted many interests • NLP research groups • Discussion on the relation between NLP and IR IR Group Meeting -- NLP
Levels of Analysis in NLP(from Dan Roth’s CS598) • Morphology • How words are constructed • Syntax • Structural relation between words • Semantics • The meaning of words and of combinations of words • Pragmatics. • How is a sentence used? What’s its purpose? • Discourse (sometimes distinguished as a subfield of Pragmatics) • Relationships between sentences; global context. IR Group Meeting -- NLP
Some NLP Problems • N-gram Models • Word Sense Disambiguation • Lexical Acquisition • (POS) Tagging • (Syntactic) Parsing • Semantic Role Labeling (Semantic Parsing) • Named Entity Recognition • Textual Entailment • … IR Group Meeting -- NLP
N-gram Models • The task: to estimate P(wn|w1,…,wn-1) • Approaches: • Maximum likelihood estimation • Various smoothing methods • Applications: • Automatic speech recognition • Spelling correction • Handwriting recognition • Statistical machine translation IR Group Meeting -- NLP
Word Sense Disambiguation (WSD) • The task: to determine which of the senses of an ambiguous word is involved in a particular use of the word • Approaches: • Supervised: • Log-linear models • Information-theoretic • Memory-based learning (kNN) • Dictionary-based: • Sense definitions • Thesauri • Translations in a second language • Unsupervised: • Clustering using EM algorithm IR Group Meeting -- NLP
Word Sense Disambiguation (WSD) • Accuracy: • Word-specific • Easy words: > 90% • Hard words: 50~70% • Applications: • Statistical machine translation • Information retrieval IR Group Meeting -- NLP
Lexical Acquisition • The task: to develop algorithms and statistical techniques for filling the holes in existing machine-learnable dictionaries by looking at the occurrence patterns of words in large text corpora • Examples: • Verb subcategorization • Propositional phrase attachment disambiguation • Selectional preferences • Semantic similarity IR Group Meeting -- NLP
Semantic Similarity • The task: to acquire a relative measure of similarity between two words • Approaches: • Vector space measures (document space, word space, modifier space, etc.) • Probabilistic measures (KL-divergence, etc.) • Applications: • Information retrieval (query expansion) IR Group Meeting -- NLP
POS Tagging • The task: labeling each word in a sentence with its appropriate part of speech • Major approaches • HMM • Transformation-based • Advantages: speed and storage • Other approaches • Neural networks, decision trees, memory-based learning, maximum entropy models IR Group Meeting -- NLP
POS Tagging • Accuracy: • 95~97% • Achieved only when the application text and the training text are from the similar source • Applications • For higher-level NLP tasks: partial parsing, parsing, NER, etc. • “…the best lexicalized probabilistic parsers are now good enough that they perform better starting with untagged text and doing the tagging themselves, rather than using a tagger as preprocessor.” (Charniak 1997) IR Group Meeting -- NLP
(Syntactic) Parsing • The task: to find the most likely syntactic parse tree of a sentence • Approaches: • Probabilistic context free grammar (PCFG) • Supervised • Unsupervised • Lexicalized models • Dependency-based models IR Group Meeting -- NLP
(Syntactic) Parsing • Accuracy: • Charniak 1997: Rec 0.875 Prec 0.874 • Collins 1997: Rec 0.881 Prec 0.886 • Applications: • For other NLP tasks such as semantic role labeling and relation extraction IR Group Meeting -- NLP
Semantic Role Labeling • The task: to identify the predicate-argument structures in sentences • Approaches: • Supervised learning • Accuracy: • Best ~70% (CoNLL 04 shared task) • Applications: • Information extraction • Question answering IR Group Meeting -- NLP
Textual Entailment • The task: given two text fragments, to recognize whether the meaning of one text is entailed (can be inferred) from the other text • Approaches: • Word overlap • Statistical lexical relations • Syntactic matching • Logic inference • Accuracy: • ~0.56, best ~0.60 (PASCAL Challenge 05) • Applications: • Question answering • Multi-document summarization IR Group Meeting -- NLP
Tools • Brill Tagger • Charniak Parser • Collins Parser • MiniPar • Semantic Parser • ASSERT Parser • CCG’s demo IR Group Meeting -- NLP
Corpora • WordNet • Penn Treebank (Sample) • PropBank • FrameNet IR Group Meeting -- NLP
Other Tasks • Automatic Speech Recognition • Natural Language Generation • Automatic Summarization • … IR Group Meeting -- NLP
Outline • Some basic/important NLP problems • Topics that recently attracted many interests • NLP research groups • Discussion on the relation between NLP and IR IR Group Meeting -- NLP
Recent topics • Unsupervised and semi-supervised approaches • Knowledge acquisition bottleneck • Semantic role labeling • Improve the performance of SRL • Use the results for other tasks • Relation extraction • WSD • Parsing • Statistical machine translation • Word alignment IR Group Meeting -- NLP
Outline • Some basic/important NLP problems • Topics that recently attracted many interests • NLP research groups • Discussion on the relation between NLP and IR IR Group Meeting -- NLP
NLP Research Groups • USC/ISI • Stanford • UPenn • Johns-Hopkins • UIUC • … IR Group Meeting -- NLP
Outline • Some basic/important NLP problems • Topics that recently attracted many interests • NLP research groups • Discussion on the relation between NLP and IR IR Group Meeting -- NLP