340 likes | 612 Views
Introduction to NLP Tools. 09/23/2003. Motivation. Machine Translation From English to French What’s needed?. Motivation Cont’d (1). Syntactic parser Part-Of-Speech Tagger Example: NP -> adj noun Morphological Analyzer Example: “tools” -> “tool” “Who is he?” -> “Who is he ?”
E N D
Introduction to NLP Tools 09/23/2003
Motivation • Machine Translation • From English to French • What’s needed?
Motivation Cont’d (1) • Syntactic parser • Part-Of-Speech Tagger • Example: NP -> adj noun • Morphological Analyzer • Example: “tools” -> “tool” “Who is he?” -> “Who is he ?” • Semantic Analyzer • Word sense disambiguate (“wash dishes”) • Choose the correct translation
Motivation Cont’d (2) • Lexicons • The information of the word How many senses? What’s the possible translations of the word? • Corpus • Useful for learning a tool • Useful for evaluation
Outline • Lexicons • Text corpora • Morphological tools • Part-Of-Speech(POS) taggers • Syntactic parsers • Semantic knowledge bases and semantic parser • Speech tools
Lexicons • Definition • A repository for words • Lexicons in LDC(Linguistic Data Consortium) • creating and sharing linguistic resources: data, tools and standards. • CELEX • WordNet
CELEX • Dutch Center for Lexical Information • Lexical databases of English , Dutch and German • 21,000 nouns, 8,000 adjectives and 6,000 verbs • English: • English Orthography, Lemmas • English Phonology, Lemmas • English Morphology, Lemmas • English Syntax, Lemmas • English Frequency, Lemmas • English Orthography, Wordforms • English Phonology, Wordforms • English Morphology, Wordforms • English Frequency, Wordforms • English Corpus Types • English Frequency, Syllables
WordNet • A database of lexical relations • Inspired by current psycholinguistic theories of human lexical memory • Synset: a set of synonyms, representing one underlying lexical concept • Example: • fool {chump, fish, fool, gull, mark, patsy, fall guy, sucker, schlemiel, shlemiel, soft touch, mug} • Relations link the synsets: hypernym, Has-Member, Member-Of, Antonym, etc.
WordNet Cont’d • Example pu-erh.cs.utexas.edu$ wn bike -partn Part Meronyms of noun bike 2 senses of bike Sense 1 motorcycle, bike HAS PART: mudguard, splashguard Sense 2 bicycle, bike, wheel HAS PART: bicycle seat, saddle HAS PART: bicycle wheel HAS PART: chain HAS PART: coaster brake HAS PART: handlebar HAS PART: mudguard, splashguard HAS PART: pedal, treadle, foot lever HAS PART: sprocket, sprocket wheel • Example • Pu-erh.cs.utexas.edu$wn bike • Information available for noun bike • -hypen Hypernyms • -hypon, -treen Hyponyms & Hyponym Tree • -synsn Synonyms (ordered by frequency) • -partn Has Part Meronyms • -meron All Meronyms • -famln Familiarity & Polysemy Count • -coorn Coordinate Sisters • -simsn Synonyms (grouped by similarity of meaning) • -hmern Hierarchical Meronyms • -grepn List of Compound Words • -over Overview of Senses • Information available for verb bike • -hypev Hypernyms • -hypov, -treev Hyponyms & Hyponym Tree • -synsv Synonyms (ordered by frequency) • -famlv Familiarity & Polysemy Count • -framv Verb Frames • -simsv Synonyms (grouped by similarity of meaning) • -grepv List of Compound Words • -over Overview of Senses
Corpus • Definition • Collections of text and speech • LDC • Penn Treebank • DSO • Hansard
Some of the Top Corpus from LDC • TIPSTER • Information Retrieval, Data Extrraction datasets • TIPSTER project, TREC project • TIMIT Acoustic-Phonetic Continuous Speech Corpus • A corpus of read speech designed to • Provide speech data for the acquisition of acousticphonetic knowledge • Useful for the development and evaluation of automatic speech recognition systems • ECI(European Corpus Initiative Multilingual Corpus) multilingual electronic text corpus • NTIMIT • A phonetically • balanced, continuous speech, telephone bandwidth speech database
Penn Treebank • A collection of corpora • Tagged with POS, Syntactic roles, predicate/argument structure, dysfluency annotation • How are they made • Hand correction of the output of an errorful automatic process • 3 million words • 1 million words tagged with predicate/argument structure for extraction semantic knowledge
Penn Treebank Cont.’d • Corpora • Wall Street Journal • ATIS (Air Travel Information System) • Brown Corpus • IBM Manual Sentences • Library of America Texts: Mark Twain, Henry Adams, Herman Melville ... • MUC-3 Messages • Example: • ( (S (NP-SBJ Rally 's) • (VP operates • and • franchises • (NP (NP (QP about 160) • fast-food restaurants) • (PP-LOC throughout • (NP the U.S)))) • Seeking/VBG to/TO block/VB • [ the/DT investors/NNS ] • from/IN buying/VBG • [ more/JJR shares/NNS ] • ./.
DSO • Word Sense Corpus • Contains sentences in which about 192,800 word occurrences have been tagged with WordNet senses • Taken from the Brown corpus and the Wall Street Journal corpus • 121 nouns and 70 verbs
Hansard • Official records (Hansards) of the 36th Canadian Parliament, both in English of French • 1.3 million pairs of aligned sentences of English and French • Example • Comme il est 14 h 30, la Chambre s'ajourne jusqu'\xe0 lundi prochain, \xe0 11 heures, conform\xe9ment au paragraphe 24(1) du R\xe8glement. • It being 2.30 p.m., the House stands adjourned until Monday next at 11 a.m., pursuant to Standing Order 24(1). • Useful for Machine Translation
Morphological Tools • PC-KIMMO • A two-level morphological parser • Porter Stemmer • Penn Treebank Tokenizer • Seperate document into words • “dog?” -> “dog ?”
Porter Stemmer • Simple algorithm, use a set of cascaded rewrite rules • Example • Ational->ATE (relational->relate) • Stem: • The main morpheme of the word, supplying the main meaning • Fast • Used very widely in Information Retrieval • Run stemmer on keywords and the words in the documents
Part-Of-Speech(POS) Taggers • Part-Of-Speech: noun, verb, pronoun, etc. • Brill’s Tagger • HMM Tagger • MXPOST
Brill’s Tagger • Transformation-Based Learning(TBL) tagger • /projects/nlp/brill-pos-tagger • First labels every word with its most-likely tag • Then Use Learned TBL Rules to correct mistakes • Example: • Change NN to VB when the previous tag is TO
HMM Tagger • Also called Maximum Likelihood Tagger • Xerox PARC's HMM tagger: ftp://parcftp.xerox.com/pub/tagger/ • Choose the tag sequence with the maximum possibility given the words seen.
MXPOST: Maximum Entropy POS Tagger • Maximum Entropy Model is a framework integrating many information sources(called features) for classification • Each candidate tag is a class • Given features of the word(the around words, the morphological feature, and around tags, etc.), decide which class it belongs.
Syntactic Parsers • Collin’s Parser • XTAG • MXPOST: Maximum Entropy Parser
Collin’s Parser • Context-free Grammar • Use frequencies to solve ambiguities • Got some idea of this parser • Web-based Chart parser
XTAG • An on-going project to develop a wide-coverage grammar for English • using a lexicalized Tree Adjoining Grammar (TAG) formalism • Context sensitive grammar • consists of a parser, an X-windows grammar development interface and a morphological analyzer. • /projects/nlp/xtag/
Semantic Knowledge Bases and Semantic Parser • Analyze what does it say • WordNet • Penn Treebank • Web-based Semantic Parser
WordNet • Respresents lexical relations • Useful in word sense disambiguation
Penn Treebank Predicate: fool(Kris)
Semantic Parser • A web-based chart parser enriched with semantic constraints • Example: • Input: My dog has fleas. • Output: has(my(dog),fleas)
Speech Tools • ISIP • EPOS • CSLU Toolkit
ISIP • ISIP(Institute for Signal and Information Processing) public domain speech recognition system • Open research software • Online courses, tutorials, dictionaries, databases • Build your own speech recognition system
EPOS • a language independent rule-driven Text-to-Speech (TTS) system • supports several main speech generation algorithms
CSLU Toolkit • Basic framework and tools for people to build, investigate and use interactive language systems • speech recognition, natural language understanding, speech synthesis and facial animation technologies • Easy to use , spread from higher education into homes