100 likes | 216 Views
A stochastic Parts Program and Noun Phrase Parser for Unrestricted Text by Kenneth Ward Church. HyungSuk Won NLP Lab. CSE POSTECH. 서론. Test : 400word sample => 99.5% correct most errors are attributable to defects in the lexicon
E N D
A stochastic Parts Program and Noun Phrase Parser for Unrestricted Textby Kenneth Ward Church HyungSuk Won NLP Lab. CSE POSTECH
서론 • Test : 400word sample => 99.5% correct • most errors are attributable to defects in the lexicon • remarkably few errors are related to the inadequacies of the extremely over-simplified grammar ( a trigram model) • One might thought that ngram models weren’t adequate for the task since it is well known that they are inadequate for determining grammaticality. • eg. long distance dependency • BUT, for tagging application, the ngram approximation may be acceptable since long distance dependencies do not seem to be very important • Leech, Garside and Atwell : 96.7% in LOB corpus, using bigram model modified with heuristics to cope with more important trigrams NLP Lab. CSE POSTECH
1. How hard is Lexical Ambiguity • computational liguistics 를 하지 않는 보통사람들은 lexical ambiguity를 중요한 것이 아니라는 strong intuition을 가지고 있다 • 반대로 CL의 expert들은 lexical ambiguity를 major issue라고 생각하고 있다. • Time flies like a arrow • Flying planes can be dangerous :practically, any content word can be used as noun, verb or adjective, and local context is not always adequate to disambiguate • 하지만, Marcus의 말대로 대부분의 texts는 실제 그렇게 어렵지 않다. • “garden paths” : The horse raced past the barn fell • 사람들이 위의 문장들을 보고 나면, 한 단어에 각각 하나의 pos를 assign하는 것은 가망없는 짓이라고 생각할 것이다. NLP Lab. CSE POSTECH
2. Lexical Disambiguation Rules • Fidditch’s lexical disambiguation rule (defrule n+prep! “>[**n+prep]!=n[npstarters]”) ; a preposition is more likely than a noun before a noun phrase ; this type can be captured with bigram and trigram statistics (more easy to obtain than Fedditch-type disambiguation rules) • if parser do not use frequency information, then every possibility in the dictionary must be given equal weight =>parsing is very difficult (ex.) the Holy See • Dictionary tends to focus on what is possible, not on what is likely (according to Webster’s Seventh New Collegiate Dictionary, every word is ambiguous) NLP Lab. CSE POSTECH
계속 • (ex.) I see a bird [NP [N city] [N school] [N committee] [N meeting]] [NP [N I] [N see] [N a] [N bird]] [S [NP [N I] [N see] [N a]] [VP [V bird]]] NLP Lab. CSE POSTECH
3. The proposed methods (PPSS: pronoun, NP: proper noun, VB:verb, UH:interjection, IN:preposition, AT:article, NN:noun) • lexical probability : • contextual probability : NLP Lab. CSE POSTECH
계속 • A search is performed in order to find the assignment of part of speech tags to words that optimizes the product of the lexical and contextual probabilities • 과정 • (“NN”) • (“AT” “NN”) (“IN” ‘NN”) • (“VB” “AT” “NN”) (“VB” “IN” “NN”) (“UH” “AT” “NN”) (“UH” “IN” “NN”) => PPSS VB IN NN, NP VB IN NN, PPSS UH IN NN, NP UH IN NN ; score less well than below, contextual scoring function has limited window of three parts of speech • (“PPSS” “VB” “AT” ‘NN”) (“NP” “VB” “AT” “NN”) (“PPSS” “UH” “AT” “NN”) (“NP” “UH” “AT” “NN”) => the same reason to above • (“” “PPSS” “VB” “AT” ‘NN”) (“” “NP” “VB” “AT” “NN”) • finally, (“” “” “PPSS” “VB” “AT” ‘NN”) NLP Lab. CSE POSTECH
4. Parsing simple non-recursive noun phrases stochastically • similar stochastic methods are applied to locate simple noun phrases with very high accuracy • stochastic parser • input: a sequence of parts of speech • processing : insert brackets corresponding to the beginning and end of noun phrases • output : [A/AT former/AP top/NN aide/NN] to/IN [Attorney/NP ….]… (ex.) NN VB NN VB, [NN] VB, [NN VB], [NN] [VB], NN [VB] NLP Lab. CSE POSTECH
계속 • AT(article), NN(singular noun), NNS(non-singular noun), VB(uninflected verb), IN(preposition) • these probabilities were estimated from about 40,000 words(11,000 noun phrases) of training material selected from the Brown Corpus NLP Lab. CSE POSTECH
5. Smoothing Issues • Zipf’s Law alleviate • no appear case : using conventional dictionary => add 1 to the frequency count of possibilities in the dictionary • proper noun and capitalized words =>capitalized words with small frequency counts (<20) were thrown out of the lexicon (ex) Act/NP • 1. add 1 for the proper noun possibility (ex.) fall ( (1 “JJ”) (65 “VB”) (72 “NN”) ) Fall ( (1 “NP”) (1 “JJ”) (65 “VB”) (72 “NN”) ) • 2. prepass : labels words as proper nouns if they are “adjacent to” other capitalized words (ex) White House, States of the Union NLP Lab. CSE POSTECH