450 likes | 642 Views
CS626-460: Language Technology for the Web/Natural Language Processing. Pushpak Bhattacharyya CSE Dept., IIT Bombay Constituent Parsing and Algorithms (with major contributions from Dr. Rajat Mohanty). Syntax.
E N D
CS626-460: Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Constituent Parsing and Algorithms (with major contributions from Dr. Rajat Mohanty)
Syntax • Syntax is the study of the combination of words into phrases, clauses and sentences. • Syntax describes how sentences and their constituents are structured.
Grammar • A finite set of rules • that generates only and all sentences of a language. • that assigns an appropriate structural description to each one.
Grammatical Analysis Techniques • Two main devices Breaking up a String Labeling the Constituents • Sequential • Hierarchical • Transformational • Morphological • Categorial • Functional
Hierarchical Breaking up and Categorial Labeling • Poor John ran away. S NP VP A N V Adv Poor John ran away
Hierarchical Breaking up and Functional Labeling • Immediate Constituent (IC) Analysis • Construction types in terms of the function of the constituents: • Predication (subject + predicate) • Modification (modifier + head) • Complementation (verbal + complement) • Subordination (subordinator + dependent unit) • Coordination (independent unit + coordinator)
An Example S • In the morning, the sky looked much brighter. Modifier Head Subject Predicate Subordinator DU Modifier Modifier Head Head Verbal Complement Modifier Head In the morning, the sky looked much brighter
Noun Phrases • John • the student • the intelligent student NP NP NP N N N Det Det AdjP student John student the the intelligent
Noun Phrase • his first five PhD students NP N Quant Det Ord N students five his first PhD
Noun Phrase • The five best students of my class NP PP Quant Det AP N five the best students of my class
Verb Phrases • can sing • can hit the ball VP VP V Aux NP Aux V can sing the ball can hit
Verb Phrase • Can give a flower to Mary VP NP Aux V PP a flower can give to Mary
Verb Phrase • may make John the chairman VP NP Aux V NP John may make thechairman
Verb Phrase • may find the book very interesting VP NP Aux V AP veryinteresting thebook may find
Prepositional Phrases • in the classroom • near the river PP PP NP NP P P in near theclassroom theriver
Adjective Phrases • intelligent • very honest • fond of sweets AP AP AP A PP Degree A A very fond honest ofsweets intelligent
Adjective Phrase • very worried that she might have done badly in the assignment AP Degree A S’ very worried that she might have done badly in the assignment
A segment of English Grammar • S’(C) S • S{NP/S’} VP • VP(AP+) V (AP+) ({NP/S’}) (AP+) (PP+) (AP+) • NP(D) (AP+) N (PP+) • PPP NP • AP(AP) A
PSG Parse Tree S • John wrote those words in the Book of Proverbs. NP VP PropN PP V NP NP P NP PP John in thosewords wrote ofproverbs thebook
Penn Treebank • John wrote those words in the Book of Proverbs. (S (NP-SBJ (NP John)) (VP wrote (NP those words) (PP-LOC in (NP (NP-TTL (NP the Book) (PP of (NP Proverbs)))
PSG Parse Tree S • Official trading in the shares will start in Paris on Nov 6. NP VP NP PP Aux V PP PP NP N P AP A will start onNov6 inParis trading official in theshares
Penn POS Tags • Official trading in the shares will start in Paris on Nov 6. [ Official/JJ trading/NN ] in/IN [ the/DT shares/NNS ] will/MD start/VB in/IN [ Paris/NNP ] on/IN [ Nov./NNP 6/CD ]
Penn Treebank • Official trading in the shares will start in Paris on Nov 6. ( (S (NP-SBJ (NP Official trading) (PP in (NP the shares))) (VP will (VP start (PP-LOC in (NP Paris)) (PP-TMP on (NP (NP Nov 6)
Penn POS Tag Sset • Adjective: JJ • Adverb: RB • Cardinal Number: CD • Determiner: DT • Preposition: IN • Coordinating Conjunction CC • Subordinating Conjunction: IN • Singular Noun: NN • Plural Noun: NNS • Personal Pronoun: PP • Proper Noun: NP • Verb base form: VB • Modal verb: MD • Verb (3sg Pres): VBZ • Wh-determiner: WDT • Wh-pronoun: WP
A Fragment of English Grammar S NP VP VP V NP NP NNP | ART N NNP Ram V ate | saw ART a | an | the N rice | apple | movie
Derivation • S is a special symbol called start symbol. S => NP VP (rewrite S) => NNP VP (rewrite NP) => Ram VP (rewrite NNP) => Ram V NP (rewrite VP) => Ram ate NP (rewrite V) => Ram ate ART N (rewrite NP) => Ram ate the N (rewrite ART) => Ram ate the rice (rewrite N) Multiple Choice Points
Two Strategies : Top-Down & Bottom-Up • Top down : Start with S and generate the sentence. • Bottom up : Start with the words in the sentence and use the rewrite rules backwards to reduce the sequence of symbols to produce S. • Previous slide showed top-down strategy.
Bottom-Up Derivation Ram ate the rice => NNP ate the rice (rewrite Ram) => NNP V the rice (rewrite ate) => NNP V ART rice (rewrite the) => NNP V ART N (rewrite rice) => NP V ART N (rewrite NNP) => NP V NP (rewrite ART N) => NP VP (rewrite V NP) => S
Parsing Algorithm A procedure that “searches” through the grammatical rules to find a combination that generates a tree which stands for the structure of the sentence
Top-Down Parsing (using A*) • DFS on the AND-OR graph • Data structures: • Open List (OL): Nodes to be expanded • Closed List (CL): Expanded Nodes • Input List (IL): Words of sentence to be parsed • Moving Head (MH): Walks over the IL
Trace of Top-Down Parsing Initial Condition (T0) OL CL (empty) IL S Ram ate the rice MH
Trace of Top-Down Parsing T1: OL CL IL MH NP VP S Ram ate the rice
Trace of Top-Down Parsing T2: OL CL IL MH NNP ART N VP S NP Ram ate the rice
Trace of Top-Down Parsing T3: OL CL IL ART N VP S NP NNP Ram ate the rice MH (portion of Input consumed)
Trace of Top-Down Parsing T4: OL CL IL N VP S NP NNP ART* Ram ate the rice MH (* indicates ‘useless’ expansion)
Trace of Top-Down Parsing T5: OL CL IL VP S NP NNP ART* N* Ram ate the rice MH
Trace of Top-Down Parsing T6: OL CL IL V NP S NP NNP ART* N* Ram ate the rice MH
Trace of Top-Down Parsing T7: OL CL IL NP S NP NNP ART* N* V Ram ate the rice MH
Trace of Top-Down Parsing T8: OL CL IL NNP ART N S NP NNP ART* N* V NP Ram ate the rice MH
Trace of Top-Down Parsing T9: OL CL IL ART N S NP NNP ART* N* V NNP* Ram ate the rice MH
Trace of Top-Down Parsing T10: OL CL IL N S NP NNP ART* N* V NNP ART Ram ate the rice MH
Trace of Top-Down Parsing T11: OL CL IL S NP NNP ART* N* V NNP ART N Ram ate the rice MH Successful Termination: OL empty AND MH at the end of IL.
Bottom-Up Parsing Basic idea: • Refer to words from the lexicon. • Obtain all POSs for each word. • Keep combining until S is obtained. (to be continued)