170 likes | 398 Views
The Rule-based Parser of the NLP Group of the University of Torino. Leonardo Lesmo Dipartimento di Informatica and Centro di Scienze Cognitive, Università di Torino, Italy Email: lesmo@di.unito.it. Goals. Wide-coverage tool. Domain-independence. Extensibility to semantics. Approach.
E N D
The Rule-based Parser of the NLP Group of the University of Torino Leonardo LesmoDipartimento di Informatica and Centro di Scienze Cognitive, Università di Torino, Italy Email: lesmo@di.unito.it
Goals • Wide-coverage tool • Domain-independence • Extensibility to semantics Approach • Manually developed rules • Two phases: Chunking and subcategorization • Procedural analysis of conjunctions and of identification of verbal dependents
Text Splits the text into words, numbers, punctuation marks Token Automaton TOKENIZER Tokens Extracts all lexical interpretations of each token DICTIONARY LOOKUP Morphological dictionary Suffix tables Sets of lexical items Chooses one lexical interpretation Tagging rules POS TAGGER Lexical items Parsing rules Establishes the connections between lexical items DEPENDENCY PARSER Verbal Caseframes Parse Tree TULE (Turin University Linguistic Environment)
The grammar • Rule-based dependency grammar • Chunking (non-verbal groups) + verbal subcategorization frames • Output: a projective tree represented as pointers to parents, including some null elements (understood items – e.g. pro-drop - and traces)
Parser Architecture Lexical Items Splits the text into groups of strictly connected words Chunking rules CHUNKING Chunked text Connects chunks linked by conjunctions, to form larger chunks ANALYSIS OF CONJUNCTIONS Procedural preference rules 1 Chunked text Procedural preference rules 2 Determines the dependents of verbs SEGMENTATION Lexical items Verb classes Determines the role (arc labels) of the verbal dependents VERBAL ATTACHMENT Verbal Caseframes Parse Tree
Example: Slitta a Tirana la decisione sullo stato di emergenza. (The decision on the emergency status in Tirana has been delayed) 1 Slitta (SLITTARE VERB MAIN IND PRES INTRANS 3 SING) 2 a (A PREP MONO) 3 Tirana (TIRANA NOUN PROPER F SING ££CITY) 4 la (IL ART DEF F SING) 5 decisione (DECISIONE NOUN COMMON F SING DECIDERE INTRANS) 6 sullo ((SU PREP MONO) 6.10 (IL ART DEF M SING)) 7 stato (STATO NOUN COMMON M SING) 8 di (DI PREP MONO) 9 emergenza (EMERGENZA NOUN COMMON F SING) 10 . (#\. PUNCT) [0;TOP-VERB] [1;PREP-RMOD] [2;PREP-ARG] [1;VERB-SUBJ] [4;DET+DEF-ARG] [5;PREP-RMOD] [6;PREP-ARG] [6.10;DET+DEF-ARG] [7;PREP-RMOD] [8;PREP-ARG] [1;END] 1: Slitta Prep-rmod Verb-subj 2: a 4: la Prep-arg Det+def-arg Lexical Items Parse Tree Infos 3: Tirana 5; decisione Prep-rmod 6: su Prep-arg 6.10: lo Stato di emergenza An example
Chunking Example: Puoi dirmi che spettacoli di cabaret posso vedere domani? (Can you tell me what cabaret plays I can see tomorrow?) • Chunking rules are grouped in packets. • Each packet is associated with a lexical category, and describes the “chunkable” possible dependents of words of that category. • Chunkable means a dependent handled during chunking (e.g. auxiliaries, but not arguments of verbs) PuoiV-modal-2nd-sing-pres dirV-inf[miPron-1st-dative]Pron [cheAdj-interr spettacoliNoun[diPrep cabaretNoun]P-group]N-group possoV-modal-1st-sing-pres vedereV-inf[domaniAdv]A-group? Chunking Rules
Packet (governing word) feature (constrains applicability) Position of dep (and possible words separating head from dep) Category of possible dep (and constraints on it) Label of connecting arc A chunk rule (NOUN common (precedes (ADJ qualif T (#\- #\' #\")) (ADJ ((type qualif) (agree))) ADJC+QUALIF-RMOD))
Conjunctions • When a coordinating conjunction is found, all following and preceding chunks are collected • All pairs are built, and the best one is chosen according to criteria based on structural similarity and distance • Special treatment for verbs Example: Ho incontrato Marco e Lucia e li ho salutati (I met Marco e Lucia and I greeted them) HoV-aux incontratoV-main [MarcoNoun-Proper]NouneConj-coord[LuciaNoun-Proper]Noun eConj-coord [liPron-pers ]PronhoV-aux salutatiV-main
Segmentation • For each verb (going from left to right): • Look for possible dependents (on its right and left) • On the left, the search is blocked from the previous verb • On the right, some “barriers” are defined to stop the search (for instance, a subordinating conjunction acts as a barrier) PuoiV-modal-2nd-sing-pres{ dirV-inf [miPron-1st-dative]Pron {[cheAdj-interr spettacoliNoun [diPrep cabaretNoun]P-group ]N-group possoV-modal-1st-sing-pres{vedereV-inf [domaniAdv]A-group? } }}}
verbs nosubj-verbs ssubj-inf-verbs subj-verbs obj-verbs indobj-verbs bisognare need camminare empty-modal basic-trans walk modal trans dovere must potere can trans-indobj dictionary subcategorization classes Verbal Subcategorization The subcategorization classes:
Example subcategorization class definitions: (subj-verbs (intrans) (verbs) ; *** verbs with a subject. Definition of subject ( verb-subj((noun (agree)) (art (agree)) (pron (not (word quale) (type relat)) (case lsubj) (agree)) (adj (type (indef demons deitt interr poss)) (agree)) (num (agree)) (prep (word in) (down (cat pron) (type indef)) (agree))))) (ssubj-inf-verbs () (verbs) ; *** verbs with an inf-verb sentential subject ( verb-subj ((verb (mood infinite) (agree))))) (empty-modal () (no-subj-verbs) ; *** modals without subject ( verb-indcompl-modal ((verb (mood infinite)))))
Transformations: basic class (e.g. trans) transformed classes (e.g. trans, trans+passivization, trans+infinitivization, trans+prodrop, trans+passivization+infinitivization, ….. ) Example transformation: (infinitivization replacing (subj-verbs) (is-inf-form tr-verb v-casefr) (cancel-case s-subj))
Chunking rules Total: 295 rules Common: 250 rules English: 34 rules Italian: 7 rules Spanish + Catalan: 4 rules • Base Subcategorization Total: 118 classes Abstract: 21 classes plus verbal locutions Italian: 40 classes English: 1 class Some statistics • Derived surface case frames 2653 case frames
Conclusions • Test of the parser on other languages, using the same grammar augmented with extra rules (see previous slide) • Partial use of semantic information (about 400 words classified according to a semantic taxonomy) • The parser has been used in a project involving spoken and written linguistic interaction with a user. It has been interfaced with an repository of semantic knowledge to build a meaning representation.