190 likes | 210 Views
Explore morphological disambiguation and syntactic analysis in Estonian language using Constraint Grammar framework. Learn about results, applications, and future work in parsing Estonian text. Developed at Tallinn Technical University's Institute of Cybernetics.
E N D
Parsing Estonian with Constraint Grammar Kaili Müürisep Institute of Cybernetics at Tallinn Technical University
Outline • Background • Constraint Grammar framework • Morphological disambiguation • Syntactic analysis • Results • Applications • Future work
Background • Project started in 1995/96 • Two grammar-writers: • morphological disambiguation - Tiina Puolakainen • syntax - Kaili Müürisep
Constraint Grammar • proposed by Fred Karlsson 1990 (University of Helsinki) • employs surface-near dependency-oriented syntax • rule-based • integrates morphological disambiguation and shallow syntactic analysis
Morphological analysis CG - Parsing Scheme Input text Identification of clause boundaries Morphological disambiguation Determination of syntactic functions Analysed text
Morphologically analyzed sentence Eesti Eesti+0 //_S_ prop sg gen #cap // Estonia Eesti+0 //_S_ prop sg nom #cap // eesti+0 //_G_ #cap // Estonian vanimad vanim+d //_A_ super pl nom // oldest asukad asukas+d //_S_ com pl nom // dwellers saabusid arrived saabu+sid //_V_ main indic impf ps2 sg ps af #Intr // saabu+sid //_V_ main indic impf ps3 pl ps af #Intr // siia siia+0 //_D_ // here siig+0 //_S_ com sg gen // whitefish
pärast pärast+0 //_D_ // afterwards pärast+0 //_K_ post #gen // after pärast+0 //_K_ pre #part // pärane+t //_A_ pos sg part // pära+st //_S_ com sg el // residue or stern viimast viimane+t //_A_ pos sg part // last vii+mast //_V_ main sup ps el #NGP-P // take, lead ... jääaega jää_aeg+0 //_S_ com sg adit // ice-age jää_aeg+0 //_S_ com sg part // $. . //_Z_ Fst //
Morphologically disambiguated sentence Eesti Eesti+0 //_S_ prop sg gen #cap // vanimad vanim+d //_A_ super pl nom // asukad asukas+d //_S_ com pl nom // saabusid saabu+sid //_V_ main indic impf ps3 pl ps af #Intr // siia siia+0 //_D_ // pärast pärast+0 //_K_ pre #part // viimast viimane+t //_A_ pos sg part // jääaega jää_aeg+0 //_S_ com sg part //
After adding syntactic labels Eesti Eesti+0 //_S_ prop sg gen #cap//**CLB @OBJ @ADVL @NN> vanimad vanim+d//_A_ super pl nom // @ADVL @AN> @<AN @PRD asukad asukas+d //_S_ com pl nom //@SUBJ @PRD @OBJ @NN> @<NN @ADVL @<Q saabusid saabu+sid//_V_ main indic impf ps3 pl ps af #Intr // @+FMV siia siia+0//_D_ // @ADVL @AD> @<AD pärast pärast+0 //_K_ pre #part // @ADVL @PN> @<PN viimast viimane+t //_A_ pos sg part // @AN> @<AN @ADVL jääaega jää_aeg+0 //_S_ com sg part // @SUBJ @OBJ @ADVL @<Q @NN> @<NN @<P
Syntactically analyzed sentence Eesti Eesti+0 //_S_ prop sg gen #cap // **CLB @NN> vanimad vanim+d //_A_ super pl nom // @AN> asukad asukas+d //_S_ com pl nom // @SUBJ saabusid saabu+sid //_V_main indic impf ps3 pl ps af #Intr // @+FMV siia siia+0 //_D_ // @ADVL pärast pärast+0 //_K_ pre #part // @ADVL viimast viimane+t //_A_ pos sg part // @AN> jääaega jää_aeg+0 //_S_ com sg part // @<P
Actually ... Eesti @NN> vanimad @AN> asukad @SUBJ saabusid saabu+sid //_V_main indic impf ps3 pl ps af #Intr // @+FMV siia siia+0 //_D_ // @ADVL pärast pärast+0 //_K_ pre #part // @ADVL viimast viimane+t //_A_ pos sg part // @AN> vii+mast //_V_ main sup ps el #NGP-P // @ADVL jääaega jää_aeg+0 //_S_ com sg part // @<P @OBJ jää_aeg+0 //_S_ com sg adit // @ADVL
Morphological disambiguation • Morphological analyser of Estonian assigns adequate morphological descriptions to about 99% of tokens in a text. • In morphologically analysed Estonian text over 45% of all words are ambiguous and have 2 – 15 readings. • > 1125 constraints • 85-90 % of words become morphologically unambiguous and the error rate of the disambiguator is less than 2 %.
Morphological disambiguation (2) • The major ambiguities are between: • The adjectival and verbal readings of participles • The nominative, genitive, partitive and short illative cases of a noun. • The adposition, adverb and noun readings. • Some coincidences: sai (white bread, got), viis (five, melody, carried), tee (tea, road, do!),või (butter, or, may), tuli (fire - light, came)
Morphological disambiguation (3) Most difficult is disambiguate between nominative, genitive, partitive and short illative cases: (1) maailma-GEN juhtivad majandusriigid the leading economic states of the world (2) maailma-PART juhtivad majandusriigid the economic states leading the world (3) maailma-ILLAT juhtivad majandusriigid the economic states leading into the world
Determination of syntactic functions • 27 syntactic tags (subject, object, adverbial etc) • no direct connection between attribute and head professori (@NN>) nahast (@NN>) portfell professor-GEN leather-ELAT portfolio • > 1300 rules • 83-90% of words become syntactically unambiguous • Correctness is 96.5 - 98.5%
Syntactic disambiguation - problems • Adverbial versus adverbial attributes • Mees sai siiski pidada ühendust mobiiltelefoniga (@ADVL @NN> @<NN) Kosovos sõdivate poegadega. • Man could still keep connection with_mobile-phone in_Kosovo fightening with_sons. • Object in genitive or attribute • Ta asetas mantli (gen @OBJ @NN>) tooli (gen @OBJ @NN>) seljatoele (@ADVL @<NN) • He put coat-GEN chair-GEN back-ALLAT. • 'He put the coat onto the back of a chair.'
Syntactic disambiguation - errors • One clause divides the other into two parts: • Seega oli samm, mille astus Eesti, palju pikem ja otsustavam. • Thus the step, that Estonia took, was bigger and more decisive. • Ellipsis • Determination of apposition, quantifiers
Applications • Automatic summary generator • Noun phrase detector • Linguistic research • Promising fields of applications: • Information retrieval • Text-to-speech synthesis • Grammar and style checker • Machine translation, translation aids
Future work • Improvement of lexicon, integration of analyser with semantic database • Bigger training corpus • Use of statistical methods • Improvement of tag set • Deeper structure • Prototypes of practical applications