300 likes | 444 Views
Russian Module for NooJ: design and implementation. Conception and realization of grammatical & lexical resources for the Russian language for Max Silberztein’s Nooj software. NOOJ Conference Inalco, Paris June 16th, 2012. Vincent BÉNET INALCO CREE Recherche assistée par ordinateur.
E N D
Russian Module for NooJ: design and implementation Conception and realization of grammatical & lexical resourcesfor the Russian languagefor Max Silberztein’s Nooj software NOOJ Conference Inalco, Paris June 16th, 2012 Vincent BÉNET INALCO CREE Recherche assistée par ordinateur
Russian Module for NooJ: design and implementation Design linguistics resources • Description of the realization Dictionaries / paradigms /grammars • Job left to be done…
Writing lexical resources for the Russian language • Build dictionairies from texts • Create one « small » dictionary and many grammars for derivational formsраб + a (slave) раб + oт+а +ть (work)за +раб +от+ к+а (salary) • Complete one « big » existing dictionary and create manygrammars
Writing lexical resources for the Russian language ZALIZNIAK’s grammatical dictionary : 96 000 entries complete dictionary, in inverted alphabetical order, with all grammatical annotation To obtain, to reach : Достигать нсв нп 1a$3(доcтигнуть//доcтичь)имеетсястрад Dostigat’ ipf nt 1a$3 (dostignut’/dostich’) has a passive form
Writing lexical resources for the Russian language Encountered problems Classification complete but some tags are absent ( V, N…) Classification based on accent markers A lot Unformal unclassified added annotations The problem of accent markers was delayed Zalizniak’s dictionary was resorting, its classification was modified, simplified and completed for computer use
The design of lexical resources for the Russian language has consisted in: 1. creatinggrammatical tags 2. recoding the dictionary with this tags 3. sorting the dictionary (inverted alphabetical order for each word) 4. fixing a paradigm model list (kartainstead ofzh1a ) 5. writing paradigms 6. problem with ë / e 7. allocating models to the words 8. verifying the results 9. testing with texts 10. Correcting and proofreading
Writing lexical ressources for Russian 1. Creating tags and properties N, A, V, ADV …. V_Pers = 1 | 2 | 3 ; V_Asp = Ipf | Pf ; V_Type = Mvt ; V_Morph = Pvb | Simp | Sufx | PvbSufx ; V_SsAsp = Det | Indet ; V_Temps = Pre | Pa | Fu ; V_Mode = Inf | Ind | Imp | Cond | Ger | Prtp ; V_Voix = Act | Pss ; V_Genre = m | f | n ; V_Nombre = s | p ; V_Constr = intr | tr | sja ; V_Cas = Im | Vi | Ro | Da | Tv | Pr ; A_Forme = fc | fl | adv; A_Genre = m | f | n ; A_SGenr = an | inan ; A_Nombre = s | p; A_Cas = Im | Vi | Ro | Da | Tv | Pr | Zv; A_Deg = Comp | Sup ; ADV_Deg = Comp;
Writing lexical ressources for Russian 2. recoding the dictionary 3. Sorting the dictionary to get inverted aphabetical ordering
Writing lexical Russian resources 4. Paradigm model list #j1a=karta #jo1a=korova #j2a=nedelja #jo2a=boginja #j3a=kniga #jo3a=sobaka #j4a=tuča #jo4a=kassirša #j5a=ulica #jo5a=volčica #j6a=statuja #jo6a=feja #j7a=linija #jo7a=furija 5. writing paradigms карта = <E>/Im+f+s + <B>у/Vi+f+s + <B>ы/Ro+f+s + <B>е/Da+f+s + <B>ой/Tv+f+s + <B>е/Pr+f+s + <B>ы/Im+f+p + <B>ы/Vi+f+p + <B>/Ro+f+p + <B>ам/Da+f+p + <B>ами/Tv+f+p + <B>ах/Pr+f+p ;
Writing lexical Russian resources 5. Paradigm for verbs взять = <E>/Inf | <B4>озьму/1+s+Pre | <B4>озьмешь/2+s+Pre | <B4>озьмет/3+s+Pre | <B4>озьмем/1+p+Pre | <B4>озьмете/2+p+Pre | <B4>озьмёшь/2+s+Pre | <B4>озьмёт/3+s+Pre | <B4>озьмём/1+p+Pre | <B4>озьмёте/2+p+Pre | <B4>озьмут/3+p+Pr | <B2>л/m+s+Pa | <B2>ла/f+s+Pa | <B2>ло/n+s+Pa | <B2>ли/p+Pa | <B4>озьми/2+s+Imp | <B4>озьмите/2+p+Imp | <B2>в/Ger | <B2>вши/Ger | <B2>вший/Prtp+Pa+Act+m+s+Im | <B2>вший/Prtp+Pa+Act+m+s+Vi | <B2>вшего/Prtp+Pa+Act+m+an+s+Vi | <B2>вшего/Prtp+Pa+Act+m+s+Ro | <B2>вшему/Prtp+Pa+Act+m+s+Da | <B2>вшим/Prtp+Pa+Act+m+s+Tv | <B2>вшем/Prtp+Pa+Act+m+s+Pr | <B2>вшая/Prtp+Pa+Act+f+s+Im | <B2>вшую/Prtp+Pa+Act+f+s+Vi | <B2>вшую/Prtp+Pa+Act+f+s+Vi | <B2>вшей/Prtp+Pa+Act+f+s+Ro | <B2>вшей/Prtp+Pa+Act+f+s+Da | <B2>вшей/Prtp+Pa+Act+f+s+Tv | <B2>вшею/Prtp+Pa+Act+f+s+Tv | <B2>вшей/Prtp+Pa+Act+f+s+Pr | <B2>вшее/Prtp+Pa+Act+n+s+Im | <B2>вшее/Prtp+Pa+Act+n+s+Vi | <B2>вшего/Prtp+Pa+Act+n+s+Vi | <B2>вшего/Prtp+Pa+Act+n+s+Ro | <B2>вшему/Prtp+Pa+Act+n+s+Da | <B2>вшим/Prtp+Pa+Act+n+s+Tv | <B2>вшем/Prtp+Pa+Act+n+s+Pr | <B2>вшие/Prtp+Pa+Act+p+Im | <B2>вшие/Prtp+Pa+Act+p+Vi | <B2>вших/Prtp+Pa+Act+an+p+Vi | <B2>вших/Prtp+Pa+Act+p+Ro | <B2>вшим/Prtp+Pa+Act+p+Da | <B2>вшими/Prtp+Pa+Act+p+Tv | <B2>вших/Prtp+Pa+Act+p+Pr | <B2>тый/Prtp+Pa+Pss+m+s+Im | <B2>тый/Prtp+Pa+Pss+m+s+Vi | <B2>того/Prtp+Pa+Pss+m+an+s+Vi | <B2>того/Prtp+Pa+Pss+m+s+Ro | <B2>тому/Prtp+Pa+Pss+m+s+Da | <B2>тым/Prtp+Pa+Pss+mo+s+Tv | <B2>том/Prtp+Pa+Pss+mo+s+Pr | <B2>тая/Prtp+Pa+Pss+f+s+Im | <B2>тую/Prtp+Pa+Pss+f+s+Vi | <B2>той/Prtp+Pa+Pss+f+s+Ro | <B2>той/Prtp+Pa+Pss+f+s+Da | <B2>той/Prtp+Pa+Pss+f+s+Tv | <B2>тою/Prtp+Pa+Pss+f+s+Tv | <B2>той/Prtp+Pa+Pss+f+s+Pr | <B2>тое/Prtp+Pa+Pss+n+s+Im | <B2>тое/Prtp+Pa+Pss+n+s+Vi | <B2>того/Prtp+Pa+Pss+n+s+Ro | <B2>тому/Prtp+Pa+Pss+n+s+Da | <B2>тым/Prtp+Pa+Pss+n+s+Tv | <B2>том/Prtp+Pa+Pss+n+s+Pr | <B2>тые/Prtp+Pa+Pss+p+Im | <B2>тые/Prtp+Pa+Pss+p+Vi | <B2>тых/Prtp+Pa+Pss+an+p+Vi | <B2>тых/Prtp+Pa+Pss+p+Ro | <B2>тым/Prtp+Pa+Pss+p+Da | <B2>тыми/Prtp+Pa+Pss+p+Tv | <B2>тых/Prtp+Pa+Pss+p+Pr | <B2>т/Prtp+Pa+Pss+m+s+fc | <B2>та/Prtp+Pa+Pss+f+s+fc | <B2>то/Prtp+Pa+Pss+n+s+fc | <B2>ты/Prtp+Pa+Pss+p+fc;
Writing lexical ressources for Russian 6. Problem of letter ë / e (partially solved: two entries or two paradigms) ёжик,N+m+an+FLX=бульдог ёж,N+m+an+FLX=богач ежик,N+m+an+FLX=бульдог еж,N+m+an+FLX=богач жевать = <E>/Inf | <B5>ую/1+s+Pre | <B5>уёшь/2+s+Pre | <B5>уёт/3+s+Pre | <B5>уём/1+p+Pre | <B5>уёте/2+p+Pre | <B5>уешь/2+s+Pre | <B5>ует/3+s+Pre | <B5>уем/1+p+Pre | <B5>уете/2+p+Pre | <B5>уют/3+p+Pre
Writing lexical Russian resources 7. Allocating models to words 8. verifiying paradigms abažur,N+m+inan+FLX=zavod abazinec,N+m+an+FLX=ukrainec abazin,N+m+an+FLX=artist abaz,N+m+inan+FLX=zavod abak,N+m+inan+FLX=čajnik abbat,N+m+an+FLX=artist
Writing lexical resources for Russian 9. Testing with russian texts : « The nose » by Gogol « The gambler » by Dostoievsky «The Prisoner of the Caucasus» by Tolstoy «The lady with the dog » by Chekhov « Short stories » by Harms
Writing lexical resources for Russian 10. Correcting errors : • -bad encoding (mixed latin/cyrillic letters) • A B E K M H O P C y X MOCKBA • errors in paradigms • bad allocation of model to words • mobile vowel / palatalization
Improving lexical resources • Increase the number of different models ? • To avoid generating unexpected or incongruous forms or failing to recognize existing forms. Читав ? Čitav ? Пиша ? Piša ? Счастие ? Ŝastiе ? Suppress word entries and / or forms ? - useless words: source of unnecessary ambiguities the names of letters a, б, в, и, к, о, с, у, я archaic unused words. - repetitions of the same word in different parts of speech ( adjectives / nouns; adjectives / pronouns; interjections/particles/parenthesis )
Available lexical resources for Russian 1 COMPILED BASIC DICTIONAIRY containing : 1 dictionary of 45,000 nouns(350 paradigms) 1 dictionaryof20,000 adjectives (50 paradigms) 1 dictionaryof 25,000 verbs (600 paradigms) 1 dictionaryof 880 prepositions & conjunctions, numerals, pronouns , 1600 adverbs, parenthetical words etc… • COMPILED ADDITONNALS DICTIONARIES:(with facultative use) 1 dictionary of propers nouns ( cities, countries, rivers … first names with diminutives) 1 dictionary of substantives-adjectives
Writing Russian grammars for Nooj designing disambiguation grammars for • -grammatical agreement between adjectives & nouns • case usage with numerals • case usage with prepositions • case usage with verbs designing grammars to locate syntagms • - date and time expression • - adverbial phrases of time , place … • idiomatic structures ( my name is, I’m.. old • verbs of motion
Writing Russian grammars for Nooj Syntactic grammar for Russian
Writing Russian grammars for Nooj Syntactic grammar for Russian
Annotating and disambiguating texts the text with its ambiguities :
Verifying grammars The text was disambiguated with the grammar of « NA » :
Russian grammars for Nooj All these grammars need improvement: • They are very sensitive to syntactic order : • fail to regognize structures if unusual ( expressive or non standard) order of word in Russian sentences. • There are no grammars (yet) : • to disambiguate adverbs / adjectives • to disambiguate adjectives / nouns • to disambiguate conjunctions / interjections
To get reliable ressources for the Russian language : The job left to be done is to design and implement: • Data bank of verified and annotated texts • Efficient syntactic grammars • Develop semantic tagging • Unified or harmonized tags for (slavic, roman, german etc..) languages to allow further multilingual treatment
Russian Module for NooJ http://www.nooj4nlp.net/pages/russian.html
Russian Module for NooJ: design and implementation Спасибо за внимание Thank you for your attention Merci de votre attention NOOJ Conference Inalco June 16th, 2012 vincent.benet@inalco.fr INALCO