350 likes | 465 Views
Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko Comenius University, Faculty of Education Slovak Academy of Sciences, Ľ. Štúr Institute of Linguistics Vladimir.Benko@fedu.uniba.sk. Slovník súčasného slovenského jazyka
E N D
Optimizing Word Sketchesfor a Large-Scale Lexicographic Project • Vladimír Benko • Comenius University, Faculty of Education • Slovak Academy of Sciences, Ľ. Štúr Institute of Linguistics • Vladimir.Benko@fedu.uniba.sk
Slovník súčasného slovenského jazyka • (Dictionary of the Contemporary Slovak Language) • A long-term project • First presented: EURALEX 1992, Helsinki • Real compilation started 1996 • First volume (A–G) published 2006(appeared January 2007) • Second volume (H–L) to appear December 2010 • Third volume (M–P1): 75% compiled
Infrastructure • 1996: • one PC per room, MS-DOS • Novell Server • some PCs at home, mostly without Internet connection • today: • dual/triple screen PC for every lexicographer • 4 servers (2 for dictionary projects, 2 for corpora) • PC at home, Internet connection
Slovník súčasného slovenského jazyka • Lexical data: pure text + lightweight markup language • (similar to Wikipedia Markup) • "headword" (bold) • 'example' (italics) • |label| (smaller print) • [*reference] (smaller print) • {structure} (sense numbers, idiom indicators) • !identification line • ?comment line
Corpora • 5 M corpus since 1998 • 20 M corpus in 2000 • Slovak National Corpus since 2003 • at present: • 550 M (60 % newspapers and journals) • web corpus (87 M, growing) • WSE since 2007 • now: version 4 of Slovak Sketch Grammar
Word Sketch Rules • *DUAL
Word Sketch Rules • *DUAL • 2:"ADJ" [tag="AD[JV]"]{0,3} 1:"NOM" • 1:"NOM" ("ADJ" "KON")? 2:"ADJ" • 1:"V.*" 2:"ADV" • 2:"ADV" 1:"V.*"
Word Sketch Rules • *DUAL • =modifier/modifié • 2:"ADJ" [tag="AD[JV]"]{0,3} 1:"NOM" • 1:"NOM" ("ADJ" "KON")? 2:"ADJ" • 1:"V.*" 2:"ADV" • 2:"ADV" 1:"V.*"
Word Sketch Rule Names (CNC, “A” Style) • is_subj_of • has_subj • is_obj4_of • has_obj4 • a_modifier • modifies • prec_prep • coord • gen1 • gen2
Word Sketch Rule Names (CNC, “A” Style) • KW is_subj_ofCL • KW has_subj CL • KW is_obj4_ofCL • KW has_obj4 CL • CL is a_modifier ofKW • KW modifies CL • CL is prec_prep ofKW • KW&CL are coord'ed • CL is gen1 case • KW is gen2 case
Word Sketch Rule Names (“A” Style) • Rule names motivated syntactically(named by syntactic function) • Keyword/Collocate position (usually) not indicated • Keyword/Collocate PoS implied • Some relationships difficult to name • Transparent for basic relationships • Difficult to extend • Precision preferred over Recall
Word Sketch Rule Names (“V” Style) • *DUAL • =a_modifier/modifies • 2:[tag="A.*"] []{0,2} 1:[tag="N.*"]
Word Sketch Rule Names (“V” Style) • *DUAL • =a_modifier/modifies • 2:[tag="A.*"] []{0,2} 1:[tag="N.*"] • *DUAL • =Aj X/X Nn • 2:[tag="A.*"] []{0,2} 1:[tag="N.*"]
Word Sketch Rule Names (“V” Style) • *DUAL • =Aj X/X Nn • 2:[tag="A.*"] []{0,2} 1:[tag="N.*"] • =Aj X • 2:[tag="A.*"] []{0,2} 1:[] • =X Nn • 1:[] []{0,2} 2:[tag="N.*"]
Word Sketch Rule Names (“V” Style) • Keyword X(UC) • Collocate Vb, Aj, Av, …(UC+LC) • Collocate Y(UC) • Keyword/Collocate • modifier/restriction sgX (LC+UC) • (usually in UNARY rules) • Secondary Collocate %s(LC) • (TRINARY rules)
Word Sketch Rule Names (SNC, “V” Style) • (BINARY)SYMMETRIC • Vb X/X Vb X , X • Av X/X AvX Cj X • Nm X / X Nm • Aj X / X Aj • Y X / X Y • Pp X / X Pp • TRINARY • pp X Y, …
Word Sketch Rule Names (SNC, “V” Style) • UNARY • sgX pX • plX cX • sX • nomX • genX 1pX • datX 2pX • accX 3pX • vocX • locX SbX, ... • insX
Word Sketch Rule Names (“V” Style) • Rule names motivated collocationally(named by PoS of Keyword/Collocate) • Keyword/Collocate position indicated explicitly • Keyword/Collocate PoS indicated (usually) explicitly • All relationships can be named uniformly • Name of syntactic function not present • Easily extensible • Recall preferred over precission
Special Treatment: Reflexive Verbs • Reflexivity of verbs in Slovak: • Reflexive formant saor siin the vicinity of a verb, which can be regarded as • a) Lexical morpheme (“inherent” reflexivity) • b) Reflexive pronoun (“proper” reflexivity or reciprocity) • c) Grammatical formant (reflexive form of a non-reflexive verb)
Special Treatment: Reflexive Verbs • In dictionaries: • (a) case implies creation of a new headword (in a common entry with the non-reflexive form of the respective verb, or have an entry of its own • (b) case may generate a new headword, or be indicated in other way (e.g. within the example zone); it depends on the type and size of the dictionary • (c) case is a syntactic phenomenon, the dictionaries usually do not treat it in a systematic way
Special Treatment: Reflexive Verbs • Reflexives in SSSJ: always in separate entries • holiť sa-lí sa -lia sa hoľ sa! -lil sa -liac sa -liaci sa -lený -lenie sanedok. • {1} (ø; čím) rezaním odstraňovať zo svojej tváre • (al. častí tela) chlpy: musí sa denne h.; h. sa namokro; • h. sa v podpazuší; Husto zarastal a zavčasu sa holil. • [Š. Žáry]; A ráno sa holí mojou žiletkou.[M. Zelinka]; • Sám sa holiť nemohol, lebo sa mu od ťažkej roboty triasli • ruky.[B. Šikula]; h. sa strojčekom, žiletkou, britvou; • holí sa každé ráno; Dlho som sa neholil, narástla mi • brada a fúzy.[P. Jaroš] • {2} dávať si odstraňovať chlpy z tváre (obyč. britvou): • otec sa už roky holí u toho istého holiča; • opak.holievať sa-va sa -vajú sa -val sa; dok. -> oholiť sa
Special Treatment: Reflexive Verbs • To be able to separate Word Sketches for reflexive and non-reflexive form of a verb, we need • (1) Secondary segmentationsplitting sentences into smaller chunks • (2) Secondary markupindicating reflexivity for verbs • (3) Use secondary markup in Word Sketch rules
Secondary Segmentation and Markup • <s>Francúzski vojenskí dôstojníci • a humanitní pracovníci na juhozápade • cez víkend varovali pred novým exodom • vystrašených Rwanďanov, predovšetkým Hutuov, • ktorí sa boja odchodu francúzskych vojakov • dozerajúcich na poriadok v oblasti, • avizovaného na koniec tohto mesiaca.</s>
Secondary Segmentation and Markup • <s0>Francúzski vojenskí dôstojníci</s0> • <s0>a humanitní pracovníci na juhozápade • cez víkend varovalipred novým exodom • vystrašených Rwanďanov,</s0> <s0>predovšetkým Hutuov,</s0> • <s0>ktorí sa boja odchodu francúzskych vojakov • dozerajúcich na poriadok v oblasti,</s0> • <s0>avizovaného na koniec tohto mesiaca.</s0>
Secondary Segmentation and Markup • <s0>Francúzski vojenskí dôstojníci • a humanitní pracovníci na juhozápade • cez víkend varovalipred novým exodom • vystrašených Rwanďanov,</s0> <s0>predovšetkým Hutuov, • ktorí sa boja odchodu francúzskych vojakov • dozerajúcich na poriadok v oblasti,</s0> • <s0>avizovaného na koniec tohto mesiaca.</s0>
Secondary Segmentation and Markup • <s0>Francúzski vojenskí dôstojníci • a humanitní pracovníci na juhozápade • cez víkend varovalir0pred novým exodom • vystrašených Rwanďanov,</s0> <s0>predovšetkým Hutuov, • ktorí sabojar1 odchodu francúzskych vojakov • dozerajúcich na poriadok v oblasti,</s0> • <s0>avizovaného na koniec tohto mesiaca.</s0>
Optimizing: Some Minor Issues • Choosing optimal browser • Mozilla Firefox for dual screen display • Google Chrome for dual window display • Default Word Sketch parameters • minimal frequency 4 • minimal salience –2.0 • no collocation clustering • minimal unary score –20.0
Optimizing: Some Minor Issues • Default Screen Layout • fixed order of tables • 4 columns only (easier to print) • 32 lines per table (to fit the screen) • font selection: Georgia (set in browser)
Infrastructure • 2 servers (eugen & samo)* • Debian, Ubuntu • Apache, Lighttpd • hot backup • three “gates”: • stable • beta (Sandbox) • alpha (Rockbox) • common authentication • ________ • * Eugen Jóna (1909–1985), Eugen Pauliny (1912–1983), Samuel Czambel (1856–1909) ... Slovak linguists