1 / 35

Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko Comenius University, Faculty of Education Slovak Academy of Sciences, Ľ. Štúr Institute of Linguistics Vladimir.Benko@fedu.uniba.sk. Slovník súčasného slovenského jazyka

mimir
Download Presentation

Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing Word Sketchesfor a Large-Scale Lexicographic Project • Vladimír Benko • Comenius University, Faculty of Education • Slovak Academy of Sciences, Ľ. Štúr Institute of Linguistics • Vladimir.Benko@fedu.uniba.sk

  2. Slovník súčasného slovenského jazyka • (Dictionary of the Contemporary Slovak Language) • A long-term project • First presented: EURALEX 1992, Helsinki • Real compilation started 1996 • First volume (A–G) published 2006(appeared January 2007) • Second volume (H–L) to appear December 2010 • Third volume (M–P1): 75% compiled

  3. Infrastructure • 1996: • one PC per room, MS-DOS • Novell Server • some PCs at home, mostly without Internet connection • today: • dual/triple screen PC for every lexicographer • 4 servers (2 for dictionary projects, 2 for corpora) • PC at home, Internet connection

  4. Slovník súčasného slovenského jazyka • Lexical data: pure text + lightweight markup language • (similar to Wikipedia Markup) • "headword" (bold) • 'example' (italics) • |label| (smaller print) • [*reference] (smaller print) • {structure} (sense numbers, idiom indicators) • !identification line • ?comment line

  5. Corpora • 5 M corpus since 1998 • 20 M corpus in 2000 • Slovak National Corpus since 2003 • at present: • 550 M (60 % newspapers and journals) • web corpus (87 M, growing) • WSE since 2007 • now: version 4 of Slovak Sketch Grammar

  6. Word Sketch Rules • *DUAL

  7. Word Sketch Rules • *DUAL • 2:"ADJ" [tag="AD[JV]"]{0,3} 1:"NOM" • 1:"NOM" ("ADJ" "KON")? 2:"ADJ" • 1:"V.*" 2:"ADV" • 2:"ADV" 1:"V.*"

  8. Word Sketch Rules • *DUAL • =modifier/modifié • 2:"ADJ" [tag="AD[JV]"]{0,3} 1:"NOM" • 1:"NOM" ("ADJ" "KON")? 2:"ADJ" • 1:"V.*" 2:"ADV" • 2:"ADV" 1:"V.*"

  9. Word Sketch Rule Names (CNC, “A” Style) • is_subj_of • has_subj • is_obj4_of • has_obj4 • a_modifier • modifies • prec_prep • coord • gen1 • gen2

  10. Word Sketch Rule Names (CNC, “A” Style) • KW is_subj_ofCL • KW has_subj CL • KW is_obj4_ofCL • KW has_obj4 CL • CL is a_modifier ofKW • KW modifies CL • CL is prec_prep ofKW • KW&CL are coord'ed • CL is gen1 case • KW is gen2 case

  11. Word Sketch Rule Names (“A” Style) • Rule names motivated syntactically(named by syntactic function) • Keyword/Collocate position (usually) not indicated • Keyword/Collocate PoS implied • Some relationships difficult to name • Transparent for basic relationships • Difficult to extend • Precision preferred over Recall

  12. Word Sketch Rule Names (“V” Style) • *DUAL • =a_modifier/modifies • 2:[tag="A.*"] []{0,2} 1:[tag="N.*"]

  13. Word Sketch Rule Names (“V” Style) • *DUAL • =a_modifier/modifies • 2:[tag="A.*"] []{0,2} 1:[tag="N.*"] • *DUAL • =Aj X/X Nn • 2:[tag="A.*"] []{0,2} 1:[tag="N.*"]

  14. Word Sketch Rule Names (“V” Style) • *DUAL • =Aj X/X Nn • 2:[tag="A.*"] []{0,2} 1:[tag="N.*"] • =Aj X • 2:[tag="A.*"] []{0,2} 1:[] • =X Nn • 1:[] []{0,2} 2:[tag="N.*"]

  15. Word Sketch Rule Names (“V” Style) • Keyword X(UC) • Collocate Vb, Aj, Av, …(UC+LC) • Collocate Y(UC) • Keyword/Collocate • modifier/restriction sgX (LC+UC) • (usually in UNARY rules) • Secondary Collocate %s(LC) • (TRINARY rules)

  16. Word Sketch Rule Names (SNC, “V” Style) • (BINARY)SYMMETRIC • Vb X/X Vb X , X • Av X/X AvX Cj X • Nm X / X Nm • Aj X / X Aj • Y X / X Y • Pp X / X Pp • TRINARY • pp X Y, …

  17. Word Sketch Rule Names (SNC, “V” Style) • UNARY • sgX pX • plX cX • sX • nomX • genX 1pX • datX 2pX • accX 3pX • vocX • locX SbX, ... • insX

  18. Word Sketch Rule Names (“V” Style) • Rule names motivated collocationally(named by PoS of Keyword/Collocate) • Keyword/Collocate position indicated explicitly • Keyword/Collocate PoS indicated (usually) explicitly • All relationships can be named uniformly • Name of syntactic function not present • Easily extensible • Recall preferred over precission

  19. Special Treatment: Reflexive Verbs • Reflexivity of verbs in Slovak: • Reflexive formant saor siin the vicinity of a verb, which can be regarded as • a) Lexical morpheme (“inherent” reflexivity) • b) Reflexive pronoun (“proper” reflexivity or reciprocity) • c) Grammatical formant (reflexive form of a non-reflexive verb)

  20. Special Treatment: Reflexive Verbs • In dictionaries: • (a) case implies creation of a new headword (in a common entry with the non-reflexive form of the respective verb, or have an entry of its own • (b) case may generate a new headword, or be indicated in other way (e.g. within the example zone); it depends on the type and size of the dictionary • (c) case is a syntactic phenomenon, the dictionaries usually do not treat it in a systematic way

  21. Special Treatment: Reflexive Verbs • Reflexives in SSSJ: always in separate entries • holiť sa-lí sa -lia sa hoľ sa! -lil sa -liac sa -liaci sa -lený -lenie sanedok. • {1} (ø; čím) rezaním odstraňovať zo svojej tváre • (al. častí tela) chlpy: musí sa denne h.; h. sa namokro; • h. sa v podpazuší; Husto zarastal a zavčasu sa holil. • [Š. Žáry]; A ráno sa holí mojou žiletkou.[M. Zelinka]; • Sám sa holiť nemohol, lebo sa mu od ťažkej roboty triasli • ruky.[B. Šikula]; h. sa strojčekom, žiletkou, britvou; • holí sa každé ráno; Dlho som sa neholil, narástla mi • brada a fúzy.[P. Jaroš] • {2} dávať si odstraňovať chlpy z tváre (obyč. britvou): • otec sa už roky holí u toho istého holiča; • opak.holievať sa-va sa -vajú sa -val sa; dok. -> oholiť sa

  22. Reflexive verbs (Slovak Orthographic Dictionary)

  23. Reflexive verbs (Slovak Orthographic Dictionary)

  24. Special Treatment: Reflexive Verbs • To be able to separate Word Sketches for reflexive and non-reflexive form of a verb, we need • (1) Secondary segmentationsplitting sentences into smaller chunks • (2) Secondary markupindicating reflexivity for verbs • (3) Use secondary markup in Word Sketch rules

  25. Secondary Segmentation and Markup • <s>Francúzski vojenskí dôstojníci • a humanitní pracovníci na juhozápade • cez víkend varovali pred novým exodom • vystrašených Rwanďanov, predovšetkým Hutuov, • ktorí sa boja odchodu francúzskych vojakov • dozerajúcich na poriadok v oblasti, • avizovaného na koniec tohto mesiaca.</s>

  26. Secondary Segmentation and Markup • <s0>Francúzski vojenskí dôstojníci</s0> • <s0>a humanitní pracovníci na juhozápade • cez víkend varovalipred novým exodom • vystrašených Rwanďanov,</s0> <s0>predovšetkým Hutuov,</s0> • <s0>ktorí sa boja odchodu francúzskych vojakov • dozerajúcich na poriadok v oblasti,</s0> • <s0>avizovaného na koniec tohto mesiaca.</s0>

  27. Secondary Segmentation and Markup • <s0>Francúzski vojenskí dôstojníci • a humanitní pracovníci na juhozápade • cez víkend varovalipred novým exodom • vystrašených Rwanďanov,</s0> <s0>predovšetkým Hutuov, • ktorí sa boja odchodu francúzskych vojakov • dozerajúcich na poriadok v oblasti,</s0> • <s0>avizovaného na koniec tohto mesiaca.</s0>

  28. Secondary Segmentation and Markup • <s0>Francúzski vojenskí dôstojníci • a humanitní pracovníci na juhozápade • cez víkend varovalir0pred novým exodom • vystrašených Rwanďanov,</s0> <s0>predovšetkým Hutuov, • ktorí sabojar1 odchodu francúzskych vojakov • dozerajúcich na poriadok v oblasti,</s0> • <s0>avizovaného na koniec tohto mesiaca.</s0>

  29. Optimizing: Some Minor Issues • Choosing optimal browser • Mozilla Firefox for dual screen display • Google Chrome for dual window display • Default Word Sketch parameters • minimal frequency 4 • minimal salience –2.0 • no collocation clustering • minimal unary score –20.0

  30. Optimizing: Some Minor Issues • Default Screen Layout • fixed order of tables • 4 columns only (easier to print) • 32 lines per table (to fit the screen) • font selection: Georgia (set in browser)

  31. Infrastructure • 2 servers (eugen & samo)* • Debian, Ubuntu • Apache, Lighttpd • hot backup • three “gates”: • stable • beta (Sandbox) • alpha (Rockbox) • common authentication • ________ • * Eugen Jóna (1909–1985), Eugen Pauliny (1912–1983), Samuel Czambel (1856–1909) ... Slovak linguists

More Related