180 likes | 265 Views
ScanLex Automatic generation of bilingual wordlists Netordbogen Oslo-meeting 9. January 2006. Anders Nøklestad Janne Bondi Johannessen. Procedure. Use parallel corpora Use Uplug ( http://uplug.sourceforge.net/ ) to automatically align sentences
E N D
ScanLexAutomatic generation of bilingual wordlistsNetordbogen Oslo-meeting 9. January 2006 Anders Nøklestad Janne Bondi Johannessen
Procedure • Use parallel corpora • Use Uplug (http://uplug.sourceforge.net/) to automatically align sentences • Use the Wortschatz project system, Univ. of Leipzig, to find words that co-occur in language pairs (http://wortschatz.uni-leipzig.de/) - (With lots of help from Chris Biemann, UL) • For each word, select the translation with the most significant co-occurrence • Use Viggo Kann’s program to convert lists to VK’s Netordbog XML-format • + convert lists by XML stylesheet to HTML
Languages • Languages: • Norwegian Bokmål • Norwegian Nynorsk • Swedish • Danish • Icelandic • English • Not enough data for Faroese
Example: Sentence pair from the Opus corpus • (src)="s26"> Dette værktøj vil hjælpe dig med grafisk at indstille serveren for CUPS-printsystemet . De tilgængelige muligheder er grupperet i relaterede emner og du kan hurtigt få adgang til dem gennem ikonvisningen til venstre . • (trg)="s26"> Dette verktøyet hjelper deg å konfigurere CUPS-printersystemet på en grafisk måte . De mulige valgene er klassifisert hierarkisk , og kan finnes fort ved hjelp av treet til venstre .
Wortschatz-format • Dette@da værktøj@da vil@da hjælpe@da dig@da med@da grafisk@da at@da indstille@da serveren@da for@da CUPS-printsystemet@da .@da De@da tilgængelige@da muligheder@da er@da grupperet@da i@da relaterede@da emner@da og@da du@da kan@da hurtigt@da få@da adgang@da til@da dem@da gennem@da ikonvisningen@da til@da venstre@da . • Dette@nb verktøyet@nb hjelper@nb deg@nb å@nb konfigurere@nb CUPS-printersystemet@nb på@nb en@nb grafisk@nb måte@nb .@nb De@nb mulige@nb valgene@nb er@nb klassifisert@nb hierarkisk@nb ,@nb og@nb kan@nb finnes@nb fort@nb ved@nb hjelp@nb av@nb treet@nb til@nb venstre@nb .
Corpora • For all languages: • The KDE subcorpus of the OPUS corpus (http://logos.uio.no/opus) • For Norwegian Bokmål, Icelandic, and English: • The EEA agreement • Gathered from various web sites
Corpus sizes • Norwegian Bokmål: 378,308 tokens • Norwegian Nynorsk: 299,944 tokens • Swedish: 1,070,290 tokens • Danish: 726,964 tokens • Icelandic: 208,143 tokens • English: 1,610,779 tokens
Co-occurrence computation For words A and B, occurring a and b times, respectively, in a corpus of bi-sentences and together in k of n bi-sentences in total, the significance of their co-occurrence is given as: where λ=ab/n
Word pairs prior to uniqification • 13968 scanlex_dais.txt • 37032 scanlex_danb.txt • 60554 scanlex_dann.txt • 76385 scanlex_dasv.txt • 73009 scanlex_enda.txt • 60014 scanlex_enis.txt • 99003 scanlex_ennb.txt • 62525 scanlex_ennn.txt • 69164 scanlex_ensv.txt • 56532 scanlex_isnb.txt • 15105 scanlex_isnn.txt • 17878 scanlex_issv.txt • 338349 scanlex_nbnn.txt • 108687 scanlex_nbsv.txt • 57473 scanlex_nnsv.txt
Most significant candidates for ”indstille” • +--------------+------------------------+-----+ • | w1 | w2 | sig | • +--------------+------------------------+-----+ • | indstille@da | konfigurere@nb | 807 | • | indstille@da | Her@nb | 779 | • | indstille@da | kan@nb | 657 | • | indstille@da | du@nb | 522 | • | indstille@da | sesjonsbehandleren@nb | 287 | • | indstille@da | utseendet@nb | 225 | • | indstille@da | panelet@nb | 189 | • | indstille@da | panelets@nb | 168 | • | indstille@da | oppgavelinje@nb | 153 | • | indstille@da | sette@nb | 108 | • | indstille@da | -konfigurasjonen@nb | 61 | • | indstille@da | Windows-filsystemer@nb | 61 | • | indstille@da | cookie@nb | 61 | • | indstille@da | lines@nb | 61 | • | indstille@da | mode@nb | 61 | • | indstille@da | videomodusen@nb | 61 | • | indstille@da | SMB@nb | 60 | • | indstille@da | XFree86@nb | 60 | • | indstille@da | opp@nb | 58 | • | indstille@da | til@nb | 57 | • +--------------+------------------------+-----+
Results • The lists show the most significantly co-occurent word pairs • Word lists located at http://omilia.uio.no/scanlex/ws/
Future • Convert bilingual word pair lists to multilingual list • Supply with word pairs from Lexin • Expand and improve wordlists through more and bigger parallel corpora • (The Bible? ENPC?) • Faroese