220 likes | 372 Views
totale Multilingual Tokenisation, Tagging and Lemmatisation. Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia JRC Workshop, 26-27 September 2005. Overview of the talk. Introduction The totale pipeline Training totale Annotating JRC-ACQUIS-sl
E N D
totaleMultilingual Tokenisation, Tagging and Lemmatisation Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia JRC Workshop, 26-27 September 2005
Overview of the talk • Introduction • The totale pipeline • Training totale • Annotating JRC-ACQUIS-sl • Conclusions Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation
Introduction • Hypothesis: to efficiently exploit the JRC-ACQUIS its texts need to be linguistically pre-processed • This normalizes (reduces) the data and gives other tools more features to work with Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation
Example 2. (a) Where an exporter has declared goods packaged using automatic systems for bagging, canning, bottling, etc., TOKEN TYPE LEMMA MSD -------------------------------------- 2. TOK_ENUM 2. Rmp (a) TOK_ENUM (a) Rmp Where TOK where Cs an TOK a Di exporter TOK exporter Ncns has TOK have Vaip3s declared TOK declare Vmps goods TOK good Ncnp packaged TOK package Vmis using TOK use Vmpp automatic TOK automatic Afp systems TOK system Ncnp for TOK for Sp bagging TOK bag Vmpp , PUN canning TOK can Vmpp , PUN bottling TOK bottle Vmpp , PUN etc. TOK_ABBR etc. Rmp MSD and LEMMA are context dependent MSD useful for any syntactically oriented further processing (PoS filtering) LEMMA useful for reducing the lexical space (easier searches) Task is much harder for inflectionally rich (or agglutinative) languages than for English or most ‘old’ EU! Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation
Nagging doubts • Normalization loses information • Annotation introduces errors and bias • Evaluation for IE non-conclusive • Unsupervised methods! Still… Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation
Wanted A tool that would take text in any language and • tokenise, • PoS tag and • lemmatise it. Should be simple to install and use, robust, fast, and adaptable to new languages, preferably with a large number of already available models (and work under Linux!) Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation
What is out there • Component software:tokenisers, taggers, (stemmers) • FS/RE environments: INTEX, CLARK • Various LT workbenches, most famous GATE • Alas: Java, time investment, history Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation
Linguistic annotation with totale • Multilingual tokenisation, tagging and lemmatisation • Perl program with a simple pipeline architecture • Input is plain UTF-8 text • Output is a list of annotated tokens • Several output formats (tabular, XML) Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation
Example use $ totale -l en Doctor, can you help? ^D <TEXT> Doctor TOK doctor Ncfs , PUN can TOK can Voip you TOK you Pp2 help TOK help Vmn ? PUN_TERM <S/> </TEXT> Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation
Multilingual resources Multilingual resources Multilingual resources Totale building blocks Perl CLOG TnT mlToken Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation
Tokenisation in totale • Perl module mlToken.pm(Camelia Ignat, JRC) • Multilingual, with resource files for supported languages (also default rules) • Splits text into tokens, marks token type • Marks paragraph and sentence boundaries • Modelled on mtSeg Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation
Tagging in totale • Annotating words in the text with their context disambiguated morphosyntactic annotations (MSDs) • Used the tri-gram tagger TnT • Trainable, fast, unknown-word guessing module, able to accommodate the large morphosyntactic tagsets of various EU languages • Uses (and induces from annotated corpus) a lexicon with ambiguity classes and tri-gram file Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation
Lemmatisation in totale • Used CLOG, which learns first-order decision lists (+ list of exceptions) • Learns lemmatisation rules for each MSD • CLOG produces Prolog programs, but these converted into Perl Tomaž Erjavec and Sašo Džeroski: Machine Learning of Morphosyntactic Structure: Lemmatising Unknown Slovene Words. Applied Artificial Intelligence 18(1), pp. 17-40, 2004. Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation
Example CLOG rule sub SUB_afcfda { my $w = $_[0]; my $lem; if ($w=~/^(.*)svetlej#353i$/){$lem=$1."svetel"} elsif ($w=~/^(.*)polnej#353i$/){$lem=$1."poln"} elsif ($w=~/^(.*)b#353i$/) {$lem=$1."b"} elsif ($w=~/^(.*)elej#353i$/) {$lem=$1."el"} elsif ($w=~/^(.*)ivej#353i$/) {$lem=$1."iv"} elsif ($w=~/^(.*)anej#353i$/) {$lem=$1."an"} elsif ($w=~/^(.*)kej#353i$/) {$lem=$1."ek"} elsif ($w=~/^(.*)tej#353i$/) {$lem=$1."t"} elsif ($w=~/^(.*)i#382ji$/) {$lem=$1."izek"} elsif ($w=~/^(.*)enej#353i$/) {$lem=$1."en"} elsif ($w=~/^(.*)rej#353i$/) {$lem=$1."er"} elsif ($w=~/^(.*)nej#353i$/) {$lem=$1."en"} else {$lem="???"} return $lem; } Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation
Training totale with MULTEXT-East resources • Learning totale tagging and lemmatisation models • MULTEXT-East language resources V3, a standardised multilingual dataset for language engineering R&D • Covers mainly Central and Eastern European languages • Freely available for research use from http://nl.ijs.si/ME/V3/ • Used MSD tagged “1984” corpus (100kW) for tagger training • Used MSD lexica (15k lemmas) for lemmatiser training Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation
Currently supported languages • English • Slovene • Czech • Romanian • Serbian • Estonian • Hungarian Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation
Processing JRC’s ACQUIS-sl with totale • sl.tar.gz 03-Sep-2005 03:51 34.4Msl/slcelex_*.xml = 144M, 7772 files • Wrapper perl program: for each file • extract text (all <P>s except first) • | totale -l sl -f XML | • substitute contents of original <P>s with annotated ones • validate against DTD • 72 hrs on asterix but 10s startup time = 77720s = 21hrs Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation
The problem of titles • Dual role of titles: as text and name of document • Should they contain P at all? • Many titles untranslated – experiment with TextCat:4,964 sl 1,663 en “Ni na razpolago v slovenskem jeziku”1,074 en 59 sl or en 12 en or sl • Also cases like “ODLOCBA t. 1346/2001/ES …” • So, did not process them.. Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation
Quantitative results: elements Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation
Lexical analysis Extracted the MULTEXT lexicon from corpus: … 8 rafinacija rafinacija Ncfsn 2 rafinacije rafinacija Ncfpa 40 rafinacije rafinacija Ncfsg 2 rafinacije15rafinacije15 Mc---d 26 rafinaciji rafinacij Npmpn 9 rafinaciji rafinacija Ncfsl 17 rafinacijo rafinacija Ncfsa … Number of lexical entries: 381,068 Different word-forms: 221,876 Different lemmas: 154,241 Different MSDs: 970 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation
Some problems • Complex tokenisation – over 15% “weird” words: priloge.opomba priloge.opomba Ncfsn who/fsf/fos/97.7 who/fsf/fos/97.7 Rgp zavarovalnica(-e) zavarovalnica(-e) Ncmsi • Weak tagging model (likes verbs!): 3 anion anion Ncmsa--n 4 anion anion Ncmsn 1 anion anion Npmsn 3 anion anion Vmp--smp 6 aniona anion Ncmsg 8 anione anion Ncmpa 1 anioni anioenAfpmsny 1 anioni anion Ncmpn 1 anioni anioniVmp--pmp 1 anioni anioniti Vmip3s--n Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation
Conclusions • Presented processing with totale onACQUIS-sl and a quick evaluation • Further work: • methodology of semi-manual annotation (model tweaking) • “lexical priming” in totale • Translations and collocates Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation