180 likes | 318 Views
Romanian Language Technology and Resources go to Europe. Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro. Romanian language. According to Ethnologue – Languages of the World (SIL)
E N D
Romanian Language Technology and Resources go to Europe Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro
Romanian language • According to Ethnologue – Languages of the World (SIL) • Spoken in: Romania (22 millions), Moldavia (2.7 millions), 300.000 (Serbia, Montenegro), 250,000 (Ukraine), 250,000 (Israel), Hungary (100,000), USA, Canada, Spain, Italy, etc. • Native speakers: 24 millions, +4 millions as a second language • Romanian (Rumanian, Moldavian, Moldovan, Daco-Romanian) • Linguistic lineage: Indo-European>Italic>Romance>Eastern • Dialects: Istro Romanian (Croatia), Macedo Romanian (Greece), Megleno Romanian (Greece) • Lexical similarity: 77% with Italian, 75% with French, 74% with Sardinian, 73% with Catalan, 72% with Portuguese and Rheto-Romance, 71% with Spanish • Other influences: Slavic, Hungarian, Turkish, etc. LT Days, Luxembourg, 14-15 Jan, 2009
Roots of NLP in Romania • Since 1900: linguistics & lexicography research (in the Academy and the universities) • 1960: early trials of Machine Translation; after that – no financing for more than 45 years • 1980s: first NLP models and systems • semantic networks, dialogue systems (IURES, QUERNAL), paradigmatic morphology and morphological analysers, unification-based formalisms, generation, grammars and parsers, etc. • Good computer science and computer engineering schools (in Bucharest, Iasi, Cluj-Napoca, Timisoara) LT Days, Luxembourg, 14-15 Jan, 2009
Schools of Computational Linguistics and NLP • Master level: • Iasi (UAIC-FII, since 2001), University of Bucharest • PhD level: • Bucharest (RACAI), Iasi (UAIC-FII) • 6 PhD thesis will be defended this year • Summer schools, international and national conferences • EUROLAN, since 1993, second as significance in Europe (after ESSLLI) • SPED (since 2001) – Speech Technology and Human-Computer Dialogue conferences • ConsILR (since 2002) – the national conference of the Consortium for Informatisation of the Romanian Language • Alumni: • >30 PhDs and PhD students doing LT all over the world LT Days, Luxembourg, 14-15 Jan, 2009
Romanian NLP research roadmap • Bucharest • Romanian Academy, RACAI (acad. Dan Tufis) • 10 researchers (3 PhDs): Romanian resources, language independent tools, human-computer interfaces, statistical models of Romanian, NLP Web services • Romanian Academy, Institute of Linguistics (acad. Marius Sala) • lexicography, old Romanian texts corpora • University of Bucharest • formal models, resources • Technical University of Bucharest & Military Academy • speech processing (prof. Corneliu Burileanu, prof. Olteanu) LT Days, Luxembourg, 14-15 Jan, 2009
Romanian NLP research roadmap • Iasi • AlexandruIoanCuza University – Dept. of Computer Science (UAIC-FII, my group) • 8 PhDs (2 in co-tutelle with prof. E.Munteanu, Dept. of Letters), 4 researchers, >20 masters in CL, undergraduate projects • resources, language independent tools in written LT, NLP Web services, computational lexicography, multimodal interfaces, NL user interfaces • Romanian Academy, Institute of Computer Science (acad. Horia-Neculai Teodorescu) • 4 PhDs, 8 researchers • speech processing and resource building, tools and annotated resources in written language processing • Romanian Academy, Institute of Philology • lexicography, old manuscripts (including in old Cyrillic) LT Days, Luxembourg, 14-15 Jan, 2009
Progressing through competition • Word Alignment (Ro-En): • RACAI 2003, 2005: ranked first • Question Answering (CLEF - Ro, En): • RACAI 2006: Ro-En 7/13, 2007: Ro-Ro 1/2 • UAIC 2008: Ro-Ro 1/2 • Answer Validation Exercise (CLEF - En) • UAIC 2007: 1/7, 2008: 1/7 • Anaphora Resolution Exercise (En): • UAIC 2007: ranked first • Textual Entailment (En): • UAIC 2007: 2-way task – 3/26, 3-way task – 4/10 • UAIC 2008: 2-way task – 2/26, 3-way task – 1/13 LT Days, Luxembourg, 14-15 Jan, 2009
Tools (some as Web services) Morphological and POS tagger (En/Ro) Lemmatizer (En/Ro) Dependency Linker (En/Ro) Sentence splitting (En/Ro) Spell checker (Ro) Word aligner (En-Ro) Anaphora resolver (En/Ro) Discourse parser (En/Ro) Summarisation (En/Ro) Q&A (En/Ro) SMT (En-Ro-En, En-Gr-En, En-Sl-En) Definitions extractor (En/Ro) Information Retrieval (Ro Wikipedia) LT Days, Luxembourg, 14-15 Jan, 2009
Resources • Ro WordNet aligned with Princeton En WN (ILI) • the second largest in the world (55,000 synsets) • Mono and multilingual corpora • various RO classical novels (about 3,000,000 words) • richest annotation: Orwell’s “1984” (110,000 words) • tagged, lemmatized, chunked, word-aligned (XCES): • Semcor (En, Ro): 1,000,000 words • Ev.Zilei (En, Ro): 1,000,000 words • AcquisCommunautaire (22 languages), Ro: 30,832,212 words • Wikipedia-Ro (fragment): 3,405,324 words • dictionaries: Dictionary of Modern Romanian – DEX, Thesaurus Dictionary of Romanian Language (eDTLR) • Language models, grammars, NE lists, complete inflexional lists, AR models, sentence splitting models, discourse cue words, etc. LT Days, Luxembourg, 14-15 Jan, 2009
Participation in projects • European past: • ELSNET (ESPRIT), ELSNET-Goes-EAST (Copernicus), TELRI (COPERNICUS), FF-POIROT (FP5), Balkanet (FP5), RolTech (INTAS), LT4eL (FP6)…(more than 30 projects, see lists at www.racai.ro, www.info.uaic.ro/~dcristea) • European active: • CLARIN: design & build the European LT infrastructure for HSS (representation in SB and EB, 2 partners and 5 member institutions) • FlareNet: Nicoletta’s speech • ALEAR: models of language evolution in humanoid agents (robots): unification optimisation and discourse modelling LT Days, Luxembourg, 14-15 Jan, 2009
National support Language Technology and preservation of national heritage – national priorities in the Ro research plan Massive financing over the last 2 years (compared to previous)… LT Days, Luxembourg, 14-15 Jan, 2009
Institute of Cultural Memory (since 1978) • Under the Ministry Culture and Arts (dir. Dan Matei) • Digitisation of the Ro literature LT Days, Luxembourg, 14-15 Jan, 2009
The thesaurus Dictionary of Romanian – since 1913 LT Days, Luxembourg, 14-15 Jan, 2009
STAR: statistical & phrase based translation @ RACAI A follow up of a successful SEE-ERA.net project (Ro, Bg, Gr, Sl, Sr) Encouraging pilot experiments for Ro-En-Ro, Gr-En-Gr, Sl-En-Sl LT Days, Luxembourg, 14-15 Jan, 2009
The CLARIN challenge: help HSS to use LT tools and resources • ALPE: a model of anchoring specifications of NLP applications on XML annotation schemas (standards) • build a pipeline/parallel architecture without any need to program • just input your own file and indicate the form of the output • use the federation of tools as bricks for new applications • cooking: the more ingredients you have, the list of possible recipes you may go for increases LT Days, Luxembourg, 14-15 Jan, 2009
What to do until standards are approved? • Explosion of formats difficulty of standardisation • Standards are like laws: they help to organise the society, but they also reduce freedom • Standards usually come late • We are in a hurry to do thinks instantly • Invent heuristics able to guess the semantics of new formats • ‘Compute’ wrappers to transform non-standard input into standard LT Days, Luxembourg, 14-15 Jan, 2009
Thank you! LT Days, Luxembourg, 14-15 Jan, 2009