1 / 17

Romanian Language Technology and Resources go to Europe

Romanian Language Technology and Resources go to Europe. Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro. Romanian language. According to Ethnologue – Languages of the World (SIL)

Download Presentation

Romanian Language Technology and Resources go to Europe

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Romanian Language Technology and Resources go to Europe Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro

  2. Romanian language • According to Ethnologue – Languages of the World (SIL) • Spoken in: Romania (22 millions), Moldavia (2.7 millions), 300.000 (Serbia, Montenegro), 250,000 (Ukraine), 250,000 (Israel), Hungary (100,000), USA, Canada, Spain, Italy, etc. • Native speakers: 24 millions, +4 millions as a second language • Romanian (Rumanian, Moldavian, Moldovan, Daco-Romanian) • Linguistic lineage: Indo-European>Italic>Romance>Eastern • Dialects: Istro Romanian (Croatia), Macedo  Romanian (Greece), Megleno Romanian (Greece) • Lexical similarity: 77% with Italian, 75% with French, 74% with Sardinian, 73% with Catalan, 72% with Portuguese and Rheto-Romance, 71% with Spanish • Other influences: Slavic, Hungarian, Turkish, etc. LT Days, Luxembourg, 14-15 Jan, 2009

  3. Roots of NLP in Romania • Since 1900: linguistics & lexicography research (in the Academy and the universities) • 1960: early trials of Machine Translation; after that – no financing for more than 45 years • 1980s: first NLP models and systems • semantic networks, dialogue systems (IURES, QUERNAL), paradigmatic morphology and morphological analysers, unification-based formalisms, generation, grammars and parsers, etc. • Good computer science and computer engineering schools (in Bucharest, Iasi, Cluj-Napoca, Timisoara) LT Days, Luxembourg, 14-15 Jan, 2009

  4. Schools of Computational Linguistics and NLP • Master level: • Iasi (UAIC-FII, since 2001), University of Bucharest • PhD level: • Bucharest (RACAI), Iasi (UAIC-FII) • 6 PhD thesis will be defended this year • Summer schools, international and national conferences • EUROLAN, since 1993, second as significance in Europe (after ESSLLI) • SPED (since 2001) – Speech Technology and Human-Computer Dialogue conferences • ConsILR (since 2002) – the national conference of the Consortium for Informatisation of the Romanian Language • Alumni: • >30 PhDs and PhD students doing LT all over the world LT Days, Luxembourg, 14-15 Jan, 2009

  5. Romanian NLP research roadmap • Bucharest • Romanian Academy, RACAI (acad. Dan Tufis) • 10 researchers (3 PhDs): Romanian resources, language independent tools, human-computer interfaces, statistical models of Romanian, NLP Web services • Romanian Academy, Institute of Linguistics (acad. Marius Sala) • lexicography, old Romanian texts corpora • University of Bucharest • formal models, resources • Technical University of Bucharest & Military Academy • speech processing (prof. Corneliu Burileanu, prof. Olteanu) LT Days, Luxembourg, 14-15 Jan, 2009

  6. Romanian NLP research roadmap • Iasi • AlexandruIoanCuza University – Dept. of Computer Science (UAIC-FII, my group) • 8 PhDs (2 in co-tutelle with prof. E.Munteanu, Dept. of Letters), 4 researchers, >20 masters in CL, undergraduate projects • resources, language independent tools in written LT, NLP Web services, computational lexicography, multimodal interfaces, NL user interfaces • Romanian Academy, Institute of Computer Science (acad. Horia-Neculai Teodorescu) • 4 PhDs, 8 researchers • speech processing and resource building, tools and annotated resources in written language processing • Romanian Academy, Institute of Philology • lexicography, old manuscripts (including in old Cyrillic) LT Days, Luxembourg, 14-15 Jan, 2009

  7. Progressing through competition • Word Alignment (Ro-En): • RACAI 2003, 2005: ranked first • Question Answering (CLEF - Ro, En): • RACAI 2006: Ro-En 7/13, 2007: Ro-Ro 1/2 • UAIC 2008: Ro-Ro 1/2 • Answer Validation Exercise (CLEF - En) • UAIC 2007: 1/7, 2008: 1/7 • Anaphora Resolution Exercise (En): • UAIC 2007: ranked first • Textual Entailment (En): • UAIC 2007: 2-way task – 3/26, 3-way task – 4/10 • UAIC 2008: 2-way task – 2/26, 3-way task – 1/13 LT Days, Luxembourg, 14-15 Jan, 2009

  8. Tools (some as Web services) Morphological and POS tagger (En/Ro) Lemmatizer (En/Ro) Dependency Linker (En/Ro) Sentence splitting (En/Ro) Spell checker (Ro) Word aligner (En-Ro) Anaphora resolver (En/Ro) Discourse parser (En/Ro) Summarisation (En/Ro) Q&A (En/Ro) SMT (En-Ro-En, En-Gr-En, En-Sl-En) Definitions extractor (En/Ro) Information Retrieval (Ro Wikipedia) LT Days, Luxembourg, 14-15 Jan, 2009

  9. Resources • Ro WordNet aligned with Princeton En WN (ILI) • the second largest in the world (55,000 synsets) • Mono and multilingual corpora • various RO classical novels (about 3,000,000 words) • richest annotation: Orwell’s “1984” (110,000 words) • tagged, lemmatized, chunked, word-aligned (XCES): • Semcor (En, Ro): 1,000,000 words • Ev.Zilei (En, Ro): 1,000,000 words • AcquisCommunautaire (22 languages), Ro: 30,832,212 words • Wikipedia-Ro (fragment): 3,405,324 words • dictionaries: Dictionary of Modern Romanian – DEX, Thesaurus Dictionary of Romanian Language (eDTLR) • Language models, grammars, NE lists, complete inflexional lists, AR models, sentence splitting models, discourse cue words, etc. LT Days, Luxembourg, 14-15 Jan, 2009

  10. Participation in projects • European past: • ELSNET (ESPRIT), ELSNET-Goes-EAST (Copernicus), TELRI (COPERNICUS), FF-POIROT (FP5), Balkanet (FP5), RolTech (INTAS), LT4eL (FP6)…(more than 30 projects, see lists at www.racai.ro, www.info.uaic.ro/~dcristea) • European active: • CLARIN: design & build the European LT infrastructure for HSS (representation in SB and EB, 2 partners and 5 member institutions) • FlareNet: Nicoletta’s speech • ALEAR: models of language evolution in humanoid agents (robots): unification optimisation and discourse modelling LT Days, Luxembourg, 14-15 Jan, 2009

  11. National support Language Technology and preservation of national heritage – national priorities in the Ro research plan Massive financing over the last 2 years (compared to previous)… LT Days, Luxembourg, 14-15 Jan, 2009

  12. Institute of Cultural Memory (since 1978) • Under the Ministry Culture and Arts (dir. Dan Matei) • Digitisation of the Ro literature LT Days, Luxembourg, 14-15 Jan, 2009

  13. The thesaurus Dictionary of Romanian – since 1913 LT Days, Luxembourg, 14-15 Jan, 2009

  14. STAR: statistical & phrase based translation @ RACAI A follow up of a successful SEE-ERA.net project (Ro, Bg, Gr, Sl, Sr) Encouraging pilot experiments for Ro-En-Ro, Gr-En-Gr, Sl-En-Sl LT Days, Luxembourg, 14-15 Jan, 2009

  15. The CLARIN challenge: help HSS to use LT tools and resources • ALPE: a model of anchoring specifications of NLP applications on XML annotation schemas (standards) • build a pipeline/parallel architecture without any need to program • just input your own file and indicate the form of the output • use the federation of tools as bricks for new applications • cooking: the more ingredients you have, the list of possible recipes you may go for increases LT Days, Luxembourg, 14-15 Jan, 2009

  16. What to do until standards are approved? • Explosion of formats  difficulty of standardisation • Standards are like laws: they help to organise the society, but they also reduce freedom • Standards usually come late • We are in a hurry to do thinks instantly • Invent heuristics able to guess the semantics of new formats • ‘Compute’ wrappers to transform non-standard input into standard LT Days, Luxembourg, 14-15 Jan, 2009

  17. Thank you! LT Days, Luxembourg, 14-15 Jan, 2009

More Related