280 likes | 415 Views
CKL --- Center for Computational Linguistics. Proje c t MŠMT LC536 (LC05) Univerzita Karlova v Praze, ÚFAL MFF Západočeská univerzita Plzeň, KKY FAV Masarykova Univerzita Brno, FI Ústav pro jazyk český AV ČR Praha http://www.centrumkomputacnilingvistiky.cz.
E N D
CKL---Center for Computational Linguistics Project MŠMT LC536 (LC05) Univerzita Karlova v Praze, ÚFAL MFF Západočeská univerzita Plzeň, KKY FAV Masarykova Univerzita Brno, FI Ústav pro jazyk český AV ČR Praha http://www.centrumkomputacnilingvistiky.cz
Center’s Advisory Board Meeting31.1.2011MFF UK, Malostranské nám. 25Room S1, 4th floor • 10:00 Introduction to the Center, history, results (Jan Hajic) • 10:25 Charles University research and results (Jan Hajic) • 10:40 Break • 11:00 Institute for Czech Language research and results (Karel Oliva) • 11:15 Masaryk University research and results (Karel Pala) • 11:30 University of West Bohemia research and results (Pavel Ircing)
The Center • Goals: • Research in all areas of computational linguistics and speech • Close cooperation in speech and langauge • Create annotated data • Algorithms and SW Tools for NL analysis and generation • Create and integrate lexical resources
History of the Center • Former Center for Computational Linguistics (program MŠMT LN) • 2000-2004 • UK, ÚJČ, ZČU: fundamental research type (B) • Now: Center for Computational Linguistics • (again) fundamental research, MŠMT LC • Masaryk University in Brno added, now 4 sites
The Center: some figures • Budget and timeframe • 2.9 mil. €, 2005-2009[-2011] (6 yrs + 9 mos) • Personální obsazení (2010): • 1 PI (professor) • 7 Co-PIs and key presons (full/assoc. prof.) • 11Postdocs (Ph.D.) • 9of them graduated with CKL support • 24 graduate students • Reduced to about 2/3 for 2011
The sites (1) • UK Praha (ÚFAL MFF / Charles University) • Formal language theory and algorithms • SW tools for NLU / NLG • Raw, Annotated data (incl. parallel) • ZČU Plzeň, KKY FAV (University of West Bohemia in Pilsen) • Speech recognition and TTS • Data collection and annotation
The sites (2) • MU Brno, FI, NLP lab (Masaryk University) • Lexical issues • Lexical databases, incl. SW • ÚJČ AV ČR (Institute of the Czech Language, Academy of Sciences of the CR) • Digitization of historical data • Lexical databases
2005 • Start of work, after some “gap” • Apr. 1, 2005 – three months vacuum • [Got back the name…] • Reduced budget for 2005 (300k €) • Durable equipment / future computing cluster • Cooperation: • EU grant proposals • continuing work on Malach (U.S.) • Start of the PIRE NSF project (JHU, Brown Univ.)
2006 • First full year • Prague Dependency Treebank v2.0 finished (published at LDC) • Speech reconstruction project (UK, specification with PIRE/JHU) • Lexical issues (UK, MU, ÚJČ) • Speech (ASR, TTS - ZČU) • IR – CLEF test collection, CLEF shared task, 1st part • Digitization of historical material (ÚJČ) • Start of EU Integrated project „Companions“: UK, ZČU • More international cooperation: EU, USA (JHU, Brown, Univ. of Pennsylvania) • Organization of Treebanks and Linguistics Theories, Dec. 2006 (UK) • 40 „results” in the government database („RIV”)
2007 • Mid-project • Lexical resources, new Czech language lexical database (MU+ÚJČ) • Added more students for English work, translation • English annotation specification, annotation (ZČU, UK) • Integration of ASR and TTS with NLU/NLG (UK, ZČU) • In the “Companions” project • SW tools for analysis and generation • Speech, language (UK, MU, ZČU) • International collaboration • EU (3 projects 6th FP: UK, UK+ZČU), USA (UK, UK+ZČU) • Local organisation of ACL 2007 and EMNLP 2007 • Still (2011) holds record in attendance (~1100 participants) • 66 results in“RIV” (16 journals, 39 in-proc., 5 SW/data etc.)
2008 • Slightly modified goals (stress on MT) • Lexical resources (MU, UK, ÚJČ) • SW tools • Semantics • detection of plagiarism (MU) • NLU (UK, MU), NLG (UK) • New algorithms for ASR • Prosody, language modeling, speech reconstruction • Data acquisition, annotation, corpus tools • Research (incl. data annotation) for machine translation • The TectoMT SW and data platform • Theoretical formal linguistics, language usage • Results (RIV): 64: 13 journal art., 32 in-proc., 5 books, 5 SW tools/data resources etc.
2009 • Should have been the last year of CKL… • Application for extension for 2010-11 • Granted for 2010 • Research: English data, MT, ASR, Dialog • Work on the parallel Czech-English treebank (PTB) • Companions project: integration work • Tight cooperation between UK and ZCU • PIRE project – workshops, students from US at UK • Euromatrix EU project on MT extended (-2012) • Organization of the CoNLL 2009 shared task • Organization of session at FET 2009 (EU conference) • Results: 62, journals: 8, in-proc.: 42, 3 books etc.
2010 • Last fully-funded year: ext. to 2011 granted in Nov. • Continuation of research along the same lines • Wrap-up in data annotation: PCEDT, PDTSx • Departures of people due to uncertainty • International cooperation: • Companions project finished (Nov. 2010) • PIRE continuing towards 2011, EuromatrixPlus renewed (UK) • New projects in 2010: • Univ. of Pennsylvania – discourse representation, annotation (UK) • Khresmoi (EU IP) – medical IR and IE, UK • Faust (STREP, machine translation, UK) • META-NET network of excellence in MT / data sharing • Chairing the ACL 2010 conference (Uppsala, Sweden) • Results (prelim.): ~60 (12 journal articles, ~40 in-proc.)
Quantitative Summary of Results • RIV 2005-2009 (2010 pending) • 274 records (+ ~ 60 in 2010) • Mostly papers in proceedings of conferences and workshops • ACL, EACL, NAACL, Coling, CoNLL; workshops • > 95% international, > 85% abroad • Some journal articles • LNCS, IEEE Transactions, LRE, Czech ling. Journals (PBML, SaS – now in WoS) • Software and data • Mostly „open source“; training, shared task (evaluation)
Most valued publications • Papers • Semi-supervised POS tagging (EACL 2009) • Best results in POS tagging so far, incl. English • Now taggers available in 5 languages • Extension of HVS Semantic Parser by Allowing Left-RightBranching (ICASSP 2008) • New result, drawing from S. Young’s work • Large-scale Semantic Networks: Annotation and Evaluation • NAACL 2009; in cooperation with Google Research (Zurich, K. Hall) • CoNLL 2009 Shared Task, CoNLL 2009 • Overall task and system description • Book • Valenční slovník českých sloves (Valency Lexicon of Czech Verbs, Karolinum Press) • Electronic version available
Most valued data • Corpora (language databases, publicly available) • Prague Dependency Treebank 2.0, Linguistic Data Consortium 2006 • Prague Czech-English Dependency Treebank, to appear in 2011 • Penn Treebank & translation to Czech, with semantic annotation ~PDT/style • Czech Wordnet 1.0 (ELRA, 2008) • Sign Language, Audiovisual (ELRA, 2008) • Test / shared task collections • CLEF 2006, 2007 • Multilingual cross-langauge search competitions • Machine Translation Open Competition – EuroMatrix/Plus 2006-10 • Czech-English, German, French, Italian, Hungarian, Spanish • CoNLL Shared Task 2007, 2009 • Dep. parsing, semantic role labeling (unified for 7 languages)
Most valued SW tools • Software • Corpus manager (client/server) Bonito/Manatee • Worldwide use: ČNK, SNK; Hu, Hr, GB • Word Sketch Engine • Commercial use (Lexical Computing) • ComPOST • State-of-the-art POS tagger (Cz, En, Dutch, Swedish, Icelandic) • Syntacticdependency parser „MST“ (Czech) • With Univ. of Pennsylvania • Improved Czec ASR and Emotional TTS • Used in the Companions project • NLG and Dialogue Manager w/knowledge base • Also for the Companions project • The TectoMT SW and data handling platform • MT, dialogue systems (now any NLU/NLG processing -> “Treex”)
The Center provided… • Material benefits • 3/4 of budget: personnel (mainly graduate students) • Generous travel money • Small equipment • Durable equipment – clusters (30-200 CPUs) • Only in 2005/6 – need for renewal • Small indirect costs (< 12%, contribution of inst.) • “intangible” benefits • (Sub)teams, even across institutions, flexible assignment of people to projects, • dissertations, one assoc. professor promotion
The Center had to work under certain “restrictions” • Employment of graduate students, postdocs, supervision of graduate students • Now at all four sites (2009: 10/4/9/1) • Requirement: at least on site…→Check • Requirement: Participation of students (Bc./Mgr./Ph.D.) • Total: 41 students→Check • 7nationalities • Students - after graduation - went to (e.g.)… • Petr Němec (UK): TextKernel, Hol.; Kiril Ribarov (UK): ČEZ • Jan Romportl, Aleš Pražák: SpeechTech (spinoff, ZČU) • Vladimír Kadlec (MU Brno): Acision (GB) • Petr Pajas (UK): Google (Zurich) • Václav Novák (UK): Ministry of Interior, then a small startup • Former CKL (LN, 00-04): M. Čmejrek, J. Cuřín (UK): IBM Research (Yorktown, Prague)
“Restrictions”(cont.’d) • Requirement: integration to EU “research space” • 9projects EU, 6thand 7th FP • All types: IP, STREP, NoE; SSA, Dig. Libraries • Companions (IP) - ZČU, UK; • Khresmoi (IP) - UK • EuroMatrix, EuroMatrixPlus, Faust (STREP) - UK • Flarenet, META-NET (NoE) - UK • Clarin (SSA) - UK, MU, ÚJČ; • KYOTO (Dig. Libraries) - MU • USA • Malach (till 2007; UK, ZČU): USC, JHU, IBM, UMD • PIRE: rozpoznávání řeči a strojový překlad (UK, indirectly ZČU): JHU, Brown Univ. • Discourse: Univ. of Pennsylvania • Treebanking: Univ. of Colorado →Check
EU Project „Companions“ • Goal • Intelligent conversational companion • Over photographs (Cz), „how was your day“ (En) • Technologies • ASR, emotional TTS • Natural language understanding, NL generation • Naturalness of dialogue: „user studies“ / „evaluation“ • CKL • UK/ZČU: ASR, TTS, NLU, NLG, Dialogue management
Semantic annotation (UK) Některé kontury problému se však po oživení Havlovým projevem zdají být jasnější.
PDT 2.0:Annotation layers „Byl by šel do lesa“ (“he’d go to the forest”) Linked layers of annotation Stand-off annotation Scheme (Relax NG) z-layer
? Generation Speech reconstruction (UK, ZČU) ●Goal: „Translation“ • ● Annotation Ten obraz jsem jim nemohl dát. Ten obraz jsem jim nemohl dát. I could not give them the painting. SEM NEMOH SEM TO JIM DÁT TEN VOBRAZ ‘m couldn’t ‘m that them give the paintin’
Speech Reconstruction Annotation • Edited transcript • All changes allowed • Manual annotation • Large data • Malach data • Companions proj. dialogues (> 100h)