620 likes | 812 Views
Research on Human Language Technologies in Flanders. Walter Daelemans (Ed.) CNTS Language Technology Group University of Antwerp walter.daelemans@ua.ac.be. Prehistory Computational Linguistics. Early isolated PhDs Willy Martin (Leuven, 1970): Analysis of a vocabulary by means of a computer
E N D
Research on Human Language Technologies in Flanders Walter Daelemans (Ed.) CNTS Language Technology Group University of Antwerp walter.daelemans@ua.ac.be
Prehistory Computational Linguistics • Early isolated PhDs • Willy Martin (Leuven, 1970): Analysis of a vocabulary by means of a computer • Luc Steels (Antwerpen, 1977): Aspects of a Modular Theory of Language • Early International MT involvement (Leuven) • EUROTRA (early eighties until 1994), METAL (mid-eighties until 1992) • Dutch language pairs
Prehistory Computational Linguistics • Droste PhDs: F. G. Droste (1969) Translating with the computer, possibilities and problems. • Frank Van Eynde (Leuven, 1985): Meaning, translatability, and machine translation • Geert Adriaens (Leuven, 1986): Process Linguistics: The Theory and Practice of a Cognitive-Scientific Approach to Natural Language Understanding • Walter Daelemans (Leuven, 1987): Studies in Language Technology: an object-oriented model of Dutch morphophonology and its applications • First groups start late eighties, early nineties • Leuven CCL (Van Eynde) • Antwerp CNTS (Daelemans, Gillis)
Prehistory Speech Technology • Early PhDs • Jean-Pierre Martens (Gent, 1982). Quality degrading aspects of filtered speech. • Luc Vanhove (Gent, 1984). Study and improvement of the linear prediction vocoder. • Werner Verhelst (Brussels, 1985). Short-time cepstra and LPC analysis-synthesis of speech. • Dirk Van Compernolle (Stanford, 1985). Speech Processing Strategies for a Multichannel Cochlear Prosthesis. • First speech technology groups start up in the mid 80’s • Gent (ELIS): Marc Vanwormhoudt, Jean-Pierre Martens • Leuven (ESAT): Dirk Van Compernolle • Brussels (ETRO): Oscar Steenhaut, Werner Verhelst
Prehistory Speech Technology • First collaborations on national level • 1983-1989: First IWONL-IRSIA project on speech analysis / synthesis (with UGent, VUB, UCL, Bell Telephone, FPMS, ACEC, Philips, correlative systems) • 1987-1992: National stimulation program on artificial intelligence (with KULeuven, VUB, UCL, ULB, FPMS, UGent)
Start of an organized field • Research Initiative on Dutch speech and language technology (1993-1994) • Flemish ministry of science and technology • To “improve and strengthen the position of Dutch” • App. 1 million euro • Speech recognition research, corpora (CoGen, ANNO), pronunciation lexicon (Fonilex) • In preparation of a long-term research programme on speech and text translation to and from Dutch
http://clif.esat.kuleuven.be Start of an organized field • Computational Linguistics in Flanders (CLIF) • Sponsor: Flemish National Science Foundation • Scientific Research community • 1995-2010 (2 renewals), 12,500 euro per year • Goals • Strengthen the integration of fundamental research on language and speech technology in Flanders to establish multidisciplinary, fundamental and applied research of natural language and Dutch in particular • Facilitate research activities of the participating research groups, improving (re-)usability of data for spoken and written language • Positive effects • Acts as a de facto spokesman of academic research community for government, Dutch language union, etc. • Fruitful environment for cooperation • All the main research groups are represented • International advisory board
Start of an organized field • Flanders Language Valley campus • Ypres, the centre of Europe • Literally arising around Lernout & Hauspie Speech Products • 1985 founded, 1995 NASDAQ, 2001 bankrupt • 125 million euro investment capacity in FLV Fund • 1995-2005 (liquidation) • “Favorable place of business for HLT companies” • CELE Research group (Kristiina Jokinen, Dirk Frimout) • University of Antwerp cooperation long term research on CAM Brain Machine • (turned out to be short term research)
From CGN to STEVIN • What happened to the “long-term research programme on speech and text translation to and from Dutch”? • CLIF recommendation: extend and valorize the “short term research projects” • Opportunities for cooperation with The Netherlands • CGN (Spoken Dutch Corpus) • 1998-2004 • 10 million words, linguistically annotated and linked to the signal • Tools, protocols, interface • From spontaneous to read speech • 5 million euro, 1/3 Flanders
From CGN to STEVIN • The position of Dutch in Language and Speech Technology (Bouma & Schuurman, 1998) • Conclusion: many weak spots and omissions in the available infrastructure • Advice: • Install Dutch-Flemish platform (coordinated by NTU) • Stimulate both fundamental and applied research • Set up an inter-university HLT education program in Flanders and reinforce the existing Dutch programs • Action plan for Dutch in language and speech technology: priorities for basic resources (Daelemans & Strik, 2002) • Prepared the contents and priorities for STEVIN. • STEVIN program (1/3 Flanders) • 2004-2011, 11.4 million euro
Others • Research within companies • Language and Computing • Nuance • … • Karel De Grote university college (Antwerp) • Readability, subtitling, … • Lessius university college (Leuven) • Terminology extraction and management, translation tools • Erasmus university college (Brussels) • Terminology, translation tools, corpora
Funding situation in Flanders • Flemish HLT research funding situation is reasonably good, (except for basic research) • Basic research grants (hard to get) • FWO PhD and postdoc mandates • FWO research projects • (VNC Dutch-Flemish research projects) • University funding (Special Research Fund) • TOP, GOA, IUAP, … • Application-oriented research • IWT PhD projects • IWT SBO / GBOU / STWW projects • TETRA projects • European Research (Framework Programs)
Research situation in Flanders • The joint research groups cover a large part of the field of language and speech technology research • Speech recognition and speech synthesis • MT, QA, Information Extraction, Summarization, Information Retrieval, Ontology and Terminology Learning • Machine Learning / statistical methods, knowledge-based / linguistic methods, hybrid methods • Text analysis (from morphology via syntax to semantics and discourse) • Corpus development and annotation • Less well-represented • Text generation from meaning representations • (Spoken) dialogue systems • Multimodality
ELIS-DSSP Jean-Pierre Martens Electronics & Information Systems University of Ghent https://speech.elis.ugent.be/
ELIS-DSSP • Embedding • research group of dept. Electronics & Information Systems (ELIS) • Key dates • 1982: first PhD • 1986: first Flemish aid for speech impaired persons, working with speech synthesis • 1988: speech synthesis technology sold to L&H • 1997: creation of spin-off company (Technology & Integration) in domain of alternative communication with speech technology • Main research themes in speech technology • auditory model based speech and music analysis • acoustic and lexical modelling for ASR • speech segmentation and labelling as a pre-processing step in speech transcription systems • objective assessment of disordered speech
ELIS-DSSP • Software development • auditory model embedding AMPEX pitch extractor • monophonic melody extractor • real-time audio indexing system comprising the isolation of speech intervals, speaker turn detection, gender and speaker clustering • AUTONOMATA grapheme-to-phoneme conversion toolkit supporting the development of error recovery from baseline system • Tool for disordered speech intelligibility assessment • Resource development • CoGeN (Corpus Gesproken Nederlands): ELIS + ESAT • FONILEX (Phonetic Lexicon): together with UA, CCL • CGN (Corpus Spoken Dutch): project leader for Flanders • COST-278 multilingual broadcast news database
Main research results • improved LVCSR by means of data-driven pronunciation variation modeling (ACCENT: FWO) • real-time audio segmentation algorithm that came out first in a multilingual evaluation campaign (ATRANOS: IWT, COST278: EU) • improved spontaneous speech recognition by the proper treatment of disfluencies (ATRANOS: IWT) • reliable prediction of disordered speech intelligibility by means of phonological features (SPACE: IWT) • improved proper name recognition by means of a phonological feature model (SPACE: IWT) • improved proper name synthesis by means of special purpose g2p converters that can be trained on very few transcribed data using a g2p-p2p approach (AUTONOMATA: STEVIN) • improved LVCSR by means of a data-driven compound composition and decomposition strategy (NBEST: STEVIN)
ETRO-DSSP Werner Verhelst Electronics & Informatics University of Brussels http://www.etro.vub.ac.be/Research/DSSP/dssp.htm
Laboratory for Digital Speech and Audio Processing – ETRO-DSSP • Embedding • research group of dept. Electronics & Informatics, ETRO of the Vrije Universiteit Brussel • Key dates • 1985: first PhD • 1988-1991: collaboration with Institute for Perception Research, The Netherlands • 2004: member of Interdisciplinary Institute for Broadband Telecommunication (IBBT) • 2006: joint research group for audiovisual speech processing with Northwestern Polytechnic University Xi’An China and start of FWO-WOG Audiovisuele systemen • Main research themes in speech technology • speech modification • speech enhancement • expressive speech analysis and synthesis • audiovisual speech analysis and synthesis
Main development work • Software development • system for automatic synchronization of studio dialogs with lip movements in video and film postproduction (IWOIB – EOS) • speech synthesis for feedback in reading tutor software (IWT - SPACE) • audiovisual text to speech synthesis system (Flemish and English) • sound management system for public address systems (IWT + ESAT +Televic)
Main development work • Resource development • Audiovisual recording studio • Database with multi-sensor speech recordings (EU-SAFIR with Thales and Voice Insight) • Database for Flemish unit selection TTS • Audiovisual database with emotional speech (new project)
Main research results • window and sampling effects in short-time cepstra of voiced speech (IWONL) • improved autocorrelative pitch detection with adaptive sign clipping (IWONL) • improved voicing source model for vocoders (VUB) • the WSOLA algorithm and its use for robust natural sounding time scaling (IWT) • perceptual speech and audio modeling with damped sinusoids (IRMUT – IWT with ESAT) • least squares theory and design of optimal noise shaping filters for speech and audio requantization (SMS4PA – IWT with ESAT and Televic) • first cross-database study for expressive speech classification (VIN - IBBT) • improved speech recognition in noisy environments with bone conducting microphones (SAFIR - FP6) • improved spelling and syllabification modes in text to speech synthesis (SPACE – IWT)
ESAT/PSI-Spraak Patrick Wambacq, Hugo Van hamme, Dirk Van Compernolle Electrical Engineering, Center for Processing Speech and Images University of Leuven http://www.esat.kuleuven.be/psi/spraak
ESAT/PSI-Spraak • Speech processing research at K.U.Leuven (Dept. Electrical Engineering, Center for Processing Speech and Images) since 1987 • Focus on speech recognition and its applications, using in-house developed large vocabulary continuous speech recognition system • 3 staff members: Dirk Van Compernolle, Hugo Van hamme, Patrick Wambacq, ≈ 10 researchers (some postdocs) • Extensive computing facilities • Cooperations at national and international level through research projects with both academia and industry, for fundamental and applied research • Current coordinator of CLIF research community
ESAT/PSI-Spraak:research themes • ASR novel architectures (episodic, hybrid, layered approaches) • ASR robustness (noise, spontaneous speech, speaker variability, …) • speech modeling and representation • computational models of human language acquisition • applications: CALL, clinical applications, indexing, subtitling, … • tools and corpora for ASR
ESAT/PSI-Spraak: FLaVoR • FLAVOR: “Flexible Large Vocabulary Recognition”: IWT funded, Oct. 2002 - Sept. 2006, with CNTS-U of Antwerp • Frustrated by the inflexibility of the traditional monolithic ASR architecture, we set out to • Incorporate Linguistic Knowledge Sources • That allow for efficient modeling of morphologically productive languages • That allow for modeling linguistic phenomena that are not well dealt with in a traditional left-to-right architecture • That allow for the modeling of both short and long term dependencies
ESAT/PSI-Spraak: FLaVoR • Through a Modular Recognizer Architecture • That assures a better reusability of components • That relies on a high degree of independence between acoustic and linguistic processing • That allows for a faster decoder and hence makes computational resources available for the more complex linguistic modeling
ESAT/PSI-Spraak: SPACE • SPACE: “SPeech Algorithms for Clinical and Educational applications”, IWT funded, Mar. 2005 - Feb. 2009, with ELIS-Ugent, DSSP-VUB, ORTHO-KULeuven, COM-UAntwerp • Main goals: • evaluate user’s speech in educational and clinical applications • improve speech recognition and speech synthesis technologies to better support these applications: • provide accurate classification of uttered phonemes • provide corrective auditory feedback • particular focus points: children’s speech, disfluencies, mis-pronunciations, deviant speech (e.g. speech of the deaf, dysarthria) • demonstrate the benefits of speech technology based tools for these applications, with involvement of experienced user groups
ESAT/PSI-Spraak: SPRAAK • SPRAAK: Speech Processing, Recognition and Automatic Annotation Kit: STEVIN project Dec. 2005 - June 2008, building on in-house developed software since 15 years • make state-of-the-art LVCSR system available for research community (free for research purposes): • modular toolkit (plug&play) for research on speech recognition, allowing researchers to focus on one particular aspect only and forget about the rest, with access to deep internals of the system (using low-level API) • recognizer with simple interface, usable by non-experts through high-level API • http://www.spraak.org
Centrum voor Computerlinguïstiek Frank Van Eynde Faculty of Arts University of Leuven http://www.ccl.kuleuven.be/
Centrum voor Computerlinguïstiek • founded in 1991 at the Faculty of Arts of K.U.Leuven • building on the expertise that had been acquired in the 80s in the framework of the machine translation projects Eurotra and Metal • part of the research unit ‘Dutch, German and Computational Linguistics’ since 2005 • member of ELSNET since 1993 and of CLIF since 1995 • main objectives: (1) acquiring funds for carrying out research in formal and computational linguistics and its application in natural language processing; (2) teaching, training and dissemination
Centrum voor Computerlinguïstiek • formal syntax and semantics (Head-driven Phrase Structure Grammar) • corpus annotation (tagging, treebanks, semantic annotation) • machine translation • multilingual information retrieval • teaching at K.U.Leuven and at international summer schools (ESSLLI, ELSNET, EMLS, LOT) • host of ESSLLI-90, TMI-95, CLAW-96, ELSNET-97, CLIN-98, EMLS-02, HPSG-04, CLIN-07 • http://www.ccl.kuleuven.be/
Machine TranslationMETIS-II (EU/FP6) 2004-2007 • Successor of Metis I (2003-2004) • Hybrid System DU-EN • Low Resources (no full parser, no parallel data) • BLEU scores about the same as SMT with IBM1 trained on Europarl • Succeeded by PaCo-MT (NTU) (2008-2011) • Hybrid system FR <-> DU <-> EN • Full Resources
Corpus AnnotationCGN / D-Coi / Lassy (STEVIN) • Series of joined Flemish/Dutch projects • spoken Dutch (CGN 1998-2000), ±10M • written Dutch (D-Coi '05-'06, Lassy '06-'09), ±50M • PoS labels, lemmata, treebank • Parts manually corrected (e.g. 1M treebank in Lassy) • Succeeded by SoNaR 2008-2011, ± 500M semantic labels (corrected) for 1M (NE, coreference, semantic roles, spatiotemporal)
CNTS Language Technology Group Walter Daelemans, Steven Gillis Faculty of Arts University of Antwerp http://www.cnts.ua.ac.be
CNTS Center for Dutch Language and Speech • Founded in 1992 to promote research in Dutch (corpus) linguistics, psycholinguistics, and computational linguistics • Research Center of the department of Linguistics (Faculty of Arts) • Member of Elsnet, CLIF, Flarenet, Clarin, Pascal, CIL, … • Co-founded ACL SIG on Computational Language Learning (SIGNLL) and the associated CoNLL conference series and CoNLL shared tasks series • Resources and software development • Corpora (CGN, COREA, Knack-2002, …) • TiMBL, Tadpole (with ILK, Tilburg University) • Memory-Based Shallow Parser (MBSP) • Spin-off: www.textkernel.nl
Research Topics • Computational Psycholinguistics • Computational models of human language acquisition and processing (phonology, morphology, syntax) • Machine Learning of Language • Memory-Based Learning; ML methodology ML-based Text Analysis • Phonological and morphological analysis, Prosody and grapheme-to-phoneme, POS tagging, chunking, grammatical relations, pp-attachment, named-entity recognition, semantic role labeling, word sense disambiguation, coreference resolution, … • LT Applications • Biomedical information extraction; Summarization and sentence simplification; Ontology extraction from text; Question Answering • NL interface to graphical design packages, serious gaming, computational stylometry • African Language Technology
http://www.biograph.be Biomedical Text Mining • BioMinT (EU FP5, Quality of Life, 2003-2005) • Information Retrieval and Information Extraction tool for human curators of SWISSPROT (protein database) • With SIB Geneva, University of Manchester, PharmaDM, University of Vienna, University of Geneva • Results CNTS • Adaptation Memory-Based Shallow Parser for biomedical language (tagger, tokenizer, NER, grammatical relations) • Biograph (University of Antwerp GOA project, 2007-2010) • Ranking genes implicated in diseases (schizophrenia, Alzheimer) using heterogeneous data (including text) • With data mining group and molecular biology group of University of Antwerp • Better text mining engine including analysis of modality and negation for better biomedical information extraction
Computational Stylometry • Computational techniques for stylometry (FWO project, 2007-2010) • Goals • Develop feature construction, feature selection and supervised and unsupervised machine learning techniques for • Authorship, gender, date, and personality attribution from text • Stylistic analysis of literary texts • Develop and make available a tool and benchmark datasets • http://www.cnts.ua.ac.be/stylometry
iTEC Piet Desmet Faculty of Arts University of Leuven, Campus Kortrijk http://www.itec-research.be
iTEC Interdisciplinary research on Technology, Education & Communication: • Computer-assisted language learning • Corpora & digital libraries • Language testing • Authoring systems
Recent projects Lingu@tic • Language learning environment Dutch & French based on video extracts • www.kuleuven-kortrijk.be/linguatic Medi@tic • Database of video extracts for language learners Dutch & French • www.kuleuven-kortrijk.be/mediatic DPC • Dutch Parallel Corpus (10 million words Dutch, English & French) • www.kuleuven-kortrijk.be/dpc
Lingu@tic Development of a free language learning environment (Dutch & French) based on authentic broadcasted video extracts Use of half-open exercises (e.g. translation exercises with alternative answers) • www.franel.eu
Medi@tic Development of a database of learning objects • Repository of free authentic video materials • Management tool for the description of audio and video assets • Exploration tool for selecting video materials which can be integrated in CALL applications
DPC Annotated sentence aligned corpus 10 million words NL-FR and NL-EN Quality control Compatible with D-COI