190 likes | 239 Views
TTC project. EC project and progress presentation 28 May, 2010 Terminology Extraction, Translation Tools and Comparable Corpora 2010-2012 www.ttc-project.eu ICT 2009.2.2. Language-Based Interaction.
E N D
TTC project EC project and progresspresentation 28 May, 2010 Terminology Extraction, Translation Tools and Comparable Corpora 2010-2012 www.ttc-project.eu ICT 2009.2.2. Language-Based Interaction The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 248005.
Introduction & Objective • Introduction • Central role of linguistic resources for translation applications • Specialized languages • Multilingual terminologies • Objective Producing multilingual terminologies from comparable corpora for translation applications TTC presentation – 28/05/2010
Concepts 1/2 • Parallel Corpora an original text and its manual translation into one or more languages Restrictions: sparse data • only available for some pairs of languages, mostly one of them is English [French-English Hansards (Germann 2001)] • only for some few specific domains [Bible ( Resnouf et al., 1999), Europarl (Koehn, 2005), (JRC-Acquis; Steinberger et al. 2006)] • Comparable Corpora [EAGLES 1996] A comparable corpus is one which selects similar texts in more than one language or variety. [Bowker, Pearson 2002, p.93] “sets of texts in different languages, that are not translations of each other” TTC presentation – 28/05/2010
Concepts 2/2 • Monolingual terminology extraction to automate the extraction of terms from corpora in specialized domains • Single word terms (SWTs) • Multi-word terms (MWTs) • Alignment through lexical context analysis [Grefenstette, 1994, p. 279] « First-order affinities describe what other words are likely to be found in the immediate vicinity of a given word » TTC presentation – 28/05/2010
WEB haversting documens Source documents Target documents terminology terminology extraction extraction lexical context lexical context extraction extraction lexical alignment process terms to be candidate translated translations bilingual dictionary Bilingual terminology mining chain TTC presentation – 28/05/2010
Applications • Machine translation tools (MT tools) • Computer-assisted translation tools (CAT tools) • Multilingual content management tools • Terminology management tools TTC presentation – 28/05/2010
Objectives • Compiling comparable corpora • Candidate term extraction • Defining and combining different strategies for term alignment • Development of an open platform for use in MT and CAT tools • Demonstrating on MT and CAT tools TTC presentation – 28/05/2010
Comparable corpora • « Web as a corpus » approach successful for general language corpus compilation • Objective compilation of specialized language corpora • Methods monolingual / interlingual comparability • Outputs Topical web crawler (M24) TTC presentation – 28/05/2010
Term extraction • Single word term SWT/Multi-word term MWT • Statistical and symbolic approaches • Objectives • evaluation of resources for term extraction SWT/MWT performance • variations of MWT • extraction of context data • Outputs • sets of extraction tools / rule sets for variants (M24) TTC presentation – 28/05/2010
Term alignment • Contextual analysis: 60% on TOP20 for single terms • Objective • To improve the contextual analysis of SWT • To reach for MWTs a score close to the score of SWT • Methods • Lexical / Contextual / Corpora strategies • Outputs • Neo-classical MWT detection component (M18) • Compositional translation component (M24) TTC presentation – 28/05/2010
Open Platform • Several tool suites for aligned corpora (Itools, Giza++) • Objective • handling and exploiting comparable corpora • Outputs • Terminology tool suite for comparable corpora • Open terminology management tool TTC presentation – 28/05/2010
Participants TTC presentation – 28/05/2010
Main impacts • Better language coverage • 5 distinct language families • 7 targeted languages: Chinese, English, French, German, Latvian, Russian and Spanish • 12 pairs of languages: Zh-En, Zh-Fr, En-Fr, En-De, En-Lv, En-Ru, En-Es, Fr-De, Fr-Ru, Fr-Es, De-Es, Lv-Ru • Expected resources • Domain-specific resources for renewable energy an computer science (focus on mobile technologies) in 7 languages • Comparable corpora, lemmatized and POS-tagged (M24) • Rule sets for recognizing term variants, for term inflection and morphological analysis (M24) • Bilingual aligned terminologies (M27) TTC presentation – 28/05/2010
Work planning • Project Duration: 36 months TTC presentation – 28/05/2010
WP1 – Requirements & Specifications (UN-LINA) • Task 1.1 – requirements analysis (by Syllabs) • Online survey (advertised among translators and localization communities, received 139 answers mainly to/into EN. • 74% use a translation software (TRADOS leader with 17% of users within the respondents) • Reasons for not using MT: price, translation quality, not suitable for specific domain • Users wishes for collection of corpora • Searchfunctions • Automaticupdating (crawl) • Frequencylists • Annotation function • Collaborative tool: share with others • Formats: sentence per line plaintext, TMX format • Terminology extraction tools TTC presentation – 28/05/2010
WP1 – Requirements & Specifications (UN-LINA) • Task 1.2 – functional specifications (by UN-LINA) • Toward a Data Model • Integration level: a functional approach • Functional capabilities: • Linguistic analysis stage: raw token, lemma, pos, offset, ... • Extractor and aligner stages: must continue our analysis by taking into account the information in the final output formats (e.g. for terms:TBX or TMF-compliant format) TTC presentation – 28/05/2010
WP1 – Requirements & Specifications (UN-LINA) • Task 1.3 – definition of data exchange format (by IMS) • The exchange format has to as simple as possible for internalpurposes • Output format willbe TBX / TMX • Consultation with UN/LINA: • on data categoriesused in TTC tools • on UIMA-basedprocessing formats • → First outlinespecification of processing formats • Consultation withusers (Sogitec, Tilde): • on data categoriesused in input tools/resources • on data categoriesused in CAT/MT • → First outlinespecification of semanticinteroperabilityunder exchange • Mapping of requirements onto ISO formats • → First outlinespecification of exchange format TTC presentation – 28/05/2010
WP2: Corpora compilation (by UL-CTS) • Use pre existing corpus to evaluate comparability • First version (not fully functional) of the crawler by September 2010. • Issues within multilingualism for example gender management • Corpus contents are dependant from type of document (news, reports, blogs…) and also differs according to targeted audience, authority, content vs linkerie and region. TTC presentation – 28/05/2010
WP8 Dissemination (by TILDE) • T8.1: Website: www.ttc-project.eu Collaborative platform • T8.2: Workshop in July • T8.3: Poster and leafletpresentedduring LREC 2010 in Malta • T8.4: IPR guidelines deliverablesubmitted to EC • T8.5: Dissemination by everychannel, conferences (“Applied Linguistics in Science and Education”, 25-26 March 2010, Saint-Petersburg, Russia, LREC 2010 Malta, EAMT Conference 2010 Saint Raphaël, 14th EURALEX International Congress, 6-10 July 2010, Leeuwarden/Ljouwert, The Netherlands), social media (LinkedIn, professionnalcommunities…) TTC presentation – 28/05/2010