240 likes | 427 Views
ELSE. Evaluation in Language and Speech Engineering January 98 - April 99. e. l. s. e. MIP - U. of Odense UDS U. di Pisa EPFL XRCE U. of Sheffield Limsi (CNRS) CECOJI (CNRS) ELRA & ELSNET. Denmark Germany Italy Switzerland France United Kingdom France France.
E N D
ELSE Evaluation in Language and Speech Engineering January 98 - April 99 e l s e
MIP - U. of Odense UDS U. di Pisa EPFL XRCE U. of Sheffield Limsi (CNRS) CECOJI (CNRS) ELRA & ELSNET Denmark Germany Italy Switzerland France United Kingdom France France ELSE Participants e l s e
Comparative Technology Evaluation Paradigm • Successfully used US DARPA (since 1984) • Shorter scale in Europe (Sqale, Grace…) • Choose task / system or component • Gather participants • Organize campaign (protocols/metrics/data) • Mandatory if technology insufficient: • MT, IR, summarization… (cf recognition 80s) e l s e Limsi-CNRS
Knowledge gained from evaluation campaigns • Knowledge shared by participants in WS: • How to get the best results ? • Methodology advantages / disadvantages • Funding agencies (DARPA / others) • Level of technology / applications • Progress vs investment • Set priorities e l s e Limsi-CNRS
Knowledge gained from evaluation campaigns • Industry • Compare with State-of-the-Art (developers) • Select technologies (integrators) • Easier market intelligence (SMEs) • Consider applications (end-users) e l s e Limsi-CNRS
Powerful tool • Go deeper into conceptual background • Metrics, protocols... • Contrastive evaluation scheme • Accompany research • Problem-solving approach • Interest for speech and NL communities e l s e Limsi-CNRS
Resources & evaluationby-products • Training and test data • Must be of high quality (used in test) • Evaluation toolkits • Expensive: of interest for all • Interest for remote users (domain, country) • Compare with state-of-the-art • Induce participation in evaluation campaign • Measure progress e l s e Limsi-CNRS
Relationship / usage-oriented evaluation • Technology evaluation • Generic task • Attract enough participants • Close enough to practical application • Usage evaluation • Specific application / specific language • User satisfaction criteria e l s e Limsi-CNRS
Relationship / usage-oriented evaluation • Technology insufficient: no application • Technology sufficient: possible application • Efforts for usage evaluation are larger than for technology evaluation • Technology evaluation (10s) generic and organized centrally • Usage evaluation (1000s) specific organized by each application developer / user e l s e Limsi-CNRS
Relationship / Long Term Research • Different objectives / time scale • Meeting points placed in the future • LTR: high risk but high profit investment e l s e Limsi-CNRS
ELSE results • What ELSE proposes ? abstract architecture (generic IR/IE) (profiling, querying and presentation) control tasks1) can be easily performed by a human 2) arbitrary composite functionality possible 3) formalism for task result description 4) measures easy to understand 6 tasks or a global task to start with... e l s e Limsi-CNRS
6 Control tasks to start with... 1. Broadcast News Transcription 2. Cross Lingual IR / IE 3. Text To Speech Synthesis 4. Text Summarization 5. Language Model Evaluation 6. Word Annotation task (POS, Lemma, Syntactic Roles, Senses etc.) e l s e Limsi-CNRS
or a global task to start with... • ”TV News on Demand” (NOD)(Inspired from BBN "Rough'n'Ready”)- segments radio and TV Broadcast- combines several recognition techniques (speaker Id, OCR, speech transcription, Named Entities etc.)- detects topics- summarizes- searches/browse and retrieves information e l s e Limsi-CNRS
Multilingualism • 15 Countries • 2 Possible solutions: • 1) Cros Lingual Functionality Requirement • 2) All participants evaluate on 2 languages - their own - one common pivotal language (English ?) e l s e Limsi-CNRS
Results Computation • Multidimensional evaluation (multiple mixed evaluation criteria) • Baseline Performance (contrastive) • dual result computation (quality) • Reproducible (automated evaluation toolkit needed) e l s e Limsi-CNRS
Language Resources • Human Built Reference Data (cost + consistency check + guidelines) • Minimal Size (chunck selective evaluation) • Minimal Quality Requirement • Language Phenomena Representativity • Reusable & Multilingual • By-products of evaluation become Evaluation Resources e l s e Limsi-CNRS
Actors in the infrastructure European Commission ELRA Evaluators Participants (EU / non EU) L. R. Producers Research Industry Citizens Users & Customers e l s e Limsi-CNRS
Need for a Permanent Infrastructure ? • Problem with Call for Proposals mechanism • Limited duration (FPs) / Share of cost by participants • Permanent organization • General policy / Strategy / Ethical aspects • scoring software • Label attribution / Quality insurance & control • Production of Language Resources (dev,test) • Distribution of Language Resources (ELRA) • Cross-over FPs e l s e Limsi-CNRS
Evaluation in the Call for Proposals • Evaluation campaigns: 2 years • Proactive scheme: Select topics (research / industry) e.g. TV News on Demand or several tasks (BNT, CLIM, etc.) • Reactive scheme: Select projects, Identify generic technologies among projects (clusters ?), resources contracted out of project budgets, a posteriori negociation e l s e Limsi-CNRS
Multilinguality • Each participant should address at least two languages (own + common language) • One language common to all participants • Compare technologies on same language/data • Compare languages on same technology • English: spoken by many people, large market, cooperation with USA • Up to 4 languages for each consortium • Other languages in future actions e l s e Limsi-CNRS
Proactive vs Reactive ? • ELSE views: • Proactive • Single Consortium • Permanent Organization (Association + Agency) • English as common language e l s e Limsi-CNRS
Estimated Cost • 100% EC funding for infrastructure org, LR • Participants: share of system development • Reactive: Extra funding for evaluation • Proactive: • 600 Keuro average each topic (3,6 Meuro total) • 90 Keuro organization • 180 Keuro LR production • 300 Keuro participants (up to 10) • 30 Keuro supervision permanent organization e l s e Limsi-CNRS
Questions ? • Are you interested by the concept ? • Would you be interested to participate ? • Would you be interested to provide data ? • Would you be ready to pay for participating ? • Would you be ready to pay for accessing the results (and by products, e.g. data and tools) of an evaluation ? • Would you be interessed in paying for specific evaluation services ? e l s e Limsi-CNRS