170 likes | 188 Views
XMELLT. Cross-lingual Multi-word Expression Lexicons for Language Technology Multilingual Information Access and Management International Research Co-operation. Nancy Ide Department of Computer Science Vassar College. Participants. Department of Computer Science, Vassar College
E N D
XMELLT Cross-lingual Multi-word Expression Lexicons for Language Technology Multilingual Information Access and Management International Research Co-operation Nancy Ide Department of Computer Science Vassar College
Participants • Department of Computer Science, Vassar College • International Computer Science Institute, University of California, Berkeley • Department of Computer Science, New York University • Computing Research Laboratory, New Mexico State University XMELLT
Framework • Planning project • one-year time frame • Originally submitted as a joint NSF-EU project with additional European partners • Istituto di Linguistica Computazionale, CNR, Pisa • Institut für Maschinelle Sprachverarbeitung,Stuttgart • LexiQuest, Paris XMELLT
Overall goal • define a core international infrastructure to support the creation of a multi-lingual multi-word expression lexicon incorporating both morpho-syntactic and semantic information XMELLT
Specific aims • determine the type and dimensions of information to serve the needs of critical NLP applications • specify an overall architecture for a joint software and lingware development project XMELLT
Aims... • Explore the possibilities for recognizing and acquiring multi-word lexical units from corpora by means of partial parsing, statistics, etc. • Outline a collaborative project to acquire and represent multi-word lexical entries for multiple languages XMELLT
Motivation • Multi-word constructions are extremely frequent in language • ~30%of the lexical stock • Existing resources do not adequately treat multi-word expressions XMELLT
Limitations • constructed for particular system or application • incorporate tailored information (e.g., primarily syntax with little semantics) • not reusable • most devoted to a single language and/or approach XMELLT
Limitations... • not flexible, expandable to multiple languages • MT systems' lexicons are typically little more than "translation memories" • No interface among single-word entries, multi-word entries, syntax, and semantics XMELLT
XMELLT Approach • Broad view of multi-word expressions • idioms, compounds, collocations, co-occurrence patterns • focus on linking of individual language lexicons • individual words and multi-word expressions • different types of multi-word expressions • e.g., English noun-noun vs Romance noun-PP XMELLT
Considerations • internal variation • sub-categorization properties • idiosyncratic constraints on inflection • meaning (non-)compositionality XMELLT
Encoding Model • Compatible and integrated with existing and de facto standards • e.g., EAGLES, PAROLE/SIMPLE, NOMLEX XMELLT
Activities • Assessment of existing lexical resources for multi-word expressions • Delivery of survey XMELLT
Activities... • Creation of a small set of sample entries • add lexical information on support verb constructions to 50 nouns drawn from NOMLEX for English, Italian, German, and French • create lexical entries for 50 N-N English constructs from the PAROLE/SIMPLE lexicons and corresponding constructs in Italian, German, and French XMELLT
Activities... • Develop preliminary specifications for structuring and encoding multi-lingual, multi-word expression lexicons • required linguistic information • harmonized data architecture and encoding format XMELLT
Activities... • Exploration of techniques for automatic acquisition • Months 1-6: Survey of acquisition techniques, typology of MWE • Months 7-12: Design of architecture for MWE acquisition XMELLT
Project information • Start date: June (?) • Web site: • Contact: http://www.cs.vassar.edu/~ide/XMELLT.html Nancy Ide (PI) Department of Computer Science Vassar College ide@cs.vassar.edu XMELLT