380 likes | 520 Views
LEXUS: A flexible web-based Lexicon Tool Interacting with ISO Data Category Registry Peter Wittenburg, Marc Kemps-Snijders MPI for Psycholinguistics. Outline. background – problem MPI motivation = NLP motivation playing LEGO ISO TC37/SC4 Data Categories Lexical Markup Framework
E N D
LEXUS: A flexible web-based Lexicon Tool Interacting with ISO Data Category RegistryPeter Wittenburg, Marc Kemps-SnijdersMPI for Psycholinguistics
Outline • background – problem • MPI motivation = NLP motivation • playing LEGO • ISO TC37/SC4 • Data Categories • Lexical Markup Framework • LEXUS Tool Mark • Demo Mark • Outlook
Tuvan orthography Tuvan appendix German orthography Russian orthography Russian appendix Xakas orthography Tofa orthography 7 DOBES teams and 12 different lexica (structures, purposes) Background stem orthography sense * lexical sub-entry * sense nr sense gram cat gram subcat Engl Transl example * simple spreadsheet little more complex incl 1:N relations orthography Engl. Transl [T|pr] nr entry-type = [stem|idiom|lexical word] head outer-body-L* headword citation form homograph no phonetic form inner-body-L grammar gloss word-level-gloss reversal definition encyclopedic info scientific name semantic domain semantic index thesaurus semantic relation* cross-ref* sense number variety meaning etymology table example* comment* picture/photo* housekeeping* small part of a complex lexicon structure at top level 4 different entry types (only one is shown)
have to use one archival lexicon representation format based on XML • have to build one archival exploitation framework • however, receive lexica • character encodings • in all sorts of formats (var. XML, SBX, CHAT, even Word) • in various structures • with different terminologies (lexical attributes, values) • how to do cross-lexical searches? • how to do lexical merging, linking and comparison? • how to solve lexicon-corpus interaction? • etc • in NLP the same problems • lack of standards • lack of re-usability • lack of interoperability • you knew this already or? Problem
Why not play LEGO? • concrete lexicon schema is basically seen as lexical attributes grouped • together with others and embedded in a tree structure. 1:1 sense nr components (sub-schemas) sense data categories (lexical attributes, linguistic concepts) gram cat engl trans 1:N examples ortho engl trans gloss
actually component association is a relation of special type What else: Relations bank breite Sitzgelegenheit something broad to sit on • need various type of relations between • attributes and units in value strings • each relation can be associated with • features, i.e. relations can be seen as • components in its own sitzgelegenheit etwas um zu sitzen something to sit on schmal gegenteil zu breit contrary to broad
What else: Inheritance just one example to reduce typing b’ang common attributes particular attributes boeb’ang common attributes particular attributes goeb’ang common attributes particular attributes
What else: conditions (operations) just one example from DOBES lexemtype if lexemtype = “stem | idiom | lexical word” head sense nr outer-body-L meaning if lexemtype = “auxil | inflect affix” etc etc sense nr meaning effect • probably better examples around • if value(X) then modify contraints(Y) • etc categorial effect etc etc
ISO TC37/SC4 – the solution? • ISO TC37/SC4 is about standardization in LR Management • central is data category registry • basically a flat list of linguistic concepts • will contain is_a relations that are part of the concept definition • “transitive_verb” is_a “verb” • with proper definitions and conceptual space (value range) • request for filling DCR (Metadata, morphology, syntax, …) • looking for abstract models (frameworks) • for lexica • for annotation structures • for semantic annotations • for syntactic annotations • …
Conceptual domain Data element concept Value domain Data element XML schema declaration /masculine/ /feminine/ /neuter/ Underlying Model Dutch system is different /Gender/ Set of Simple datcats Complex datcat complex datcats simple datcats XML object List of values Implemented as an XML attribute named ‘gen’ m, f, n <w lemme=“vert” gen=“f”>verte</w>
General Model Lexical Markup Framework Metamodel Data category selection Lexical model
Metamodel • Made of lexical layers • Lexical layers • Made of lexical components (or components) Core Model Lexical DB 1..1 1..1 1..1 0..n Global Info Lexical Entry 1..1 1..1 0..n 0..n Sense Form • basis for modeling purposes is UML • there will be an XML-schema based instantiation
1..1 1..n Morphology 1..1 1..1 0..n 0..1 Paradigm Inflexion Extended Model Lexical DB 1..1 1..1 /lemma/ /POS/ /gender/ /key form/ 1..1 0..n Global Info Lexical Entry 1..1 1..1 0..n 0..n Sense Form /orthography/ /variant for/ /orthography/ /gender/ /number/ /tense/ /person/ /mood/ /identifier/
Lexical Entry Proposed Extensions still ongoing discussions 1..1 1..1 0..n 1..1 Sense Form 1..1 0..n Syntactic family 1..1 1..1 Syntactic family Semantic frame 1..1 1..1 Semantic formula Construct set 1..1 0..n Syntactic construct 1..1 0..n Syntactic construct Semantic argument 1..1 0..n Syntactic position
What will LMF be? • descriptions of the general model (metamodel + DCS) • DC have to be ISO 11179/12620/… compliant • Core model • including component building, relations, conditions, inheritance • Extension mechanism • Proposed but not normative extensions (morphology, syntax, …) • XML-schema based instantiation • currently version 5 of the Draft Proposal • ISO/TC 37/SC 4 N130 Rev.5 • Date: 2005-03-19 • Working draft of ISO WD 24613:2005 • web-site: http://www.tc37sc4.org/
Goal LEXUS • To provide a framework capable of handling diverse lexicon structures and formats. • Lexus is based upon Lexicon Markup Framework • within ISO TC37/SC4 that defines a blueprint for such a flexible framework. • LEXUS is first test and reference implementation of LMF. • Increase interoperability by offering well accepted data categories (ISO, GOLD, Shoebox MDF) Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004
Current Status • supports full LMF core model • allows for flexible creation of structures and content. • supports use of well-accepted Data Category Registries (ISO 12620, Shoebox MDF) • allows for dynamic editing of structures and content. • supports use of multimedia content. • import of existing lexica (Shoebox, Chat) • export( Shoebox/LMF XML) • customizable layout Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004
Current Status • user authentication • personal workspace for creating and editing lexica • merging facilities • simple and advanced search Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004
Current Status (Technical) • Implemented in java and using Open Source components • Uses Spring to ‘wire’ the application • Modular approach avoiding ‘hard’ links • Uses Hibernate as the persistence framework • Allows use of multiple databases (Postgres, MySQL,…) • Uses Tomcat as Servlet Container Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004
Logging onto the application Users must authenticate before loggin onto the application. Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004
User workspace Each user has his/her own personal workspace where private lexica are stored Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004
Lexicon creation New lexica may be created… Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004
Lexicon import New lexica may be imported from a lexical resource… Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004
Lexicon structure The LMF core model can be identified in this simple structure. Components and datacategories can be identified using different icons. All may be dynamically created or modified. Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004
Lexicon structure Representation of a more complex structure. By selecting a node in the Tree the content of a component or datacategory is shown and may be modified. Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004
Data category selection Data categories can easily be selected from data category registries. . Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004
Lexical entry overview Overview of lexical entries. By selecting a lexical entry the details will be revealed. Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004
Lexical entry details Details of a lexical entry. Entry structure modifications are bound to schema definition, e.g. cardinality. Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004
Lexical entry details Attribute values can be easily modified. Various value types are supported( text, video, audio, image or file) Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004
Lexical entry details Example of uploading a video file. Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004
Lexical entry details Viewing multimedia content. Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004
Alternative entry view Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Alternative views are provided which may be customized in look and feel.
Synchronization of lexica Personal Workspace Main Lexicon Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Lexica may be copied to and modified in personal workspace
Synchronization of lexica Personal Workspace Main Lexicon Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Lexica may be synchronized with main lexicon
Synchronization of lexica When synchronizing lexica the user is notified of structural changes and is in total control of the synchronization proces. Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004
Future directions • Support for various types of relations • Import of data from other sources • Support for other Data Category Registries, e.g. GOLD • Integration with MPI archive • Integration with exploitation tools (ELAN, ANNEX) • Miscellaneous user requests Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004
References • ISO (2004): Lexical Markup Framework. ISO Document in progress • N. Ide, A. Lenci, N. Calzolari (2003): RDF Instantiation of ISLE/Mile Lexical Entries. LDC Workshop. Philadelphia • P. Wittenburg, W. Peters, S. Drude (2002): Analysis of Lexical Structures from Field Linguistics and Language Engineering. LREC 2002 Conference. Las Palma, Mai • P. Wittenburg (2001): Lexical Structures. MPI Technical Report. MPI Nijmegen • J. Bell, S. Bird (2000): A Preliminary Study of the Structure of Lexicon Entries. Workshop on Web-Based Language Documentation and Description. Philadelphia. • Ide, N., Kilgarriff, A. and Romary, L. (2000), A Formal Model of Dictionary Structure and Content, Euralex, Stuttgart Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004
Example lexical structure Stem orthography Sense nr Sense * sense Lexical subentry Gram cat Gram subcat orthography Engl. Transl. Engl. Transl. Example * [T/pr] nr Example lexical structure used in the TEOP project within DOBES Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 * sign stands for 1:n relations of sub-structures