340 likes | 517 Views
Hypermedia Lexica and Lexicon Metadata. The MetaLex model in the ModeLex project Dafydd Gibbon U Bielefeld Europe E-MELD Workshop, Detroit, August 2002. Overview.
E N D
Hypermedia Lexica andLexicon Metadata The MetaLex model in the ModeLex project Dafydd GibbonU BielefeldEurope E-MELD Workshop, Detroit, August 2002
Overview Metalex goalsBackground: DATR, Hyprlex, Speech, Language DocumentationMetalex design: theory and practiceLexical documents & metadocuments Lexical objects, properties, structuresMetalex implementationIvory Coast encyclopaedia project Ega documentation model project The Modelex (multimodal lexicon) project Ivory Coast + Nigeria documentation curriculum projectExtending metalexModalities & submodalities Data-driven lexicography Data structures & algorithms: trees, lattices; induction, inference
Metalex goals: background • General objectives: • Versatile high quality spoken language lexicography • Motivated balance of high-tech + low tech • Good resources are data-driven and theory-informed • Specific project objectives: • DATR/ILEX: formal lexicon theory and implementation • VerbMobil: integrated HyprLex dissemination model • HyprLex encyclopaedia model for Ivory Coast Languages • Ega endangered language documentation model • Modelex - theory and design of multimodal lexica • Ivory Coast and Nigeria curricula for language documentation
Metalex design: data and theory • Data-driven data + metadata acqusition: • Systematic metatext derived from and supporting ... • Computational fieldwork • Induction of lexica • Theory-informed data + metadata acquisition: • Integrated Lexicon (ILEX) consisting of ... • Abstract Lexicon (ALEX) - "theory" in the mathematical sense • Object Lexicon (OLEX) - "model" in the mathematical sense
Metalex design: data • Data-driven acquisition: • Computational fieldwork • Portable metadatabase with restricted vocabulary and general metatext, and • Definition of and support for transcription + annotation • Portable support for scenarios, scripts • Portable support for lexicon processing • Induction of lexica • Lexicon tools for • Extraction of macrostructural elements (lexeme elements) • Induction of microstructural information (media concordance, POS, ...) • Induction of mesostructural regularities and subregularities (grammar, ...)
Metalex design: theory • Theory-informed formalisation: • Abstract Lexicon (ALEX) - "theory" in the mathematical sense • Decomposition (componential A-V description) • Generalisation (inheritance) • Composition (multilinear operations) • Object Lexicon (OLEX) - "model" in the mathematical sense • XML archiving and dissemination formats • object-relational database acquisition and processing formats • = Integrated Lexicon (ILEX)
Metalex implementation:architecture • Data model Ç Theory = shared lexicon architecture: • Macrostructure: declarative and procedural components • Lexicon architecture: relational, inheritance, text, ... • Lexical objects: entry types • Lexical access: fact query, semasiological / onomasiological indexing • Mesostructure: • Generalisations: grammar, phonetics, cultural background, ... • Composition of lexicon object types: idioms, words, morphemes, ... • Lexical access: inferential query • Microstructure: • Lexical entry (article, lemma structure - atom, string, tree, ...) • Types of lexical information - standardly: "lexicon model"
Metalex implementation:microstructure • Microstructure specification philosophy: • Anybody can specify any kind of unpredictable detail • Questionnaire / Experiment / Corpus / Archive dependence • Lexicon architecture: relational, inheritance, text, ... • Intelligent (semi-)automatic classification, not fixed attributes • Theory-informed coarse grouping is possible • Media attributes: visual, auditory, tactile, ... • Meaning attributes: definition, gloss, lexical relations, ... • Composition attributes: context/category, parts, operations • Use attributes: style, register, concordance, media illustrations, ... • Micrometadata attributes: lexicographer DB indices, source (e.g. fieldwork metadata) DB indices, modification, ...
Metalex implementation:fieldwork metadata source (1) Situation dimensions • participant: fieldworker, partners, contacts • channel: modalities, media • locale: indoor/outdoor, spatial configuration • temporal: date, time, calendar event • functional: affiliation, role, occasion; observation (prompt, metadata management) Language dimension • affiliation • discourse level: discourse type, genre + prosody • phrase level: recursive phrasal categories/relations + prosody • word level: clitics, inflexion, word formation + prosody
Metalex implementation:fieldwork metadata source (2) Technical dimension • physical characteristics of participants: age, sex, health • physical characteristics of locale: indoor/outdoor, spatial configuration, temporal sequence, date (season), time (of day) • audio: mike type, position, room; A/D; channels, fsample, resolution; formats • video: camera & microphone type, analogue/digital; filters, lenses; audio; formats • other sensors: laryngograph, airflow, data glove, ... Metalinguistic dimension • empirical method: introspection, experiment, corpus elicitation • materials: questionnaire, experiment layout, corpus scenario • metadata specification: index, metatext type, metacatalogue type
Metalex implementation:fieldwork metadata entry tool LREC 2002, Workshop on Portability Issues
Metalex implementation:fieldwork metadata entry tool HanDBase DBMS for PalmOS
Metalex objectsin conjunction with work in ISLE CLWG(Computational Lexicon Working Group) (see Gibbon in reading list) LEXICON: • { < Macrostructure > , < Mesostructure > } • Macrostructure: Ordering( {ENTRY, ...} ) • Mesostructure: < FrontmatterMetadata, Descriptions > ENTRY: • < Microstructure, HousekeepingMetadata >
The LEXICON object Front Matter Metadata: • Bibliographical: creator, publisher, title, date, ... • Medium / format: paper, CD-ROM/DVD, web, ... Macrostructure type: • access: semasiological/onomasiological, • n-lingual/langue(s), • special: taxonomy (thesaurus), concordance • structure, e.g. tabular: f(type,attrib)=value
The ENTRY object: metadata Entry Metadata: (see Gibbon & al. in reading list) • Entry type (wrt macrostructure specification): • encyclopaedic • multiword unit, word, ... • Microstructure data model specification: • entry structure: flat, tree, graph (net), ... • dta categories specification (atribute, field, information type) • DC groups - structural skeleton • DCs • DC substructure - homography, homophony, polysemy ...
The ENTRY object: DC groups Media ("surface"): • acoustic (phonetic, earcon, sonification,), visual (orthography, icon, gesture, ...) Composition (structure): • part (e.g. morphology for words), context (e.g. POS, subcat for words) Meaning (definition, illustration): • semantic (components, relations, senses, ontology) • pragmatic (speech act, dialogue, disfluency, ...) Use: typically: media (e.g. audio) concordance, ... Metadata: lexicographer, ...
The ENTRY object: DCs Countless Data Category models: (see reading list) • every existing dictionary • linguistic "types of lexical information" • several European projects (GENELEX, MULTILEX, ACQUILEX, ...) • ISO terminology norms (cf. MARTIF etc. ...)
The ENTRY object: DC structures Computationally relevant properties of fields: • type (atomic, complex: tree, string, xyz-formatted text) • character encoding spec.: ASCII, Unicode, xyz • tree (or other graph/net): • finite depth • flat, disjunctive disjunctive tree • recursive graph (net) • table, non-tree graph, anchor/link/index structure • generated text: • print, hypertext (compiled vs. dynamic (generated on the fly)
Metalex microstruture application Media ("surface"): • phonemic & tonemic transcription (SAMPA ASCII - still waiting for Unicode...) Composition (structure): • morphemic substructure, category & subcategory Meaning (definition, illustration): • glosses (English, French, German) • definitions, senses, relations, components; audio-visual illustration Use: genres; examples (e.g. concordance link); free text notes Metadata: first record; last field
Metalex field lexicon microstruture Anouman_1: • Media attributes: • Phonemic tier: `an'U~m`'a~ • Skeletal tier: VNVNV • Tonal tier: L H LH • Signal tier: Audio • Meaning attributes: • F-gloss: Oiseau • E-gloss: Bird • G-gloss: Vogel • Definition: avis • Homophone full: Anouman_2: grandchild • Homophone phonemic: Anouman_3: yesterday • Use: • < Concordance pointer > • Genre: narrative • Metadata: • Lexicographer: S. Adouakou • Source: Bielefeld-Anyi-Corpus, Adaou village, CI • Date: March 2002
Metalex portable lexical database Relational database: • Metalex specs flattened • structure re-constitution via metalex specs • HanDBase for PalmOS • Features: • standard full RelDBMS • XML, CSV, text export • export/import via GSM • inexpensive (wrt laptop) • stylus, keyboard, sync input • light weight • low power consumption • inconspicous in use • interfaces to Scheme, C
Metalex extensionThe Modelex project:"Theory and Design of Multimodal Lexica" Goals: • Data-driven, theory-informed lexicon models • Formal properties of abstract data models for multimodal lexica • Interpretation of abstract data models in XML • Integration of parallel annotation lattices for modalities and submodalities • Development of a prototype multimodal lexicon
Modelex: gesture annotation Time Aligned Signal Corpus System (Java, GPL) Jan-Torsten Milde, U Bielefeld TASX annotator: • Phonological tier • ToBI tiers • Gesture tier • Speech Act tier Anyi, Ega, German
Model-theoretic compilation in ILEX:INTERPRETATION ( ALEX ) = OLEX
Metalex in the Modelex project:Multimodal concordance as microstructure DC Prototype: http://www.spectrum.uni-bielefeld.de/langdoc/PAX/
Metalex in the Modelex project:underspecified ALEX microstructure for gesture coordinates Hand: <parts> == "Palm" "Digit" <vector> == "<name>" <coord "<name>"> <coord> == "<x1>" "<y1>" "<x2>" "<y2>" <> == . Palm: <parts> == <vector> <name> == palm <width> == pw <height> == ph <x1 fore> == <x1> <x1 middle> == ( <x1> + ( <x2> - <x1> ) / 3 ) <x1 ring> == ( <x1> + ( <x2> - <x1> ) * 2 / 3 ) <x1 pinky> == <x2> <x1> == px1 <y1> == py1 <x2> == ( <x1> + <width> ) <y2> == ( <y1> + <height> ) <> == Hand .
Metalex in the Modelex project:fully specified ALEX microstructure for gesture coordinates Hand:<parts> = palm px1 py1 ( px1 + pw ) ( py1 + ph ) thumb px1 py1 ( px1 - lt ) py1 fore px1 py1 px1 ( py1 - lf ) middle ( px1 + ( ( px1 + pw ) - px1 ) / 3 ) py1 ( px1 + ( ( px1 + pw ) - px1 ) / 3 ) ( py1 - lm ) ring ( px1 + ( ( px1 + pw ) - px1 ) * 2 / 3 ) py1 ( px1 + ( ( px1 + pw ) - px1 ) * 2 / 3 ) ( py1 - lr ) pinky ( px1 + pw ) py1 ( px1 + pw ) ( py1 - lp )
Metalex: conclusion & prospects User complexity: • demands an open, data-driven approach Domain: • demands a theory-informed approach • with computational acquisition & inference Data-driven and theory-informed lexica • are possible (METALEX) • need integrated model-theoretic approach (ILEX): INTERPRETATION (ALEX) = OLEX • a formal problem remains: differing complexity of trees (archive): simulation of other graphs via semantics only annotation lattices (data), tables (lexica): regular relations if non-recursive, indexed grammars if recursive?