Data models and the (blind ?) query of lexical resources

Work in progress Data models and the (blind ?) query of lexical resources Laurent Romary Inria & HUB

Argument • Querying language resources requires precise knowledge of the underlying representation model • Given the over-expressiveness of the TEI, we need complementary (possibly local – aka crystals) models to enforce reference constructs • The TEI dictionary is a good use case to start, given the existence of LMF as an underlying model

Querying semi-structured data • Characteristics • Order (sequences or sets) • Recursivity (depth-free ?) • Typing (local or structural) • Schema driven or not • Select-from-where and DB models • Traditionally: works on the link between the database models and the corresponding query structures (ERM) • Recent proposal in the semi-structured document community (the Abiteboul school => cf. peri-Xquery works; D. Florescu) • Importance of paths (set semantics) • Role of patterns (e.g. pairs, sequences, car.cdr etc.) as typing mechanisms • Very few works about data with low conformance to a reference model • Low predictive data • Objective: limiting un-predictability in the lexical domain

Lexical data is a messy field • From full-form lexica for NLP to encyclopaedic dictionaries • Legacy unstructured/unpredictable data • Available software (e.g. Shoebox) or scholarly traditions (the Multext format for full-form lexica) • Two core reference traditions/models/serialisations • Onomasiological • Concept to term; as is the case for most terminological databases • ISO 16642 (TMF – Terminological Markup Framework) • Natural serialisation in ISO 30042 (TBX) • Semasiological • Word to sense; as implemented in traditional dictionaries • ISO 24613 (LMF – Lexical Markup Framework) • Unclear serialisation landscape. We argue in (Romary, 2013) that the TEI just offers the background we need

TEI – a wealth of possibilities, reflecting messiness • Orphan grammatical descriptors [corrected!] • Orphan sense descriptors • <def>, <usg>, <cit> etc. can occur outside a sense • Multiple elements to provide the “same” information • E.g. <gloss> vs. <cit> • General issues • Free text can occur everywhere • Existence and usage of large-coverage TEI classes • E.g. <sense> (text | model.gLike | sense |model.entryPart.top | model.phrase | model.global)*

model.global • model.global.edit [addSpandamageSpandelSpan gap space] • model.global.meta [alt altGrp certainty fLibfsfvLib index interpinterpGrp join joinGrp link linkGrplistTranspose precision respons span spanGrpsubstJoin timeline] • model.global.spoken [incident kinesic pause shift vocal writing] • model.milestoneLike [anchor cbfwgb lb milestone pb] • model.noteLike • [note witDetail] • figure metamarknotatedMusic

Model - principles • General modeling strategy from ISO/TC 37 (cf. Object Management Group) • Meta-model • General, underlying model that informs current practice • Data-categories • Provides the elementary descriptors to instantiate models • Possibly registered/standardised/re-sused from ISOcat • Any serialization isomorphic with a given model is acceptable • “blind” (no schema, no documentation) interoperability requires sharing vocabularies within communities of services

1..1 0..n Lexical entry <entry> 1..1 1..1 1..1 1..1 1..1 0..n Sense Form Morphology 0..n 1..1 LMF-TEI meta-model components (simplified) Lexicon <form> <gramGrp> <sense>

1..1 0..n Lexical entry 1..1 1..1 1..1 1..1 1..1 0..n Sense Form Morphology 0..n 1..1 Main data categories Lexicon /part of speech/ (<pos>) /inflexional class/ (<itype>) /gender/ (<gen>) /number/ (<number>) /case/ (<case>) /person/ (<per>) /tense/ (<tns>) /mood/ (<mood>) /orthography/ (<orth> ) /pronunciation/ (<pron> ) /hyphenization/ (<hyph> ) /syllabification/ (<syll>) /stress pattern/ (<stress> ) /definition/ (<def>) /example/ (<eg>) /usage/ (<usg>) /etymology/ (<etym>)

Examples of constraints • Forbid the usage of <entryFree>, <dictScrap> (status of <hom>, <superEntry> to be determined) • Systematic use of a grammatical container (gramGrp) for all gramamtical features • Limit the usage of <gramGrp> to <entry>, <form>, <sense>, <cit> • Only allow semantic descriptors in<form> (usage constraints), <sense> and <cit> (for contextualizing an example) • … cf. Budin et alii, 2012, Romary & Wehstein, 2012, Romary, 2013

Paths • Baseline when using (XML) TEI documents: • Xpath: entry/sense/usg[@type=‘geo’] • Issues • Model agnostic – serialization specific • Model based query-language (component-data-category (CDC) path) • Pointing to explicit components and data categories • $lexicalEntry.$sense.geographicalUsage • A CCD path can be check as being compatible with the model • We can consider the compiled set of all path compatible with the model: CDC Graph • Natural interface with DB/faceting environment suuch as ElasticSearch

Queries • Retrieval of a specific entry considering constraint on the form • token to word-form mapping • $lexiconEntry.$form[orthography=‘chats’] • Retrieval of a sense from an entry given additional constraints • $lexiconEntry.$sense*[subjectField=‘nautical’] • Search for all entries having some specific form, grammatical or semantic properties, for instance the retrieval of all transitive verbs • … • Extraction of all (or part of all) occurrences of a certain descriptor in a group of lexical entries , for instance all translated examples • …

Signatures • Objective: characterizing the data as compliant with a given model (M) • Identification of queryable data (D) • Principle • SM: Construction of a compiled graph of components and data categories allowed by a model (component-DC graph) • SD: Construction of the compiled graph of CDC paths from the data • SD must be a subset of SM

Silent data • Scenario: querying multiple dictionaries of various types • e.g. presence of full-form lexica for which queries about <sense> do not apply • Identifying all paths from the model which are not realized in the data • SM -SD

Noisy data • TEI encoded data which do not fulfill LMF compliance • Checking process • Compiling all possible paths as a CDC graph • Comparison with possible CDC paths allowed by the model • Note that data can still be queried • Depending on semantic, lower recall and precision

What’s next • The issue of querying language resources should be accompanied by an enforcement of models • Integration within a language resource query language agenda (bringing in semi-structured database specialists) • Going blind? • Procedures for identifying compatibilities between queries and data • Data quality check • Recommendations for DARIAH & CLARIN • LMF additional part? • Not just a technical issue…

Trend: TEI reaching out new communities • Bringing back existing communities of practices

Data models and the (blind ?) query of lexical resources

Data models and the (blind ?) query of lexical resources

Presentation Transcript

Career Counseling for Clients who Are Blind or Visually Impaired

Introduction to Generalized Linear Models

What’s New in Low Vision Technology 2008 CCTV Video Magnifiers Trends and Considerations

limma: Linear Models for Microarray Data

Cognitive Lexical Semantics

NCBI Molecular Biology Resources

Displaying Selected Data with Queries

QUERY OPTIMIZATION AND QUERY PROCESSING

Distributed Databases

SQL and SQAPL

6 . Distributed Query Optimization

Outline

Programming Language Implementation Lexical and Syntax Analysis Part II

Lexical Analysis

Structured Query Language (SQL)

Lecture 7: Query Execution

CS 245: Database System Principles

Data Mining 2

Chapter 20

INTRODUCTION TO PEOPLESOFT QUERY

Chapter 4 Data-Oriented Models

What’s in This Module?