170 likes | 306 Views
Work in progress. Data models and the (blind ?) query of lexical resources. Laurent Romary Inria & HUB. Argument. Querying language resources requires precise knowledge of the underlying representation model
Work in progress Data models and the (blind ?) query of lexical resources Laurent Romary Inria & HUB
Argument • Querying language resources requires precise knowledge of the underlying representation model • Given the over-expressiveness of the TEI, we need complementary (possibly local – aka crystals) models to enforce reference constructs • The TEI dictionary is a good use case to start, given the existence of LMF as an underlying model
Querying semi-structured data • Characteristics • Order (sequences or sets) • Recursivity (depth-free ?) • Typing (local or structural) • Schema driven or not • Select-from-where and DB models • Traditionally: works on the link between the database models and the corresponding query structures (ERM) • Recent proposal in the semi-structured document community (the Abiteboul school => cf. peri-Xquery works; D. Florescu) • Importance of paths (set semantics) • Role of patterns (e.g. pairs, sequences, car.cdr etc.) as typing mechanisms • Very few works about data with low conformance to a reference model • Low predictive data • Objective: limiting un-predictability in the lexical domain
Lexical data is a messy field • From full-form lexica for NLP to encyclopaedic dictionaries • Legacy unstructured/unpredictable data • Available software (e.g. Shoebox) or scholarly traditions (the Multext format for full-form lexica) • Two core reference traditions/models/serialisations • Onomasiological • Concept to term; as is the case for most terminological databases • ISO 16642 (TMF – Terminological Markup Framework) • Natural serialisation in ISO 30042 (TBX) • Semasiological • Word to sense; as implemented in traditional dictionaries • ISO 24613 (LMF – Lexical Markup Framework) • Unclear serialisation landscape. We argue in (Romary, 2013) that the TEI just offers the background we need
TEI – a wealth of possibilities, reflecting messiness • Orphan grammatical descriptors [corrected!] • Orphan sense descriptors • <def>, <usg>, <cit> etc. can occur outside a sense • Multiple elements to provide the “same” information • E.g. <gloss> vs. <cit> • General issues • Free text can occur everywhere • Existence and usage of large-coverage TEI classes • E.g. <sense> (text | model.gLike | sense |model.entryPart.top | model.phrase | model.global)*
model.global • model.global.edit [addSpandamageSpandelSpan gap space] • model.global.meta [alt altGrp certainty fLibfsfvLib index interpinterpGrp join joinGrp link linkGrplistTranspose precision respons span spanGrpsubstJoin timeline] • model.global.spoken [incident kinesic pause shift vocal writing] • model.milestoneLike [anchor cbfwgb lb milestone pb] • model.noteLike • [note witDetail] • figure metamarknotatedMusic
Model - principles • General modeling strategy from ISO/TC 37 (cf. Object Management Group) • Meta-model • General, underlying model that informs current practice • Data-categories • Provides the elementary descriptors to instantiate models • Possibly registered/standardised/re-sused from ISOcat • Any serialization isomorphic with a given model is acceptable • “blind” (no schema, no documentation) interoperability requires sharing vocabularies within communities of services
1..1 0..n Lexical entry <entry> 1..1 1..1 1..1 1..1 1..1 0..n Sense Form Morphology 0..n 1..1 LMF-TEI meta-model components (simplified) Lexicon <form> <gramGrp> <sense>
1..1 0..n Lexical entry 1..1 1..1 1..1 1..1 1..1 0..n Sense Form Morphology 0..n 1..1 Main data categories Lexicon /part of speech/ (<pos>) /inflexional class/ (<itype>) /gender/ (<gen>) /number/ (<number>) /case/ (<case>) /person/ (<per>) /tense/ (<tns>) /mood/ (<mood>) /orthography/ (<orth> ) /pronunciation/ (<pron> ) /hyphenization/ (<hyph> ) /syllabification/ (<syll>) /stress pattern/ (<stress> ) /definition/ (<def>) /example/ (<eg>) /usage/ (<usg>) /etymology/ (<etym>)
Examples of constraints • Forbid the usage of <entryFree>, <dictScrap> (status of <hom>, <superEntry> to be determined) • Systematic use of a grammatical container (gramGrp) for all gramamtical features • Limit the usage of <gramGrp> to <entry>, <form>, <sense>, <cit> • Only allow semantic descriptors in<form> (usage constraints), <sense> and <cit> (for contextualizing an example) • … cf. Budin et alii, 2012, Romary & Wehstein, 2012, Romary, 2013
Paths • Baseline when using (XML) TEI documents: • Xpath: entry/sense/usg[@type=‘geo’] • Issues • Model agnostic – serialization specific • Model based query-language (component-data-category (CDC) path) • Pointing to explicit components and data categories • $lexicalEntry.$sense.geographicalUsage • A CCD path can be check as being compatible with the model • We can consider the compiled set of all path compatible with the model: CDC Graph • Natural interface with DB/faceting environment suuch as ElasticSearch
Queries • Retrieval of a specific entry considering constraint on the form • token to word-form mapping • $lexiconEntry.$form[orthography=‘chats’] • Retrieval of a sense from an entry given additional constraints • $lexiconEntry.$sense*[subjectField=‘nautical’] • Search for all entries having some specific form, grammatical or semantic properties, for instance the retrieval of all transitive verbs • … • Extraction of all (or part of all) occurrences of a certain descriptor in a group of lexical entries , for instance all translated examples • …
Signatures • Objective: characterizing the data as compliant with a given model (M) • Identification of queryable data (D) • Principle • SM: Construction of a compiled graph of components and data categories allowed by a model (component-DC graph) • SD: Construction of the compiled graph of CDC paths from the data • SD must be a subset of SM
Silent data • Scenario: querying multiple dictionaries of various types • e.g. presence of full-form lexica for which queries about <sense> do not apply • Identifying all paths from the model which are not realized in the data • SM -SD
Noisy data • TEI encoded data which do not fulfill LMF compliance • Checking process • Compiling all possible paths as a CDC graph • Comparison with possible CDC paths allowed by the model • Note that data can still be queried • Depending on semantic, lower recall and precision
What’s next • The issue of querying language resources should be accompanied by an enforcement of models • Integration within a language resource query language agenda (bringing in semi-structured database specialists) • Going blind? • Procedures for identifying compatibilities between queries and data • Data quality check • Recommendations for DARIAH & CLARIN • LMF additional part? • Not just a technical issue…
Trend: TEI reaching out new communities • Bringing back existing communities of practices