1 / 17

Data models and the (blind ?) query of lexical resources

Work in progress. Data models and the (blind ?) query of lexical resources. Laurent Romary Inria & HUB. Argument. Querying language resources requires precise knowledge of the underlying representation model

colin
Download Presentation

Data models and the (blind ?) query of lexical resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Work in progress Data models and the (blind ?) query of lexical resources Laurent Romary Inria & HUB

  2. Argument • Querying language resources requires precise knowledge of the underlying representation model • Given the over-expressiveness of the TEI, we need complementary (possibly local – aka crystals) models to enforce reference constructs • The TEI dictionary is a good use case to start, given the existence of LMF as an underlying model

  3. Querying semi-structured data • Characteristics • Order (sequences or sets) • Recursivity (depth-free ?) • Typing (local or structural) • Schema driven or not • Select-from-where and DB models • Traditionally: works on the link between the database models and the corresponding query structures (ERM) • Recent proposal in the semi-structured document community (the Abiteboul school => cf. peri-Xquery works; D. Florescu) • Importance of paths (set semantics) • Role of patterns (e.g. pairs, sequences, car.cdr etc.) as typing mechanisms • Very few works about data with low conformance to a reference model • Low predictive data • Objective: limiting un-predictability in the lexical domain

  4. Lexical data is a messy field • From full-form lexica for NLP to encyclopaedic dictionaries • Legacy unstructured/unpredictable data • Available software (e.g. Shoebox) or scholarly traditions (the Multext format for full-form lexica) • Two core reference traditions/models/serialisations • Onomasiological • Concept to term; as is the case for most terminological databases • ISO 16642 (TMF – Terminological Markup Framework) • Natural serialisation in ISO 30042 (TBX) • Semasiological • Word to sense; as implemented in traditional dictionaries • ISO 24613 (LMF – Lexical Markup Framework) • Unclear serialisation landscape. We argue in (Romary, 2013) that the TEI just offers the background we need

  5. TEI – a wealth of possibilities, reflecting messiness • Orphan grammatical descriptors [corrected!] • Orphan sense descriptors • <def>, <usg>, <cit> etc. can occur outside a sense • Multiple elements to provide the “same” information • E.g. <gloss> vs. <cit> • General issues • Free text can occur everywhere • Existence and usage of large-coverage TEI classes • E.g. <sense> (text | model.gLike | sense |model.entryPart.top | model.phrase | model.global)*

  6. model.global • model.global.edit [addSpandamageSpandelSpan gap space] • model.global.meta [alt altGrp certainty fLibfsfvLib index interpinterpGrp join joinGrp link linkGrplistTranspose precision respons span spanGrpsubstJoin timeline] • model.global.spoken [incident kinesic pause shift vocal writing] • model.milestoneLike [anchor cbfwgb lb milestone pb] • model.noteLike • [note witDetail] • figure metamarknotatedMusic

  7. Model - principles • General modeling strategy from ISO/TC 37 (cf. Object Management Group) • Meta-model • General, underlying model that informs current practice • Data-categories • Provides the elementary descriptors to instantiate models • Possibly registered/standardised/re-sused from ISOcat • Any serialization isomorphic with a given model is acceptable • “blind” (no schema, no documentation) interoperability requires sharing vocabularies within communities of services

  8. 1..1 0..n Lexical entry <entry> 1..1 1..1 1..1 1..1 1..1 0..n Sense Form Morphology 0..n 1..1 LMF-TEI meta-model components (simplified) Lexicon <form> <gramGrp> <sense>

  9. 1..1 0..n Lexical entry 1..1 1..1 1..1 1..1 1..1 0..n Sense Form Morphology 0..n 1..1 Main data categories Lexicon /part of speech/ (<pos>) /inflexional class/ (<itype>) /gender/ (<gen>) /number/ (<number>) /case/ (<case>) /person/ (<per>) /tense/ (<tns>) /mood/ (<mood>) /orthography/ (<orth> ) /pronunciation/ (<pron> ) /hyphenization/ (<hyph> ) /syllabification/ (<syll>) /stress pattern/ (<stress> ) /definition/ (<def>) /example/ (<eg>) /usage/ (<usg>) /etymology/ (<etym>)

  10. Examples of constraints • Forbid the usage of <entryFree>, <dictScrap> (status of <hom>, <superEntry> to be determined) • Systematic use of a grammatical container (gramGrp) for all gramamtical features • Limit the usage of <gramGrp> to <entry>, <form>, <sense>, <cit> • Only allow semantic descriptors in<form> (usage constraints), <sense> and <cit> (for contextualizing an example) • … cf. Budin et alii, 2012, Romary & Wehstein, 2012, Romary, 2013

  11. Paths • Baseline when using (XML) TEI documents: • Xpath: entry/sense/usg[@type=‘geo’] • Issues • Model agnostic – serialization specific • Model based query-language (component-data-category (CDC) path) • Pointing to explicit components and data categories • $lexicalEntry.$sense.geographicalUsage • A CCD path can be check as being compatible with the model • We can consider the compiled set of all path compatible with the model: CDC Graph • Natural interface with DB/faceting environment suuch as ElasticSearch

  12. Queries • Retrieval of a specific entry considering constraint on the form • token to word-form mapping • $lexiconEntry.$form[orthography=‘chats’] • Retrieval of a sense from an entry given additional constraints • $lexiconEntry.$sense*[subjectField=‘nautical’] • Search for all entries having some specific form, grammatical or semantic properties, for instance the retrieval of all transitive verbs • … • Extraction of all (or part of all) occurrences of a certain descriptor in a group of lexical entries , for instance all translated examples • …

  13. Signatures • Objective: characterizing the data as compliant with a given model (M) • Identification of queryable data (D) • Principle • SM: Construction of a compiled graph of components and data categories allowed by a model (component-DC graph) • SD: Construction of the compiled graph of CDC paths from the data • SD must be a subset of SM

  14. Silent data • Scenario: querying multiple dictionaries of various types • e.g. presence of full-form lexica for which queries about <sense> do not apply • Identifying all paths from the model which are not realized in the data • SM -SD

  15. Noisy data • TEI encoded data which do not fulfill LMF compliance • Checking process • Compiling all possible paths as a CDC graph • Comparison with possible CDC paths allowed by the model • Note that data can still be queried • Depending on semantic, lower recall and precision

  16. What’s next • The issue of querying language resources should be accompanied by an enforcement of models • Integration within a language resource query language agenda (bringing in semi-structured database specialists) • Going blind? • Procedures for identifying compatibilities between queries and data • Data quality check • Recommendations for DARIAH & CLARIN • LMF additional part? • Not just a technical issue…

  17. Trend: TEI reaching out new communities • Bringing back existing communities of practices

More Related