1 / 21

Relations between Data Categories

Relations between Data Categories. Jan Odijk, CLARIN-NL/UiL-OTS January 8, 2010 MPI, Nijmegen. Relations between Data Categories. Data Categories & ISOCAT Relations for Search Data Category Sets Relations for Mapping DCs Structured Elements Mapping & Structure Conclusions.

mircea
Download Presentation

Relations between Data Categories

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Relationsbetween Data Categories Jan Odijk, CLARIN-NL/UiL-OTS January 8, 2010 MPI, Nijmegen

  2. Relations between Data Categories • Data Categories & ISOCAT • Relations for Search • Data Category Sets • Relations for Mapping DCs • Structured Elements • Mapping & Structure • Conclusions

  3. Defining the Topic: Data Categories • Data Categories (DC) are defined in a Registry • (Some) Data Categories are made part of a standard • (Contribution to) Semantic Interoperability by • Using the standardized DCs, or • Mapping one’s own DC to a standardized DC

  4. Interoperability by Mapping DCs • Example (simple, naïve): • LR1: uses DC myDC • LR2: uses DC yourDC • Standard has ISODC • myDC ISODC • yourDC  ISODC • Therefore: myDC  yourDC  hisDC • ISOCAT standardized DCs serve as a pivot (cf. interlingua) • By IL 2n mappings needed instead of n*(n-1) • Gain when n>3

  5. Relations for Searching (Odijk 2009) • Find closely related DCs in ISOCAT • Grammatical Relation used in definition of transitive • Grammatical Relation is not a DC in ISOCAT • Grammatical Functionis a DC in ISOCAT (DC-1296) • syntacticFunctionis a DC in ISOCAT (DC-1507) • “Syntactic function”is a DC in ISOCAT (DC-2244) • Dependencyis a DC in ISOCAT (DC-2323) • Problem: • How do I find alternative names for the same concept in ISOCAT? • How do I find closely related DCs in ISOCAT? • It currently requires a linear manual search… , even across different profiles!! • Grouping closely related concepts together would help •  e.g. in (multiple) trees, implemented by relations between DCs

  6. Data Category Sets • For each coherent data category set a DC must exist to identify it. E.g. in the value domain of the DC morphosyntacticTagSet • STTS • Penn tagSet, • CGN tagSet • ISOCAT must represent/group them as a set • Data Category Selections (DCS) appear suited for this • They should be reusable by anyone • But no PIDs are provided for DCS

  7. Implicit Semantics: ‘Mime’-like approach • Pragmatic option • Resource/Tool 1 specifies: tagSet=STTS • Resource/Tool 2 specifies tagSet=STTS • Match is found  interoperability • Semantics of STTS is left implicit • identity of semantics suffices • Occurs often, is simple and must be supported

  8. Semantics Explicit: Mapping:where? • Option 1: Directly in an XML file Schema • “PID can be embedded in the schemata of linguistic resources” • http://www.csc.fi/english/pages/neeri09/workshop/materials/windhouwer.pdf, slide 8 • Will that allow complex mappings as given above? • Option 2: in separate files • Needed for commonly used coherent subsets (e.g. Penn Treebank Tagset, STTS, CGN Tagset, etc.) • To avoid duplication, inconsistency, etc. • Is that possible now?

  9. Relations for Mapping DCs: • Option 1: • myDC, yourDC outside ISOCAT • ISODC inside ISOCAT • All ISOCAT DCs are part of the DC IL • Option 2: • myDC, yourDC inside ISOCAT • ISODC in ISOCAT • Only a subset of ISOCAT DCs are part of the DC IL

  10. Relations for Mapping DCs: • Option 1 is most natural but • Option 2 is desirable for members from de facto standard data categories • Mapping between ISOCAT DCs can be implemented by relations between ISOCAT DCs

  11. Structured Elements (1) • ISOCAT has no provisions for this except for • Strings (sequences of Characters) • REs over strings But many are actually in use: • Attribute Value Pairs (AV-Pairs) • Attribute is a DC • Value must be • of attribute DC type and • from attribute DC Conceptual Domain • Records/AV matrices • Which AV-Pairs are possible/mandatory for noun, verb etc

  12. Structured Elements (2) • Lists • e.g. HPSG SUBCAT attribute: [NPnom, NPacc] • Trees/Tree Models • E.g. DUELME database (Dutch Multiword Expressions) • SAID (LDC2003T10 ) • Treebanks • Structured categories as in Categorial Grammar • np\s/np, np/np, etc.

  13. Structured Elements (3) • Sets • E.g. set of verbpatterns in Rosetta • Subcat patterns Alpino: • {intransitive, transitive, pc_pp(aan)} (breien ‘to knit’) • Parameterized values • E.g. Alpino: pc_pp(aan) • i.e. prepositional complement of syntactic category PP with aan as head

  14. Mapping & Structure (1) • Mapping of DCs is actually mapping of DC combinations •  often requires structure • Structures are also needed if there is to be a pivot • Examples • Combination: Atomic DC  A-V pair combination: • ISOCATRosetta • Transitive • thetavp=vp120 & synvps=[synNP] & caseAssigner=True

  15. Mapping & Structure (2) • Penn TreebankISOCAT • JJR  • partOfSpeech=adjective & degree=comparative • STTS Tagset=>ISOCAT • VVIMP  • partOfSpeech=verb & main verb & mood=imperative

  16. Mapping & Structure (3) • List: Atomic DC  List: • ISOCAT, AlpinoHPSG • Transitive  • [NPnom’ NPacc] • Combinationparameterized value • Rosetta  Alpino • synPREPNP in synvps & prepkey1=aan  • pc_pp(aan) • (in fact : subcats U= {pc_pp(aan)} )

  17. Mapping & Structure (4) • Union: German Adjectives • Morphosyntactic features: • Gender (3), Case (4), • Number (2), Declensiontype (3) • In theory 3*4*2*3=72 distinctions • Gender is neutralized in plural • So: 3*4*1*3 + 4*1*3=36+12=48 distinctions • Only 5 forms are used: eForm, erForm, esForm, emForm, enForm are the corresponding tags

  18. Mapping & Structure (4) • Map enForm to a union of a combination of morphosyntactic features: • enForm • m sg acc str V m sg gen str V n sg gen str V dat pl str V • dat sg mixed V gen sg mixed V pl mixed V m sg acc mixed V • dat sg weak V gen sg weak V pl weak V m sg acc weak • (using underspecification for gender in some cases)

  19. Mapping & Structure (5) • Conclusion: • One can often not map DCs in isolation • But must map whole entry (record) to a new entry (set of entries) • Entry= a combination of Data Categories • lexicon entry or • Annoted text entry • Or even complexer: • multiple entries  multiple sets of entries

  20. Mapping & Structure (6) • Questions: • Additional means are needed to provide structures as pivot • Does LMF provide part of this for lexicons? • Does anything exist for ‘entries’ in text corpora? • How can we specify relations between combinations of DCs

  21. Conclusions • Relations between DCs are needed for grouping synonymous/closely related DCs ( Easier Search) • De facto standard DC sets must be included in ISOCAT • Cf. Erhard Hinrichs 2009 • A subset of ISOCAT DCs to be marked as member of IL • Data Category Selections need a PID • Mapping requires relations between DC combinations • Mapping via IL requires • Standardized lexicon entry model • Standardized annotated text entry model

More Related