210 likes | 350 Views
Relations between Data Categories. Jan Odijk, CLARIN-NL/UiL-OTS January 8, 2010 MPI, Nijmegen. Relations between Data Categories. Data Categories & ISOCAT Relations for Search Data Category Sets Relations for Mapping DCs Structured Elements Mapping & Structure Conclusions.
E N D
Relationsbetween Data Categories Jan Odijk, CLARIN-NL/UiL-OTS January 8, 2010 MPI, Nijmegen
Relations between Data Categories • Data Categories & ISOCAT • Relations for Search • Data Category Sets • Relations for Mapping DCs • Structured Elements • Mapping & Structure • Conclusions
Defining the Topic: Data Categories • Data Categories (DC) are defined in a Registry • (Some) Data Categories are made part of a standard • (Contribution to) Semantic Interoperability by • Using the standardized DCs, or • Mapping one’s own DC to a standardized DC
Interoperability by Mapping DCs • Example (simple, naïve): • LR1: uses DC myDC • LR2: uses DC yourDC • Standard has ISODC • myDC ISODC • yourDC ISODC • Therefore: myDC yourDC hisDC • ISOCAT standardized DCs serve as a pivot (cf. interlingua) • By IL 2n mappings needed instead of n*(n-1) • Gain when n>3
Relations for Searching (Odijk 2009) • Find closely related DCs in ISOCAT • Grammatical Relation used in definition of transitive • Grammatical Relation is not a DC in ISOCAT • Grammatical Functionis a DC in ISOCAT (DC-1296) • syntacticFunctionis a DC in ISOCAT (DC-1507) • “Syntactic function”is a DC in ISOCAT (DC-2244) • Dependencyis a DC in ISOCAT (DC-2323) • Problem: • How do I find alternative names for the same concept in ISOCAT? • How do I find closely related DCs in ISOCAT? • It currently requires a linear manual search… , even across different profiles!! • Grouping closely related concepts together would help • e.g. in (multiple) trees, implemented by relations between DCs
Data Category Sets • For each coherent data category set a DC must exist to identify it. E.g. in the value domain of the DC morphosyntacticTagSet • STTS • Penn tagSet, • CGN tagSet • ISOCAT must represent/group them as a set • Data Category Selections (DCS) appear suited for this • They should be reusable by anyone • But no PIDs are provided for DCS
Implicit Semantics: ‘Mime’-like approach • Pragmatic option • Resource/Tool 1 specifies: tagSet=STTS • Resource/Tool 2 specifies tagSet=STTS • Match is found interoperability • Semantics of STTS is left implicit • identity of semantics suffices • Occurs often, is simple and must be supported
Semantics Explicit: Mapping:where? • Option 1: Directly in an XML file Schema • “PID can be embedded in the schemata of linguistic resources” • http://www.csc.fi/english/pages/neeri09/workshop/materials/windhouwer.pdf, slide 8 • Will that allow complex mappings as given above? • Option 2: in separate files • Needed for commonly used coherent subsets (e.g. Penn Treebank Tagset, STTS, CGN Tagset, etc.) • To avoid duplication, inconsistency, etc. • Is that possible now?
Relations for Mapping DCs: • Option 1: • myDC, yourDC outside ISOCAT • ISODC inside ISOCAT • All ISOCAT DCs are part of the DC IL • Option 2: • myDC, yourDC inside ISOCAT • ISODC in ISOCAT • Only a subset of ISOCAT DCs are part of the DC IL
Relations for Mapping DCs: • Option 1 is most natural but • Option 2 is desirable for members from de facto standard data categories • Mapping between ISOCAT DCs can be implemented by relations between ISOCAT DCs
Structured Elements (1) • ISOCAT has no provisions for this except for • Strings (sequences of Characters) • REs over strings But many are actually in use: • Attribute Value Pairs (AV-Pairs) • Attribute is a DC • Value must be • of attribute DC type and • from attribute DC Conceptual Domain • Records/AV matrices • Which AV-Pairs are possible/mandatory for noun, verb etc
Structured Elements (2) • Lists • e.g. HPSG SUBCAT attribute: [NPnom, NPacc] • Trees/Tree Models • E.g. DUELME database (Dutch Multiword Expressions) • SAID (LDC2003T10 ) • Treebanks • Structured categories as in Categorial Grammar • np\s/np, np/np, etc.
Structured Elements (3) • Sets • E.g. set of verbpatterns in Rosetta • Subcat patterns Alpino: • {intransitive, transitive, pc_pp(aan)} (breien ‘to knit’) • Parameterized values • E.g. Alpino: pc_pp(aan) • i.e. prepositional complement of syntactic category PP with aan as head
Mapping & Structure (1) • Mapping of DCs is actually mapping of DC combinations • often requires structure • Structures are also needed if there is to be a pivot • Examples • Combination: Atomic DC A-V pair combination: • ISOCATRosetta • Transitive • thetavp=vp120 & synvps=[synNP] & caseAssigner=True
Mapping & Structure (2) • Penn TreebankISOCAT • JJR • partOfSpeech=adjective & degree=comparative • STTS Tagset=>ISOCAT • VVIMP • partOfSpeech=verb & main verb & mood=imperative
Mapping & Structure (3) • List: Atomic DC List: • ISOCAT, AlpinoHPSG • Transitive • [NPnom’ NPacc] • Combinationparameterized value • Rosetta Alpino • synPREPNP in synvps & prepkey1=aan • pc_pp(aan) • (in fact : subcats U= {pc_pp(aan)} )
Mapping & Structure (4) • Union: German Adjectives • Morphosyntactic features: • Gender (3), Case (4), • Number (2), Declensiontype (3) • In theory 3*4*2*3=72 distinctions • Gender is neutralized in plural • So: 3*4*1*3 + 4*1*3=36+12=48 distinctions • Only 5 forms are used: eForm, erForm, esForm, emForm, enForm are the corresponding tags
Mapping & Structure (4) • Map enForm to a union of a combination of morphosyntactic features: • enForm • m sg acc str V m sg gen str V n sg gen str V dat pl str V • dat sg mixed V gen sg mixed V pl mixed V m sg acc mixed V • dat sg weak V gen sg weak V pl weak V m sg acc weak • (using underspecification for gender in some cases)
Mapping & Structure (5) • Conclusion: • One can often not map DCs in isolation • But must map whole entry (record) to a new entry (set of entries) • Entry= a combination of Data Categories • lexicon entry or • Annoted text entry • Or even complexer: • multiple entries multiple sets of entries
Mapping & Structure (6) • Questions: • Additional means are needed to provide structures as pivot • Does LMF provide part of this for lexicons? • Does anything exist for ‘entries’ in text corpora? • How can we specify relations between combinations of DCs
Conclusions • Relations between DCs are needed for grouping synonymous/closely related DCs ( Easier Search) • De facto standard DC sets must be included in ISOCAT • Cf. Erhard Hinrichs 2009 • A subset of ISOCAT DCs to be marked as member of IL • Data Category Selections need a PID • Mapping requires relations between DC combinations • Mapping via IL requires • Standardized lexicon entry model • Standardized annotated text entry model