320 likes | 465 Views
A Revised Data Model for the ISO Data Category Registry. Marc Kemps- Snijders Menzo Windhouwer Peter Wittenburg Sue Ellen Wright. Overview. The ISO Data Category Registry The DCIF model Revisions to the data model Revisions to the DCIF Conclusions and future work. Overview.
E N D
A Revised Data Model for the ISO Data Category Registry Marc Kemps-Snijders Menzo Windhouwer Peter Wittenburg Sue Ellen Wright TKE 2008
Overview • The ISO Data Category Registry • The DCIF model • Revisions to the data model • Revisions to the DCIF • Conclusions and future work TKE 2008
Overview • The ISO Data Category Registry • Syntax • ISOcat • The DCIF model • Revisions to the data model • Revisions to the DCIF • Conclusions and future work TKE 2008
The ISO Data Category Registry • Data categories (DCs): • “The result of the specification of a given data field” • In practice the specification contains field names, definitions and constraints, and the specification is expressed in and for various languages • ISO/TC 37 is revising ISO 12620 • 12620: contains a fixed list of DCs • 12620.2: describes the data model and procedures for the DCR TKE 2008
Syntax http://syntax.inist.fr/ • Implemented by LORIA as a proof of concept • Contains around 1700 data categories • However, further adoption by the community requires improved user interface features and fully developed functionalities TKE 2008
ISOcat http://www.isocat.org/ • Under construction by the Max Planck Institute for Psycholinguistics in close collaboration with ISO/TC 37 • Built on the experience gathered with Syntax TKE 2008
Overview • The ISO Data Category Registry • The DCIF model • The need for a revision • Unified Modeling Language • Revisions to the data model • Revisions to the DCIF • Conclusions and future work TKE 2008
Syntax DCIF model • Data Category Interchange Format (DCIF) • A model based on the meta model of the Terminological Markup Framework (TMF, ISO 16642) • Expressed in the Generic Mapping Tool (GMT) XML vocabulary TKE 2008
The DCIF model TKE 2008
The need for a revision • The DCIF is an interchange format, not a data model • Optimally ISO 12620.2 will describe a semantically rich data model and an interchange format • Experience with Syntax, ongoing work in TC37 and the design of ISOcat also revealed problems with the model • DCR model ≠ term model TKE 2008
Unified Modeling Language • To be able to express more of the semantics UML class diagrams are used to describe the revised data model • When constraints can’t be graphically expressed they are added as Object Constraint Language (OCL) notes TKE 2008
Overview • The ISO Data Category Registry • The DCIF model • Revisions to the data model • simple and complex DCs • object and working languages • duplication of the English description • types of conceptual domains • sharing simple data categories • Revisions to the DCIF • Conclusions and future work TKE 2008
Simple and complex DCs • Complex DCs have a conceptual domain: • Open • Closed: a finite list of values represented by simple DCs • The DCIF model doesn’t distinguish between simple and complex DCs • Closed DCs: recognized by their conceptual domain • Open DCs can’t be distinguished from simple DCs TKE 2008
Simple and complex DCs Data Category Description Section Simple Data Category Complex Data Category Conceptual Domain TKE 2008
Object and working languages • The language section (LS) provides information on the DC in the context of a specific language: • Names and descriptions in various working languages • Constraints on the conceptual domains for various object languages • Its easy to get confused by the use of one LS for two purposes TKE 2008
Object and working languages Simple Data Category Data Category Complex Data Category Description Section Conceptual Domain * * Language Section Linguistic Section TKE 2008
Duplication of the English description • The working language of the description section is English • Alternative descriptions in various languages can be added in their respective language sections • Getting the English description has become an exception: • for English look in the description section for all other languages look in their language section TKE 2008
Duplication of the English description Data Category Description Section self.language section -> one( language = “en” and definition -> notEmpty() ) * Language Section TKE 2008
Types of conceptual domains • Open data categories can be restricted to a specific data type • However, the DCIF model has no mechanism for assigning those data type constraints • The need has risen to add arbitrary constraints in addition to the data type TKE 2008
Types of conceptual domains * Complex Data Category Linguistic Section Conceptual Domain + * * Constrained Data Category Constrained Linguistic Section Schema Specific Domain + * * Open Data Category Closed Linguistic Section Open Conceptual Domain + * * Closed Data Category Value Domain Simple Data Category + + TKE 2008
Types of conceptual domains • Constraints can be expressed in any language: • Object Constraint Language (OCL) • Semantic Web Rule Language (SWRL) • Schematron • … • The DCR can’t be expected to interpret all these languages, during standardization special attention have to be paid to the coherence of the constraints TKE 2008
Sharing simple data categories • The DCR contains many duplicates for simple DCs, to prevent those closed DCs should be able to share them • Sharing is allowed in the current model, however for some corner cases the model isn’t expressive enough: • DCs can be member of multiple profiles, which is stored globally • When simple DCs are shared, DC profile specific value domains can accidently merge TKE 2008
Sharing simple data categories • Stored • Complex DC: /Ca/ • Belongs to profile p1 and p2 • Value domain: {/Sa/, /Sb/, /Sc/} • Complex DC: /Cb/ • Belongs to profile p1 and p2 • Value domain: {/Sa/, /Sb/} • Simple DC: /Sa/ • Belongs to profile p1 • Simple DC: /Sb/ • Belongs to profiles p1 and p2 • Simple DC: • Belongs to profile p2 Output Profile: p1 Complex DC: /Ca/ Simple DC: /Sa/ Simple DC: /Sb/ Complex DC: /Cb/ Simple DC: /Sa/ Simple DC: /Sb/ Profile: p2 Complex DC: /Ca/ Simple DC: /Sb/ Simple DC: /Sc/ Complex DC: /Cb/ Simple DC: /Sb/ Input Profile: p1 Complex DC: /Ca/ Simple DC: /Sa/ Simple DC: /Sb/ Complex DC: /Cb/ Simple DC: /Sa/ Profile: p2 Complex DC: /Ca/ Simple DC: /Sb/ Simple DC: /Sc/ Complex DC: /Cb/ Simple DC: /Sb/ TKE 2008
Sharing simple data categories • Stored • Complex DC: /Ca/ • Belongs to profile p1 and p2 • Value domain: {/Sa/, /Sb/, /Sc/} • Complex DC: /Cb/ • Belongs to profile p1 and p2 • Value domain: {/Sa/, /Sb/} • Simple DC: /Sa/ • Belongs to profile p1 • Simple DC: /Sb/ • Belongs to profiles p1 and p2 • Simple DC: • Belongs to profile p2 Output Profile: p1 Complex DC: /Ca/ Simple DC: /Sa/ Simple DC: /Sb/ Complex DC: /Cb/ Simple DC: /Sa/ Simple DC: /Sb/ Profile: p2 Complex DC: /Ca/ Simple DC: /Sb/ Simple DC: /Sc/ Complex DC: /Cb/ Simple DC: /Sb/ Input Profile: p1 Complex DC: /Ca/ Simple DC: /Sa/ Simple DC: /Sb/ Complex DC: /Cb/ Simple DC: /Sa/ Profile: p2 Complex DC: /Ca/ Simple DC: /Sb/ Simple DC: /Sc/ Complex DC: /Cb/ Simple DC: /Sb/ TKE 2008
Sharing simple data categories Closed Data Category + Profile Value Domain self.conceptual domain -> isUnique(profile) and self.descriptionsection.profiles -> forAll(…) and self.conceptual domain -> forAll(…) and self.conceptual domain -> forAll(…) and self.linguistic section -> forAll(…) Value Domain + Simple Data Category TKE 2008
Overview • The ISO Data Category Registry • The DCIF model • Revisions to the data model • Revisions to the DCIF • Conclusions and future work TKE 2008
Revision to the DCIF • The revised data model now forms the basis of a revision of the DCIF, the interchange format • The attempt to be TMF compliant has been abandoned, as a DCR isn’t a termbase • The DCIF is now a hierarchical simplification of the network of classes • A new DCIF XML vocabulary allows comprehensive serialization of the data model TKE 2008
DCIF component hierarchy TKE 2008
Overview • The ISO Data Category Registry • The DCIF model • Revisions to the data model • Revisions to the DCIF • Conclusions and future work TKE 2008
Conclusions • The revised data model allows a clearer specification of the data category • The model now explicitly captures constraints that were originally only stated in the text • This will enhance the usability and stability of the new DCR implementation, ISOcat TKE 2008
Current and future work • The current ISOcat alpha release supports this revised data model and DCIF • The latest ISO 12620.2 draft, describing the revised data model and DCIF, is currently under ballot • The switch from Syntax to ISOcat is scheduled for end 2008/begin 2009 TKE 2008
Data Category Registrydefining widely accepted linguistic concepts visit http://www.isocat.org/ TKE 2008