300 likes | 416 Views
Semantic Extensions to Domain-Specific Markup Languages. Aparna Varde, Elke Rundensteiner, Murali Mani, Mohammed Maniruzzaman and Richard D. Sisson Jr. Worcester Polytechnic Institute (WPI) Worcester, Massachusetts, USA. Introduction.
E N D
Semantic Extensions to Domain-Specific Markup Languages Aparna Varde, Elke Rundensteiner, Murali Mani, Mohammed Maniruzzaman and Richard D. Sisson Jr. Worcester Polytechnic Institute (WPI) Worcester, Massachusetts, USA CCCT-04
Introduction • XML, the eXtensible Markup Language: Widespread standard in storing and publishing data. • Domain-specific markup languages designed with XML tag sets. • Standardization bodies extend these to include additional semantics. • Aspects such domain knowledge, XML constraints are important. • Focus of Paper: Generic issues in extending markup languages. CCCT-04
Domain-specific markup language • Medium of communication for potential users of the domain. • Users: industries, consumers, universities, research organizations, publishers etc. • Follows XML syntax. • Encompasses the semantics of the domain. • Examples • MML: Medical Markup Language • MatML: Materials Science Markup Language Industries Markup Language Publishers Consumers Research Organizations Universities CCCT-04
MML: Medical Markup Language • Creates standards for medical data to be stored and accessed worldwide. • MML module contents, e.g., “basic clinic information”, “surgery record information”. • Used by primary care physicians, general surgeons etc. • Specific information in sub-areas such as “opthalmology” cannot be stored with these modules. • Thus there is need for more semantics in MML. CCCT-04
Motivation for extension to markup languages • Analogous to medical domain and opthalmology there are specifics in other domains. • Why not define a new markup language for each aspect? • Typically basic information in generic language that needs cross-referencing, e.g., basic surgical details in opthalmology. • Common information should not be stored twice. • Advisable to extend existing markup language with additional semantics. CCCT-04
Extending the Materials Science Markup Language, MatML • MatML: Materials Science Markup Language. • XML for materials property data. • Heat Treating: controlled heating and cooling of materials to achieve desired mechanical and thermal properties. • Need to include semantics of Heat Treating in MatML. • At WPI, Heat Treating extension to MatML is proposed. • Several issues, domain-specific and XML-related crucial here. <MatML_doc> <Material> <BulkDetails> …………… </BulkDetails> <ComponentDetails> ……………... </ComponentDetails> …………………. …………………. …………………. …………………. </Material> </MatML_doc> CCCT-04
General issues in extending any markup language • Steps essential in markup language extension. • Desired language features. • XML schema constraints. • Retrieval using XQuery. CCCT-04
Steps essential in markup language extension • Understand domain semantics. • Model the data. • Conduct interviews. • Define the ontology. • Reiterate the ontology. • Outline the initial schema. • Revise the schema based on critical reviews. CCCT-04
1. Understand domain semantics • Acquire domain knowledge: terminology, processes, entities etc. • This helps determine essential tags to store data in the domain. • Study existing markup language in detail. • This is to understand where exactly it needs extension. CCCT-04
2. Model the data • Build data model after studying domain. • Use techniques such as Entity-Relationship diagrams. • Thus represent domain entities, their properties and relationships. Subset of E-R Diagram for Heat Treating CCCT-04
3. Conduct interviews • Needs of potential users are important. • This helps determine entities and attributes in extension. • Users: industries, universities, research organizations, publishers etc. • Domain experts can identify needs of users. • Hence, interview the domain experts. CCCT-04
4. Define the ontology • Ontology serves as established lingo for the domain. • Hence defining ontology is important to proceed with design. • Issues • Synonyms: two or more words with same meaning, e.g., in financial domain, “salary” and “income”. • Homographs: one word with multiple meanings, e.g., “share” in financial domain could refer to “sharing of assets” or “shares in the stock market”. • Clarify such terms with reference to context through ontology. CCCT-04
5. Reiterate the ontology • Once ontology established, useful to have another round of discussions with experts. • Additional discussions with domain experts may lead to further clarifications. • Example: remove existing entities, create new ones, based on terminology. • Accordingly ontology needs to be altered. • Use this ontology for schema design. High-level ontology for Heat Treating CCCT-04
6. Outline the initial schema • Schema provides structure, i.e., defines grammar for the markup language. • Once data model and ontology are approved by domain experts, outline the initial schema. • Adhere to the syntax of original markup language to be accommodated as extension. Partial snapshot of schema for Heat Treating extension to MatML. CCCT-04
7. Revise the schema based on critical reviews • Initial schema serves as medium of communication between designers and users. • This is subject to further changes until domain experts are satisfied. • Schema revision may involve several iterations. • Some of these include discussions with standards bodies. • For proposed extension to be accepted as worldwide standard, it must be approved by experts & standards bodies. CCCT-04
Desired language features • Avoid redundancy. • Make information non-ambiguous. • Provide easy interpretability of data. • Capture domain constraints in the schema. CCCT-04
1. Avoid redundancy • Markup language extension should be such that duplication of storage is avoided. • Data stored in the original markup language should be cross-referenced in the extension. • Example • In medical domain, there should be cross-referencing between “basic clinic information” in the original language and “opthalmological details” in the extension. • Schema should be structured accordingly. CCCT-04
2. Make information non-ambiguous • Domain terminology, its semantics, aspects such as synonyms / homographs are significant. • The schema design should adhere to the ontology to avoid ambiguity. • Annotations should be included within the schema to enhance clarity. • Example: • For spectacle prescriptions in opthalmology, include meanings of terms “myope” and “hypermetrope” in schema as annotations. CCCT-04
3. Provide easy interpretability of data • Data is stored using markup language tags. • Readers should be able to interpret this data without much reference to the literature. • Thus the schema design should be organized accordingly. • Example: • In science and engineering domains, experimental conditions should be stored close to results to enhance readability. CCCT-04
4. Capture domain constraints in the schema • Certain requirements imposed by the domain need to be captured in schema. • Done through XML constraints feature. • Some constraints • Primary key: To uniquely identify an entity. • Choice: To declare mutually exclusive elements. • Example: In financial domain, a person could be either “insolvent” (bankrupt) or “asset-holder” but not both. CCCT-04
XML schema constraints • Sequence constraint. • Disjunction constraint. • Key constraint. • Occurrence constraint. CCCT-04
1. Sequence constraint • To declare a list of elements in order. • Enclose elements in <xsd:sequence> tags. • Example: • In Heat Treating extension, element “QuenchConditions” must occur before “Results”. CCCT-04
2. Disjunction constraint • To declare mutually exclusive elements, i.e., only one of them can exist. • Enclose elements in <xsd:choice> tags. • Example: • In Heat Treating, a part can be made by “Casting” OR “Powder Metallurgy”, not both. CCCT-04
3. Key Constraint • To declare an attribute to be a primary key, i.e., it must be unique and non-null. • Indicate the attribute as type “xsd:ID” and its use as “required”. • Example: • In Heat Treating, the name of the cooling medium (quenchant) is crucial because the purpose of the experiments is to categorize the quenchants. CCCT-04
4. Occurrence constraint • To declare minimum and maximum permissible occurrences of an element. • Indicate “minOccurs = x” and “maxOccurs = y” where “x” and “y” denote the minimum and maximum occurrences respectively. • Value “maxOccurs = unbounded” means no upper bound on number of occurrences. • Value “minOccurs = 0” means that element need not be stored even once. • Example: • In Heat Treating, Cooling Rate must be recorded at a minimum of 8 points in an experiment and there is no upper bound for it. The maximum number of graphs stored per experiment is 3 and it is not necessary that at least one graph be stored. CCCT-04
Retrieval using XQuery • Encourage users to store data in a case-sensitive manner. • Use tags to enhance querying efficiency. CCCT-04
1. Encourage users to store data in a case-sensitive manner • XQuery is case-sensitive • Hence it is useful to place emphasis on case when storing data using markup language. • This facilitates retrieval using XQuery. CCCT-04
2. Use tags to enhance querying efficiency • It is possible to anticipate a typical user query in a domain. • Thus advisable to add a level of abstraction for faster retrieval of information. • Example: • In Heat Treating, a user is likely to retrieve name details of quenchant without its property details. • Hence place tags <NameDetails> and <PropertyDetails> around quenchant information. • Thus entire path of quenchant need not be traversed for name details. • This enhances querying efficiency. CCCT-04
Conclusions • Aspects of extending domain-specific markup languages discussed here. • These include motivation for extension, steps in extension, language features, XML constraints and retrieval considerations. • Extension to MatML proposed at CHTE, WPI to include Heat Treating semantics. • Paper summarizes general issues in extending domain-specific markup languages. CCCT-04
Acknowledgments • Database Systems Research Group in Department of Computer Science at WPI. • Quenching Research Team in Department of Materials Science at WPI. • Center for Heat Treating Excellence and its member companies. CCCT-04