Semi-Structured Data Models

Semi-Structured Data Models By Chris Bennett

Semi-Structured Data • What is it? • Data where structure not necessarily determined in advance (often implicit in data) • Descriptive, not prescriptive • Self-describing and flexible in structure • Where does it come from? • When the data cannot (or simply is not) modeled naturally or usefully using a standard data model • Merging multiple data sources, sparse user annotations, rapidly evolving schemas specific to given communities • Raw data is often semi-structured • Frequently a product of rapidly evolving schema • Examples • HTML, XML, BibTex, Integrated data sources, etc..

Semi-Structured Data • This is great – infinite flexibility!! Is there a catch? Always a tradeoff… • In this case, retrieval and query performance can suffer greatly compared to more structured data models

Semi-Structured Data So we know what it is – how do we… • Model it? • Directed labeled graphs • Query it? • Many proposals, all include regular path expressions…Lorel, XML Query… • Store it? • Big challenge Haystack Model

Semi-Structured Data Models • What do they do? • Provide a common framework • In effect, they add some structure • Why? • Semi-structured data often is irregular or missing, similar concepts are represented using different types, heterogeneous sets are present, or object structure is not fully • Standardize information exchange • Data verification (both internal and external) • Examples • OEM, XML DTD, XML Schema…

OEM – Object Exchange Model • Developed at Stanford (mid 90s) • Precursor to today’s accepted semi-structured data acronyms (XML) • (label, type, value, object-ID) • Main feature – self-describing • Requires a good bit of human intervention, though

Object-Oriented Model versus OEM • OEM is an information exchange model (does not specify object storage issues) • OEM is much simpler (supports object nesting…omits classes, methods, inheritance) • Uses labels in place of schema

Advantages of OEM • Simple model makes transforming and merging data simpler • Advanced features can be “emulated” (implies human intervention) • More suitable for heterogeneity • Hindsight: Extreme heterogeneity mandates more than a little human intervention without some structure

Components of OEM • Query Language • OEM-QL – typical SELECT-WHERE-FROM • Translator • Translates OEM-QL to specific data source and back • Mediator • Collects work of translators then merges and/or combines them to make OEM structures

OEM-QL SELECT – WHERE – FROM Adaptation of SQL-like language for OO models SELECT fetch-expression FROM object WHERE condition Expressions in the SELECT and WHERE clauses use the notion of a path that describes a traversal through an object using sub-object structure and labels

OEM-QL SELECT biblio.?.topic FROM root WHERE biblio.?.internal-call-no ? - denotes match to any label • Return the topic of books where there exists an internal call number • The question mark allows the user to say that the intermediate “node” in the path through the object can be named anything

XML DTD – Document Type Definition • Let there be (a little) more structure… • DTD’s define the legal building blocks of an XML document. • It defines the document structure with a list of legal elements and/or attributes, and it can be declared inline or external to the XML document.

XML DTD Example <!DOCTYPE note [ <!ELEMENT note (to, from, heading, body) > <!ELEMENT to (#PCDATA) > <!ELEMENT from (#PCDATA) > <!ELEMENT heading (#PCDATA) > <!ELEMENT body (#PCDATA) > ]>

XML DTD Advantages • An application can use a standard DTD to verify that data you receive from the outside world is valid. • It is flexible enough so that you can nest: • + -- at least one occurrence • * -- zero or more occurrences • ? – zero or one occurrence Example: <!ELEMENT note (to +, from, header, message *, #PCDATA)>

DTD Drawbacks • What about constraints?? • DTD’s do not offer much help in constraining the value of a particular attribute or element (only on the use of markup) • Automated processing of XML documents requires more rigorous and comprehensive facilities in this area. • Requirements are for constraints on how the component parts of an application fit together, the doc structure, attributes, data-typing, and so on.

XML Schema Well formatted is not enough! Let there be more structure! • XML Schema is an XML-based alternative (and ultimate successor) to DTD’s • They express shared vocabularies and allow machines to carry out rules made by people. • They provide a means for defining the structure, content and semantics of XML documents

Successor to DTD’s • XML Schema: • Extensible to future additions • Richer and more useful than DTD’s • Written in XML • Support data types • Support namespaces

XML Schema Advantages • Better validation, restriction, and type conversion • Extensible – reuse, modify existing data types, reference multiple schemes

XML Schema Details Defines… • Elements that can appear in a document • Attributes that can appear in a document • Which elements are child elements • Order of child elements • Number of child elements • Whether an element is empty or can include test • Data types for elements and attributes • Default and fixed values for elements and attributes

XML Schema Components Primary components,: • Simple type definitions , Complex type definitions, attribute declarations, and elements declarations The secondary components, which must have names, are as follows: • Attribute group definitions, Identity-constraint definitions, Model group definitions, and Notation declarations Finally, the "helper" components provide small parts of other components; they are not independent of their context: • Annotations, Model groups, Particles, Wildcards, Attribute Uses

XML Namespaces (W3C Documentation) • Collection of names, identified by a URI reference, which are used in XML documents as element types and attribute names • XML namespaces differ from the "namespaces" conventionally used in computing disciplines in that the XML version has internal structure and is not, mathematically speaking, a set

XML Schema Example W3C XML Schema Primer (examples) <schema xmlns="http://www.w3.org/2001/XMLSchema" xmlns:po="http://www.example.com/PO1" targetNamespace="http://www.example.com/PO1" elementFormDefault="unqualified" attributeFormDefault="unqualified"> <element name="purchaseOrder" type="po:PurchaseOrderType"/> <element name="comment" type="string"/> <complexType name="PurchaseOrderType"> <sequence> <element name="shipTo" type="po:USAddress"/> <element name="billTo" type="po:USAddress"/> <element ref="po:comment" minOccurs="0"/>  </sequence>  </complexType> <complexType name="USAddress"> <sequence> <element name="name" type="string"/> <element name="street" type="string"/>  </sequence> </complexType>  </schema>

Querying Semi-Structured Data • Keys: • Semi-structured data modeled on directed graphs • User cannot have full knowledge of data structure, but we should exploit what structure we do know exists • Examples • Lorel • Developed at Stanford (1997) as part of the Lore (lightweight object repository) project • XPath • W3C standard • Language for addressing parts of an XML document

Lore System Stanford Link • Successor to OEM • Fully functional DBMS for XML with: • Declarative query language, multiple indexing techniques, a cost-based query optimizer, multi-user support, logging, and recovery • Novel features include: • DataGuides, • Management of external data • Proximity search.

Lore – Novel Features • DataGuides • Structural summary of all paths in that database • Used by query optimizer to exploit known structure • Manage External Data • Proximity Search • Ranks database objects based on their proximity to other objects • Measure proximity based on distances in the graph linking the objects together

Lorel – Lore Query Language • Based on OQL • Provides powerful path traversal operators • Makes extensive use of type coercion to help yield "intuitive" results for all queries over XML data • Permits flexible form of declarative navigational access • Particularly suited to when details of structure are not known

Lorel – Coercion Rules

Lorel Example Find the names and zip codes of all “cheap” restaurants select Guide.restaurant.name, Guide.restaurant.(.address)?.zipcode where Guide.restaurant.% grep “cheap” - The ? after .address means the address is optional in the path expression - The % will match any subobject of restaurant - Comparison operator grep returns true if string “cheap” appears anywhere in the subobject value

Lorel – Another example select X.name from John.name JN, John.child X, X.name XN where JN == XN • “Retrieve the children of John bearing his name” • == expects atomic values so they are coerced Rewritten: select X.name from John.child X where John.name == X.name

Lorel – Constructing Results • S-F-W in Lorel has same semantics as SQL: results are a bag (multiset) or a set if ‘distinct’ is used • Results is always a collection of OEM objects (elimination by OID) • For each assignment of the variables in the from clause that passes the condition of the where clause, a value is generated according to the expressions in the select clause • Results could refer to database objects or could refer to new objects created by coercion

Lorel – Data Updates • Create and delete database names • Delete is implicit when object becomes unreachable • Create a new atomic or complex object • Modify the value of an existing atomic or complex object • Bulk load an OEM database

Lorel – Updates cont’d… • Assigning names to objects Name myFavorite := element (select Guide.Restaurant where Guide.Restaurant.name = “Saigon”) • Creating objects new_oem (int, 5) new_oem (complex, struct(a:{new_oem(int,5)}, b:{X,Y}))

XPath Features • XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax • Provides basic facilities for manipulation of strings, numbers and booleans • XPath uses a compact, non-XML syntax to facilitate use of XPath within URIs and XML attribute values

XPath – How It Works W3C XPath Information • XPath models an XML document as a tree of nodes • Root nodes, element nodes, text nodes, attribute nodes, namespace nodes, processing instruction nodes, comment nodes • Evaluation occurs with respect to a “context” which consists of: • a node (the context node) • a pair of non-zero positive integers (the context position and the context size) • a set of variable bindings • a function library • the set of namespace declarations in scope for the expression

XQuery – How It Works • Location path – selects a set of nodes relative to the context node • An expression that is a location path results in a node set • Examples of location paths • Includes functions for node sets, strings, numbers, etc…

XPath – Generic Example Simple: employee[@secretary and @assistant] Selects all the employee children of the context node that have both a secretary attribute and an assistant attribute W3C School Examples

Semi-Structured Data Models

Semi-Structured Data Models

Presentation Transcript

Structured Thread Models

Keyword Search on Structured and Semi-Structured Data

Putting Semi-structured Data to Practice

Semi-Indexing Semi-Structured Data (in tiny space)

Collectively Representing Semi-Structured Data from the Web

ICS 321 Spring 2011 Semi-structured Data Model

Text Search for Fine-grained Semi-structured Data

A Robust System Architecture For Mining Semi-structured Data

Semi-Structured Data and XML

XML and the Semi-Structured Data Model

Efficient Algorithms for Mining Semi-structured Data

Efficient Search in Semi-structured Data Spaces

Semi-supervised Structured Prediction Models

GLASS : A Graphical Query Language for Semi-Structured Data

Semi-Parametric Models

Diversifying Query Results on Semi-Structured Data

Semi-structured data - exercises

Semi-structured Data

Semi-Structured data (XML)