Data Warehousing/Mining Comp 150 DW Semistructured Data

Data Warehousing/MiningComp 150 DW Semistructured Data Instructor: Dan Hebert

Semistructured Data • Everything that has no rigid schema • Schema is contained within the data (self-describing), OR • No separate schema, OR • Schema exists but places only loose constraints on data • Emerged as an important topic for a variety of reasons • Many data sources like WWW which we would like to treat as databases but cannot for the lack of schema • Desirable to have an extremely flexible format for data exchange between disparate databases • May want to view structured data as semistructured data for the purpose of browsing

Motivation • Some data really is unstructured/semistructured • World Wide Web, • Data exchange formats • Some exotic database management systems, e.g., ACeDB, popular with biologists • Data integration • Browsing

Motivation - World Wide Web • Why do we want to treat the Web as a database? • To maintain integrity • To query based on structure (as opposed to content) • To introduce some “organization”. • But the Web has no structure. The best we can say is that it is an enormous graph.

Motivation - Data Formats • Much (probably most) of the world’s data is in data formats • These are formats defined for the interchange and archiving of data • Data formats vary in generality. ASN.1 and XDR are quite general • Scientific data formats tend to be “fixed schemas” • The textual representation given by data formats is sometimes not immediately translatable into a standard relational/object-oriented representation

Motivation - Data Integration • Goal is to integrate all types of information, including unstructuredinformation • Irregular, missing information, structure not fully known, dynamic schema evolution, etc. • Traditional data models and languages not well suited • Cannot accommodate heterogeneous data sets (different types and structures), etc. • Difficult to build software that will easily convert between two disparate models • OEM (Object Exchange Model) • Semistructured data model from TSIMMIS project at Stanford • Internal data structure for exchange of data between DBMSs • Used by other systems: e.g., Windows 95 registry, Lotus Notes

Motivation - Browsing • To query a database one needs to understand the schema. • However schemas have opaque terminology and the user may want to start by querying the data with little or no knowledge of the schema. • Where in the database is the string “Casablanca” to be found? • Are there integers in the database greater than 216 ? • What objects in the database have an attribute name that starts with “act”? • While extensions to relational query languages have been proposed for such queries, there is no generic technique for interpreting them.

The Model • Represent data as some kind of graph-like or tree-like model • Cycles are allowed but usually refer to them as trees • Several different approaches with minor differences (easy to convert) • Data on labels or edges, nodes carry information or not • Straightforward to encode relational and object-oriented databases • Issue: object identity

Querying Semistructured Data • There are (at least) three approaches to this problem • Add arbitrary features to SQL or to your favorite query language • Find some principled approach to programs that are based on the type of the data • Represent the graph (or whatever the structure is) as appropriate predicates and use some variety of datalog on that structure

The “Extend SQL” Approach • In fact it is an attempt to extend the philosophy of OQL and comprehension syntax to these new structures • It is the approach taken in the design of UnQL and also of Lorel • Looks very similar to OQL (path expressions)

Example select Entry.Movie.Title from DB where Entry.Movie.Director...

Syntax Issues • Need (path) variables to tie paths and edges together • Paths of arbitrary length • “Find all strings in db” • “Find whether “Allen” acted in “Casablanca” • Need regular expresions to constrain paths • Rich set of overloadings for operators to deal with comparisons of objects with values and of values with sets

Underlying Computational Strategy • Model graph as a relational database and use relational query language. • Database large relation (node-id, label, node-id) • Used by Stanford group in LORE/LOREL • Complications • Labels are from heterogeneous set of types, need more than one relation • Additional relations if info to be stored in nodes • Various navigation issues

Semistructured Data - Case StudyObject Exchange Model

OEM Features • Common model for heterogeneous information exchange, self-describing • Each object: OID Label Type Value • OID= unique identifier or NULL • Label= character string descriptor • Type= atomic data type or set • Value= atomic value or set of object references • “Help pages” for labels • Query language OEM-QL

Representing Semistructured Data Using OEM Label <collection, {b1, a1, ...}> b1: <book, {t, a}> t: <title, “Database and ...”> a: <author, {n, p}> n: <name, “Jeff Ullman”> p: <picture, “/gifs/ullman.gif”> a1: <article, {v, w, x}> v: <author, “Gio Wiederhold”> w: <title, “Mediators in the …”> x: <journal, “IEEE Computer”> Set Value Memory Addresses Atomic Value ...

An OEM Query Language: OEM-QL • Logic-based language for OEM • Match object patterns, generate variable bindings, construct new OEM objects from existing ones • Get articles published in “IEEE Computer” • P :- • P:<articles {<journal “IEEE Computer”>}> • Get titles of books by “Jeff Ullman” • <answer_title T> :- • <book {<author “Jeff Ullman”> <title T>}>

Semistructured Data - Case StudyWWW Extraction

Problem • Lots of valuable information on the Web • irregular structure • highly dynamic • Embedded in HTML • Limited query facilities

Data Extraction Tool • Flexible, easy to use • Accommodate virtually any HTML source • Interface with existing system, e.g., data warehouse, user interface for querying Query Data Warehouse World Wide Web Extractor WH Integrator Specification

Approach • Extract Web data into OEM format • Query using OEM-QL • Python-based, configurable parser • Declarative description of HTML source • location of data on page • how to package data into OEM • “Regular expression”-like syntax • Human intelligence rather than A.I.

[ “variable(s)”, “source”, “pattern” ] Extractor Specification Consists of commands of the form:

HTML Source File <HTML> <HEAD> . . . <TABLE> <TR> <TH><I> header 1</I></TH> <TH><I> header 2</I></TH> <TH><I>header 3</I></TH> </TR> <TR> <TD> text 1</TD> <TD><A HREF=http://www.stuff/> text 2 </A></TD> <TD> text 3</TD> </TR> . . . </TABLE> . . . </BODY> </HTML>

Specification File [ [“root”, “get('http://www.example.test/')”, “#” ], [“__tempvar1”, “root”, “*<table>#</table>*” ], [“__tempvar2”, “split (__tempvar1,’</tr>’)”, “#” ], [“rows”, “__tempvar2[1:-1]”, “#” ], [“header1,header2_url,header2,header3”, “rows”, “*<td>#</td>*<a*href=#>#</a>*<td>#</td>*”] ]

<rootcomplex { <rowscomplex { <header1string “text 1”> <header2_urlstring “http://www.stuff”> <header2string “text 2” <header3string “text 3”> }> <rowscomplex { }> }> ... ... Result OEM Object

Basic Syntax:Variable • variable(l:p:t) • optional parameters for specification of corresponding OEM object • l: label name • t: type • p: parent object • _variable • temporary data structure, does not appear as OEM object

Basic Syntax: Source • split(variable,token) • creates a list with multiple elements using token as the element separator • get(URL) • obtain contents of HTML file at address URL

Basic Syntax: Patterns • token1#token2 • match and store current input (between tokens) • token1*token2 • match, don’t store current input (between tokens)

Syntactic Sugar • Functions for extracting commonly used HTML constructs • extract_table(variable),pattern • split_table_row(variable) • split_table_column(variable) • extract_list(variable),pattern • split_list(variables)

Advanced Features • Customization of output • structure, label names, data type, ... • Extraction across multiple HTML pages • Graceful recovery from parse errors • resume parsing using next input from source • Multiple patterns in single command • follow different parse tree depending on structure in source

Sample Extraction Scenario . . .

Extracted OEM Data OEM-QL query: <city C {<high H> < low L>}> :- <temperature {<city_temp {<country “Germany”> <city C> <high_today H> <low_today L>}>}>

Evaluation • Better than • writing programs • YACC, PERL, etc. • A.I. • Can do better • GUI tool to simplify the generation of extractor specification • Machine learning or data mining techniques to automatically infer structure...

Data Warehousing/Mining Comp 150 DW Semistructured Data