Integrating Structured & Unstructured Data

Integrating Structured & Unstructured Data

Goals • Identify some applications that have crucial requirement for integration of unstructured and structured data • Identify key technical issues in integrating unstructured and structured data • Identify potential approaches

Definitions (simplified) • Structured object: • <oid, {<name, value>}> • Unstructured object: • <oid, {word}> • <oid, unknown/complex structure> • Semi-structured object • <oid, {<name, value>}, {word}> • <name, value> pairs may be • Given (e.g. author, title, etc.) • Extracted (e.g. Date, Zipcode, etc.) • Inferred (e.g. Topic)

Representative Applications • BPI: Messasges- unstructured • Web Applications: unstructured pages • Corporate Portals: • DSS involving Combination of simulation with database system • News syndication: author etc + story • Call centers: customer interaction + structured component of complaint • Mail system/document systems • Tourist information system • Product catalogs/engineering spec sheets • Patents/chenistry documents • Matching Legal documents (with cross citations) with building codes --- representative

Key Technical Issues • Query language & data model • Sharp vs fuzzy / complete vs best-effort • Boolean vs similarity queries (relationship to “value”) • Integration strategies • Loose vs. tight coupling Architectures (many possibilities) • Search engine into DBMS or DBMS into search engine • Late & early binding (warehousing vs virtual) • Integration vs articulation (union vs intersection) • Feature extraction from unstructured data • Role of meta data & integrity constraints • Inconsistency of data sources • Priorty rules for mediation • Management & data organization issues • Version management , freshness, security • Continuous queries over streams

Strucured:People(firstname, lastname, company, location) • Semi-structured:Papers(title, {authors}, text) • Unstructured: Reviews Q1: Reviews of papers by Almaden authors on II • Search reviews using Join(People.<fn,ln>, Papers.authors).keywords Q2: Folks in Almaden and Watson working on same topic • Join of Papers.text followed by joined with names in People Q3: Papers on privacy & data mining by Agarwal in Watson • Combine ranks of results from People and Papers Q4: Almaden authors whose papers had negative reviews • Infer sentiment of a review and interesting joins Q5: Crrent research topics in Almaden • Join People and Papers followed by clustering

Combining Scores • DB: • Aggarwal, Watson, s1 • Agarwal, Almaden, s2 • Agrawal, Almaden, s3 • IR • Sigmod 00 paper, r2 • PODS 01 papers, r1 • KDD00 paper, r3 Papers on privacy & data mining by Agarwal in Watson Result Query Chopper Combiner DB IR

Result Result Query Query Chopper & Router Chopper & Router DB IR DB IR Query Processing

Approaches (1) • Query Languages • XML-based extensions for queries • W3C working group on Xquery considering extension for full text • XXL (Weikum), XIRQL (Fuhr) • Specialized languages for highly structured data (e.g. chemical molecules)? • Graph-based models & languages (RDF, Protégé – Stanford) • Extended relational (e.g. SQL/MM) • Inverse queries on business events • Reasoning systems • Statistical approaches (approximate/ data mining)

Approaches (2) • Pluses of tight coupling • Enforcement of ontologies, schemas • Security, management, query optimization, integriry constraints • Negatives of tight coupling • Does not address federation issues/autonomy • Pluses of loose coupling • Flexibility • Negatives of loose coupling And the dinner bell rings …

Concluding Remarks • We need further discussion on issues and approaches during the rest of the workshop

Integrating Structured & Unstructured Data