Welcome to CPSC 534B: Information Integration

Welcome to CPSC 534B: Information Integration Laks V.S. Lakshmanan laks@cs.ubc.ca Rm. 315

Course Objectives • Most applications of information technology require effective and efficient management of information. • Information may reside anywhere – not just in DBs. • Information can be heterogeneous. • Information of interest may not all be in one place. • Information Integration. • II enabler for a whole class of new applications.

Course Objectives (contd.) • Key technologies: • RDBMS • Heterogeneous database systems • View integration and management • Semistructured data and XML (data on the web) • Main goal: learn about key concepts, techniques, algorithms, languages, and abstractions that make II possible. And have some fun.

Tentative Schedule Basic Tools (GOFDB) • Week of Jan. 5: Overview/review of FOL. • Jan. 12: Review of Relational algebra, calculus, datalog, SQL, integrity constraints. • Jan. 19: Query containment and equivalence. • Conjunctive • Negation & aggregation

Tentative Schedule Integration Take 1 – Global Info. Systems • Jan. 26: Integration models – Global As View and Local As View query answering using views (an application) II Take 2 – Dealing with heterogeneity • Feb. 2: SchemaLog and SchemaSQL. • Feb. 9: Schema Integration & Matching. • Feb. 16: Break!

Tentative Schedule (contd.) II Take 3 – Dropping (rigid) structure • Feb. 23: Intro to Semistructured data and XML (data model) • XPath & Tree Pattern Queries • Mar. 1: XPath (contd.) XQuery. • Mar. 8: XQuery (contd.) TAX algebra / structural Join algos • Mar. 15: XML Storage • Native • Relational • Mar. 22: XML + Information Retrieval

Tentative Schedule (contd.) II Take 4 – Semantic Web (The final frontier?) • Mar. 29: Semantic Web and II • Project Talks and demos: April 5 onward.

Marking Scheme • Assignments 45% • Project 55% • Reading papers • Critiquing them • Innovating • Implementing • Reporting and presenting • Projects can involve teams of 2-3 people (subject to approval). • Each team to include  1 MCS student.

Suggested Project Themes • Ideas/suggestions offered throughout the course, so be attentive! • Data cleaning: key step required in data integration. • Mining DTD/schema for XML docs: what you do when you must deal with XML data with no accompanying DTD/schema. • XML schema integration: different XML data sources may follow different DTD/schemas. How do you provide a unified integrated view to the user?

Project Themes (contd.) • XML query containment/equivalence: given queries (in XQuery or XPath), can rewrite them into more efficient ones; possibly use DTDs or integrity constraints. • XML query operator evaluation algorithms: develop cost models and cost-based physical optimization strategies. • XML and data security: how do you ensure queries are evaluated securely? Do not divulge anything you are not supposed to.

Project Themes (contd.) • XML and Information Retrieval: effective way of querying documents marked up using XML (e.g., Shakespear’s plays); how do you combine IR and database-style XML querying? • Data integration issues for biology: scientific data tends to be heterogeneous. How to meet the data integration challenges there? • Query Answering using Views for XML: Extend the QAV technology developed for RDBMS for XML querying.

Project Themes (contd.) • Detecting similarity between XML documents: develop notions of similarity between XML docs and implement algorithm(s) for detecting similarity • Ranking answers to keyword search queries over XML data: develop and implement algorithms for ranking answers, based on “quality” of match • XML interop: leverage semantic web and ontologies for matching schemas (XML or relational) and develop/implement algorithms for answering cross-queries

Project Themes (contd.) • Explore higher-order logics for tree (XML) querying: example candidates are HiLog and (extensions of) SchemaLog. [can be purely conceptual or part conceptual and part implementation.]

Welcome to CPSC 534B: Information Integration