340 likes | 348 Views
Learn about the evolution of data warehousing to information integration, including structured and unstructured data, and the existing solutions for integrating both types of information.
E N D
A Journey from Data Warehousing to Active Information Integration Mukesh Mohania IBM India research Lab mkmukesh@in.ibm.com
Outline • What is Data Warehousing (DW)? • What is Information Integration (II)? • Existing Solutions • From DW II • Event based (Active) Information Integration • Context-Oriented Information Integration
Structured and Unstructured Information • Information content in an enterprise can be structured or unstructured • Structured Content: payroll, sales orders, invoice, customer profiles, etc. • Unstructured Content:: emails, reports, web-pages, complaints, information on sales, customers, competitors, products, suppliers and people, etc. • According to recent estimates, structured content < 20%, unstructured content > 80% • Historically, the structured and unstructured data management technologies have evolved separately Artificial separation between these two “kinds” of information • Enterprises are realizing the need to bridge this separation, and are demanding integrated retrieval, management and analysis of both the structured and unstructured content
From Data Warehouse to Information Integration • Business needs • Support for decision making, based on • Historical or point-in-time view of the business • Aligned across different departments • Technical limitations • OLTP systems and performance must be protected • Historical and summary data stores needed • Reconciliation of data in different systems is slow • Ad hoc query performance needs to be optimized Data warehousing was first driven by a need for consistent business information from disparate systems.
Mid-1980s: Data Warehousing • Data warehouse • Reconciling disparate data • Single version of the truth • Historical record • Characteristics • Information not needed immediately • Structured data • Unidirectional data flow • Trusted sources
Business needs Support for decision making, based on • Historical or point-in-time view of the business, • as well as near real-time views Aligned across different departments The need for consistency extends to more immediate business information
Mid-1990s: Operational Data Store • Operational data store • Near real-time • Reconciling a subset of data • Characteristics • Immediate and historical info needs • Structured data • Partial bi-directional data flow • Increasing technical metadata
The vision: Comprehensive integration of information • Integrated information • Real-time knowledge • Integrating all information • Characteristics • Immediate and historical information needs • Fully merged informational & operational needs • Structured and unstructured data • Bi-directional data flow • Caching reduces data flow • Complete business & technical metadata VLDB 2006, SIGMOD 2007, PODS 2007
Structured and Unstructured Information Integration:A Brief Background on Existing Solutions Existing solutions can be classified in terms of the query paradigm used: • Keyword Query Based Solutions (DB2 ESE, DbXplorer/BANKS [ICDE02]) • Relational data exposed to search engine as virtual text documents • Query both structured and unstructured information using keywords • SQL Query Based Solutions (SQL LIKE predicate, DB2 NetSearch Extender) • Text data exposed to relational engine as virtual tables with text columns • Query both structured and unstructured information using SQL • Provide SQL primitives to search text in table columns using a set of keywords
Keyword Query Based Solution: DB2 ESE DB2 Enterprise Search Extender Keyword Query
Keyword Query Based Solution:DbXplorer/BANKS [ICDE02] DbXplorer/ BANKS Keyword Query Search Engine Keyword Query
Keyword Query Based Solutions: Summary • Advantage: Simplicity! • Disadvantages • Less expressive (as compared to SQL) • How to ask for the information related to the five best performing stocks in the past week? • Need to specify a set of keywords that succinctly encodes the information need • Not always easy
SQL Query Based Solution:Standard SQL LIKE Predicate SELECT stocks.price, docs.text FROM stocks, docs WHERE (stocks.name = ‘IBM’ AND docs.text LIKE ‘% IBM %’) OR (stocks.name = ‘ORCL’ AND docs.text LIKE ‘% ORCL %’) DB2 UDB / DB2 Information Integrator
SQL Query Based Solution:Net Search Extender SELECT stocks.price, docs.text FROM stocks, docs WHERE (stocks.name = ‘IBM’ AND CONTAINS(docs.text, “IBM”)) OR (stocks.name = ‘ORCL’ AND CONTAINS(docs.text, “ORCL”)) DB2 UDB / DB2 Information Integrator CONTAINS(…) Net Search Extender
SQL Query Based Solutions: Summary • Advantages: • More expressive – can specify more involved and sophisticated queries • Disadvantages: • The unstructured data is still queried using keywords • Need to specify a set of keywords that succinctly encodes the information need • Not always easy • The SQL query and the embedded keyword query encode the same information need • Redundant effort • Association of documents with tuples (local context), not with the entire result (global context) • Same documents get attached to “IBM” when “IBM” is queried with “ORCL” as when “IBM” is queried with “DELL”
Data Stream • A data stream is a sequence of data items X1, X2, …, Xn, coming continuously from single or multiple sources where random access to data is not allowed. Data Stream Characteristics Strongly regular: strongly periodic (inclusive zero time interval between two data items), only one type of data, schema can be derived or conforms schema. Weakly regular: weakly periodic (follows some time interval), mixed types of data but follows the order, schema can be derived. Irregular: aperiodic, types of data unknown, no order, schema cannot be derived.
Active functionalities over streaming data • Provides real-time functionalities that is needed in several advanced applications. • Alert a doctor when the blood pressure of a patient goes below X, heart beats less than Y and ECG touches Z. • Sell all my INTC stocks at the higher trading price exchange if the price difference at any time between two exchanges is more than 2%. • Cancel my tomorrow’s flight if there is a terrorist attack in the region of flying. • Events can be defined on composition of data streams that can trigger some pre-defined actions (notification and alert, database change, etc.) • Context can be associated with the events • INTC was trading higher at NASDAQ at 9:32 AM since CEO of INTC rang the opening bell.
Active Rules • An active rule is composed of three components: • Event (E): Monitor - Detect - Evaluate • Condition (C): Derive - Analyze - Evaluate • Action (A): Collaborate - Integrate - Effect • Events: • customer-event(1st purchase, new subscription, etc.) • time-based event (birthday, retirement, etc.) • product-based (launch of a new product, decline in sales, etc.) • calendar-based (Christmas, Diwali, etc.)
Rule Examples • If the value of a transaction is less than $2 and this transaction is made by credit card, notify to the fraud detection system and send a notification to the customer service representative for calling the credit card holder immediately to check the validity of the transaction. • If a customer has made at least 3 transactions or the total value of all transactions is more than $2000 during Christmas holidays, then offer 10% discount to the user between January 10-January 31. • If the duration of a telephone call exceeds by more than 40 minutes, then send a notification to the fraud detection system. • If less than 10% of the stock is sold in a retail store by the end of the week, offer a 20% discount on the non-luxury items for the next week.
Business Logic/Process Active Functionalities Feedback Adaptor Adaptor Adaptor Adaptor Connectors Monitor Monitor Monitor Monitor Data Stream DB MDB Web Data Sources Architecture
Workflow Execution Rules DB Metadata Rule Execution Engine Information Integrator and Decision Analysis Event Composer and Detector Information Flow Engine Integration Hub
Motivation Non-relational Query Relational Query RDBMS CM System (20% of enterprise data) (80% of enterprise data) DB2 Result Retrieved data and documents Unstructured Data (Content) Management Structured Data Management Current Scenario: Isolated Management of Structured and Unstructured Information Needed: Consolidated Management
Motivation Relational Query Content Query Broker RDBMS CM System (100% of enterprise data) Main result: Relational Result Addition: Relevant Documents Main result: Retrieved documents Addition: Relevant database fragment Consolidated Management A broker enables consolidation of information stored in Relational and Non-relational (CM) systems
Problem To enhance structured and unstructured data retrieval through symbiotic consolidation of related information. Specifically: • Enhance structured data retrieval by associating additional documents relevant to the user context with the query result • Enhance document contents by associating additional information derived from structured data Structured data = relations, schema-based (XML) documents Unstructured data = schema-less (free-flow) documents, web-pages
Solution Overview “Get the 3 companies with max price variation” DB2 Information Integrator SELECT name, max(price) - min(price) FROM stocks GROUP BY name ORDER BY 2 FETCH FIRST 3 ROWS ONLY SCORE “Doctype:Patents” DB2 Enterprise Search Extender “IBM” “ORCL” “MSFT” “Database” “Software” “Doctype:Patents” CIKM 2005 Best Paper Award
Solution Overview DB2 Information Integrator SELECT name, max(price) - min(price) FROM stocks GROUP BY name ORDER BY 2 FETCH FIRST 3 ROWS ONLY SQL Query Result SCORE “Doctype:Patents” DB2 Enterprise Search Extender “IBM” “ORCL” “MSFT” “Database” “Software” SQL Query Context “Doctype:Patents”
Main Idea • Specify information need in terms of SQL over the structured database • Additional information needs specified using “directives” (optional) • Automatically synthesize the “context” of the SQL query from its result and the known semantic dependencies in the structured data • Use this context and the directives to retrieve the unstructured data
Overall Architecture User Interface/Application Query Result + Relevant Documents Query + Directives Metadata Repository (Criollo) Query Handler SCORE Query Result + Context Relevant Documents Query Metadata mapping Context Handler Metadata Modified Query Result ModifiedQuery Context + Directives Metadata DB2 II Enterprise Search CM CM CM CM Structured Data Source Unstruct Data Source Metadata
Applications • Financial: Customer-centric investment account and risk assessment documents • Health: Patient specific report and medical articles • Telecommunications and Manufacturing: Defect statistics and engineering specifications • Marketing: Customer transaction history and marketing documents
Research problems in Active Information Integration • How to model the Active Rules? Is it just an ECA or something more? • How to define events and rules along dimensional and fact tables in data warehousing? • Do we need a different data modeling schemes for Active/real-time data warehousing? • How to monitor the data sources for active integration? • What data need to be materialized for computing the exact change at data warehouse site? • How to provide keyword search on data warehousing considering the data is exposed as business objects to users? • How to translate keyword query into more semantic based query for OLAP analysis? How to handle uncertainty in keyword based queries? • Single data indexing mechanism for both unstructured and structured data? • New architectures for Active/Real-time Data warehousing – handling large number of queries in almost sub-seconds.
Conclusions • Integration Evolution • 1980 Federated Data Integration • 1985 Data warehouse • 1995 Operational data store • 1999 Client information integration • 2003 Information integration • Active Information Integration Approaches • Rule based Information Integration • Co-relating structured and unstructured data