260 likes | 422 Views
Querying Web-Sources within a Data Federation. Lynn Wu 1 , Aykut Firat 2 , Tarik Alatovic 3 , Stuart Madnick 1 1 MIT Sloan School of Management 2 Northeastern University 3 INSEAD International Conference on Information Systems (ICIS) December 11, 2006. Motivating Scenario. You want:
E N D
Querying Web-Sources within a Data Federation Lynn Wu1, Aykut Firat2, Tarik Alatovic3, Stuart Madnick1 1MIT Sloan School of Management 2Northeastern University 3INSEAD International Conference on Information Systems (ICIS) December 11, 2006
Motivating Scenario You want: • The current stock quotes of all companies listed on the Stock Exchange • that are in the biotechnology industry. • And you want to see each of the stock quotes in all the major currencies.
Good News All of the necessary information is available (and for free) on the Web … Stock price for any company Listing of companies in an industry Conversion between any two currencies So what’s the problem?
Process – Part 1 Web sites are not like Relational (SQL) databases. Must go step-by-step: first find all the biotech companies.
Process – Part 2 Then must find the stock price of each, one-by-one. 237 Biotech firms ADH ACOR ACHN ABPI ADLS ANX ACAD
Process – Part 3 Then must convert stock price of each, one-by-one.
General Scenario • Users often have to browse through many websites and collect and process a lot of information manually. • Wouldn’t it be great if you could get all the stock quotes in the biotech industry using one query? select ticker, price from yahooF where ticker IN (select companyticker from companytable where industry='Biotechnology')
Why is this so difficult? • Websites have various capability restrictions. • Web sites do not accept general queries (e,g., SQL). • Assuming they somehow accepted general queries, there are still problems. For example: • select price from yahooF This is not answerable as Yahoo! Finance requires at least one ticker at a time to get the stock quote. • select exchanged, expressed, rate, date from olsen where expressed='USD' and date= '12/10/06' Must specify both currencies.
Existing Solutions • Commercial databases can incorporate heterogeneous data sources through the use of wrappers: • However, there is no general-purpose wrapper that can query the entire Web. • Need to construct one wrapper per website. • This is our focus – how can these be improved ? • Other options: • Using highly expressive context-free grammars to express the capability restrictions • Has not been used widely in commercial systems due to their complexity.
Query: Select ..from s1,s2,s3 … Federation Engine IBM DB2 Wrapper: Request - Reply Protocol Wrapper Wrapper for S1 Wrapper for S2 Wrapper for S3 Capability Capability Capability Handler Handler Handler Data Data Data Extraction Extraction Extraction Web Sources S1-website S2-website S3-website How does a Federated database system handle the problem? Example: IBM DB2 For web sites (S1, S2, S3), each wrapper must be custom crafted.
Research Contribution • Offer a complete, practical, and scalable solution to easily incorporate websites into a data federation. • Abstract wrapper components into separate reasoning engines. • Capability reasoning engine for query planning and execution • Data extraction engine
Query: Select ..from s1,s2,s3 … Federation Engine IBM DB2 Wrapper: Request - Reply Protocol Wrapper Wrapper for S1 Wrapper for S2 Wrapper for S3 Capability Capability Capability Handler Handler Handler Data Data Data Extraction Extraction Extraction Web Sources S1-website S2-website S3-website Our Solution Three-Layered Architecture— with capability declaration Two-Layered Architecture—current IBM solution Query: Select ..from s1,s2,s3 Query: Select ..from s1,s2,s3 … … Federation Engine Federation Engine IBM DB2 IBM DB2 Wrapper: Request Wrapper: Request - - Reply Protocol Reply Protocol C C apability apability Wrapper, Wrapper, Query Query R R ecord ecord Capability Capability planning planning CR CR Declaration Declaration for S1 for S1 Engine Engine with with CR CR for S2 for S2 capability capability CR CR declaration declaration for S3 for S3 D D ata ata Data Data E E xtraction xtraction Data Data Extraction Extraction DE DE Spec Files Spec Files for S1 for S1 Extraction Extraction Engine Engine Engine Engine DE DE for S2 for S2 DE DE for S3 for S3 Web Web Sources Sources S1-website S2-website S3-website
Adding a web source is simple. • Define the data extraction rules. • Define the capability record. No procedural coding involved at all !
Data Extraction: Cameleon Engine • Extract data from web pages using declarative specifications that extract specific fields within a website. • Can answer rudimentary queries involving only a single website. Input param Regular expression identifying the region and extracts the price Example data extraction rules for Yahoo! Finance
Cameleon Studio tool enables quick creation and testing of the data extraction rules
Capability Record • For Yahoo Finance!, we have two attributes of interest. • Cameleon extracts data and form a table format • Capability Record Must provide one (and only one) Ticker at a time (some sites allow up to 50 Tickers at a time). relation(‘YahooF’, [[‘Ticker’, string, bound(1)], [‘Price’, number, free]], ['=']) Price is value returned. Can only use equality (=) operator. relation(olsen, [['Exchanged',string, bound(1)], ['Expressed',string, bound(1)], ['Rate',number, free], ['Date',string, bound(1)]], ['=']). relation(‘companytable’, [[‘Industry’, string, bound(1)], [‘CompanyTicker’, string, free]], ['='])
DB2 XML Wrapper (Adapted from IBM). IBM DB2 • Uses wrapper to access non-relational data sources. • DB2 first decomposes the original query into query fragments and then sends them to wrappers. • Wrapper sends the result back to DB2 which then assembles the final results.
select price * 1.3 from YahooF where ticker in (‘GE’, ‘IBM’, ‘MSFT’); Query Fragment HXP: Price Table: YahooF Predicates: ticker in (‘GE’, ‘IBM’, ‘MSFT’) Request HXP: Price Table: YahooF Predicate: ticker = ‘IBM’ HXP: Price Table: YahooF Predicate: ticker = ‘MSFT’ HXP: Price Table: YahooF Predicate: ticker = ‘GE’ Wrapper plan 1 Wrapper plan 2 Wrapper plan 3 Request-Reply-Compensate protocol example Request-Reply-Compensate Protocol
Query Planning • Now we have a capability record defined. • Add a secondary mini query planner that is designed specifically to work with capability records. • Can answer queries involving multiple web sources. • Specify a query execution order of query fragments. • Independent query fragments are executed first. • Followed by dependent query fragments that can uses the prior results.
Independent query fragment SELECT COMPANYTICKER, INDUSTRY FROM COMPANYTABLE WHERE INDUSTRY = BIOTECHNOLOGY’ AND COMPANYTICKER < 'AD') Depends on the previous query fragment SELECT TICKER, PRICE FROM YAHOOF WHERE TICKER = [<unbound kind>] Our Solution • Example 1 • Find all the stock quotes of biotech companies. SELECT TICKER, PRICE FROM YAHOOF WHERE TICKER IN (SELECT COMPANYTICKER FROM COMPANYTABLE WHERE INDUSTRY='BIOTECHNOLOGY' AND COMPANYTICKER <'AD'))
Independent query fragment SELECT COMPANYTICKER, INDUSTRY FROM COMPANYTABLE WHERE INDUSTRY = BIOTECHNOLOGY AND COMPANYTICKER < AD Depends on the previous query fragment SELECT TICKER, PRICE FROM YAHOOF WHERE TICKER = [<unbound kind>] Example Query COMPANYTICKER INDUSTRY --------------------------------------------- ACAD Biotechnology ACAM Biotechnology ACOR Biotechnology ACEL Biotechnology SELECT PRICE, TICKER FROM YAHOOF WHERE TICKER = ACAD SELECT PRICE, TICKER FROM YAHOOF WHERE TICKER = ACAM SELECT PRICE, TICKER FROM YAHOOF WHERE TICKER = ACOR SELECT PRICE, TICKER FROM YAHOOF WHERE TICKER = ACEL TICKER PRICE ------------------------------------------- ACAD 14.90 ACAM 6.51 ACOR 5.10 ACEL 3.18
Example 2 • Now you want the stock price in all major currencies. select yahooF.ticker, yahooF.price * exchange.rate, exchange.curency from • (select ticker, price from yahooF • where ticker IN (select companyticker from companytable • where industry=‘biotechnology’) • (select currency, olsen.rate from • (select currency from currency_map • where currency <> ‘USD') currency_map, • (select exchanged, 'USD', rate, ‘12/10/06' • from olsen where expressed= 'USD' and date=‘12/10/06') olsen • where currency_map.currency = olsen.exchanged and • currency_map.currency <> 'USD ') as exchange
Example 2 Get all the exchange rates against the USD on Dec 10 2006 select olsen.rate, from (select currency, from currency_map where currency <> ‘USD')currency_map, (select exchanged, ‘USD', rate, ‘12/10/06' from olsen where expressed=‘USD' and date=‘12/10/06') olsen, where currency_map.currency = olsen.exchanged and currency_map.currency <> ‘USD' Query fragment 1 Query fragment 2
Query fragment 2 Capability record (select exchanged, ‘USD', rate, ’12/10/06' from olsen where expressed=‘USD' and date=’12/10/06’) olsen relation(olsen, [['Exchanged',string, bound(1)], ['Expressed',string, bound(1)], ['Rate',number, free], ['Date',string, bound(1)]], ['=']). select olsen.rate from (select currency from currency_map where currency <> 'USD') currency_map, (select exchanged, 'USD', rate, '12/10/06' from olsen where expressed= 'USD' and date='12/10/06') olsen, where currency_map.currency = olsen.exchanged and currency_map.currency <> 'USD' (select exchanged, 'USD', rate, ’12/10/2006' from olsen where expressed= 'USD' and date='12/10/06' and exchanged in (select currency from currency_map where currency<>’USD’)) Modified Query fragment 2
TICKER PRICE ------------------------------------------- ACAD 14.90 ACAM 6.51 ACOR 5.10 Currency rate ---------------------------------------- AUD 1.46 CAD 1.32 HKD 7.72 YPY 113.00 select ticker, price * exchange.rate, exchanged.currency TICKER PRICE($) PRICExRATE CURRENCY ------------------------------------------------------------------------------------------------------------- ACAD 14.90 21.754 AUD ACAD 14.90 19.668 CAD ACAD 14.90 115.028 HKD ACAD 14.90 1683.7 YPY ACAM 6.51 9.505 AUD ACAM 6.51 8.593 CAD ACAM 6.51 50.257 HKD ACAM6.51 735.63 YPY ACOR 5.10 7.446 AUD ACOR 5.10 6.732 CAD ACOR 5.10 39.372 HKD ACOR 5.10 576.3 YPY
Conclusion • Three-layered architecture for querying web sources. • Instead of burying capability handling in each wrapper, we created a generic capability handler. • Using this capability handler, adding a web source to a federated database is as simple as declaring the extraction rules and capability record for the source. • This was implemented and successfully tested. • This makes millions of semi-structured web sites into useful “databases.”