350 likes | 368 Views
Pi-Web Join in a Web Warehouse. S S Bhowmick, S K Madria, W -K Ng, E -P Lim Nanyang Technological University Singapore. Presentation Overview. Search engines & web query systems Research objectives Current research WHOWEDA Pi Web Join Benefits of Pi-Web Join Summary.
E N D
Pi-Web Join in a Web Warehouse S S Bhowmick, S K Madria, W -K Ng, E -P Lim Nanyang Technological University Singapore
Presentation Overview • Search engines & web query systems • Research objectives • Current research • WHOWEDA • Pi Web Join • Benefits of Pi-Web Join • Summary DASFAA’ 99
“If you build it, they will come” • WWW is chaotic • Increasingly difficult to locate information. • Related data are scattered in a piecemeal fashion • Data, data everywhere….but how to find it? DASFAA’ 99
Limitations of Search Engines • Do not exploit hyperlinks • No efficient document management • Query results cannot be further manipulated DASFAA’ 99
Current Web Research • Web query systems: • W3QS, WebSQL, AKIRA, NetQL, RAW, WebLog • Semistructured data: • LOREL, UnQL, WebOQL • Website management system: • STRUDEL DASFAA’ 99
Context of this research • Build a web warehouse • Web data access • Historical web data • Information over time • Web data manipulation • Efficient visualization of web information • Maintenance of web data • Web data mining • Overcome existing limitations DASFAA’ 99
WHOWEDA - What? • WareHouse Of Web Data • Subject - oriented • Integrated • Temporal • Granularity - Lower, higher • Some summary • Not updatable • Alternative information sources DASFAA’ 99
Web Information Coupling System • A system to couple and manipulate related web information • Web data model • Web objects • Web algebra DASFAA’ 99
Web Objects • Node: url, title, format, size, date, text • Link: source-url, target-url, label, link-type • Web tuple - Set of nodes and links • Web table - Collection of web tuples • Web schema DASFAA’ 99
Web Schema • Metadata in the warehouse • Structural ‘summary’ of web table • Coupling of related information begins with a query graph • Query graph ->Web schema • Ordered 4-tuple: • Set of node variables • Set of link variables • Connectivities • Predicates DASFAA’ 99
Example 1 • Produce a list of diseases with their symptoms, evaluation procedures and treatment starting from the web site at http://www.panacea.org/ • Web tableDiseases DASFAA’ 99
q treatment z x symptoms Disease List p evaluation http://www.panacea.org/ z Query Graph (Web Schema) for Example 1
q1 Treatment list http://www.panacea.org/ Symptoms list x0 z1 AIDS List of Diseases Evaluation p2 Elisa Test A web tuple in ``Diseases”
Example 2 • Produce a list of drugs, and their uses and side effects starting from the web site at http://www.panacea.org/ • Web tableDrugs DASFAA’ 99
Side effects Drug list http://www.panacea.org/ a b d List of Diseases k Uses Query Graph (Web Schema) of ``Drugs”
Side effects of Indavir Drug list http://www.panacea.org/ AIDS a0 b1 d1 Indavir Side effects List of Diseases Use k1 Uses of Indavir A web tuple in ``Drugs”
Web Algebra • Formal foundation of data representation and manipulation in a web warehouse • Web operators: • Information access operator • Information manipulation operators • Web schema operators • Data visualization operators DASFAA’ 99
Global Coupling - Information Access • To couple related information from the Web (ER 98) • Match portions of the web that satisfy the web schema • Input is a query graph • Output is a web table DASFAA’ 99
Web Project • Eliminate nodes from web tuples which are irrelevant • Based on project conditions • Set of node variables • Start node variable and end-node variable • Node variable and depth of links • Used to isolate data of interest in a web table, allowing subsequent web queries to run over smaller, more structured web table DASFAA’ 99
Web Project • May create duplicate web tuples (web bag) • Duplicate web tuples are not removed automatically • To justify knowledge discovery (FODO 98) DASFAA’ 99
http://www.panacea.org/ Symptoms list x0 z1 AIDS List of Diseases Evaluation p2 A web project on ``Diseases”
Web Join Operator • Information manipulation operator (DEXA’ 98) • Manipulate information residing in a web warehouse to derive additional information • Harness useful, composite information from two web tables • Capitalize on the reuse of retrieved data from the WWW in order to reduce execution time of queries DASFAA’ 99
Joinable Nodes • Nodes participating in the web join process • Expressed as a pair • Each node in the pair should have identical contents DASFAA’ 99
Web Join • Combine two web tables by concatenating a web tuple of one web table with a web tuple of other web table whenever there exist joinable nodes • Joinable nodes may be identified from the schemas of the two web tables • URLs of the joinable nodes are identical (Assuming that the last modification date is same) DASFAA’ 99
q treatment http://www.panacea.org/ z x symptoms Disease List p evaluation Side effects Drug list b d Joined schema k Uses
q1 Treatment list http://www.panacea.org/ Symptoms list x0 z1 AIDS List of Diseases AIDS Evaluation p2 Side effects of Indavir Drug list Elisa Test b1 d1 Indavir Side effects Use k1 Uses of Indavir Joined Tuple
Motivation of Pi-web Join • Quite often web join operation couples irrelevant nodes • In a complex web query with several web join operation, the size of the resultant web table can become very large with many ``contaminated” nodes • Pi-web join resolves the above limitation by eliminating ``contaminated” nodes • Reduces the size of joined web table DASFAA’ 99
Pi-web Join • Web join followed by web project • The projection conditions are specified by the user: conditions are similar to web project • We do not eliminate the joinable nodes • By retaining the joinable nodes we preserve the correlation between the information captured from two web tables • Pi-web join may result in a web bag DASFAA’ 99
Example 3 • Produce a list of diseases with their symptoms and side-effects starting from the web site at http://www.panacea.org/ DASFAA’ 99
Procedure • Perform web join on “Diseases” and “Drugs” • Project node variables b, k, q, p, node variables between a and q, node variables between b and k, node variables between b and d DASFAA’ 99
http://www.panacea.org/ z x symptoms Disease List Side effects d Pi-joined schema
http://www.panacea.org/ Symptoms list x0 z1 AIDS List of Diseases Side effects of Indavir d1 Pi-joined Tuple
Benefits of Pi-web Join • Minimize the amount of data transmitted over the network in distributed web join processing • Reduction in storage cost associated with a joined web table • Reduces cognitive overhead associated with locating relevant nodes • Improve completeness of schema by removing unbound nodes and links DASFAA’ 99
Summary • Motivation • Introduced WHOWEDA • Web project & Web Join • Pi Web Join • For more information: www.cais.ntu.edu.sg:8000/~whoweda DASFAA’ 99
Web Bags • Existence of identical web tuples. • Created due to web project operation. • Multiplets - each collection of identical web tuples • Structure based knowledge discovery • Used for discovering (FODO 98) • Visible nodes • Luminous nodes • Luminous paths DASFAA’ 99