700 likes | 813 Views
Detecting and Representing Relevant Page-Level Web Deltas. Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907 skm@cs.purdue.edu. Replaces its antecedents leaving no trace!!!!. Current Situation of W 3.
E N D
Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907 skm@cs.purdue.edu
Replaces its antecedents leaving no trace!!!! Current Situation of W3 • The Web allows information to change at any time and in any way • Two forms of changes • Existence • Structure and content modification • Leaves no trace of the previous document
Problems of Change Management • Problem: • Detecting, Representing and Querying these changes • The problem is challenging • Typical database approaches to detect changes based on triggering mechanisms are not usable • Information sources typical do not keep track of historical information to a format that is accessible to the outside user
Motivating Example • Assume that there is a web site at www.panacea.gov • Provides information related to drugs used for various diseases
Motivating Example • Suppose, on 15th January, a user wishes to find out periodically (every 30 days) • information related to side effects and uses of drugs used for various drugs and • changes to these information at the page-level compared to its previous version
Structure of www.panacea.gov • Web page at www.panacea.gov contains a list of diseases • Each link of a particular disease points to a web page containing a list of drugs used for prevention and cure of the disease • Hyperlinks associated with each drug points to documents containing a list of various issues related to a particular drug (description, manufacturers, clinical pharmacology, uses, side-effects etc) • From the hyperlinks associated with each issue, one can retrieve details of these issues for a particular drug
A Snapshot as on 15th Jan Side effects Indavir Ritonavir Uses AIDS Cancer Heart disease Alzheimer’s Disease Side effects Hirudin Uses Diabetes Niacin Ibuprofen Impotence Side effects Vasomax Side effects Side effects Caverject Uses Uses
Some Changes • 25th January • Links related to Diabetes are removed • New link containing information related to Parkinson’s Disease • Information related to issues, side-effects and uses of various drugs for Cancer are also modified
A Partial Snapshot as on 25th Jan Side effects Tolcapone Parkinson’s Disease Uses Cancer www.panacea.gov Diabetes Side effects
Some Changes • 30th January • Links related to Impotence is modified • Previously provided by www.pfizer.com • Now by www.panacea.gov • Inter-linked structure of the Web pages related to Caverject is also modified • Information about Viagra, a new drug for Impotence is added
A Partial Snapshot as on 30th Jan Side effects www.panacea.gov Uses Caverject Impotence Side effects Vasomax Viagra Uses
Some Changes • 8th February • Link structure of Heart Disease is modified • Label Heart Disease is modified to Heart Disorder • Content of the pages dealing with side-effects and uses of Hirudin are updated • Inter-linked document structure of Niacin is modified • Web pages related to the side effects and uses of Ibuprofen (Alzheimer’s Disease) are removed
On 8th February www.panacea.gov Heart disorder Alzheimer’s Disease Side effects Hirudin Uses Niacin Side effects
Side effects Uses A Snapshot as on 15th Feb Indavir Ritonavir AIDS Alzheimer’s Disease Cancer Heart disease Parkinson’s Disease Hirudin Niacin Impotence Viagra Vasomax Caverject
Objectives • Web deltas - Changes to web information • Detecting and representing relevant page-level web deltas • changes that are relevant to user’s query, not any arbitrary changes or web deltas • Restricted to page level • Detect those documents • which are added to the site • deleted from the site • those documents which has undergone content or structural modification • How these delta documents are related to one another and with other documents relevant to the user’s query
The WHOWEDA Project • WHOWEDA: A WareHouse of WEb DAta • To design and implement a web warehousing system capable of effective extraction, management, and processing of information on the World Wide Web • Data model: WHOM (WareHouse Object Model)
Overview of WHOM • Our web warehouse can be conceived of as a collection of web tables • A set of web tuples and a set of web schemas represents a web table • A web tuple is a directed graph containing nodes and links and satisfies a web schema • Nodes and links contain content, metadata and structural information associated with Web documents and hyperlinks • Tree representation • Web algebra containing web operators to manipulate web tables • Global Coupling, Web Select, Web Join etc.
Overview of our approach • Step 1: Two snapshots of old and new relevant data is coupled from the Web using global web coupling operation and materialized in two web tables. • Step 2: Web join, left outer join and right outer joined operations are performed on these two web tables • Result is joined, left and right outer joined web tables • Step 3: Delta web tables containing different types of web deltas are generated from these resultant web tables. • Elaborate on these steps……...
Step 1: Retrieving snapshots of Web data using Global Web Coupling
Web Query Specification • Features: • Draw a web query as a directed connected acyclic graph (also called a coupling query) • Query can also be specified in text form • Specify search conditions on the nodes and edges of the graph • Performed by the global web coupling operator
Coupling Query • Set of node variablesXn • Each variable represents set of Web documents • Set of link variablesXl • Each variable represent set of hyperlinks • Set of connectivities C in DNF defined over node and link variables • To specify hyperlink structure of the documents • Set of predicates P defined over some of the node and link variables • Specify metadata, content or structural conditions • Set of coupling query predicates Q • Conditions on execution of the query
Example • Suppose, on 15th January, a user wishes to find out periodically (every 30 days) from the web site at www.panacea.gov • information related to side effects and uses of drugs used for various diseases • Result of the query is stored in the form of web table
Coupling Query • Xn = {a, b, d, k} • Xl = { - } • P = {p1, p2, p3, p4} • p1(a) = METADATA:: a[url] EQUALS “www.panacea.gov” • p2(b) = CONTENT:: b[html.body.title] NON-ATTR-CONT “drug list” • p3(k) = CONTENT:: k[html.body.title] NON-ATTR-CONT “uses” • p4(d) = CONTENT:: d[html.body.title] NON-ATTR-CONT “side effects”
Coupling Query • C = k1 AND k2 AND k3 • k1 = a < - > b • k2 = b < -{1, 6} > d • k3 = b < -{1, 3} > k • Q = {q1} • q1(b) = COUPLING_QUERY:: polling_frequency EQUALS “30 days”
Pictorial Representation “side effects” d {1, 6} www.panacea.gov a b “drug list” {1, 3} k “uses”
Web Table Drugs (15th Jan) a0 b0 u0 d0 Indavir AIDS k0 a0 b0 u1 d1 Ritonavir AIDS k1 Beta Carotene a0 b1 d2 Cancer k2 a0 b5 d12 Ibuprofen Alzheimer’s Disease k12
a0 b3 d4 k5 Albuterol Diabetes a0 b4 u4 u5 u6 d5 Impotence Vasomax k6 a0 b4 u7 d6 Cavarject Impotence u8 k7 a0 b2 u2 d3 Heart Disease Hirudin k3 Web Table Drugs (15th Jan)
a0 b0 u0 d0 Indavir AIDS k0 a0 b0 u1 d1 Ritonavir AIDS k1 a0 b2 u2 d3 Heart Disorder Hirudin k3 Web Table New Drugs (15th Feb) Beta Carotene a0 b1 d2 Cancer k2
a0 b2 u3 d7 Heart Disorder Niacin k7 a0 b4 u9 d8 Impotence Vasomax k8 a0 b6 u10 d10 b6 Tolcapone Parkinson’s Disease k10 Web Table New Drugs (15th Feb) a0 b4 u7 d6 Cavarject Impotence k7
a0 b4 u12 d9 Impotence Viagra k9 a0 b6 u10 d10 b6 Tolcapone Parkinson’s Disease k10 Web Table New Drugs (15th Feb)
Web Join • Information composition operator • Combines two web tables into a single web table under certain conditions • Combine two web tables by concatenating a web tuple of one web table with a web tuple of other web table whenever there exist joinable nodes • Two nodes are joinable if they are identical • Two nodes are identical if the URL and last modification date of the nodes are same • The joined web tuple is stored in a different web table
Web Join • Join web tables Drugs and New Drugs • Nodes which has not undergone any changes are the joinable nodes in these two web tables. • Content modified nodes, new nodes and deleted nodes cannot be joinable nodes
a0 b0 u0 d0 Indavir (3) AIDS k0 Ritonavir a0 u1 d1 AIDS k1 Joined web table a0 b0 u0 d0 AIDS Indavir (1) AIDS k0 a0 AIDS a0 b0 d1 u1 Ritonavir (2) AIDS a0 k1
a0 b4 u7 d6 Cavarject (5) Impotence u8 k7 a0 b4 u7 Cavarject Impotence Joined Web Table a0 b2 u3 d7 Heart Disorder Niacin (4) k4 a0 u2 d3 Heart Disease Hirudin k3
Joined Table a0 b2 u2 d3 Heart Disease Hirudin (6) k3 Hirudin a0 u2 d3 Heart Disorder k3
a0 b4 u7 d6 Cavarject (5) Impotence u8 k7 a0 b4 u7 Cavarject Impotence Types of web tuples • Web tuples in which all the nodes are joinable • Results of joining two versions of web tuples that has remained unchanged during the transition • Web tuples in which • some of the nodes are joinable nodes • remaining nodes are the result of insertion, deletion or modification operations
a0 b0 u0 d0 Indavir (3) AIDS k0 Ritonavir a0 u1 d1 AIDS k1 Types of web tuples • Tuples in which • Some of the nodes are joinable nodes • Out of the remaining nodes some are result of insertion, deletion or modification and • The remaining ones remained unchanged during the transition
Outer Web Join • Web tuples that do not pariticipate in the web join process (dangling web tuples) are absent from the joined web table • Outer web join enables us to identify them • Left outer web join • Right outer web join
a0 b0 u0 d0 Indavir AIDS k0 a0 b0 u1 d1 Ritonavir AIDS k1 a0 b2 u2 d3 Heart Disorder Hirudin k3 Web Table New Drugs (15th Feb) Beta Carotene a0 b1 d2 Cancer k2
a0 b2 u3 d7 Heart Disorder Niacin k7 a0 b4 u9 d8 Impotence Vasomax k8 Web Table New Drugs (15th Feb) a0 b4 u7 d6 Cavarject Impotence k7
a0 b4 u12 d9 Impotence Viagra k9 a0 b6 u10 d10 b6 Tolcapone Parkinson’s Disease k10 Web Table New Drugs (15th Feb)
a0 b4 u9 d8 Impotence Vasomax k8 Beta Carotene a0 b1 d2 Cancer a0 b4 u12 d9 Impotence Viagra k2 k9 a0 b6 u10 d10 b6 Tolcapone Parkinson’s Disease k10 Right Outer Web Join
Types of web tuples • New web tuples which are added during the transition • These tuples contain some new nodes and remaining ones content are changes • Tuples in which all the nodes have undergone content modification • Tuples which existed before and in which some of the nodes are new and remaining ones content have changed.
Beta Carotene a0 b1 d2 Cancer k2 a0 b5 d12 Ibuprofen Alzheimer’s Disease k12 Web Table Drugs (15th Jan) a0 b0 u0 d0 Indavir AIDS k0 a0 b0 u1 d1 Ritonavir AIDS k1
a0 b3 d4 k5 Albuterol Diabetes a0 b4 u4 u5 u6 d5 Impotence Vasomax k6 a0 b4 u7 d6 Cavarject Impotence u8 k7 a0 b2 u2 d3 Heart Disease Hirudin k3 Web Table Drugs (15th Jan)
a0 b3 d4 k5 Albuterol Diabetes a0 b4 u4 u5 u6 d5 Impotence Vasomax k6 Beta Carotene a0 b1 d2 Cancer k2 a0 b5 d12 Ibuprofen Alzheimer’s Disease k12 Left Outer Web Join
Types of web tuples • Web tuples which are deleted during the transition • These tuples do not occur in the new web table • Tuples in which all the nodes have undergone content modification • Tuples in which some of the nodes are deleted and remaining ones content have changed.
Overview • Input • Joined, left outer joined and right outer joined web tables • Output • Set of delta web tables