Detecting and Representing Relevant Web Deltas in WHOWEDA

Detecting and Representing Relevant Web Deltas in WHOWEDA Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla madrias@umr.edu Based on IEEE ICDCS’00 and IEEE TKDE (under minor revision)

Replaces its antecedents leaving no trace!!!! Current Situation of W3 • The Web allows information to change at any time and in any way • Two forms of changes • Existence • Structure and content modification • Leaves no trace of the previous document

Problems of Change Management • Problems: • Detecting, Representing and Querying these changes • The problem is challenging • Typical database approaches to detect changes based on triggering mechanisms are not usable • No access right, no support for triggers • Information sources typically do not keep track of historical information to a format that is accessible to the outside user

Applications • Provides the framework for • Web Site Administrator • Trend analysis and Mining • E-commerce • Customers of E-commerce Web Site • Competitive Intelligence : Product and Price comparisons • Notification Services (with PDA)

Objectives • Web deltas - Changes to web data • Detecting and representing relevant page-level web deltas • changes that are relevant to user’s query, not any arbitrary changes or web deltas • Restricted to page level • Detect those documents • which are added to the site • deleted from the site • those documents which have undergone content or structural modification • How these delta documents are related to one another and with other documents relevant to the user’s query

Related Work • Lore (Stanford)– change management (SIGMOD’97 and ICDE’98) • Contrast • OEM based, not applied on Web • WebCQ (Georgia Tech) • Needs a set of URLs. • No interdocument changes • Htmldiff (AT&T)– • Input - two versions • Output – marked up copy highlight changes • Contrast • Difficult to browse in case of large file • Ours is based on query , not any change

Change Mgmt in DBMS • Two Approaches • Snapshot collection at times t1, t2,….. • Snapshot deltas, D and Ds at time t1, t2,….. • Contrast – we use snapshot delta approach, but with semi-structured data

Motivating Example • Assume that there is a web site at www.panacea.gov • Provides information related to drugs used for various diseases • Suppose, on 15th January, a user wishes to find out periodically (every 30 days) • information related to side effects and uses of drugs used for various drugs and • changes to these information at the page-level compared to its previous version

Structure of www.panacea.gov • www.panacea.gov contains a list of diseases • Each link of a particular disease points to a web page containing a list of drugs used for prevention and cure of the disease • Hyperlinks associated with each drug points to documents containing a list of various issues related to a particular drug (description, manufacturers, clinical pharmacology, uses, side-effects etc) • From the hyperlinks associated with each issue, one can retrieve details of these issues for a particular drug

A Snapshot as on 15th Jan Side effects Indavir Ritonavir Uses AIDS Cancer Heart disease Alzheimer’s Disease Side effects Hirudin Uses Diabetes Niacin Ibuprofen Impotence Side effects Vasomax Side effects Side effects Caverject Uses Uses

A Partial Snapshot as on 25th Jan Side effects Tolcapone Parkinson’s Disease Uses update Cancer New Link www.panacea.gov Diabetes Side effects

A Partial Snapshot as on 30th Jan Side effects www.panacea.gov Uses Caverject Impotence Side effects Vasomax Viagra Uses

On 8th February www.panacea.gov Heart disorder Alzheimer’s Disease Side effects Hirudin Uses Niacin Side effects

Side effects Uses A Snapshot as on 15th Feb Indavir Ritonavir AIDS Alzheimer’s Disease Cancer Heart disease Parkinson’s Disease Hirudin Niacin Impotence Viagra Vasomax Caverject

Types of Changes • Insert Node • Delete Node • Update Node (update contents) • Insert Link – same as either Insert node or update node • Delete Link – same as either delete node or update node • Update link – same as update node

WHOWEDA* Project Key Objectives • Design a suitable data model to store web data, called WHOM (Warehouse of Object Model) • Development of web algebra and query language to extract and manipulate web data • Change Management of Web data • Development of knowledge discovery and web mining tools • *Joint project with NTU, Singapore

Overview of WHOM • Collection of web tables • Set of web tuples and a set of web schemas represents a web table • Web tuple - directed graph containing nodes and links and satisfies a web schema • Nodes and links contain content, metadata and structural information associated with Web documents and hyperlinks • Tree representation (Can handle XML) • Web algebra containing web operators to manipulate web tables • Global Coupling, Web Select, Web Join etc.

Step 1: Retrieving Snapshots of Web Data Using Coupling Query Graph Example • Suppose, on 15th January, a user wishes to find out periodically (every 30 days) from the web site at www.panacea.gov • information related to side effects and uses of drugs used for various diseases • Result of the query is stored in the form of web table

Pictorial Representation “side effects” d {1, 6} www.panacea.gov a b “drug list” {1, 3} k “uses”

Coupling Query • Set of node variablesXn, Xn = {a, b, d, k} • Each variable represents set of Web documents • Set of link variablesXl, Xl = { - } • Each variable represent set of hyperlinks • Set of predicates P defined over some of the node and link variables • P = {p1, p2, p3, p4} • p1(a) = METADATA:: a[url] EQUALS “www.panacea.gov” • p2(b) = CONTENT:: b[html.body.title] NON-ATTR-CONT “drug list” • p3(k) = CONTENT:: k[html.body.title] NON-ATTR-CONT “uses” • p4(d) = CONTENT:: d[html.body.title] NON-ATTR-CONT “side effects”

Coupling Query • Set of connectivities C in defined over node and link variables • To specify hyperlink structure of the documents • Specify metadata, content or structural conditions • C = k1 AND k2 AND k3 • k1 = a < - > b • k2 = b < -{1, 6} > d • k3 = b < -{1, 3} > k • Set of coupling query predicates Q • Conditions on execution of the query • Q = {q1} • q1(G) = COUPLING_QUERY:: G:polling_frequency EQUALS “30 days”

Web Table Drugs (15th Jan) a0 b0 u0 d0 Indavir AIDS k0 a0 b0 u1 d1 Ritonavir AIDS k1 Beta Carotene a0 b1 d2 Cancer k2 a0 b5 d12 Ibuprofen Alzheimer’s Disease k12

a0 b3 d4 k5 Albuterol Diabetes a0 b4 u4 u5 u6 d5 Impotence Vasomax k6 a0 b4 u7 d6 Cavarject Impotence u8 k7 a0 b2 u2 d3 Heart Disease Hirudin k3 Web Table Drugs (15th Jan)

a0 b0 u0 d0 Indavir AIDS k0 a0 b0 u1 d1 Ritonavir AIDS k1 a0 b2 u2 d3 Heart Disorder Hirudin k3 Web Table New Drugs (15th Feb) Beta Carotene a0 b1 d2 Cancer k2

a0 b2 u3 d7 Heart Disorder Niacin k7 a0 b4 u9 d8 Impotence Vasomax k8 Web Table New Drugs (15th Feb) a0 b4 u7 d6 Cavarject Impotence k7

a0 b4 u12 d9 Impotence Viagra k9 a0 b6 u10 d10 b6 Tolcapone Parkinson’s Disease k10 Web Table New Drugs (15th Feb)

Storage of Web Objects • Warehouse Node pool– distinct nodes, each node has node-id, version-ids • warehouse document pool – actual documents • Web table pool • Table node pool- type identifier name that node and link represents in schema,link-id, version-ids, URL of the node, target node-id, label, and link type of the link • web tuple pool- ids of all the nodes and links belonging to web tuple • web schema pool – store the web schema and coupling query

Step 2: Performing Web Join, Left and Right Outer Web Join Web Join • Combine two web tables by concatenating two web tuples whenever there exist joinable nodes • Two nodes are joinable if they are identical • Two nodes are identical if the URL and last modification date of the nodes are same • The joined web tuple is stored in a different web table

Web Join • Join web tables Drugs and New Drugs • Nodes which have not undergone any changes are the joinable nodes in these two web tables. • Content modified nodes, new nodes and deleted nodes cannot be joinable nodes

a0 b0 u0 d0 Indavir (3) AIDS k0 Ritonavir a0 u1 d1 AIDS k1 Joined web table a0 b0 u0 d0 AIDS Indavir (1) AIDS k0 a0 AIDS a0 b0 d1 u1 Ritonavir (2) AIDS a0 k1

a0 b4 u7 d6 Cavarject (5) Impotence u8 k7 a0 b4 u7 Cavarject Impotence Joined Web Table a0 b2 u3 d7 Heart Disorder Niacin (4) k4 a0 u2 d3 Heart Disease Hirudin k3

Joined Table a0 b2 u2 d3 Heart Disease Hirudin (6) k3 Hirudin a0 u2 d3 Heart Disorder k3

a0 b4 u7 d6 Cavarject (5) Impotence u8 k7 a0 b4 u7 Cavarject Impotence Types of web tuples • Web tuples in which all the nodes are joinable • Results of joining two versions of web tuples that has remained unchanged during the transition • Web tuples in which • some of the nodes are joinable nodes • remaining nodes are the result of insertion, deletion or modification operations

a0 b0 u0 d0 Indavir (3) AIDS k0 Ritonavir a0 u1 d1 AIDS k1 Types of web tuples • Tuples in which • Some of the nodes are joinable nodes • Out of the remaining nodes some are result of insertion, deletion or modification and • The remaining ones remained unchanged during the transition, but may be joinable with others

Algorithm for Computing joinable nodes

Algorithm of web join

Algorithm of web join (continued)

Outer Web Join • Web tuples that do not participate in the web join process (dangling web tuples) are absent from the joined web table • Outer web join enables us to identify them • Left outer web join • Right outer web join

Types of web tuples (Right Outer) • New web tuples which are added during the transition • These tuples contain some new nodes and remaining ones content are changed. • Tuples in which all the nodes have undergone content modification • Tuples which existed before and in which some of the nodes are new and remaining ones content have changed.

a0 b0 u0 d0 Indavir AIDS k0 a0 b0 u1 d1 Ritonavir AIDS k1 a0 b2 u2 d3 Heart Disorder Hirudin k3 Web Table New Drugs (15th Feb) Beta Carotene a0 b1 d2 Cancer k2

a0 b2 u3 d7 Heart Disorder Niacin k7 a0 b4 u9 d8 Impotence Vasomax k8 Web Table New Drugs (15th Feb) a0 b4 u7 d6 Cavarject Impotence k7

a0 b4 u12 d9 Impotence Viagra k9 a0 b6 u10 d10 b6 Tolcapone Parkinson’s Disease k10 Web Table New Drugs (15th Feb)

Types of web tuples (Left Outer) • Web tuples which are deleted during the transition • These tuples do not occur in the new web table • Tuples in which all the nodes have undergone content modification • Tuples in which some of the nodes are deleted and of remaining ones content have changed.

Beta Carotene a0 b1 d2 Cancer k2 a0 b5 d12 Ibuprofen Alzheimer’s Disease k12 Web Table Drugs (15th Jan) a0 b0 u0 d0 Indavir AIDS k0 a0 b0 u1 d1 Ritonavir AIDS k1

a0 b3 d4 k5 Albuterol Diabetes a0 b4 u4 u5 u6 d5 Impotence Vasomax k6 a0 b4 u7 d6 Cavarject Impotence u8 k7 a0 b2 u2 d3 Heart Disease Hirudin k3 Web Table Drugs (15th Jan)

Algorithm of outer web join

Algorithm of outer web join (continued)

Step 3: Generating Delta Web Tables • Input • Joined, left outer joined and right outer joined web tables • Output • Set of delta web tables

Delta Web Tables • Encapsulate the relevant changes that have occurred in the Web with respect to a user’s query • Three types • Delta+ web table • Contains a set of tuples containing new nodes inserted during transition • Delta- web table • Set of web tuples containing nodes removed during the transition • Delta-M web table • Set of web tuples representing the previous and current sets of modified nodes

Steps for Generation • Phase 1: Delta Nodes Identification Phase • Nodes which are added, deleted or modified during the transition are identified • Input: Old and new version of web tables and a set of joinable nodes from the joined web table • Output: • Nodes which exists in new web table but not in old web table are the new nodes • Nodes which exists in old web table but not in new one are the deleted nodes • Nodes which exists in both the web tables but are not joinable are the nodes which have undergone content modification

Detecting and Representing Relevant Web Deltas in WHOWEDA

Detecting and Representing Relevant Web Deltas in WHOWEDA

Presentation Transcript

Deltas

Detecting Spam Web Pages

Deltas and Estuaries

Detecting Sequences and Cycles of Web Pages

Odata Deltas

Detecting Skype flows Hidden in Web Traffic

Rivers and Deltas

Deltas

Cutting Canyons and Building Deltas

Deltas and estuaries

Detecting and Defending against Web-Server Fingerprinting

Detecting Deception in the Context of Web 2.0.

Detecting Spam Web Pages

Detecting and Representing Relevant Page-Level Web Deltas

Representing Images; Detecting faces in images

Detecting Web Spam with CombinedRank

Representing Images; Detecting faces in images