WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science Purdue University, West Lafayette, IN 47907 skm@cs.purdue.edu

www.is.a.mess

WWW • collection of multimedia documents in the form of web pages connected via hyperlinks.

Characteristics of WWW • WWW is a set of directed graphs • data in the WWW has a heterogeneous nature • unstructured versus structured information • no central authority to manage information • Dynamic verses static information • Web information discoveries - search engines

As WWW grows, more chaotic it becomes • Web is fast growing, distributed, non-administered global information resource • WWW allows access to text, image, video, sound and graphic data • more business organizations creating web servers • more chaotic environment to locate information of interest • lost in hyperspace syndrome

Does it affect the corporate world? • Lack of credibility of data • Different sites with different data • Same site different data • Historical information is not available • Previous versions of web data • How does web data change with time • Summarization over time • Data to information • Reduction in productivity • Analysis is manual

How users find web sites • Indexes and search engines 75 • UseNet newsgroups 44 • Cool lists 27 • New lists 24 • Listservers 23 • Print ads 21 • Word-of-mouth and e-mail 17 • Linked web advertisement 4

Limitations of Search Engines • Do not exploit hyperlinks • search is limited to string matching • Queries are evaluated on archived data rather than up-to-date data; no indexing on current data • low accuracy • replicated results • no further manipulation possible

Limitations of Search Engines • ERROR 404! • No efficient document management • Query results cannot be further manipulated • No efficient means for knowledge discovery

Current Research Projects • Web Query System • W3QS, WebSQL, AKIRA, NetQL, RAW, WebLog • Semistructured Data • LOREL, UnQL, WebOQL • Website Management System • STRUDEL • Web Warehouse - WHOWEDA

WHOWEDA -Key Objectives • Design a suitable data model to represent web information • development of web algebra and query language • Maintenance of Web data • Development of knowledge discovery and web mining tools • Web warehouse

WHOWEDA - What? • WareHouse Of Web Data • Subject - oriented • Integrated • Temporal • Granularity - Lower, higher • Some summary • Not updatable • Alternative information sources

What is a Web Warehouse? • Subject-oriented, integrated, time-variant, non-volatile repository of web data for direct querying and analysis for some sort of decision making • A process whereby organizations or individuals extract value from their Web informational assets through the use of special stores called web warehouses

WHOWEDA!www.cais.ntu.edu.sg:8000/~whoweda • A WareHouse Of WEb DAta • Web Information Coupling Model (WICM) • Web Objects • Web Schema • Web Information Coupling Algebra • Web Information Maintenance • Web Mining and Knowledge discovery

User WWW Warehouse Concept Mart Web Querying & Analysis Component Web Information Mining System Web Information Coupling System Web Information Maintenance System Web Mart Web Mart Web Warehouse Web Mart Web Mart

User WWW Web Query & Display Warehouse Concept Mart Global Web Manipulation Global Web Coupling Global Ranking Pre processing Data Visualization Schema Tightness Web Warehouse Data Visualization Web Union Web Select Web Intersection Web Project Local Web Manipulation Local Web Coupling Schema Tightness Local Ranking Schema Search Web Join Schema Match

Web Objects • Node - url, title, format, size, date, text • Link - source-url, target-url, label, link-type • Web tuple • Web table • Web schema • Web database

Web Schema • Metadata in the warehouse • Structural ‘summary’ of web table • Information Coupling using a Query graph • Query graph ->Web schema • directed graph represented by Ordered 4-tuple: • Set of node variables • Set of link variables • Connectivities • Predicates

Information Square's homepage Headline article 1 Headline article n News@TCS Local news 1 (List of video files) List of links to local news News specials Local news k World news 1 Airport info List of links to world news World news t

e e x x y y target_url CONTAINS "article” g g f z label CONTAINS "Local News" target_URL CONTAINS "newshub/specials" url CONTAINS "local" h w label CONTAINS "World News" url CONTAINS "world" url contains “headlines”

Information Square's homepage Headline article 1 List of links to local news Local news 1 News specials World news 1 List of links to world news

Schema- example • Node variables: Xn = { x, y, z, w } • Link variable: Xl = { e, f, g } • Connectivities: C = { x<e>y and x<fg->z and x<fh->w } • The symbol represents an anonymous node variable, a node variable not restricted by any predicate.

Predicates • P={x.url=”http://www.mediacity.com.sg/i-square”, • y.url CONTAINS “headlines” • e.target_url CONTAINS "article", • f.target.url CONTAINS "newshub/specials", • g.label CONTAINS "Local News", • z.url CONTAINS "local", • h.label CONTAINS "World News", • w.url CONTAINS "world" }

Query Graph - Example 1 • Query graph - same as schema except that it has one more parameter to control the results returned. • Informally, it is directed connected graph consists of nodes, links and keywords imposed on them. • Produce a list of diseases with their symptoms, evaluation procedures and treatment starting from the web site at http://www.panacea.org/ • Web tableDiseases

Treatment list q g Treatment http://www.panacea.org/ Issues Symptoms list f y x z Symptoms List of Diseases e Evaluation Evaluation w p

q1 Treatment list g1 Treatment http://www.panacea.org/ Issues f1 Symptoms list x0 y1 z1 Symptoms AIDS List of Diseases e1 Evaluation Evaluation w1 p2 Elisa Test

Example 2 • Produce a list of drugs, and their uses and side effects starting from the web site at http://www.panacea.org/ • Web tableDrugs

Drug list Side effects http://www.panacea.org/ Issues r c a b d Side effects List of Diseases Use s k Uses

Side effects of Indavir Drug list http://www.panacea.org/ Issues AIDS r1 a0 b1 c1 d1 Indavir Side effects List of Diseases Use s1 k1 Uses of Indavir

Query Language • Starting from the CS deptt home page at NTU, find all documents that are linked through paths of length less than two containing only local links, and have in their text “database”.

COUPLE WEBTABLE W FROM WWW SUCH THAT NODE I, j IN WWW and LINK e,f,g IN WWW AND I<e|f,g>j WHERE I.url EQUALS “http://www.ntu.edu.sg” AND j.text CONTAINS “database” AND f.link-type EQUALS local AND g.link-type EQUALS local;

Web Algebra • Formal foundation of data representation and manipulation in a web warehouse • Web operators: • Information access operator • Information manipulation operators • Web schema operators • Data visualization operators

Information access operator • Global Web Coupling

Information Manipulation - Web select • Web project • Local web coupling • Web join • Web cartesian product • Web union • Web intersect • Local Web coupling

Web Select • Extracts web tuples from web tables satisfying certain conditions on node and link variables and on connectivities • Input is select Schema • Output is a web table satisfying the select schema

select W1 tuples that contain world news about Indonesia since May 1 1998. • sMsW1 where Ms = < Xsn, Xsl, Cs, Ps >, Xsn = { x, w }, Xsl = { }, Cs = { }, Ps = { x.date > "1May1998", w.text CONTAINS “Indonesia”}

Xn’ = { x, y, z, w },Xl’ = { e, f, g } • C’ = { x<e>y and x<fg->z and x<fh->w } • P’={x.url=”http://www.mediacity.com.sg/i-square”, x.date > "1May1998", • e.target_url CONTAINS "article", f.target.url CONTAINS "newshub/specials", • g.label CONTAINS "Local News", • z.url CONTAINS "local", • h.label CONTAINS "World News", • w.url CONTAINS "world", • w.text CONTAINS “Indonesia” }

Web Information Coupling System • A database system to couple related web information • Global web Coupling and Local Web Coupling

Global Coupling - Information Access • To integrate data from the Web • To create historical data • To couple related information from the WWW satisfying a query graph • Operator to create web tables • From web with no schema to web table with web schema

Why local web coupling? • Directly querying the WWW to gather these information is an expensive and repetitive affair • Web documents containing similar information can reside in different web tables in a web warehouse • A mechanism to gather these similar information by additional manipulation of the materialized web tables

Local Web Couple operator • Two web tuples and can be coupled if there exist atleast one pair of nodes from and which contains similar information.

Local Web Couple operator • The web couple operator is basically a web cartesian product followed by web select: • We denote web couple by the symbol:

Web Coupling

M2 = < Xn”, Xl”, C”,P” > for W2 • Xn” = { s, t, u}, Xl” = { k, l, m, n }, • C” = { s<kl>t and s<mn>u }, • P”{s.url= “http://www.asia1.com.sg/straitstimes/”, • k.label = “REGION”, • l.target_url= “http://www.asia1.com.sg/straitstimes/pages/sea*.html”, m.label = “WORLD”, • n.target_url=“http://www.asia1.com.sg/straitstimes/pages/wrld*.html”}

W1 qq W2 where • q = (x.date=s.date) & (w.text CONTAINS “Indonesia”) & (t.text CONTAINS “Indonesia”)

Xn* = { x, y, z, w, s, t, u }, Xl* = { e, f, g, k, l, m, n }, C*= { x<e>y and x<fg->z and x<fh->w and s<kl>t and s<mn>u } • P* = { x.url=”http://www.mediacity.com.sg/i-square”, e.target_url CONTAINS "article", • f.target.url CONTAINS "newshub/specials", • g.label CONTAINS "Local News", • z.url CONTAINS "local", • h.label CONTAINS "World News", • w.url CONTAINS "world", • s.url = “http://www.asia1.com.sg/straitstimes/”,

k.label = “REGION”, l.target_url = “http://www.asia1.com.sg/straitstimes/pages/sea*.html”, • m.label = “WORLD”, • n.target_url = “http://www.asia1.com.sg/straitstimes/pages/wrld*.html”, • x.date = s.date, • w.text CONTAINS “Indonesia”, • t.text CONTAINS “Indonesia"}

Local Web Coupling • Initiated explicitly by the user • User provides the pair of node variables and the keyword set based on which coupling is to be performed • Coupling nodes in each pair of web tuples in the input web tables must satisfy one of the coupling conditions

Construction of coupled table • First perform a web cartesian product on the two web tables • For each web tuple in the resultant web table • the specified instances of node variables are inspected to determine whether the web tuple satisfy coupling compatibility condition(s)

WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science