1.21k likes | 1.22k Views
WHOWEDA is a web warehouse that allows direct querying and analysis of web data, enabling organizations and individuals to extract value from their web information assets. It provides a subject-oriented, integrated, and time-variant repository of web data for decision making.
E N D
WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science Purdue University, West Lafayette, IN 47907 skm@cs.purdue.edu copy-right@sanjay madria
www.is.a.mess copy-right@sanjay madria
WWW • Huge, widely distributed, hetreogenous collection of semi-structured multimedia documents in the form of web pages connected via hyperlinks. copy-right@sanjay madria
Characteristics of WWW • WWW is a set of directed graphs • data in the WWW has a heterogeneous nature • unstructured versus structured information • no central authority to manage information • Dynamic verses static information • Web information discoveries - search engines copy-right@sanjay madria
As WWW grows, more chaotic it becomes • Web is fast growing, distributed, non-administered global information resource • WWW allows access to text, image, video, sound and graphic data • more business organizations creating web servers - e-commerce • more chaotic environment to locate information of interest • lost in hyperspace syndrome copy-right@sanjay madria
WWW data - Does it affect the corporate world? • Lack of credibility of data • Different sites with different data • Same site different data • Historical information is not available • Previous versions of web data • How does web data change with time • Summarization over time • Data to information • Reduction in productivity • Analysis is manual copy-right@sanjay madria
How users find web sites • Indexes and search engines 75 • UseNet newsgroups 44 • Cool lists 27 • New lists 24 • Listservers 23 • Print ads 21 • Word-of-mouth and e-mail 17 • Linked web advertisement 4 copy-right@sanjay madria
Limitations of Search Engines • Do not exploit hyperlinks - recently google • search is limited to string matching • key-world oriented search queries are evaluated on archived data rather than up-to-date data; no indexing on current data • low accuracy • replicated results • no further manipulation possible copy-right@sanjay madria
Continue ……. • ERROR 404! • No efficient document management • Query results cannot be further manipulated • No efficient means for knowledge discovery copy-right@sanjay madria
Current Research Projects • Web Query System • W3QS, WebSQL, AKIRA, NetQL, RAW, WebLog, XML-QL • Semistructured Data • LOREL, UnQL, WebOQL, Website Management System • STRUDEL • Web Warehouse - WHOWEDA copy-right@sanjay madria
WHOWEDA -Key Objectives • Design a suitable data model to represent web information • development of web algebra and query language • Maintenance of Web data • Development of knowledge discovery and web mining tools • Web warehouse copy-right@sanjay madria
WHOWEDA - What? • WareHouse Of Web Data • Subject - oriented • Integrated • Temporal • Granularity - Lower, higher • Some summary • Not updatable • Alternative information sources copy-right@sanjay madria
Web Warehouse? • Subject-oriented, integrated, time-variant, non-volatile repository of web data for direct querying and analysis for some sort of decision making • A process whereby organizations or individuals extract value from their Web informational assets through the use of special stores called web warehouses copy-right@sanjay madria
WHOWEDA!www.cais.ntu.edu.sg:8000/~whoweda • A WareHouse Of WEb DAta • Web Information Coupling Model (WICM) • Web Objects • Web Schema • Web Information Coupling Algebra • Web Information Maintenance • Web Mining and Knowledge discovery copy-right@sanjay madria
User WWW Warehouse Concept Mart Web Querying & Analysis Component Web Information Mining System Web Information Coupling System Web Information Maintenance System Web Mart Web Mart Web Warehouse Web Mart Web Mart
User WWW Web Query & Display Warehouse Concept Mart Global Web Manipulation Global Web Coupling Global Ranking Pre processing Data Visualization Schema Tightness Web Warehouse Data Visualization Web Union Web Select Web Intersection Web Project Local Web Manipulation Local Web Coupling Schema Tightness Local Ranking Schema Search Web Join Schema Match
Web Objects • Node - url, title, format, size, date, text • Link - source-url, target-url, label, link-type • Web tuple • Web table • Web schema • Web database copy-right@sanjay madria
Web Schema • Metadata in the warehouse • Structural ‘summary’ of web table • Information Coupling using a Query graph • Query graph ->Web schema • directed graph represented by Ordered 4-tuple: • Set of node variables • Set of link variables • Connectivities • Predicates copy-right@sanjay madria
Information Square's homepage Headline article 1 Headline article n News@TCS Local news 1 (List of video files) List of links to local news News specials Local news k World news 1 Airport info List of links to world news World news t copy-right@sanjay madria
e e x x y y target_url CONTAINS "article” g g f z label CONTAINS "Local News" target_URL CONTAINS "newshub/specials" url CONTAINS "local" h w label CONTAINS "World News" url CONTAINS "world" url contains “headlines” copy-right@sanjay madria
Information Square's homepage Headline article 1 List of links to local news Local news 1 News specials World news 1 List of links to world news copy-right@sanjay madria
Schema- example • Node variables: Xn = { x, y, z, w } • Link variable: Xl = { e, f, g } • Connectivities: C = { x<e>y and x<fg->z and x<fh->w } • The symbol # represents an unbound node variable or link variable; a variable not restricted by any predicate. • “-” represents one unbound links • “-+” represents more than one unbound links copy-right@sanjay madria
Predicates • P={x.url=”http://www.mediacity.com.sg/i-square”, • y.url CONTAINS “headlines” • e.target_url CONTAINS "article", • f.target.url CONTAINS "newshub/specials", • g.label CONTAINS "Local News", • z.url CONTAINS "local", • h.label CONTAINS "World News", • w.url CONTAINS "world" } copy-right@sanjay madria
Query Graph - Example 1 • Query graph - same as schema except that it has one more parameter to control the results returned. • Informally, it is directed connected graph consists of nodes, links and keywords imposed on them. • Produce a list of diseases with their symptoms, evaluation procedures and treatment starting from the web site at http://www.panacea.org/ • Web tableDiseases copy-right@sanjay madria
Treatment list q g Treatment http://www.panacea.org/ Issues Symptoms list f y x z Symptoms List of Diseases e Evaluation Evaluation w p
q1 Treatment list g1 Treatment http://www.panacea.org/ Issues f1 Symptoms list x0 y1 z1 Symptoms AIDS List of Diseases e1 Evaluation Evaluation w1 p2 Elisa Test
Example 2 • Produce a list of drugs, and their uses and side effects starting from the web site at http://www.panacea.org/ • Web tableDrugs copy-right@sanjay madria
Drug list Side effects http://www.panacea.org/ Issues r c a b d Side effects List of Diseases Use s k Uses
Side effects of Indavir Drug list http://www.panacea.org/ Issues AIDS r1 a0 b1 c1 d1 Indavir Side effects List of Diseases Use s1 k1 Uses of Indavir
Query Language • Starting from the CS dept. home page at NTU, find all documents that are linked through paths of length less than two containing only local links, and have in their text “database”. copy-right@sanjay madria
COUPLE WEBTABLE W FROM WWW SUCH THAT NODE I, J IN WWW and LINK e,f,g IN WWW AND I<e|f,g>J WHERE I.url EQUALS “http://www.ntu.edu.sg” AND J.text CONTAINS “database” AND f.link-type EQUALS local AND g.link-type EQUALS local; copy-right@sanjay madria
Web Algebra • Formal foundation of data representation and manipulation in a web warehouse • Web operators: • Information access operator • Information manipulation operators • Web schema operators • Data visualization operators copy-right@sanjay madria
Information access operator • Global Web Coupling copy-right@sanjay madria
Information Manipulation - Web select • Web project • Local web coupling • Web join • Web cartesian product • Web union • Web intersect • Local Web coupling copy-right@sanjay madria
Web Select • Extracts web tuples from web tables satisfying certain conditions on node and link variables and on connectivities • Input is select Schema • Output is a web table satisfying the select schema copy-right@sanjay madria
select W1 tuples that contain world news about Indonesia since May 1 1998. • sMsW1 where Ms = < Xsn, Xsl, Cs, Ps >, Xsn = { x, w }, Xsl = { }, Cs = { }, Ps = { x.date > "1May1998", w.text CONTAINS “Indonesia”} copy-right@sanjay madria
Xn’ = { x, y, z, w },Xl’ = { e, f, g } • C’ = { x<e>y and x<fg->z and x<fh->w } • P’={x.url=”http://www.mediacity.com.sg/i-square”, x.date > "1May1998", • e.target_url CONTAINS "article", f.target.url CONTAINS "newshub/specials", • g.label CONTAINS "Local News", • z.url CONTAINS "local", • h.label CONTAINS "World News", • w.url CONTAINS "world", • w.text CONTAINS “Indonesia” } copy-right@sanjay madria
Web Information Coupling System • A database system to couple related web information • Global web Coupling and Local Web Coupling copy-right@sanjay madria
Global Coupling - Information Access • To integrate data from the Web • To create historical data • To couple related information from the WWW satisfying a query graph • Operator to create web tables • From web with no schema to web table with web schema copy-right@sanjay madria
Why local web coupling? • Directly querying the WWW to gather these information is an expensive and repetitive affair • Web documents containing similar information can reside in different web tables in a web warehouse • A mechanism to gather these similar information by additional manipulation of the materialized web tables copy-right@sanjay madria
Local Web Couple operator • Two web tuples and can be coupled if there exist atleast one pair of nodes from and which contains similar information. copy-right@sanjay madria
Local Web Couple operator • The web couple operator is basically a web cartesian product followed by web select: • We denote web couple by the symbol: copy-right@sanjay madria
Web Coupling copy-right@sanjay madria
Example 1 • Produce a list of diseases and their symptoms starting from the web site at http://www.panacea.org/ • Web tableDiseases copy-right@sanjay madria
Issues http://www.panacea.org/ symptoms e z x y symptoms List of Diseases Web Schema or Query Graph of ``Diseases”
Issues Issues Issues Issues http://www.panacea.org/ http://www.panacea.org/ http://www.panacea.org/ http://www.panacea.org/ Symptoms of AIDS Symptoms of Lung Diseases Symptoms of Diabetes Symptoms of Cancer AIDS Diabetes Cancer Lung Disease e0 e2 e3 e1 z0 z1 z2 z3 x0 x0 x0 x0 y1 y2 y0 y3 symptoms symptoms symptoms symptoms List of Diseases List of Diseases List of Diseases List of Diseases Web table ``Diseases”
Example 2 • Produce a list of drugs, and their side effects starting from the web site at http://www.panacea.org/ • Web tableDrugs copy-right@sanjay madria
Drug list Side effects http://www.panacea.org/ Issues r c a b d Side effects List of Diseases Web Schema or Query Graph of ``Drugs”
Side effects of Ritonavir Side effects of Indavir Drug list Drug list http://www.panacea.org/ http://www.panacea.org/ Issues Issues AIDS AIDS r2 r1 a0 a0 b1 b1 c2 c1 d2 d1 Indavir Ritonavir Side effects Side effects List of Diseases List of Diseases Side effects of letrozole Drug list http://www.panacea.org/ Issues Cancer r3 a0 b2 c3 d3 Letrozole Side effects List of Diseases Side effects of Beta Carotene Drug list http://www.panacea.org/ Issues Heart Disorder r4 a0 b4 c4 d4 Side effects Beta Carotene List of Diseases Web table ``Drugs”