510 likes | 534 Views
The WWW as a Database: WWW Query Languages. Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University. Outline. searching the WWW search engines WWW query languages WebSQL WWW graph cost Jumping Spider hybrid. Searching the WWW. search engines
E N D
The WWW as a Database:WWW Query Languages Curtis Dyreson James Cook University (Townsville, Australia) Aalborg University
Outline • searching the WWW • search engines • WWW query languages • WebSQL • WWW graph • cost • Jumping Spider • hybrid
Searching the WWW • search engines • Altavista, Infoseek, 2100 others! • static architecture • robot: periodic, slow, non-uniform coverage • index: keywords to URLs, fast, ranking algorithm • example query Lecture notes on trees in a data structures course.
A Search Engine Index data structures
A Search Engine Index data structures lecture notes
A Search Engine Index trees data structures lecture notes
A Search Engine Index trees data structures lecture notes
A Search Engine Index trees data structures lecture notes
WWW Query Languages • search engines index single pages • multi-page concepts • hunting strategy • search engine to nearby page • manual search • WWW query languages WebSQL, W3QS, WebLog
WWW Graph Structure • large (650K servers, 350M pages) • dynamic, cyclic link = edge page = node
WebSQL • SQL-like • search engine to find pages • path expression (regular expression of links) • text manipulation predicates SELECT <attribute list> FROM <document list> WHERE <predicate>;
WebSQL From Clause • from clause collects a set of documents • unstructured - primitive schema • MENTIONS - retrieve from search engine DOCUMENT x SUCH THAT x MENTIONS ‘data structures’
WebSQL From Clause • from clause collects a set of documents • unstructured - primitive schema Document[URL, text, link to URL, modify date] • MENTIONS - retrieve from search engine SELECT z.URL FROM DOCUMENT x SUCH THAT x MENTIONS ‘data structures’, DOCUMENT y SUCH THAT x -> y, DOCUMENT z SUCH THAT y->* z WHERE y CONTAINS ‘lecture notes’ AND z CONTAINS ‘trees’;
WebSQL From Clause • path expression finds related documents • URL • local link: -> • global link: => DOCUMENT x SUCH THAT “http://www.cs.auc.dk” DOCUMENT y SUCH THAT x -> y DOCUMENT y SUCH THAT x => y
WebSQL From Clause • at most one link: ? • any number of links: * • alternation: | DOCUMENT y SUCH THAT x ->(->)? y DOCUMENT y SUCH THAT x ->* y DOCUMENT y SUCH THAT x (=> | ->*) y
WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y
WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y Java
WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y Java
WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y Java
WebSQL From Clause • path expression limits search space • local link, search limited to local machine • global link, can go anywhere • =>* would search all of WWW • pre-analysis, filtering • even three to four local links infeasible
WebSQL Where Clause • like SQL • CONTAINS, text search of retrieved document • can push CONTAINS into navigation WHERE y CONTAINS ‘lecture notes’ AND y.length < 4000;
WebSQL Query • Find lecture notes on trees in a data structures course. SELECT z. FROM DOCUMENT x SUCH THAT x MENTIONS ‘data structures’, DOCUMENT y SUCH THAT x -> y, DOCUMENT z SUCH THAT y->* z WHERE y CONTAINS ‘lecture notes’ AND z CONTAINS ‘trees’;
data structures -> lecture notes data structures
data structures -> lecture notes data structures
data structures -> lecture notes data structures lecture notes
lecture notes ->* trees data structures lecture notes
lecture notes ->* trees data structures lecture notes
lecture notes ->* trees trees data structures lecture notes
Result trees data structures lecture notes
WebSQL Architecture • Java implementation
WWW Query Language -Drawbacks • dynamic architecture • O(p**k) - p is length of path expression - k is branching factor • a priori knowledge of topology • back links are a problem
Jumping Spider - a Hybrid • like a search engine - static architecture - keyword searches • like a WWW query language - uses modified WWW graph - one kind of path expression
Kinds of Links • content refinement queries are common • heuristic information in subdirectories is refined • different kinds of links back - subdirectory to parent down - parent directory to subdirectory side - unrelated directories
data structures -> lecture notes data structures
data structures -> lecture notes data structures
data structures -> lecture notes data structures lecture notes
lecture notes -> trees data structures lecture notes
lecture notes -> trees data structures trees lecture notes
Analysis • search engine index - adds a pertinent index • pertinent index - O(nlogn) to O(n**2) space - all URLs that can reach this URL - tree-like, so should be close to O(nlogn) • more intersections • implemented in Perl 5
Related Work • WWW query languages WebSQL (Arocena et al. - WWW6 ’97) W3QS (Konopnicki and Shmueli - VLDB’95) WebLog (Lakshmanan et al. RIDE ’96) AKIRA (Lacroix et al. - ER ’97) • Indexes that already use directories Infoseek WebGlimpse (Manber et al. - Usenix ’97) • Semi-structured data models - many