1 / 51

The WWW as a Database: WWW Query Languages

The WWW as a Database: WWW Query Languages. Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University. Outline. searching the WWW search engines WWW query languages WebSQL WWW graph cost Jumping Spider hybrid. Searching the WWW. search engines

Download Presentation

The WWW as a Database: WWW Query Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The WWW as a Database:WWW Query Languages Curtis Dyreson James Cook University (Townsville, Australia) Aalborg University

  2. Outline • searching the WWW • search engines • WWW query languages • WebSQL • WWW graph • cost • Jumping Spider • hybrid

  3. Searching the WWW • search engines • Altavista, Infoseek, 2100 others! • static architecture • robot: periodic, slow, non-uniform coverage • index: keywords to URLs, fast, ranking algorithm • example query Lecture notes on trees in a data structures course.

  4. A Search Engine Index

  5. A Search Engine Index data structures

  6. A Search Engine Index data structures lecture notes

  7. A Search Engine Index trees data structures lecture notes

  8. A Search Engine Index trees data structures lecture notes

  9. A Search Engine Index trees data structures lecture notes

  10. WWW Query Languages • search engines index single pages • multi-page concepts • hunting strategy • search engine to nearby page • manual search • WWW query languages WebSQL, W3QS, WebLog

  11. WWW Graph Structure • large (650K servers, 350M pages) • dynamic, cyclic link = edge page = node

  12. WebSQL • SQL-like • search engine to find pages • path expression (regular expression of links) • text manipulation predicates SELECT <attribute list> FROM <document list> WHERE <predicate>;

  13. WebSQL From Clause • from clause collects a set of documents • unstructured - primitive schema • MENTIONS - retrieve from search engine DOCUMENT x SUCH THAT x MENTIONS ‘data structures’

  14. WebSQL From Clause • from clause collects a set of documents • unstructured - primitive schema Document[URL, text, link to URL, modify date] • MENTIONS - retrieve from search engine SELECT z.URL FROM DOCUMENT x SUCH THAT x MENTIONS ‘data structures’, DOCUMENT y SUCH THAT x -> y, DOCUMENT z SUCH THAT y->* z WHERE y CONTAINS ‘lecture notes’ AND z CONTAINS ‘trees’;

  15. WebSQL From Clause • path expression finds related documents • URL • local link: -> • global link: => DOCUMENT x SUCH THAT “http://www.cs.auc.dk” DOCUMENT y SUCH THAT x -> y DOCUMENT y SUCH THAT x => y

  16. WebSQL From Clause • at most one link: ? • any number of links: * • alternation: | DOCUMENT y SUCH THAT x ->(->)? y DOCUMENT y SUCH THAT x ->* y DOCUMENT y SUCH THAT x (=> | ->*) y

  17. WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y

  18. WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y Java

  19. WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y Java

  20. WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y Java

  21. WebSQL From Clause • path expression limits search space • local link, search limited to local machine • global link, can go anywhere • =>* would search all of WWW • pre-analysis, filtering • even three to four local links infeasible

  22. WebSQL Where Clause • like SQL • CONTAINS, text search of retrieved document • can push CONTAINS into navigation WHERE y CONTAINS ‘lecture notes’ AND y.length < 4000;

  23. WebSQL Query • Find lecture notes on trees in a data structures course. SELECT z. FROM DOCUMENT x SUCH THAT x MENTIONS ‘data structures’, DOCUMENT y SUCH THAT x -> y, DOCUMENT z SUCH THAT y->* z WHERE y CONTAINS ‘lecture notes’ AND z CONTAINS ‘trees’;

  24. data structures -> lecture notes

  25. data structures -> lecture notes data structures

  26. data structures -> lecture notes data structures

  27. data structures -> lecture notes data structures lecture notes

  28. lecture notes ->* trees data structures lecture notes

  29. lecture notes ->* trees data structures lecture notes

  30. lecture notes ->* trees trees data structures lecture notes

  31. Result trees data structures lecture notes

  32. WebSQL Example

  33. WebSQL Architecture • Java implementation

  34. WWW Query Language -Drawbacks • dynamic architecture • O(p**k) - p is length of path expression - k is branching factor • a priori knowledge of topology • back links are a problem

  35. Jumping Spider - a Hybrid • like a search engine - static architecture - keyword searches • like a WWW query language - uses modified WWW graph - one kind of path expression

  36. Kinds of Links • content refinement queries are common • heuristic information in subdirectories is refined • different kinds of links back - subdirectory to parent down - parent directory to subdirectory side - unrelated directories

  37. Re-using the WWW Graph

  38. Directory Trees

  39. Down Links

  40. Back Links

  41. Eliminate Back Links

  42. Transitive Closure of Down Links

  43. Plus a Side Link

  44. data structures -> lecture notes data structures

  45. data structures -> lecture notes data structures

  46. data structures -> lecture notes data structures lecture notes

  47. lecture notes -> trees data structures lecture notes

  48. lecture notes -> trees data structures trees lecture notes

  49. Analysis • search engine index - adds a pertinent index • pertinent index - O(nlogn) to O(n**2) space - all URLs that can reach this URL - tree-like, so should be close to O(nlogn) • more intersections • implemented in Perl 5

  50. Related Work • WWW query languages WebSQL (Arocena et al. - WWW6 ’97) W3QS (Konopnicki and Shmueli - VLDB’95) WebLog (Lakshmanan et al. RIDE ’96) AKIRA (Lacroix et al. - ER ’97) • Indexes that already use directories Infoseek WebGlimpse (Manber et al. - Usenix ’97) • Semi-structured data models - many

More Related