WebOQL

WebOQL A Web Object Query Language

Overview • Data model supports abstractions for modeling record-based data, structured documents and hypertexts • Supports querying small databases represented as documents (such as catalogs), restructuring single pages (converting a large page into smaller pages), restructuring sets of pages, for example, creating an index page containing a hyperlink to each of them and adding to each page a hyperlink to index page. • Restructuring the content of a web site in order to show the same content in another view.

Internal arc: represent structured objects External arc: represent references (links), cannot have descendants and their records must contain a ‘URL’ field. Data Model The WebOQL data model introduces the hypertree: a tree based Data model representing structured document containing hyperlinks Hypertrees are Ordered arc-labeled trees with two kinds of arcs – Internal and external.

Data Model Example: [Group: students] [Group: professors] [Name: oded. Seniority: 8] [Name: moshe. Sem: 5] [Name: arik. Sem: 8] [Label: arik home page. URL: www…/index.html] [Label: seminar in www. URL: www…/s.html] [Label: databases. URL: www…/index.html] [Label: moshe home page. URL: www…/index.html]

Data Model Hyper trees are a useful data structure because the have three important abstractions: • Collections • Nesting • Ordering The reference notion which is very important to the web structure is captured through the distinction between internal and external arcs. Because the nodes have no type the tree can hold heterogeneous records within its arcs.

F : URLs Hypertrees Data Abstractions WEB a pair (t,F) where: t is a hypertree and schema browsing function PAGE F(u) where u is a URL

Tree operators Definitions: Tails: a tails of tree t are trees obtained by chopping prefixes of t. Simple tree: simple trees of tree t are the trees that are composed of an arc that stems from the root of t and its sub tree . Subtree: subtrees of t are the trees at the end of arcs which stem from the root of t.

q4’ q5& q5! q5&2 q5 q6 q7

t1: t2: t1 + t2: [label1: b] [label1: a] [label1: b] [label1: c] [label1: c] [label1: c1] [label1: c1] [label1: a1] [label1: a1] [label1: c2] [label1: c2] [label1: a2] [label1: a2] Tree operators Concatenate : Tree1 + Tree2 Connects two trees by their roots:

t1: [ label1: a / t1 ] [label1: a] [label1: a1] [label1: a1] [label1: a2] [label1: a2] Tree operators Hang : [ Arc1 / Tree1 ] Hangs the tree from a new arc.

t1: t1’ : [label1: a] [label1: b] [label1: a1] [label1: a1] [label1: a2] [label1: a2] Tree operators Prime : Tree’ The first subtree of the argument.

t1: t1& : [label1: a] [label1: a] [label1: b] [label1: a1] [label1: a1] [label1: a2] [label1: a2] Tree operators Head : Tree & [x] The first x simple trees of the argument, if x is not specified then only the first simple tree.

Tails of T (prefixes) [Label:3] [Label:3] [Label:3] [Label:1] [Label:2] [Label:2] [A:1] [A:2] [B:1] [B:1]

[Label:3] [Label:1] Tree t [Label:2] [A:1] [A:2] [B:1] [Label:1] [Label:3] [Label:2] [A:1] [A:2] [B:1] Sample trees of t null [A:1] [A:2] [B:1] Sub trees of t

HANG [Label:Papers from smith Format:ps.Z] [Label: “papers from smith”, Format: “ps.Z”/q1] [Tag: “UL”/[Tag: “LI”, Text: “First Child”]+ [Tag: “LI”, Text: “Second Child”]+ [Tag: “LI”, Text: “Third Child”]+ [Url: “http://a.b.c.”, Label “Click Here”] [Title:Recent……….. Url:http://………..] [Title : Are……….. Url:http://www……….] HANG + concatenate [Url: “http://a.b.c.”, Label “Click Here”] [Tag:UL] [Tag:LI Text:FirstChild] [ ] [ ]

Tree operators Peek : Arc.field Extracts a field from an arc’s label, e.g. Example.Group can have a value of ‘students’. If this filed does not exist a value of ‘nil’ is returned. IsField : Arc?field Test for the presence of a field from in an arc’s label, e.g. Example?Group evaluates to true, while Example?Name evaluates to false.

PPage – when a hypertree has an associated URL that identifies it. • WWeb – Collection of interrelated pages. • External Arc of each page is a link in the web • Schema – A web can be optionally have a distinguished page to provide entry point to the web

NNo Schema: One must know URL of one or more pages http://a.b.c./three.html http://a.b.c./one.html http://a.b.c./two.html

Weboql query Web Web Schema http://a.b.c./three.html http://a.b.c./one.html http://a.b.c./four.html http://a.b.c./two.html

<UL> <LI> First Child <LI> Second Child <LI> Third Child </UL> <A HREF=“http://a.b.c.”> Click Here </A >

[Url:http://a.b.c. Label: Click here] [Tag: LI Text:First Child] [Tag: LI Text:Third Child] [Tag: LI Text:Second Child] Tree representing HTML document consisting of a list and a hyperlink • Trees are ordered • Arcs are not labelled with atomic values but records

[group:DBMS] [group:Card] [group:ProgLang] [Title:Recent Authors:Smith Publications:Tech] [Title:Are…… Authors:Smith Publications:ACM] [Label:Abstract Url: www…] [Label:Full Papers Url: www…] Paper Database CS papers

Select From Where SELECT - FROM - WHERE This familiar query language construct is used by WebOQL as the main construct of queries. Query to evaluate [y.Label, y.URL] Definition of variables x in example, y in x! A boolean condition x.Seniority= 8

[Label: seminar in www. URL: www…/s.html] [Label: databases. URL: www…/index.html] SELECT - FROM - WHERE For each instantiation of the variables in the from clause check the condition in the where clause, if its true then evaluate the query in the select clause and append it to the result.

Select [Y.title, y.publication] From x in cs papers, y in x’ missing data Publication - undefined

Compute a listing of the papers’ publication data grouped by title. • Select [x.Title / Select [z.Publication] from y in csPapers, z in y’ Where x.title = y.title ] From w in csPapers , x in w’

Schema – a distinguished hypertree • Browsing function – maps strings (URLs) to hypertree, it defines a graph where the nodes are pages and there is an arc between node a and b if the content of the page at node a contains an external arc whose url attribute is the url of the page at node b.

Analogy with Relational database • Hypertree > Relations • Webs > databases • Schema of a web >catalog of a database

Select [x.Tag] From x in browse(http://www.cs.toronto.edu”) Tag : body] [Tag :head]

SFW creates a web Select [y.Title, y’.URL] as schema From x in csPapers , y in x’ Where y.authors ~”smith” • Create a web page with URL “Group Names” whose content is the list of group names (assume that there is no such page in the current web) • Select [x.Group] as “Group Names” from x in csPapers

Create several pages ; one for each research group (using the group name as URL). Each page contains the publications of the corresponding group • Select x’ as x.Group from x in csPapers

Data Model [Tag: UL Text: one of the…] [Tag: H1, Text: City Overview…] • Records as Labels on Arcs • Internal and External Arcs [Tag: L1, Text: If you are interested…] [Tag: LI, Text: One of the…] [Tag: L1, Text: All the hotels…] [Tag: XYZ, Text: If you are…] [Tag: XYZ, Text: …] [Label: Theatres Online, Url: http://www…, Base: http://www…, Text: This page contains...] [Tag: XYZ, Text: Contains…] [Label: Sports Zone, Url: http://www…, Base: http://www…, Text: Sports Zone…] [Tag: XYZ, Text: One of the…] [Label: All the Hotels, Url: http://www…, Base: http://www…, Text: These are all…]

Query: list elements containing “ticket” doc := “http://www.citynet.com/overview.html”; [tag “UL”/ Select y from y in doc !’ where y’.text ~ “ticket”] [Tag: UL] [Tag: LI] [Tag: LI] [Tag: XYZ, Text: …] [Label: Theatres Online, Url: http://www…, Base: http://www…, Text: This page contains...] [Label: Sports Zone, Url: http://www…, Base: http://www…, Text: Sports Zone…] [Tag: XYZ, Text: If you are…] [Tag: XYZ, Text: One of the…]

Web restructuring Using these tree operators we have shown how a tree can be restructured. To restructure a web we must have a function which maps one web to another. The new web has some hypertree as its schema while the browsing function is an extension of the old web’s browsing function - targets URLs which were not previously targeted. The way it is done in WebOQL is by using the AS clause.

Title: students Title: professors Web restructuring Generally the select clause of WebOQL has the form of: Select q1 as s1, q2 as s2, …., qn as sn Si can be either the key word schema, or a string query. An as clause which evaluates to schema defines the schema of the web. [Title: y.Group] as schema

students [Name: moshe] [Name: arik] Web restructuring Generally the select clause of WebOQL has the form of: Select q1 as s1, q2 as s2, …., qn as sn Si can be either the key word schema, or a string query. An as clause which evaluates to a string defines a page and is treated as the URL for it. [x.Name] as y.Group

Web restructuring After a web is created there are two possibilities : either query it further (restructure it) or return it to the host application. If we want to return the web to the host application for the sake of showing it to a browser then we must format the pages in an HTML compliant way. This is easily done by restructuring it using HTML tags as labels.

Document restructuring Web documents are a perfect example of semi structured data since they do not have a fixed schema and can have various irregularities. In an HTML document most of the tags may appear any number of times or not at all. WebOQL uses a wrapper which creates abstract syntax trees (AST) from any arbitrary HTML document. This is easily done since the markup tags of HTML reflects the logical relationship between the various information items. Example: <UL> <LI> item 1. </LI> <LI> item 2. </LI> <LI> item 2. </LI> </UL>

Document restructuring Navigation patterns: In the examples we have seen the variables used in the queries ranged over simple trees of the tree we queried, however in the WWW variables may range over several linked sub trees whose structure is not fully known to us. select [x.text] from x in “someone’s.html” via ^*[Tag = “H2”] ^ - record predicate which is true for every internal arc. [Tag=“H2”] - record predicate which is true for every arc which has an ‘H2’ tag.

Document restructuring Navigation patterns: In the examples we have seen the variables used in the queries ranged over simple trees of the tree we queried, however in the WWW variables may range over several linked sub trees whose structure is not fully known to us. select [x.text] from x in “someone’s.html” via >*[not(Tag = “H2”)] > - record predicate which is true for every external arc. [not(Tag=“H2”)] - record predicate which is true for every arc which does not have an ‘H2’ tag.

Document restructuring Navigation patterns: When navigation patterns are omitted then they query is treated as if there was a navigation pattern which always evaluated to true. Variables are instantiated in left to right depth-first or breadth-first search. Since the default is breadth-first to use depth-first the key word viadfs is used instead of via.

Select q1 as s1, q2 as s2, q3 as s3, ….qm as sm where qi’s are queries and si is either a string query or keyworld Schema. • Generate a web consisting of a page for each research group containing a title and author of all its publications, and an index web page , that lists all the groups and provides links to their pages • newWeb Select unique [Name : x.Group, url : x.Group] as schema [y.Title, y.Authors ] as x.Group From x in csPapers, y in x’

[Name: Card Punching Url: Card Punching] [Name:… Url:..] “As Schema” [Name: Prog. Lang Url: Prog.Lang..] Prog. Lang. Card Punching [Titles: Assembly Lan Authors: John,..] [Titles: Cobol… Authors: James J] [Titles: Recent… Authors: Smith] [Titles: Arc… Authors: Smith] “As x. group”

NewerWeb < newWeb • select [ Tag: “H3”, Text: y.Title ] + • [ Tag: “BR”, Text: y.Publication ] + • [ Tag: “BR”, Text: y.Authors ] + • [ Tag: “P” ] • as x.Name • from x in schema, y in x.Name • | • select [ Tag: “H2”, Text: “Publications of the” * x.Name * “ Group” ] + x.Name + • [ Tag: “A”, Label: “To Index”, Url: “http://a.b.c/Index of Projects.html” ] • as “http://a.b.c/” * x.Name * “.html” • from x in schema

| • select [ Url: “http://a.b.c/Index of Projects.html” ] as schema, • [ Tag: “H2”, Text: “Index of Projects” ] + • [ Tag: “UL” / • select [ Tag: “LI” / • [Tag: “A”, Label: x.Name, • Url: “http://a.b.c/” * x.Url * “.html” • ] • ] • from x in schema • ] as “http://a.b.c/Index of Projects.html

<H2> Index of Projects </H2> <UL> <LI> <A HREF = “http://a.b.c./cardpunching.html”> Card Punching </A> </LI> <LI> <A HREF = “http://a.b.c./programminglanguages.html”> Programming Languages </A> </LI> <LI> ….. </UL> Index Page

<H2>Publications of the Card Punching group </H2> <H3> recent Discoveries in Card Punching </H3> <BR> Technical Report TROIS <BR> Peter Smith, John Brown <P> <H3> Are Magnetic Media Better ? </H3> <BR> ACM TOCP Vol 3 No. (1942) pp.2337 <BR> Peter Smith, John Brown <P> <A HREF=“http://a.b.c./IndexnProject.html”> To index </A> Group Pages

Navigation Pattern [Not (Tag = “A”)]* - Path of any length composed of arcs not having an attribute tag with value “A”. [Tag = “LI”] [Tag = “A”] – path of length 2 ^*> - all paths in a tree that lead from root to an external arc Select [x.url] from x in “http://a.b.c./index.html” Via [not (tag = “Table”)]*> All the external arcs in the document pointed to by the “http”……” that do not occur within a table

Select [x.url,x.text] From x in “http://a.b.c./root.html” Via (^*[Labled “Next’’]>)*

query web API Query Engine URL tree Wrapper Manager Wrapper Wrapper Wrapper Wrapper DBMS File System Web1 Web k ... Architecture

WebOQL

WebOQL

Presentation Transcript