370 likes | 503 Views
Declaratively Producing Data Mash-ups. Sudarshan Murthy 1 , David Maier 2 1 Applied Research, Wipro Technologies 2 Department of Computer Science, Portland State University. http://www.sixml.org. Mash-ups. Web applications that combine information from multiple sources [Wikipedia]
E N D
Declaratively Producing Data Mash-ups Sudarshan Murthy1, David Maier2 1Applied Research, Wipro Technologies 2 Department of Computer Science, Portland State University http://www.sixml.org
Mash-ups • Web applications that combine information from multiple sources [Wikipedia] • A mash-up does not need to be a web app • Data that includes or transcludes content from multiple sources • In either case, a source is likely only a fragment • This work is about data mash-ups • In this talk, a mash-up is an XML document Declaratively Producing Data Mash-ups
Portland State University Campus Map • 45 markers, 53 landmarks • Marker: Balloon on map • Landmark: Building, department, … • Information from 188 fragments in 58 web pages • Fragments selected manually http://sparce.cs.pdx.edu/cmap/ Declaratively Producing Data Mash-ups
Portland Metro Food Markets • 154 markers, 154 landmarks • 154 fragments harvested programmatically from 4 MS Word documents • Developed for the Oregon Department of Agriculture http://sparce.cs.pdx.edu/Declaratively Producing Data Mash-ups/oda-1.1/ Declaratively Producing Data Mash-ups
An HTML Review Report Declaratively Producing Data Mash-ups
Problem Areas • Development • Getting data from heterogeneous fragments • Might use a DBMS, yet code operators such as sort, join, and aggregate for external data • Execution • When to get external data, how much to get? • Design: Expressing that • A part comes from an external fragment • A part is data (such as page number) which cannot be “selected” in the source Declaratively Producing Data Mash-ups
Outline • Introduction • The conceptual approach • Sixml: Condensed mash-ups • Sixml DOM: Reconstituted mash-ups • Sixml Navigator: Formatted mash-ups • Evaluation • Summary • Discussion Declaratively Producing Data Mash-ups
Superimposed Information (SI) • SI is new data and structure overlaid on existing base information • Mark: A reference to an external fragment • Benefits • Multiple, simultaneous organizations • Make new connections among base fragments • Preserve context Heterogeneous sources: Word, Excel, PDF, HTML,… Declaratively Producing Data Mash-ups
Services Services Transform Collect and Classify Extract and Combine DBMS Docs The Mash-up Production Process Services Condensed mash-up Reconstitutedmash-up Formattedmash-up DBMS DBMS Docs Docs Collect marks, add new data and structure Extract data from marks and combine with added data Format reconstituted data for display and other purposes Declaratively Producing Data Mash-ups
SI, Bi-level Information, Mash-ups • A condensed mash-up is SI • Links mash-up parts to external fragments • Relates to mash-up design: Sixml • A reconstituted mash-up and a formatted mash-up are both bi-level information • SI plus reconstituted parts • Relates to runtime mash-up manipulation and execution: Sixml DOM and Sixml Navigator Declaratively Producing Data Mash-ups
Outline • Introduction • The conceptual approach • Sixml: Condensed mash-ups • Sixml DOM: Reconstituted mash-ups • Sixml Navigator: Formatted mash-ups • Evaluation • Summary • Discussion Declaratively Producing Data Mash-ups
Sixml • A mash-up specification language • SI represented as XML; Sixml is XML • A condensed mash-up is encoded as a Sixml document • A mark association is encoded as an XML element of a type we define • Associate marks with six kinds of content • Validated using standard schema constructs • Uniform and comprehensible serialization Declaratively Producing Data Mash-ups
Sixml Mark Associations <Comment excerpt="" xmlns:sixml="http://schema.sixml.org"> Contradicts prior work <sixml:EMark> <sixml:Descriptor>…</sixml:Descriptor> </sixml:EMark> </Comment> <Comment excerpt="" xmlns:sixml="http://schema.sixml.org"> <sixml:TMark> Contradicts prior work <sixml:Descriptor>…</sixml:Descriptor> </sixml:TMark> <sixml:AMark target="excerpt" sixml:valueSource="true"> <sixml:Descriptor>…</sixml:Descriptor> </sixml:AMark> <sixml:EMark> <sixml:Descriptor>…</sixml:Descriptor> </sixml:EMark> </Comment> <Comment excerpt=""> Contradicts prior work </Comment> <Comment excerpt="" xmlns:sixml="http://schema.sixml.org"> Contradicts prior work <sixml:AMark target="excerpt"> <sixml:Descriptor>…</sixml:Descriptor> </sixml:AMark> <sixml:EMark> <sixml:Descriptor>…</sixml:Descriptor> </sixml:EMark> </Comment> <Comment excerpt="" xmlns:sixml="http://schema.sixml.org"> <sixml:TMark> Contradicts prior work <sixml:Descriptor>…</sixml:Descriptor> </sixml:TMark> <sixml:AMark target="excerpt"> <sixml:Descriptor>…</sixml:Descriptor> </sixml:AMark> <sixml:EMark> <sixml:Descriptor>…</sixml:Descriptor> </sixml:EMark> </Comment> • By default text excerpt is assigned at run time, but possible to declare that the value should be something other than the excerpt • Mark association names shown here are same as type name, but custom names are possible (with both static and dynamic typing) Declaratively Producing Data Mash-ups
Sixml Mark Descriptors <Comment excerpt="" xmlns:sixml="…" xmlns:xsi="…"> <sixml:TMark> Contradicts prior work <sixml:Descriptor xsi:type="sixml:XPointer"> <pointer>http://www.w3.org/#element(/1/2)</pointer> </sixml:Descriptor> </sixml:TMark> <sixml:AMark target="excerpt" sixml:valueSource="true"> <sixml:Descriptor>…</sixml:Descriptor> </sixml:AMark> <sixml:EMark> <sixml:Descriptor xsi:type="sixml:SPARCE"> <Agent>OfficeAgents.MSWord</Agent> <Doclocation="c:\abc.doc" /> <SubdocstartChar="45" endChar="53" /> </sixml:Descriptor> </sixml:EMark> </Comment> Any internal structure OK. An implementation specific to an xsi:type interprets the structure Declaratively Producing Data Mash-ups
Outline • Introduction • The conceptual approach • Sixml: Condensed mash-ups • Sixml DOM: Reconstituted mash-ups • Sixml Navigator: Formatted mash-ups • Evaluation • Summary • Discussion Declaratively Producing Data Mash-ups
Sixml DOM • Extends W3C XML DOM to easily manipulate Sixml documents • Using DOM can be tedious and inefficient • Automatic and lazy reconstitution • Detects mark associations and interprets attributes such as sixml:valueSource • Developer uses only the DOM interface • Access to descriptors and “context” of external fragments Declaratively Producing Data Mash-ups
Run-time Representation <Comment excerpt="" xmlns:sixml="http://schema.sixml.org"> <sixml:TMark> Contradicts prior work <sixml:Descriptor>…</sixml:Descriptor> </sixml:TMark> <sixml:AMark target="excerpt" sixml:valueSource="true"> <sixml:Descriptor>…</sixml:Descriptor> </sixml:AMark> <sixml:EMark> <sixml:Descriptor>…</sixml:Descriptor> </sixml:EMark> </Comment> DOM tree Declaratively Producing Data Mash-ups
Generating a Sixml DOM Tree Sixml DOM tree A mark association is “attached” to its target, but is not a child - The DOM interface suffices to access the reconstituted mash-up Value reconstituted Descriptor is not a child Declaratively Producing Data Mash-ups
Context Information • Information retrieved from the context of an external fragment • An xsi:type-specific implementation determines (statically or dynamically) what is in context <sixml:Context> <Content> <Text>provide ... system</Text> </Content> <Presentation> <FontName>Times New Roman</FontName> <FontSize>11</FontSize> </Presentation> <Placement> <Page>3</Page> </Placement> </sixml:Context> Declaratively Producing Data Mash-ups
Programming with Sixml DOM • procedure WriteComment(SixmlElementc) • XmlElementctxt = c.markAssociations[0].Context • XmlNodepage = ctxt.getElementsByTagName("Page")[0] • Writeln("Page: ", page.firstChild.nodeValue) • Writeln("Excerpt: ", c.getAttribute("excerpt")) • Writeln("Comment: ", c.firstChild.nodeValue) • Only Lines 1 and 2 use the Sixml DOM interface • Lines 2–4 get page number; Line 5 the reconstituted excerpt; and Line 6 the comment text Declaratively Producing Data Mash-ups
Outline • Introduction • The conceptual approach • Sixml: Condensed mash-ups • Sixml DOM: Reconstituted mash-ups • Sixml Navigator: Formatted mash-ups • Evaluation • Summary • Discussion Declaratively Producing Data Mash-ups
Sixml Navigator • Alternative to the traditional path navigator • Extends XDM so that Sixml documents can be declaratively queried using existing languages and query processors • Also applies to XPath 1.0 and XSLT 1.0 • Performs automatic and lazy reconstitution Declaratively Producing Data Mash-ups
XDM Extensions • Allow child elements for any kind of node with which a mark may be associated • Make a mark association a child of its target node • Represent a mark descriptor and context as children of a mark association • These extensions allow reuse of existing query languages and processors Declaratively Producing Data Mash-ups
An Extended-XDM Tree Extended-XDMtree Declaratively Producing Data Mash-ups
Queries over Bi-level Information • With Comment as current node, get the comment text ./text() • Get excerpt of commented region ./@excerpt • Get page number of commented region ./sixml:EMark/sixml:Context/Placement/Page <sixml:Context> <Placement> <Page>3</Page> </Placement> </sixml:Context> Declaratively Producing Data Mash-ups
Outline • Introduction • The conceptual approach • Sixml: Condensed mash-ups • Sixml DOM: Reconstituted mash-ups • Sixml Navigator: Formatted mash-ups • Evaluation • Summary • Discussion Declaratively Producing Data Mash-ups
Implementation and Usage • Element types for Sixml mark associations defined in XML Schema • Sixml DOM and Sixml Navigator in C# on the .NET Framework • Sixml DOM implemented by extending DOM and by revising DOM • Three implementations of Sixml DOM: 2 extensions (MS and Mono), 1 revision (Mono) • Sixml, Sixml DOM, and Sixml Navigator used in mash-ups for several applications Declaratively Producing Data Mash-ups
Experimental Data • 8 mash-ups • 4 each from 2 apps; different scale factors • File size: 200 KB to 26.1 MB • #Docs referenced: 18 to 426 • #Mark associations: 1.9K to over 311K • 3 traditional XML documents • File size: 484 KB to 113.7 MB • Tree depth: 4, 8, 16 Declaratively Producing Data Mash-ups
Evaluation Summary • Sixml DOM • Saves time over DOM when accessing mark associations • When accessing SI, savings decrease as the amount of SI increases • It is better to use DOM to access large traditional XML documents • Sixml Navigator • Saves time over traditional navigator for both mark associations and SI Declaratively Producing Data Mash-ups
Outline • Introduction • The conceptual approach • Sixml: Condensed mash-ups • Sixml DOM: Reconstituted mash-ups • Sixml Navigator: Formatted mash-ups • Evaluation • Summary • Discussion Declaratively Producing Data Mash-ups
Summary • A mash-up has three forms: condensed, reconstituted, and formatted • Sixml, Sixml DOM, and Sixml Navigator support the three forms, respectively • Sixml makes it easier to specify mash-ups; Sixml DOM and Navigator provide a more efficient means of manipulating mash-ups • The XML Schema instance documents and the source code are on www.sixml.org Declaratively Producing Data Mash-ups
Outline • Introduction • The conceptual approach • Sixml: Condensed mash-ups • Sixml DOM: Reconstituted mash-ups • Sixml Navigator: Formatted mash-ups • Evaluation • Summary • Discussion Declaratively Producing Data Mash-ups
Our Mash-up Framework XSLT and XQuery Processors Client Application XPath Processor Sixml Sixml DOM Sixml Navigator SPARCE Bulk Accessor Cloaker Reference and retrieve fragments of arbitrary types Efficiently retrieve large number of fragments Hide data to improve query expression and execution Declaratively Producing Data Mash-ups
Bi-level Query Processors • Sixml Navigator uses Sixml DOM internally: Does not construct extended-XDM trees • Existing query processors use the Sixml Navigator instead of using the traditional navigator Declaratively Producing Data Mash-ups
Mark Creation Clipboard Superimposed Application Mark Manager Superimposed Info Descriptors Repository S1 M4 <Mark ID="M4"> <Agent>AcrobatAgents.PDFAgent</Agent> <Class>AcrobatPDFTextMark</Class> <Address>2|395|439</Address> … <ContainerID>D6</ContainerID> </Mark> Declaratively Producing Data Mash-ups
Activation and Context Retrieval Context Manager Base Application Superimposed Application Base Info Mark Manager Superimposed Info Descriptors Repository S1 M4 <Mark ID="M4"> <Agent>AcrobatAgents.PDFAgent</Agent> <Class>AcrobatPDFTextMark</Class> <Address>2|395|439</Address> … <ContainerID>D6</ContainerID> </Mark> Declaratively Producing Data Mash-ups
About Context PDF Mark PowerPoint Mark • Context information is modeled as a hierarchical property set Declaratively Producing Data Mash-ups