Dissemination and Synchronization for Mobility (and Beyond)

Dissemination and Synchronization for Mobility (and Beyond) Michael Franklin UC Berkeley MDM Tutorial 7 January 2001

Outline • Dissemination vs. Synchronization • Architectural Concepts • Types of nodes • Data Delivery Mechanisms • User Profiles • Data Dissemination • DBIS Toolkit, Xfilter, Continuous Queries • Synchronization • for PDAs: Palm HotSync, Edison, SyncML • Data Recharging • Consistency for weakly connected devices • Wrap Up

Intro: Data Dissemination • disseminate • 1. To scatter widely, as in sowing seed. • 2. To spread abroad, promulgate. disseminate information • In a data management context, this refers to the proactive distribution of relevant data to users. • Examples: • News feeds, stock tickers, event broadcasts, SPAM, …

Intro: Data Synchronization • synchronize • 1. To cause to occur or operate with exact coincidence in time or rate. • 2. To cause to occur or operate at the same time as something else. • In a data management context this refers to making base data and device-cached data consistent. • Examples: • Palm HotSync, Email (?), disconnected operation

Discussion • From the definitions, you might think that the two concepts are completely unrelated, but are they? • Examples: • Email Lists/On-line communities • Groupware apps such as shared calendars • AvantGo • What are the essential characteristics that distinguish one from the other? • How related? How different?

Tutorial Goals • To identify common infrastructure to support large scaledata distribution: dissemination and syncrhonization. • To describe recent and on-going research in supporting dissemination. • To describe existing synchronization protocols and future directions for them. • To outline avenues for continuing research and infrastructure development.

2. Architectural Concepts • Dissemination and Sync are inherently distributed; • Both require a Network architecture. • A key concept is that of an Overlay Network • “application-level” network built on top of Internet protocols; interacts with the “regular” internet. • May use both public and private communication links. • Exploits “Data Centers” deployed around the world. • Content Routing can be done at the application level so can be based on application and data semantics. • Caching, Prefetching, Staging, etc. can be done transparently. • E.g., CDNs such as Akami, FastForward Networks

Architecture (continued) • We will focus on three key aspects of such architectures: • Types of nodes in the system. • Options for data delivery mechanisms. • Representation of data needs and preferences through user profiles.

i) Types of Nodes • Clients • Interact with end user, may cache data and updates • Client Proxies • Deal with disconnection, provide network interface • Data Sources • The ultimate repositories for data • Intermediaries (“Information Brokers”) • Provide storage/caching, application level routing • value added data processing • communications transducing

Network Components profile query Internet response profile query response Client Proxies Data Sources Information Brokers

ii) Data Delivery Options • There are many ways to move data between sources and receivers: • Pull vs. Push • Does the data move because the receiver asked for it or because the source decided to send it? • Periodic vs. Aperiodic • Does the data move according to a predefined schedule or is movement event/demand driven? • Unicast vs. 1 to N • Does the data go to a single receiver or many? • Reliability Guarantees • best effort, guaranteed once, transactional…

Aperiodic Periodic Unicast 1-to-n Unicast 1-to-n Data Delivery Mechanisms [Franklin & Zdonik, OOPSLA 97] Push Pull Aperiodic Periodic Unicast 1-to-n Unicast 1-to-n on- demand broadcast polling polling w\snoop Email lists publish/ subscribe Person- alized News Broad- cast disks request/ response Dimensions are largely orthogonal – all combinations are potentially useful.

Network Transparency Sources Brokers Clients • A fundamental principle for systems design: • Type of a link matters only to nodes on each end.

iii) User Profiles • An expression of a user’s (or group of users) data interests and priorities. • Must be Declarative: • Query languages enabled modern database systems. • Profile languages will enable next generation information management. • Sources: • users • learned (implicitly or through feedback) • hybrid • collaborative/clustering approaches

Why are Profiles Needed? • Necessary for push-based dissemination • how else to know what to send to user? • Useful for optimizing data synchronization • can precompute data to be transferred to user • can identify potential hot spots • Also can be used for data management • Caching • Staging at brokers and proxies • Prefetching • Precomputation of customized data views

Profile Contents Three main components: 1) Domain Specification: content-based, declarative specifications of user interests (read “queries”). 2) Utility Specification: Specifications of user priorities and dependencies among data items and requirements for resolution, freshness, ordering, etc. 3) User Context information: where, when, who, what. Useful for tailoring data delivery to users based on their current and future needs.

Example Profile WHERE <article> <subject> Database <\> <title> $t <\> <year> $y <\> <conference> $c <\> <\> ELEMENT AS $X IN (www.cs.*.edu/*/$S), $S conforms to “bib.dtd” CONSTRUCT$X UTILITY ( $X ) (10 * ( $c = “SIGMOD” OR $c = “VLDB”)) + (8 * ( $c = “EDBT” OR $c = “ICDE”) + (100 * ( $a = “Gray”)) - (2001 - $y)

Summary So Far • Despite initial impressions, Dissemination and Synchronization are closely related. • A common infrastructure can support both. • Basis is an overlay network with application-level routing, transparent caching, staging, etc. • Nodes are clients, proxies, brokers, and sources. • Various data delivery mechanisms combined via network transparency. • User profiles are the key to push-based delivery, precomputation, and network data management.

Huge Amount of Dynamic Data Ubiquity of Information Services 3. Data Dissemination Demand for timely dissemination of data to a large set of consumers • Stock and sport tickers • Personalized news delivery • Traffic information systems • Software distribution • Asymmetric (server to devices) data flow/usage dictates system architecture. • Selective Dissemination of Information (SDI) • the right data to the right people at the right time

Dissemination Topics • The DBIS Toolkit • XFilter: efficient routing and filtering of XML documents. • Related Database technologies: triggers and continous queries.

Dissemination-Based Information Systems (DBIS) • Outgrowth of “Broadcast Disks” project. SIGMOD 95 (Acharya et al.) • Framework proposed OOPSLA 97 (Franklin & Zdonik) • Toolkit description/demo SIGMOD 99 (Altinel et al.) • XML-based Profile system (Xfilter) in VLDB 00 (Altinel & Franklin) • Profile learning techniques in ICDE 00 (Cetintemel, Franklin, Giles) • Now part of “Data Centers” NSF ITR Project with Stan Zdonik @ Brown & Mitch Cherniack @ Brandeis - focus on profile-based data management

DBIS Framework The DBIS Framework is based on three fundamental principles: 1) No one data deliverymechanism is best for all situations (e.g., apps, workloads, topologies). 2) Network Transparency: Must allow different mechanisms for data delivery to be applied at different points in the system. 3) Topology, routing, and delivery mechanism should vary adaptively in response to system changes. Goal is to provide a library of components from which to construct dissemination apps.

Proxy cache Proxy cache Proxy cache DBIS Example An example: Unicast pull Unicast pull 1-to-n push DB Server Can vary dynamically Unicast pull

DBIS Toolkit • Data Source Library – wraps data sources to encapsulate communication and convert data. • Client Library – encapsulates comm., converts queries and profiles, monitors and filters data. • Information Broker – primary component of the DBIS. Handles communication transducing, caching, scheduling, profile management and matching. • Catalog Manager (master) • Real-Time Performance Monitoring Tool and Control Panel.

DBIS Components

Data Sources Clients Broadcast Medium Information Broker Data Sources Other Information Brokers Forwarded Profiles Data Items Data Items Data Source Manager Broker Manager Decomposed Profiles / Profile Updates Data Source Registration Pull Requests Filtered Data HD Cache Profile Manager Mapper Profiles / Pull Requests Catalog Updates IB Master Broadcast Manager Scheduler Client Manager Network Manager Profiles / Pull Requests Acknowledgement (Tune information)

More on Brokers • Brokers are middleware components that can act as both clients and servers. • Must support data caching • Needed to convert pushed-data to pulled-data • Also allows implementation of hierarchical caching • Profile Management • Profiles needed for push • Allow informed data management: prefetch, staging, etc. • Profile Matching • No profile language sufficient for all applications. • Need an API for adding app-specific profiling

DBIS Toolkit

DBIS Research Issues • Each data delivery mechanism has unique aspects • Broadcast Disks - scheduling., caching, prefetching, updates, error handling,… • On-demand Broadcast - scheduling, data staging • Publish/Subscribe- large-scale filtering, channelization • Security/Fault-tolerance/Reliability • End-to-End network design and control • Fundamental performance tradeoffs • Profile Languages and Processing

XFilter: XML Document Filtering • Provides efficient filtering (routing) of XML documents against many XPath profiles by: • Representation of XPath queries as Finite State Machines (FSMs) • Sophisticated FSM indexing and processing • Enhancements to avoid “query” skew • Accepts any XML document (no DTDs needed) • Implemented in the DBIS-Toolkit and as a stand-alone library • Developed by Mehmet Altinel for his Ph.D. work, Published in [Altinel & Franklin, VLDB 2000]

Why XML-Based SDI? • XML is becoming the dominant format for data exchange on the Internet • XML provides structural and semantic cues • Query languages for XML have been developed • The combination of XML encoding and expressive query languages allows the creation of highly focused and accurate profiles

The challenge is to efficiently and quickly match incoming XML documents against the potentially huge set of user profiles. An XML-Based SDI System User Profiles Filtered Data XML Documents XML Conversion Filter Engine Users Data Sources

XPath as a Profile Language • W3C recommendation (used for path expressions in XSLT and XPointer) • Has the right level of expressiveness for SDI • Operates on a single document at a time • Can address any node in an XML document using hierarchical relationships, wildcards and element node filters • In XFilter, we use XPath to describe predicates over entire documents • If the result contains at least one element of a document, then the document satisfies the XPath expression

Filter applied to product element node Important XPath Features • Parent/Child (‘/’) and Ancestor/Descendant (‘//’): /catalog/product//msrp • Wildcards (match any single element): /catalog/*/msrp • Element Node Filters to further refine the nodes: • Filters can contain nested path expressions //product[price/msrp < 300]/name

Successful Profiles & Filtered Data Profile Info Path Nodes Filter Engine XML Parser (SAX Based) Element Events XML Documents Query Index Successful Queries Profile Base XFilter Architecture /a/b[c/d]/e //d/*/*/e /b/e /a//b/c //b/d/*/e /c/*/d//e User Profiles (XPath Queries) XPath Parser

XML Parsing and Filtering • Event-based XML Parsing using SAX API • XML documents are converted to a linear sequence of events that drive the execution of the filter • Callback functions are implemented to deal with the different events • Start Element • Element Data • End Element

Filter Engine • Tricky aspects of the XPath language: • Checking the order of elements in the queries • Handling wildcards and descendent operators • Evaluating filters that are applied to element nodes (Nested path expressions) • Solution: • Convert each XPath query into a Finite State Machine (FSM) • A profile is considered to be satisfied when its final state is reached • Index the states of FSMs for efficient evaluation

FSM Representation • Each element node is a state • A state is represented using a Path Node structure (Contains information to process current state): • Compare the level of element name in input document with the level value of the path node • Evaluate the element node filter if there is any • Locate next path nodes for the state change in the FSM representation • Calculate the level values of next states using relative distance values (in terms of levels) stored in the path nodes • Not generated for wildcard (“*”) nodes

Path Node 3 Rel Dist = NA Level = Any Filter Expression Path Node 1 Rel Dist = NA Level = 1 Path Node 2 Rel Dist = 2 Level = ? Path Node 4 Rel Dist = 1 Level = ? Level <a> <x> <b> <y> <c att1 = 500> <d/> </c> … 1 PN2 PN4 PN2 PN2 PN2 PN4 PN4 PN1 PN1 PN1 Level = 3 El = b Level = 3 El = b Level = 3 El = b Level = 6 El = d Level = 6 El = d PN3 PN3 PN3 PN3 Level = 1 El = a Level = 1 El = a Level = Any El = c Filter Expression Level = Any El = c Filter Expression Level = Any El = c Filter Expression Query is satisfied Path Node Decomposition / a / * / b // c[@att1 = ‘500’] / d 2 3 4 5 6 5

Handling Multiple Queries • Key insight for scalable SDI: • Index the queries instead of the data • Hash table based on the element names in the queries • Each node contains two lists of path nodes: • Candidate List: Stores the path nodes that represent current state of each query • Wait List: Stores the path nodes that represent the future states • State transition is represented by promoting a path node from the Wait List to the Candidate List • Initial distribution of path nodes has a significant impact on performance

Q2 = // b / * / c / d Q1 3 NA -1 Q4 3 1 ? Q2 1 NA -1 Q1 2 1 ? Q1 1 NA 1 Q4 2 1 ? Q2 2 2 ? Q4 1 NA -1 Q3 3 NA -1 Q3 1 NA 2 Q5 1 NA 1 Q5 2 3 ? Q5 3 NA -1 Q3 2 1 ? Q2 3 1 ? Q2-1 Q2-2 Q2-3 Q3 = / * / a / c // d Q4 = b / d / e Q5 = / a / * / * / c // e Q3-1 Q3-2 Q3-3 Q4-1 Q4-2 Q4-3 Q5-1 Q5-2 Q5-3 Examples Q1 = / a / b // c Query Id Position Rel Dist Level Q1-1 Q1-2 Q1-3

CL z WL a Q1-1 Q3-1 Q5-1 CL WL Q1-2 Q2-1 Q4-1 b WL CL Q5-2 Q3-2 Q2-2 Q1-3 c WL CL d Q4-2 Q3-3 Q2-3 WL CL e Q5-3 Q4-3 Query Index Construction Element Hash Table CL : Candidate List WL: Wait List

Enhanced Algorithms • Drawbacks of the “Basic” approach: • Query skew: hot elements are likely to have very long Candidate Lists • Unnecessary evaluations of queries for which the input document contains only a subset of the required element names • Two enhancement strategies: • List Balance • Prefiltering

List Balance Algorithm • When adding an FSM to the Query Index, select a “pivot” Path Node whose element has the shortest Candidate List length • Treat the pivot node as the initial state of the FSM • Attach the portion of FSM that precedes the pivot node as a prefix • Evaluate the prefix as a precondition by using a stack of traversed element nodes in the XML document

Q2 = // b / * / c / d Q1 1 NA 1 Q1 2 1 ? Q1 3 NA -1 X X 3 ? Q2 3 1 ? X X NA 1 Q2 2 2 ? X X NA -1 Q3 2 NA -1 Q4 2 1 ? X X NA 2 Q2 1 NA -1 Q2-1 Q2-2 Q2-3 Q4 = b / d / e Q5 = / a / * / * / c // e Q4 1 1 ? b Q5 1 NA -1 a, c Q4-2 Q4-1 Q5-1 FSMs in List Balance Q1 = / a / b // c Query Id Position Rel Dist Level Q1-1 Q1-2 Q1-3 Q3 = / * / a / c // d Q3 1 1 ? a Q3-2 Q3-1 Prefix

z a b c d e Query Index in List Balance Element Hash Table CL WL Q1-1 CL WL Q1-2 Q2-1 WL CL Q2-2 Q1-3 Q3-1 WL CL Q3-2 Q2-3 Q4-1 WL CL Q4-2 Q5-1 CL : Candidate List WL: Wait List

Prefiltering • Implemented as an initial pass that is performed before processing the queries • Based on Yan’s [Yan 94] Key Based algorithm • Each input XML document is parsed twice • In the first pass: • Match the element names for each query with the document • In the second pass: • Consider only the queries that passed the first step • Selectivity of the Prefiltering step determines its benefit.

a b e Q1 = / a // b[ c / d = 100] / e Q2 c d Nested Path Expressions • Element Node Filters may contain other XPath queries • Nested query is treated like a separate query • For relative execution, initial state of nested query is activated after parent element node is satisfied. • If result not available, assume true and “mark” for later re-evaluation. Q1 Q2

Performance Evaluation • Experimental Environment • NITF DTD is used to generate input documents and queries (Contains 158 elements organized in 7 levels with 588 attributes) • IBM’s XML Generator is used to create input documents • We implemented a similar XPath query generator • Workload Parameters to Examine • Scalability of the algorithms • Different document and query settings

Scalability Experiments (Max. Depth = 5, No Wildcards, No filters)

Dissemination and Synchronization for Mobility (and Beyond)