260 likes | 373 Views
Publish/Subscribe Systems with Distributed Hash Tables and Languages from IR. Christos Tryfonopoulos & Manolis Koubarakis Intelligent Systems Lab Dept. of Electronic & Computer Engineering Technical University of Crete, Greece http://www.intelligence.tuc.gr. Overview. Motivation
E N D
Publish/Subscribe Systems with Distributed Hash Tables and Languages from IR Christos Tryfonopoulos & Manolis Koubarakis Intelligent Systems Lab Dept. of Electronic & Computer Engineering Technical University of Crete, Greece http://www.intelligence.tuc.gr
Overview • Motivation • Distributed resource sharing • The DHTrie protocols • Local filtering algorithms • Conclusions
Motivation • Resource sharing is at the core of today’s computing (Web, P2P, Grid). • One-time as well as continuous querying functionality is needed. • Data models and languages based on Information Retrieval are useful for annotating and querying resources. • Many nice technologies to build on (e.g., overlay networks, agents etc.)
Related work • Distributed information retrieval • p-Search, PlanetP, [Li et.al. 2003], [Cohen et.al. 2003], [Reynolds et.al. 2002], … • Publish/subscribe • Non DHT-based • SIFT, SIENA, Le Suscribe, Gryphon, P2P-DIET, … • DHT-based • Scribe, Bayeux (topic-based) • [Tam et.al. 2003], [Pietzuch et.al. 2003], [Terpstra et.al. 2003], [Triantafillou et.al. 2004] (content-based)
Distributed resource (file) sharing • Two kinds of basic functionality are expected: • One-time querying • A user poses a query “I want photos of Euro 2004 champions”. • The system returns a list of pointers to matching resources. • Publish/subscribe • A user posts a continuous query to receive a notification when a photo of “Euro 2004 champions” is published. • The system notifies the subscriber with a pointer to the peer that published the video clip.
Distributed resource sharing • One-time query scenario
Distributed resource sharing • Publish/subscribe scenario
Achievements in the context of DIET • Languages and data models from IR (emphasis on textual information). • Efficient filtering algorithms. • The system P2P-DIET • A super-peer based P2P system. • Implemented on top of the lightweight mobile agent platform DIET Agents. DIET project: www.dfki.uni-kl.de/IVSWEB/DIET DIET Agents: http://diet-agents.sourceforge.net/ P2P-DIET: http://www.intelligence.tuc.gr/p2pdiet • Current work: • Solve the pub/sub problem using ideas from DHTs.
Distributed Hash Tables (DHTs) • Created to solve the object location problem in a distributed (dynamic) network of nodes. • Support only one operation: Given a key, map the key onto a node • Many existing systems (Chord, CAN, Pastry, Tapestry, P-Grid, DKS, Viceroy, …). • Needs logarithmic number of messages to locate a node.
Data model… • Publications are attribute-value pairs (A,s), where A is a named attribute and s is a text value. • An example of a publication in model AWP {(AUTHOR, “John Smith”), (TITLE, “Information dissemination in P2P systems”), (ABSTRACT, “In this paper we show …”)}
…and query language • Examples of continuous queries in model AWP
Distributed resource sharing revisited • Publish/subscribe scenario
The DHTrie protocols • Subscribing with a continuous query • Assume query q of the form: • Then for a random attribute Ai and a random word wj contained in either si or wpi , we create the string Aiwj and use it as the key to forward the query to peer with ID = H(Aiwj).
The DHTrie protocols (cont’d) • Publishing a resource • Assume a publication p of the form: • Obtain a list of peer IDs by hashing string Aiwj for all words, and all attributes in p (necessary to ensure correctness). Use indirect message passing and the DHT infrastructure to forward the message. • The receiver node, contacts neighbors included in the recipients list, removes them from it and forwards the message.
Direct message passing • Traditional way to handle a message forwarding to more than one recipients. • Send a lookup() message for each recipient. • For k recipients we need O(k log(N)) lookup messages. • Multicast techniques not applicable, since group of peers to be contacted is not known a priori.
Indirect message passing • Incorporate recipient list into message • Avoid asking the same routing question more than once • Opportunistic forwarding • Increase in message size due to: • publication size • process publication (remove stopwords, stemming) • use inverted (and compressed) index • receipient list size • use gap compression (avoid peer IDs)
The DHTrie protocols • Notifying interested subscribers • To find all matching queries in a peer, we use filtering algorithm BestFitTrie. [Tryfonopoulos, Koubarakis, Drougas, SIGIR 2004] • Once all matching queries are found, a notification message is created and forwarded to peers using indirect message passing.
Filtering algorithms at each super-peer • Query clustering algorithm BestFitTrie • Data structure is a hash table of tries • Hash table is used for fast access to trie roots • We search for the best place to store query q, in two phases: • Best position trie-wise • Best position forest-wise • Matching procedure examines only tries with roots contained in the incoming document
PrefixTrie: Prefix-based clustering (handle a query as a sequence of words) BestFitTrie: Set-based clustering (handle a query as a set of words) Filtering algorithms at each super-peer
BestFitTrie 1M PrefixTrie 1M BestFitTrie 3M PrefixTrie 3M Filtering algorithms at each super-peer
Other interesting issues • Load balancing • Frequency of occurrence of words may overload certain peers. • Index queries under infrequent words. • Use controlled replication. • Word frequency computation • Also useful in other types of queries (VSM). • Global vs Local ranking schemes. • Propose a hybrid ranking scheme, with updating and estimation mechanisms.
Thank you Funding sources: IST/FET project DIET (www.dfki.uni-kl.de/IVSWEB/DIET) IST/FET project Evergrow (http://www.evergrow.org) Heraclitus Ph.D. Fellowship Program (Greece)