410 likes | 545 Views
Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval). Ryan Huebsch Joe Hellerstein, Nick Lanham, Boon Thau Loo, Timothy Roscoe, Scott Shenker, Ion Stoica p2p@db.cs.berkeley.edu UC Berkeley, CS Division. Intel Berkeley Research 4/14/03. Outline. Motivation
E N D
Querying the Internet with PIER(PIER = Peer-to-peer Information Exchange and Retrieval) Ryan Huebsch Joe Hellerstein, Nick Lanham, Boon Thau Loo, Timothy Roscoe, Scott Shenker, Ion Stoica p2p@db.cs.berkeley.edu UC Berkeley, CS Division Intel Berkeley Research 4/14/03
Outline • Motivation • General Architecture • Brief look at the Algorithms • Potential Applications • Current Status • Future Research • Conclusion
Information in the Wide Area • The Internet enables massive information collection and dissemination • Lots of data is naturally distributed • Files – MP3s (Napster, Kazaa, AudioGalaxy) • Logs – software, network, virus, usage • Instant Messaging/Blogs – (AIM, LiveJournal) • Sensors data – (TinyDB, IrisNet) • It is hard finding exactly what you want • We need Internet-scale query processors to find the data the user wants
What is an Internet Scale Query Processor? • Environment/Requirements • Thousands to millions of nodes • Set of nodes constantly changing • Streaming data, streaming queries • In situ data processing • Easy to use set of queries are not static • No central authority or administrator • So what about Oracle, DB2, SQL Server? What about Distributed Hash Tables (DHTs)? Can these solve the problem?
Databases are Cool, but… • Relational databases bring a declarative interface to the user/application. • Ask for what you want, not how to get it • Database community is not new to parallel and distributed systems • Parallel: Centralized, one administrator, one point of failure • Distributed: Did not catch on, complicated, never really scaled above 100’s of machines • “VLDB” currently means 100’s of machines Internet needs many more
P2P DHTs are Cool, but… • Lots of effort is put into making DHTs • Decentralized Control • Scalable (thousands millions of nodes) • Reliable (every imaginable failure) • Security (anonymity, encryption, etc.) • Efficient (fast access with minimal state) • Load balancing, and others • Still only a hash table interface, put and get • Hard (but not impossible) to build real applications using only the basic primitives
Databases + P2P DHTsMarriage Made in Heaven? • Well, databases carry a lot of other (expensive) baggage • ACID transactions • Consistency above all else • So we just want to unite the query processor with DHTs • DHTs + Relational Query Processing = PIER • Bring complex queries to DHTs foundation for real applications
Outline • Motivation • General Architecture • Brief look at the Algorithms • Potential Applications • Current Status • Future Research • Conclusion
Architecture • DHT is divided into 3 modules • We’ve chosen one way to do this, but this may change with time and experience • Goal is to make each simple and replaceable • PIER has one primary module • Add-ons can make it look more database like.
lookup(key) ipaddr join(landmarkNode) leave() CALLBACK: locationMapChange() Very simple interface Plug in any routing algorithm here: CAN, Chord, Pastry, Tapestry, Kademlia, SkipNet, Viceroy, etc. Architecture: DHT: Routing
Very simple interface Plug in any routing algorithm here: CAN, Chord, Pastry, Tapestry, Kademlia, SkipNet, Viceroy, etc. Architecture: DHT: Routing lookup(key) ipaddr route(key, msg) join(landmarkNode) leave() CALLBACK: locationMapChange() CALLBACK: inRoute(key, msg) Version 2 of API(in progress)
Currently we use a simple in-memory storage system, no reason a more complex one couldn’t be used Architecture: DHT: Storage store(key, item) retrieve(key) item remove(key)
Connects the pieces, and provides the ‘DHT’ interface Architecture: DHT: Provider get(ns, rid) item put(ns, rid, iid, item, lifetime) renew(ns, rid, iid, lifetime) success? multicast(ns, item) unicast(ns, rid, item) lscan(ns) items CALLBACK: newData(ns, item)
Connects the pieces, and provides the ‘DHT’ interface Architecture: DHT: Provider get(ns, rid) item put(ns, rid, iid, item, lifetime) renew(ns, rid, iid, lifetime) success? multicast(ns, item) unicast(ns, rid, item, route?) lscan(ns) items CALLBACK: newData(ns, item) Version 2 of API(in progress)
Architecture: PIER • Currently, consists only of the relational execution engine • Executes a pre-optimized query plan • Query plan is a box-and-arrow description of how to connect basic operators together • selection, projection, join, group-by/aggregation, and some DHT specific operators such as rehash • Traditional DBs use an optimizer + catalog to take SQL and generate the query plan, those are “just” add-ons to PIER • Research!
Outline • Motivation • General Architecture • Brief look at the Algorithms • Potential Applications • Current Status • Future Research • Conclusion
Joins: The Core of Query Processing • A relational join can be used to calculate: • The intersection of two sets • Correlate information • Find matching data • Goal: • Get tuples that have the same value for a particular attribute(s) (the join attribute(s)) to the same site, then append tuples together. • Algorithms come from existing database literature, minor adaptations to use DHT.
Joins: Symmetric Hash Join (SHJ) • Algorithm for each site • (Scan) Use two lscan calls to retrieve all data stored at that site from the source tables • (Rehash) put a copy of each eligible tuple with the hash key based on the value of the join attribute • (Listen) use newData to see the rehashed tuples • (Compute) Run standard one-site join algorithm on the tuples as they arrive • Scan/Rehash steps must be run on all sites that store source data • Listen/Compute steps can be run on fewer nodes by choosing the hash key differently
Joins: Fetch Matches (FM) • Algorithm for each site • (Scan) Use lscan to retrieve all data from ONE table • (Get) Based on the value for the join attribute, issue a get for the possible matching tuples from the other table • Note, one table (the one we issue the gets for) must already be hashed/stored on the join attribute • Big picture: • SHJ is put based • FM is get based
Joins: Additional Strategies • Bloom Filters • Use of bloom filters can be used to reduce the amount of data rehashed in the SHJ • Symmetric Semi-Join • Run a SHJ on the source data projected to only have the hash key and join attributes. • Use the results of this mini-join as source for two FM joins to retrieve the other attributes for tuples that are likely to be in the answer set • Big Picture: • Tradeoff bandwidth (extra rehashing) for latency (time to exchange filters)
Naïve Group-By/Aggregation • A group-by/aggregation can be used to calculate: • Split data into groups based on value • Max, Min, Sum, Count, etc. • Goal: • Get tuples that have the same value for a particular attribute(s) (group-by attribute(s)) to the same site, then summarize data (aggregation).
Naïve Group-By/Aggregation • At each site • (Scan) lscan the source table • Determine group tuple belongs in • Add tuple’s data to that group’s partial summary • (Rehash) for each group represented at the site, rehash the summary tuple with hash key based on group-by attribute • (Combine) use newData to get partial summaries, combine and produce final result after specified time, number of partial results, or rate of input • Hierarchical Aggregation: Can add multiple layers of rehash/combine to reduce fan-in. • Subdivide groups in subgroups by randomly appending a number to the group’s key
Naïve Group-By/Aggregation Application Overlay Root Root Each message may take multiple hops Each level fewer nodes participate … Sources Sources
Smarter Aggregation • Naïve method has multiple hops in overlay network between each level • Idea: Aggregate along the in overlay path from the source to the root • Every node with data is a leaf • Aggregate local data and route towards a predetermined node • Nodes along the path (who may also be leaves) intercept these messages (via new API callback) • These nodes wait for a period of time, aggregating all the messages they see, then continue routing towards the root
Smarter Aggregation Application Overlay Root Root Along the overlay route, combine messages Send message to root Sources Sources Sources Sources
Smarter Aggregation • Sounds like TAG? Ideas can we borrow? • Details • How to choose the root • When do nodes send their data • Long wait: Fewer messages, Slow response • Short wait: More messages, Fast response • How do the different DHTs perform • Chord: Binomial Tree • CAN: k-ary Tree • For reliability, what happens when we choose multiple roots, are the trees diverse enough? • Lots of questions… few answers right now
Outline • Motivation • General Architecture • Brief look at the Algorithms • Potential Applications • Current Status • Future Research • Conclusion
Going from a DHT Query Processor Application • Correlation, Intersection Joins • Summarize, Aggregation, Compress Group-By/Aggregation • Probably not as efficient as custom designed solution for a single particular problem • Common infrastructure for fast application development/deployment
Network Monitoring(Nick, Ryan, Timothy, Brent) • Lot’s of data, naturally distributed, almost always summarized aggregation • Intrusion Detection usually involves correlating information from multiple sites join • Data comes from many sources • nmap, snort, ganglia, firewalls, web logs, etc. • PlanetLab is our natural test bed • Current Status • Connecting PIER to the various data sources • Determining what are the interesting queries
Enhanced File Searching(Boon) • First step: Take over Gnutella • Just make PlanetLab look like an ultraPeer(s) on the outside, but run PIER on the inside • Queries that intersect document lists Join • For efficiency use Bloom Filters Bloom Join • Objectives • Generate (positive?) publicity for Planetlab • Generate interesting workloads that will stress PIER (and DHTs) • Study the effectiveness of ultrapeers on recall and bandwidth reduction • Determine whether a PIER/Gnutella hybrid network would scale while maintaining sufficiently high recall of results
Enhanced File Searching(Boon) • Long term: Value added services • Better searching, utilize all of the MP3 ID tags • Reputations • Combine with network monitoring data to better estimate download times • Current status • Currently able to run Gnutella traces using PIER over Millennium and Planetlab. • Next step is to integrate with popular Gnutella open source Limewire. • Expect a "live" system by end of summer
Network Services • i3-style Network services • Mobility and Multicast • Sender is a publisher • Receiver(s) issue a continuous query looking for new data • Service Composition • Services issue a continuous query for data looking to be processed • After processing data, they publish it back into the network • Looking forward: More complex services • Composition of e-service providers, e.g. via SOAP • Dataflow programs in the wide area
Outline • Motivation • General Architecture • Brief look at the Algorithms • Potential Applications • Current Status • Future Research • Conclusion
Codebase • Approximately 17,600 lines of NCSS Java Code • Same code (overlay components/pier) run on the simulator or over a real network without changes • Runs simple simulations with up to 10k nodes • Limiting factor: 2GB addressable memory for the JVM (in Linux) • Runs on Millennium and Planet Lab up to 64 nodes • Limiting factor: Available/working nodes & setup time • Code: • Basic implementations of Chord and CAN • Selection, projection, joins (4 methods), and naïve aggregation. • Non-continuous queries
Seems to scaleSimulations of 1 SHJ Join Warehousing Full Parallelization
Codebase (The Other Story) • It’s Java! • Network is slow • C program can get 340Mbps between to Millennium nodes • Java (nio library) gets about 100Mbps without serialization • With Serialization about 5Mbps! • Memory Management • Use object pools to save on object creation/garbage collection • Allowed fast prototyping, “lessons” were postponed • Code was written to generate graphs for papers • To make it work for REAL people • Error Handling • User interfaces • Use real data, real queries
Outline • Motivation • General Architecture • Brief look at the Algorithms • Potential Applications • Current Status • Future Research • Conclusion
Future Research • Most of the current work has been developing the infrastructure Enables the juicy research: • Smart Hierarchical Aggregations • Query Optimization (adaptive?) and catalogs • Improved resilience (answer quality) • Routing, Storage and Layering • Range Predicates • Continuous Queries over Streams • Semi-structured Data (XML) • Applications, Applications, Applications…
Outline • Motivation • General Architecture • Brief look at the Algorithms • Potential Applications • Current Status • Future Research • Conclusion
Conclusion • Distributed data needs a distributed query processor • DHTs too simple, databases too complex • PIER occupies a point in the middle of this new design space • Infrastructure for real applications • Current algorithms are simple, lets see how far that goes • For real efficiency and reliability more research needed into algorithms • Still some work needed to get things running in the real world