Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval)

Querying the Internet with PIER(PIER = Peer-to-peer Information Exchange and Retrieval) Ryan Huebsch† Joe Hellerstein†, Nick Lanham†, Boon Thau Loo†, Scott Shenker‡, Ion Stoica† p2p@db.cs.berkeley.edu †UC Berkeley, CS Division ‡International Computer Science Institute, Berkeley CA VLDB 9/12/03

Internet Scale 1000’s – Millions Single Site Clusters Distributed 10’s – 100’s What is Very Large?Depends on Who You Are • Challenge: How to run DB style queries at Internet Scale! Database Community Network Community

What are the Key Properties? • Lots of data that is: • Naturally distributed (where it’s generated) • Centralized collection undesirable • Homogeneous in schema • Data is more useful when viewed as a whole • This is the design space we have chosen to investigate.

Who Needs Internet Scale?Example 1: Filenames • Simple ubiquitous schemas: • Filenames, Sizes, ID3 tags • Born from early P2P systems such as Napster, Gnutella, AudioGalaxy, etc. • Content is shared by “normal” non-expert users… home users • Systems were built by a few individuals ‘in their garages’  Low barrier to entry

Example 2: Network Traces • Schemas are mostly standardized: • IP, SMTP, HTTP, SNMP log formats • Network administrators are looking for patterns within their site AND with other sites: • DoS attacks cross administrative boundaries • Tracking virus/worm infections • Timeliness is very helpful • Might surprise you how useful it is: • Network bandwidth on PlanetLab (world-wide distributed research test bed) is mostly filled with people monitoring the network status

Our Challenge • Our focus is on the challenge of scale: • Applications are homogeneous and distributed • Already have significant interest • Provide a flexible framework for a wide variety of applications • Other interesting issues we will not discuss today: • Heterogeneous schemas • Crawling/Warehousing/etc • These issues are complementary to the issues we will cover in this talk

Four Design Principles (I) • Relaxed Consistency • ACID transactions severely limits the scalability and availability of distributed databases • We provide best-effort results • Organic Scaling • Applications may start small, withouta priori knowledge of size

Four Design Principles (II) • Natural habitat • No CREATE TABLE/INSERT • No “publish to web server” • Wrappers or gateways allow the information to be accessed where it is created • Standard Schemas via Grassroots software • Data is produced by widespread software providing a de-facto schema to utilize

Declarative Queries Query Plan Overlay Network Physical Network

Recent Overlay Networks:Distributed Hash Tables (DHTs) • What is a DHT? • Take an abstract ID space, and partition among a changing set of computers (nodes) • Given a message with an ID, route the message to the computer currently responsible for that ID • Can store messages at the nodes • This is like a “distributed hash table” • Provides a put()/get() API • Cheap maintenance when nodes come and go

Key = (15,14) Data Recent Overlay Networks:Distributed Hash Tables (DHTs) (16,16) (16,0) (0,0) (0,16)

Recent Overlay Networks:Distributed Hash Tables (DHTs) • Lots of effort is put into making DHTs better: • Scalable (thousands  millions of nodes) • Resilient to failure • Secure (anonymity, encryption, etc.) • Efficient (fast access with minimal state) • Load balanced • etc.

PIER’s Three Uses for DHTs • Single elegant mechanism with many uses: • Search: Index • Like a hash index • Partitioning: Value (key)-based routing • Like Gamma/Volcano • Routing: Network routing for QP messages • Query dissemination • Bloom filters • Hierarchical QP operators (aggregation, join, etc) • Not clear there’s another substrate that supports all these uses

Metrics • We are primarily interested in 3 metrics: • Answer quality (recall and precision) • Bandwidth utilization • Latency • Different DHTs provide different properties: • Resilience to failures (recovery time)  answer quality • Path length  bandwidth & latency • Path convergence  bandwidth & latency • Different QP Join Strategies: • Symmetric Hash Join, Fetch Matches, Symmetric Semi-Join, Bloom Filters, etc. • Big Picture: Tradeoff bandwidth (extra rehashing) and latency

Experimental Setup • A simple join query • SELECT R.pkey, S.pkey, R.padFROM R, SWHERE R.num1 = S.pkey AND R.num2 > constant1 AND S.num2 > constant2 AND f(R.num3, S.num3) > constant3 • Table R has 10 times more tuples than S • num1, num2, num3 are uniformly distributed • R.num1 is chosen such that 90% of R tuples have one matching tuple in S, while 10% have no matching tuples. • R.pad is used to insure all result tuples are 1KB • Symmetric Hash Join

Does This Work for Real?Scale-up Performance (1MB source data/node) Real Network Simulation

Current Status of PIER • PIER currently runs box-and-arrow dataflow graphs • We are working on query optimization • GUI (called Lighthouse) for query entry and result display • Designed primarily for application developers • User inputs an optimized physical query plan

Future Research • Routing, Storage and Layering • Catalogs and Query Optimization • Hierarchical Aggregations • Range Predicates • Continuous Queries over Streams • Sharing between Queries • Semi-structured Data

Bonus Material

Symmetric Hash Join (SHJ)

Fetch Matches (FM)

Symmetric Semi Join (SSJ) • Both R and S are projected to save bandwidth • The complete R and S tuples are fetched in parallel to improve latency

Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval)