A Scalable Information Management Middleware for Large Distributed Systems

A Scalable Information Management Middleware for Large Distributed Systems Praveen Yalagandula HP Labs, Palo Alto Mike Dahlin, The University of Texas at Austin

Trends • Large wide-area networked systems • Enterprise networks • IBM • 170 countries • > 330000 employees • Computational Grids • NCSA Teragrid • 10 partners and growing • 100-1000 nodes per site • Sensor networks • Navy Automated Maintenance Environment • About 300 ships in US Navy • 200,000 sensors in a destroyer [3eti.com]

Information Management Information Management Research Vision Wide-area Distributed Operating System • Goals: • Ease building applications • Utilize resources efficiently Data Management Security Monitoring ...... Scheduling

Information Management • Most large-scale distributed applications • Monitor, query, and react to changes in the system • Examples: • A general information management middleware • Eases design and development • Avoids repetition of same task by different applications • Provides a framework to explore tradeoffs • Optimizes system performance Job Scheduling System administration and management Service location Sensor monitoring and control File location service Multicast service Naming and request routing ……

Contributions – SDIMS Scalable Distributed Information Management System • Meets key requirements • Scalability • Scale with both nodes and information to be managed • Flexibility • Enable applications to control the aggregation • Autonomy • Enable administrators to control flow of information • Robustness • Handle failures gracefully

SDIMS in Brief • Scalability • Hierarchical aggregation • Multiple aggregation trees • Flexibility • Separate mechanism from policy • API for applications to choose a policy • A self-tuning aggregation mechanism • Autonomy • Preserve organizational structure in all aggregation trees • Robustness • Default lazy re-aggregation upon failures • On demand fast reaggregation

Outline • SDIMS: a general information management middleware • Aggregation abstraction • SDIMS Design • Scalability with machines and attributes • Flexibility to accommodate various applications • Autonomy to respect administrative structure • Robustness to failures • Experimental results • SDIMS in other projects • Conclusions and future research directions

Attributes • Information at machines • Machine status information • File information • Multicast subscription information • ……

Aggregation Function • Defined for an attribute • Given values for a set of nodes • Computes aggregate value • Examples • Total users logged in the system • Attribute – numUsers • Aggregation function – summation

Aggregation Trees f(f(a,b), f(c,d)) A2 • Aggregation tree • Physical machines are leaves • Each virtual node represents a logical group of machines • Administrative domains • Groups within domains • Aggregation function, f, for attribute A • Computes the aggregated value Ai for level-i subtree • A0 = locally stored value at the physical node or NULL • Ai = f(Ai-10, Ai-11, …, Ai-1k) for virtual node with k children • Each virtual node is simulated by some machines f(a,b) f(c,d) A1 A0 d c a b

Example Queries • Job scheduling system • Find the least loaded machine • Find a (nearby) machine with load < 0.5 • File location system • Locate a (nearby) machine with file “foo”

Example – Machine Loads • Attribute: “minLoad” • Value at a machine M with load L is ( M, L ) • Aggregation function • MIN_LOAD (set of tuples) (C, 0.1) (A, 0.3) (C, 0.1) (D, 0.7) (C, 0.1) (A, 0.3) (B, 0.6) minLoad

Example – Machine Loads Query: Tell me the least loaded machine. • Attribute: “minLoad” • Value at a machine M with load L is ( M, L ) • Aggregation function • MIN_LOAD (set of tuples) (C, 0.1) (A, 0.3) (C, 0.1) (D, 0.7) (C, 0.1) (A, 0.3) (B, 0.6) minLoad

Example – Machine Loads Query: Tell me a (nearby) machine with load < 0.5. • Attribute: “minLoad” • Value at a machine M with load L is ( M, L ) • Aggregation function • MIN_LOAD (set of tuples) (C, 0.1) (A, 0.3) (C, 0.1) (D, 0.7) (C, 0.1) (A, 0.3) (B, 0.6) minLoad

Example – File Location • Attribute: “fileFoo” • Value at a machine with id machineId • machineId if file “Foo” exists on the machine • null otherwise • Aggregation function • SELECT_ONE(set of machine ids) B B C null C B null fileFoo

Example – File Location Query: Tell me a (nearby) machine with file “Foo”. • Attribute: “fileFoo” • Value at a machine with id machineId • machineId if file “Foo” exists on the machine • null otherwise • Aggregation function • SELECT_ONE(set of machine ids) B B C null C B null fileFoo

Scalability • To be a basic building block, SDIMS should support • Large number of machines (> 104) • Enterprise and global-scale services • Applications with a large number of attributes (> 106) • File location system • Each file is an attribute  Large number of attributes

f1,f2,…,f7 f1, f2, f3 f4,f5,f6,f7 f6, f7 f2, f3 f4, f5 f1, f2 Scalability Challenge • Single tree for aggregation • Astrolabe, SOMO, Ganglia, etc. • Limited scalability with attributes • Example: File Location

f1,f2,…,f7 f1, f2, f3 f4,f5,f6,f7 f6, f7 f2, f3 f4, f5 f1, f2 Scalability Challenge • Single tree for aggregation • Astrolabe, SOMO, Ganglia, etc. • Limited scalability with attributes • Example: File Location • Automatically build multiple trees for aggregation • Aggregate different attributes along different trees

Building Aggregation Trees • Leverage Distributed Hash Tables • A DHT can be viewed as multiple aggregation trees • Distributed Hash Tables (DHT) • Supports hash table interfaces • put (key, value): inserts value for key • get (key): returns values associated with key • Buckets for keys distributed among machines • Several algorithms with different properties • PRR, Pastry, Tapestry, CAN, CHORD, SkipNet, etc. • Load-balancing, robustness, etc.

DHT - Overview • Machine IDs and keys: Long bit vectors • Owner of a key = Machine with ID closest to the key • Bit correction for routing • Each machine keeps O(log n) neighbors Key = 11111 11101 10111 11000 10010 get(11111) 00001 00110 01100 01001

DHT Trees as Aggregation Trees 111 Key = 11111 11x 1xx 010 110 001 100 000 101 011 111

DHT Trees as Aggregation Trees Mapping from virtual nodes to real machines 111 Key = 11111 11x 1xx 010 110 001 100 000 101 011 111

000 111 111 00x 11x 11x 0xx 1xx 1xx 010 010 110 110 010 110 001 001 001 100 100 100 101 000 000 101 101 011 011 000 011 111 111 111 DHT Trees as Aggregation Trees Key = 11111 Key = 00010

000 111 00x 11x 0xx 1xx 010 110 010 110 001 001 100 100 101 000 101 011 000 011 111 111 DHT Trees as Aggregation Trees Key = 11111 Key = 00010 Aggregate different attributes along different trees hash(“minLoad”) = 00010  aggregate minLoad along tree for key 00010

Scalability • Challenge: • Scale with both machines and attributes • Our approach • Build multiple aggregation trees • Leverage well-studied DHT algorithms • Load-balancing • Self-organizing • Locality • Aggregate different attributes along different trees • Aggregate attribute A along the tree for key = hash(A)

Flexibility Challenge • When to aggregate? • On reads? or on writes? • Attributes with different read-write ratios #reads >> #writes #writes >> #reads { read-write ratio File Location Total Mem CPU Load Best Policy Aggregate on reads Aggregate on writes Partial Aggregation on writes Astrolabe Ganglia DHT based systems Sophia MDS-2

Flexibility Challenge • When to aggregate? • On reads? or on writes? • Attributes with different read-write ratios #reads >> #writes #writes >> #reads { read-write ratio File Location Total Mem CPU Load Best Policy Aggregate on reads Aggregate on writes Partial Aggregation on writes Single framework – separate mechanism from policy  Allow applications to choose any policy  Provide self-tuning mechanism Astrolabe Ganglia DHT based systems Sophia MDS

API Exposed to Applications • Install • Update • Probe • Install: an aggregation function for an attribute • Function is propagated to all nodes • Arguments up and down specify an aggregation policy • Update: the value of a particular attribute • Aggregation performed according to the chosen policy • Probe: for an aggregated value at some level • If required, aggregation is done • Two modes: one-shot and continuous

Flexibility Update-Up Up=all Down=0 Policy Setting Update-All Up=all Down=all Update-Local Up=0 Down=0

Self-tuning Aggregation • Some apps can forecast their read-write rates • What about others? • Can not or do not want to specify • Spatial heterogeneity • Temporal heterogeneity • Shruti: Dynamically tunes aggregation • Keeps track of read and write patterns

Shruti – Dynamic Adaptation R A Update-Up Up=all Down=0

A Lease based mechanism Shruti – Dynamic Adaptation R A Update-Up Up=all Down=0 Any updates are forwarded until lease is relinquished

Shruti – In Brief • On each node • Tracks updates and probes • Both local and from neighbors • Sets and removes leases • Grants leases to a neighbor A • When gets k probes from A while no updates happen • Relinquishes leases from a neighbor A • When gets m updates from A while no probes happen

Flexibility • Challenge • Support applications with different read-write behavior • Our approach • Separate mechanism from policy • Let applications specify an aggregation policy • Up and Down knobs in Install interface • Provide a lease based self-tuning aggregation strategy

A D B C Administrative Autonomy • Systems spanning multiple administrative domains • Allow a domain administrator control information flow • Prevent external observer from observing the information • Prevent external failures from affecting the operations • Challenge • DHT trees might not conform

} Ensure that virtual nodes aggregating data of a domain are hosted on machines in the domain A D B C Administrative Autonomy • Our approach: Autonomous DHTs • Two properties • Path locality • Path convergence

A Scalable Information Management Middleware for Large Distributed Systems