A Scalable Distributed Information Management System (SDIMS)

A Scalable Distributed Information Management System (SDIMS) P. Yalagandula, M. Dahlin cs.utexas.edu SIGCOMM 2004

Outline • Introduction • Goal : Aggregation • Innovation • Flexibility • Scalability • Robustness • Implementation • Evaluation • Conclusions

Introduction • Why SDIMS ? • Monitor, querying, reacting to changes are core components of applications such as system management, service placement, data sharing and caching, etc. • SDIMS in a networked system would provide a distributed operating system backbone and facilitate the development and deployment of new distributed service.

Introduction (cont.) • Fundamental • Hierarchical aggregation • A node access detailed views of nearby information and summery views of global information. • A hierarchical system aggregate information through reduction trees.

Introduction (cont.) • A SDIMS should have four properties. • Scalable • Flexibility • Administrative isolation • Robustness

Scalable • SDIMS should accommodate large numbers of nodes. • SDIMS should allow applications to install and monitor large numbers of data attributes.

Flexibility • SDIMS should accommodate a range of applications and attributes. • Read-dominated attribute (rarely change) • Num of CPUs • Write-dominated attribute (change often) • Num of processes • SDIMS should leave the policy decision of tuning replication to applications.

Administrative isolation • Nodes can be arranged in an organizational or administrative hierarchy. • Domain-based control. • Monitor • Query

Robustness • SDIMS should adapt to reconfigurations in a timely fashion when node failures or disconnections. • SDIMS should provide mechanisms so that applications can tradeoff the cost of adaptation with consistency level of aggregated results when reconfigurations occur.

Related Works • Astrolabe • A single logical aggregation tree that mirrors a system administrative hierarchy. • A general interface for installing new aggregation functions. • An unstructured gossip protocol for disseminating information and replicating all aggregated attribute values for a sub-tree to all nodes in the sub-tree.

Related Works (cont.) • Any nodes can answer queries by using local information. • Not scalable. (replication) • Not flexibility. (Type of attribute) • Solution : P2P Go to DHT

Tree • For each level in the hierarchy, the agent maintains a record with the list of child zones (and their attributes), and which child zone represents its own zone (self). Back to Astrolabe

Gossip protocol • Periodically, each agent selects some other agent at random and exchanges state information with it. • If the two agents are in the same zone, the state exchanged relates to MIBs in that zone. • If the two agents are in different zone, they exchange state associated with the MIBs of their least common ancestor zone. Back to Astrolabe

Related Works (cont.) • DHT • SkipNet, CAN, Pastry, Chord, Tapestry

Problem • How to scalable map different attributes to different aggregation tree in a DHT mesh ?{physical network vs overlay network} • How to provide flexibility in the aggregation to accommodate different application requirement ?{flexible API for installing and controlling system}

Problem ? • How to adapt a DHT mesh to attain administrative isolation property ? {virtual organization} • How to provide robustness without unstructured gossip and total replication ?{cache; pre-computing or on-demand re-aggregation}

Aggregation Abstraction

Aggregation Abstraction • Each physical node in the system is a leaf in the tree. • An internal non-leaf, which we call virtual node, is simulated by one or more physical nodes at the leaves of the sub-tree for which the virtual node is the root.

Aggregation Abstraction (cont.) • Each physical node has local data stored as a set of (attributeType, attributeName, value) tuples. • The system associates an aggregation function ftype with each attribute type.

Aggregation Abstraction (cont.) • For each level-i sub-tree Ti in the system has an aggregate valueVi, type, name for each (attributeType, attributeName) pair. • The aggregate value for a level-i sub-tree Ti is the aggregate function for the type, ftype computed across the aggregate values of each of Ti‘s k children.Vi, type, name = ftype

Aggregation Abstraction (cont.) • Example of ftype • Avg(V1, …, Vn)=1/n 錯誤 • SUM(V1, …, Vn) = 正確 • Aggregation function satisfy the hierarchical computation property

Aggregation Abstraction (cont.) node Virtual node

Innovation • Flexibility • Scalability • Administrative isolation • Robustness

Flexibility • Operation API • Install • Update • Prob

Install Operation • The Install operation installs an aggregation function in the system.

Prob Operation 使用於強制reconfigure,更新所有cache

Prob Operation (cont.) • When node A issues a continuous probe at level l for an attribute, then updates for the attribute at any node in A’s level-l ancestor’s subtree are aggregated up to level l and is propagated down along the path from the ancestor to A.

Update and Prob Operation

Update and Prob Operation (cont.)

Update Operation API • Update-UpK-downj :Up to kth level and propagates the aggregate values of a node at level l downward for j levels. (l ≤ k)

Operation API K Update-UpK-downj Level-4 Level-3 L Level-2 J Level-1 Level-0

Dynamic Adaptation • A SDIMS implementation can dynamically adjust its up/down strategies for an attribute based on its measured read/write frequency.

Scalability • SDIMS defines the aggregation abstraction to mesh with its underlying scalable DHT system. • SDIMS refines the basic DHT abstraction to form an Autonomous DHT (ADHT) to achieve the administrative isolation properties

Mapping to DHT 1

Mapping to DHT • Aggregating an attribute along the aggregation tree is corresponding to DHTtreek for k =hash(attribute type, attribute name) • Different attributes will be aggregated along different trees.

Administrative isolation • For security • Updates and Probes are not accessible outside the domain • For availability • Queries for values in a domain are not affected by failures of nodes in other domains • For efficiency • Domain-scoped queries can be simple and efficient.

Administrative isolation • Autonomous DHT • Path Locality: Search paths should always be contained in the smallest possible domain. • Path Convergence: Search paths for a key from different nodes in a domain should converge at a node in that domain.

應合併 Administrative isolation Domain univ. Domain dept. L0: host L2: univ. isolation property is violated

Administrative isolation Domain dept. Domain univ. L0: host L2: dept. Autonomous DHT

Robustness • ADHT • Distributed Computing (?) • Aggregation Management Layer (AML) • Lazy re-aggregation • On-demand Re-aggregation • Replication in Space

2 Layer arch. : ADHT and AML • The ADHT layer informs the AML layer about reconfigurations in the network. • NewParent • FailedChild • NewChild

Implementation DifferentOverlay(?)

MIB • Child MIBs containing raw aggregate values gathered from children. • Reduction MIB containing locally aggregated values across this raw information • Ancestor MIB containing aggregate values scattered down from ancestors.

Implementation parent child

Implementation (cont.) • attribute key : Use for retrieving data by aggregation function. • (attributetype, attribute name)

Implementation (cont.) • A node acts • as leaf for all attribute keys • as a level-1 subtree root for keys whose hash matches the node’s ID in b prefix bits. • as a level-i subtree root for keys whose hash matches the node’s ID in the initial i * b bits. • as the system’s global root for attribute keys whose hash matches the node’s ID in more prefix bits than any other node

Evaluation 更新自己的MIB 更新全部Node的MIB Up-All, Down 0 Monitor的attribute變化少 Monitor的attribute變化多

Evaluation (cont.) the session size is set to 8 (domain size), the branching factor is set to 16 Message size nodes

Evaluation (cont.) Bf: Branch Factor Average path length to root

Evaluation (cont.) Bf: Branch Factor

A Scalable Distributed Information Management System (SDIMS)

A Scalable Distributed Information Management System (SDIMS)

Presentation Transcript

Ceph : A Scalable, High-Performance Distributed File System

Frangipani: A Scalable Distributed File System

1DT066 Distributed Information System

Ceph: A Scalable, High-Performance Distributed File System

1DT066 Distributed Information System

Ceph: A Scalable, High-Performance Distributed File System

A Scalable Information Management Middleware for Large Distributed Systems

1DT066 Distributed Information System

SDIMS: A Scalable Distributed Information Management System

CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM

1DT057 Distributed Information System

Ceph: A Scalable, High-Performance Distributed File System

1DT057 DISTRIBUTED INFORMATION SYSTEM DISTRIBUTED FILE SYSTEM

1DT057 Distributed Information System

SD-SQL Server : A Scalable Distributed Database System

1DT057 Distributed Information System

Components of a Scalable Distributed Relational Information Service

1DT057 Distributed Information System

1DT057 Distributed Information System

SD-SQL Server : A Scalable Distributed Database System

SDIMS: A Scalable Distributed Information Management System

1DT066 Distributed Information System