180 likes | 257 Views
Data Grid Technologies. Sathish Vadhiyar Sources/Credits: Technical papers listed in references. Replica Strategies. Problem Motivation. Replication to deal with faults and provide scheduling flexibility.
E N D
Data Grid Technologies Sathish Vadhiyar Sources/Credits: Technical papers listed in references
Problem Motivation • Replication to deal with faults and provide scheduling flexibility. • Given a file that is partitioned into blocks that are replicated throughout a wide-area file system, how can a client retrieve the file with the best performance? • Various algorithms
Basic Downloading Algorithm • The client opens a thread to each server containing the file • A block size is chosen • Each thread selects a different block to download and all threads start downloading • A thread then chooses a new block that is currently not being downloaded by any other thread • Adaptive – Servers with higher bandwidths to clients download more blocks • Selection of block size - tricky
Aggressive Redundancy • To provide fault tolerance and to improve download time • A redundancy factor, R • The client downloads a block simultaneously from R servers • Only 1 is chosen – whichever returns first
Progress-Driven Redundancy • Retry a download only when it is progressing slowly • Progress number - P, redundancy factor – R • Each block assigned a download number initialized to 0 • When a thread attempts to download a block, it increments the block’s download number
Progress-Driven Redundancy (Continued) • For selecting a new block to download • If there is a block B whose download number < R, and if there are P blocks after B whose downloads have completed, then select B • Else select next block whose download number is zero
Fastest1 • Another approach • For downloading a block, choose a server that has minimum value of time*(l+1) • time – predicted time to download a block when there is no contention. Obtained from NWS numbers before download is initiated. • l – number of threads currently downloading from the server
Multiple clients • Situation arises when parallel data for computation on parallel clients have to be selected from available replica server locations • More challenges – download decision by a client can impact download performance on other clients. Need to predict this impact. • Periodic network monitoring have to be augmented by measurements corresponding to current downloads
Collective Download algorithm • Each algorithm connects to a server only once even if some of the data belongs to other clients – download phase • The clients then redistribute data among themselves – redistribution phase • Widely followed in parallel-I/O • Especially useful when clients and servers are on either side of WAN – multiple latencies can be avoided at the cost of less expensive redistribution phase
Replica Placement Strategies • Replica placement questions • When should replicas be created? • Which files should be replicated? • Where should replicas be placed? • The model assumes that data is produced in tier-1 (root) and there are storage spaces at various tiers (levels of hierarchy) • Clients that request data form the leaves of the hierarchy
Placement strategies • Best client • Each storage node maintains history regarding number of requests for the files it contains • If the number of requests for a file exceeds the threshold, the node creates a replica of the file in that client node that has generated most requests for that file (best-client) • The request details for the file are cleared.
Strategies … • Cascading replication • Analogy to a 3-tiered function • Once a threshold for a file is exceeded at the root, a replica is created at the next level on the path to the best client and so on… • Geographical locality is exploited • Plain caching – done at the client • Caching plus Cascading Replication
Strategies… • Fast Spread • A replica of the file is stored at each node along its path to the client • Replica selection – closest replica • Replica replacement – least popular file with oldest age is replaced. Popularity logs are cleared periodically
Findings • Best-client performs worst for random access patterns and shows improvement for access patterns with a bit of geographical locality • Fast spread works much better than cascading for random data access • Bandwidth savings are more in fast spread than in cascading • Fast spread has high storage requirements
Sources / References / Credits • Algorithms for high Performance, Wide-area distributed file downloads. J.S. Plank, S. Atchley, Y.Ding and M. Beck, Parallel Processing Letters, vol. 13, no. 2, pp 207-224, June 2003. • Downloading Replicated Wide-Area Files – a Framework and Empirical Evaluation. R.L. Collins and J.S. Plank. NCA 2004. • Identifying Dynamic Replication Strategies for a High-Performance Data Grid. K. Ranganathan and I. Foster. Grid 2002.
Sources / References / Credits • Grid-Based Galaxy Morphology Analysis for the National Virtual Observatory. Ewa Deelman, Raymond Plante, Carl Kesselman, Gurmeet Singh, Mei-Hui Su, Gretchen Greene, Robert Hanisch, Niall Gaffney, Antonio Volpicelli, James Annis, Vijay Sekhri, Fermi Tamas Budavari, Maria Nieto-Santisteban, William O'Mullane, David Bohlender, Tom McGlynn, Arnold Rots, Olga Pevunova, Supercomputing 2003. • Applying Chimera virtual data concepts to cluster finding in the Sloan Sky Survey. James Annis , Yong Zhao, Jens Voeckler, Michael Wilde, Steve Kent, Ian Foster. SC 2002.
Sources / References / Credits • Kavitha Ranganathan and Ian Foster, Decoupling Computation and Data Scheduling in Distributed Data Intensive Applications, Proceedings of the 11th International Symposium for High Performance Distributed Computing (HPDC-11), Edinburgh, July 2002.