180 likes | 387 Views
A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments. Athanasia Asiki, Katerina Doka, Ioannis Konstantinou, Antonis Zissimos and Nectarios Koziris National Technical University of Athens School of Electrical and Computer Engineering
E N D
A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka,Ioannis Konstantinou, Antonis Zissimos and Nectarios Koziris National Technical University of Athens School of Electrical and Computer Engineering Computing Systems Laboratory e-mail: {nasia, katerina, ikons, azisi, nkoziris}@cslab.ece.ntua.gr
Abstract • “A service-oriented architecture of a generic middleware platform, which provides the required services for efficient content storage, search and retrieval in a distributed environment.” • Algorithms from Peer-to-Peer computing are introduced in a grid environment in order scalability, fault-tolerance and data availability despite nodes arrivals and departuresto be ensured • The system consists of heterogeneous resources belonging to different Virtual Organizations
Outline • Introduction • Challenges and requirements • Overall architecture • A multidimensional indexing scheme • The Distributed Replica Location Service • GridTorrent protocol
A brief introduction • Grid computing • Remotely located, disjoint and diverse processing and data storage facilities are integrated under common software architecture(middleware) • Resources connected to a shared network and provide the necessary software-level services to be remotely used and administered • Heterogeneity of resources • Rules and policies define the sharing of resources • P2P computing • Oriented towards the sharing of large amount of data • Files are stored in a dynamic set of peers, which may join or the leave the network • Absence of centralized structures
Our approach • Main Idea Exploitation of Peer-to-Peer techniques to build a service-oriented infrastructure for data management and search • Features of the proposed system: • A powerful metadata search mechanism supporting both point and range queries • A data transfer mechanism for efficient storage and retrieval of data in distributed and heterogeneous resources • A distributed Replica Location Service to keep track of file locations • Motivation The design of the proposed architecture has been largely motivated by the requirements posed by the Gredia research project(http://www.gredia.eu/?Page=home)
Challenges and requirements • The data needs to be partitioned among the nodes according to a strategy ensuring: • load balance • effective query processing • Absence of centralized structures to provide an overall view of the system • Efficient query routing is required so as: • A small number of nodes to process the query • The number of exchanged messages to remain relative small • Data locality shall be preserved, namely relative data to be kept in the same node if feasible • New advanced features should not affect the maintenance cost of the overlay • The large size of data along with limitations in storage capacity and network bandwidth should be considered and a more flexible structure should be adopted Metadata and data will be stored in different overlays
Architecture (1) • Three different overlays will be implemented: • Metadata overlay • DRLS overlay • Storage overlay • The Metadata overlay and the DRLS overlay are implemented with the required extensions to the Kademlia DHT • The Storage overlay comprises a distributed repository • Kademlia DHT • PING, STORE, FIND_NODE and FIND_VALUERPCs • A hash function is used to assign keys to values stored in the DHT • Distance among points in the key space is defined by and XOR metric
Architecture (2) • A “data file” is described by a predefined set of attributes included in its metadata file • Upload a file • The file is assigned with a unique identifier • The unique identifier is included in the “metadata file” • The “data file” is uploaded to the Storage Overlay by the GridTorrent mechanism • The “metadata file” is inserted in the Metadata Overlay • The physical location(s) where the file is (are) inserted in the DRLS overlay
Architecture (3) • Search procedure • The search mechanism is applied in the Metadata overlay • A user can search the metadata files according to his / her criteria and select the data file(s) of its own interest • The physical location(s) of the “data file” replicas are returned by the DRLS overlay • The GridTorrent protocol downloads a file exploiting sharing properties in order to boost aggregate performance
Space Filling Curves • Map continuously a compact interval to a d-dimensional space • Partitioning of the d-dimensional space into 2kd cells, which in turn are mapped through the Space Filling Curve to 2kd points of a single dimension • Recursive nature (generation of the curve) • Perseverance of locality, points being close in the 1-dimensional space are mapped to points that are close together in the d-dimensional space
Multidimensional indexing • The set of d attributes chosen to be indexed form a d-dimensional space • Each combination of attributes’ values is depicted to a point of the d-dimensional space • The key of a metadata file is produced by the Space Filling curve • Query processing • Clusters of the Space Filling curve will be defined answering the query • Lookup for the specified clusters • Load balance • Virtual servers entity that owns an interval of the identifier space • Each physical node contains multiple virtual servers • When a physical node is overloaded by virtue of available storage space or bandwidth, it may move one ore more of its virtual servers to another, underloaded physical node
DRLS (1) • A Distributed Replica Location Service • A DHT by correlates its inherent key-value pairs to the unique identifier of a file to Physical File Names (PFNs) mappings • Problems • In DHTs read-only files are stored while mutable data cannot be handled • The replication strategy results in key-value pairs to be stored in nodes that are close to the ID of the key and cached around the network • The exact location(s) of a key-value pair in a given moment cannot be returned and the update of all replicas is not ensured • PFN mappings for a given LFN change frequently
DRLS (2) • Solution • Every lookup always queries all nodes responsible for a specific key-value pair • The Lookup procedure does not stop to the first returned value and peers that do not reply with a value, they are considered as not uptodate • When all available results are returned, the query node compares the results based on some predefined version vector (indicating the latest update of the value) • Updates are propagated to the nodes it has found responsible for storage but not yet up-todate with the latest value
The data transfer mechanism of the Storage overlay • Implementation of BitTorrent designed to interface and integrate with GridFTP • A GridTorrent client is able to request file fragments from other GridTorrent clients holding the file or connect to GridFTP servers • Effectiveness and scalability, even under extreme load and flash crowd conditions • Optimized replica selection based on the rarest-first policy • BitTorrent • A peer-to-peer protocol that allows clients to download files from multiple sources while uploading them to other users at the same time
GridTorrent • Queries the RDLS periodically • The DRLS returns to the GridTorrent peer • PFN of replicas identified by the GridTorrent URL gtp://site.fully.qualified.domain.name/path/to/file • File size, hashes of pieces, size of each piece • GridTorrent client • The two invlolved peers initiate communication by exchanging the BitTorrent bit field message, informing each other of the pieces they possess • A request message for blocks is issued • GridFTP server • The client issues a GridFTP partial get message for the data within the specific block it intends to download
Conclusions • The paper proposes a service-oriented architecture for efficient search and retrieval of annotated content • Different overlays are implemented and Peer-to-Peer techniques are introduced in the Grid environment • The design of the platform ensures scalability and fault-tolerance • An extensible architecture is presented favoring the integration with other systems and the development of Grid applications