240 likes | 386 Views
Data Grid, Cloud and Vertical RDBMS. Presenter: Dipesh Gautam. Overview. Introduction Why Data Grid? High Level View Design Considerations Data Grid Services Topology Grids and Cloud Convergence of Grid and Cloud Vertical RDBMS Benefits of column-oriented layout. Introduction.
E N D
Data Grid, Cloud and Vertical RDBMS Presenter: DipeshGautam
Overview • Introduction • Why Data Grid? • High Level View • Design Considerations • Data Grid Services • Topology • Grids and Cloud • Convergence of Grid and Cloud • Vertical RDBMS • Benefits of column-oriented layout
Introduction • Data Grid: an architecture or set of services that enable individual or group of users ability to access and transact large amounts of geographically distributed data. • The data may be replicated throughout the grid outside the original administrative domain of the data. • The integration between users and the data are handled and controlled by the data grid middleware.
Why Data Grid? • Large dataset size • Geographic distribution of users and resources • Computationally intensive analysis • No other architecture exists that allows us to apply technologies in large scale application domains
Design Considerations • Mechanism Neutrality • Designed to be as independent as possible of low level mechanisms • Defining interfaces that sum up oddness of specific storage systems. • Compatibility with Grid Infrastructure • Take advantage of fundamental Grid infrastructure • Compatible with lower level Grid mechanisms • Uniformity of Information Infrastructure • The same data model and interface used to access the grids metadata
Data Grid Services • Middleware provides following services: • Universal namespace • Data transport service • Data access service • Data replication service • Resource management system(RMS)
Why Universal namespace? • Number of systems and networks are connected within a grid • Different file naming conventions of separate systems within grid • Physical file names merely do not address the problem locating the data. • Universal namespace provides logical file names • Storage Resource Broker provides service to map between logical and physical file names • Upon requesting logical file names, all matching physical file names are returned and the end user chose appropriate replica
Data Transport Service • Middleware service for data transfer • The atomicity of the requested data transfer ensures the fault tolerant service • Data transfer is resumed after each interruption until all requested data is receive • Many possible strategies: • Starting the entire transmission from the beginning • Resuming from the point of interruption. E.g: GridFTP sends data from the last acknowledged byte without starting the entire transfer from the beginning. • Provides service for low-level access and connection between hosts for file transfer • Provides I/O functions that allow user to see remote files as if they were local to their system • Provides high level abstraction of the access and transfer of data between different systems hiding the complexity and presenting user as a unified data source
Data access service • Work with data transport service to provide security, access control and management of data transfer within the grid • Provides security service to authenticate users • Provides authorization service to control access by simple file permission to Access Control Lists (ACLs), Role-Based Access control • Provides encryption service to protect the confidentiality of the data transport (e.g SSL )
Data replication service • Why replication? • Scalability • Fast access • User collaboration • Replicas are often placed close to the sites where users need them • Replication is controlled by a replica management system • Replica management system determines the needs of replicas based on the requests • Timely update of the replica is performed by propagating the changes in some node to all the nodes in the grid
Replica update • Centralized model: single master replica updates all others • Decentralized model: all peers update each other • The topology of node placement influence update strategy
Replica Placement • Static replication • Uses a fixed replica set of nodes with no dynamic changes to the files being replicated • Dynamic replication • based on popularity of data • If request exceeds the replication threshold, the replica is placed on the server that directly services the client provided that the storage is available • Dynamic deletion of replicas that have null access value • Adaptive replication • The dynamic threshold is computed based on request arrival rates from clients over a period of time • The replicas with lower threshold and were not created in the current replication interval can be removed • Fair-share replication • Based on access load and storage load of candidate servers • Server with less access load is selected for replication as the replicated in server with more access load degrades the performance for all clients • Among the candidate servers with same access load, server with less storage load is selected • Lot more replication placement strategy exists
Resource management system(RMS) • Core functionality of data grid • Manages all the actions related to storage resources • Fulfils user and application requests for data resources based on type of request and policies • Schedules creation of replicas • Enforces policy and security within the data grid resources by including authentication, authorization and access support systems with different administrative policies to inter-operate • Enforces system fault tolerance and stability requirements
Topology • Various topologies have been used to address need of the scientific community • Four major types of topologies • Federation topology • Monadic topology • Hierarchical topology • Hybrid topology
Federation Topology • Allows each institution control over their data • The institution who receives request from authorized institution determines whether to send data to the requesting institution • The federation could be loosely or tightly integrated • Preferred by the institutions that wish to share data from already existing systems
Monadic topology • All the collected data is fed into a central repository • Central repository responds to all queries for data • No replicas in the topology • This topology is well suited when all access to the data is local or within a single region with high speed connectivity
Hierarchical Topology • Suited for collaborating data from single source to distributed multiple locations around the world
Hybrid Topology • Any combination of other topologies • Suited for researches working on projects want to share their results to further research by making it readily available for collaboration
Grids and Clouds • Grid • Grid refers for distributed computing in science and engineering • In grid computing, virtual organizations share computer resources over a network • Scientific research , collaboration • Share local resources • Heterogeneous , real resource • Geographically distributed, locally owned and managed • Cloud • Cloud refers for a computer network in the context of network management • In cloud computing anybody can access data and compute services over the internet • Web services, business apps • Make huge data centers available • Homogeneous virtualized resources • Geographically distributed, centrally owned and managed
Convergence of Grid and Cloud • Interoperability standards among the service providers of both grid and cloud should be considered by the user • Interoperating cloud looks like grid
Vertical RDBMS • Column-Oriented DBMS • Store data column wise instead of row wise • In row oriented DBMS the values on the rows are serialized and stored in memory as: 1, Smith, Joe, 40000; 2, Jones, Mary, 50000; 3, Johnson, Cathy, 44000; • In column oriented DBMS the columns are serialized as: • 1, 2, 3; Smith, Jones, Johnson; Joe, Mary, Cathy; 40000, 50000, 44000;
Benefits of Column-Oriented layout • Efficient when aggregate needs to be computed over many rows but only for notably smaller subset of columns • Efficient in writing a column when new values of column for all rows are supplied at once • Suite for Online Analytical Processing(OLAP) like workloads which involve a smaller number of highly complex queries over all data of terabyte size.
References • http://en.wikipedia.org/wiki/Data_grid • http://www.globus.org/toolkit/about.html • Martin Antony Walker, Grids and Clouds, http://www.ogf.org/OGF25/materials/1500/Grids+and+Clouds+OGF25+MAW.pdf • http://staff.science.uva.nl/~adam/courses/2004/documents/Course-DataGrid.ppt • http://en.wikipedia.org/wiki/Column-oriented_DBMS