Analysis of Cloud Data Management Systems

Analysis of Cloud Data Management Systems Student: Miro Szydlowski Supervisor: Prof. Mehmet Orgun Date: 11.11.11

INTRODUCTION ? Distributed Databases NoSQL Cloud Data Stores Relational Database Management Systems 1/22

Presentation Plan • Origins of Database Management Systems • Raise to power • ACID qualities • Problems and Solutions • Consequences of being popular • Partitioning, Replication, Load Balancing, • Distributed Database Management Systems • Challenges of Connected World • Cloud Computing • Definition, Type • Place of DBMS in Cloud • Cloud Data Management Systems • CAP, BASE, NoSQL and few other concepts • NoSQL by implementation type • Example: AmazonDB • Which one to choose? 2/22

Database Management Systems “…set of software programs that control the organisation, storage, management and data retrieval” Database Models: Hierarchical Network Relational Object-relational 3/22

Origins of Relational Database Management Systems • 1970, University of California • In the following 20 years became not only accepted not only essential, but considered the only solution for enterprise data storage • Why? • Data normalisation • Metadata reuse • User Views <-> Community View <-> Storage • SQL! • Guarantees data integrity - ACID 4/22

ACID • Atomicity • Consistency • Isolation • Durability • Provides consistent state of the database • …but at a cost 5/22

Problems and Solutions • Load Balancing • …and finally • Distributed Database Management Systems • Very successful solution, but the businesses were growing… • Data volume • Data warehousing, business intelligence • Merges and acquisitions • WWW • New Solutions: • Partitioning • Hardware • Horizontal • Vertical • Replication • Multi-master • Master-Slave …but the challenges kept coming… 6/22

Challenges of the Connected World • Search Engines • Mobile Devices • Business-To-Business (Web Services) • Stream Processing • Data Warehousing • Directory Services • Current example: 2011 Twitter statistics: • • 1 Billion Tweets per week • • 140 million Tweets per day in average • • 177 million Tweets sent on March 11, 2011. • • Current record: 6,939 TPS - set 4 seconds after midnight in Japan on New Year’s Day. • New Solutions needed ASAP 7/22

What is Cloud Computing? • Lots of definitions, one of them below: • “…a pool of highly scalable, abstracted infrastructure, capable of hosting end-customer applications, that is billed by consumption” (James Staten) • Automation • Virtualization • Scalability • Pay-as-you-go pricing model 8/22

Cloud Computing Types By Deployment Type By Service Type Cloud Data Management Systems? IaaS or PaaS 9/22

Dark Cloud • Beginning of 21st century – open critique of the relational database management systems: • Too complex for an average user • Can’t cope with data volumes • Relational mapping is an overkill • One size doesn’t fit all – we want to prioritize some features • Why do we need to build the ORM? • Distributed RDMSs are fake! • Scalability! Why don’t we re-engineer and rebuild instead of constantly ‘patching’ RDBMS? 10/22

CAP and BASE • Eric Brewer at ACM Symposium in 2000 made a statement: • It is unachievable to implement all three qualities of a “shared-data system” at once: • • Consistency • • Availability • • Partition Tolerance • …so – pick any two! • Since we can’t guarantee ACID, lets BASE our systems on another principle: • Basically Available • Soft State • Eventually Consistent • These two ideas changed the approach to the database design… • …and gave birth to the ‘NoSQL’ movement 11/22

Few new concepts • Hash – based partitioning • certain property of each entity is used to calculate a hash value, which is used to determine which database server to use to store the entity • ‘Shared nothing’ architecture • cluster of independent machines that communicate over a high-speed network • Sharding • splitting up a database across multiple machines • MapReduce • not a database system, but a programming framework • every job sent is divided into two parts: a ‘Map’, and a ‘Reduce’ 12/22

NoSQL Movement • Their main objection: unnecessary complexity of the relational databases • Motto: “select a right tool for the job” • “Tool in the box” approach • Principles of NoSQL data stores: • Built for performance • Built for real scalability • Build for high availability • Typically use a very specific data access pattern • Either schemaless or implementing very simple schemas • Weak consistency guarantees • Declarative query language (such as SQL) replaced with simple APIs 13/22

NoSQL Databases by Implementation Type • Key/Value Stores • BigTable • Document-based • Columnar • (also, graph, object-oriented, distributed object stores and dozen of others…) 14/22

Key/Value Stores • Data is stored as a key/value pair • Basic APIs – Put/Get/Remove • Scalability: Sharding or Replicating data items • Advantages: Performance and scalability • Best For: High-performance systems that deal with one type of object • Examples: HBase, SimpleDB, Cassandra • Potential Issues: Data Integrity has to be supported by application, supports only one type of query 15/22

‘BigTable’ Databases • Named after Google’s ‘BigTable’ implementation • Each row can have different set of columns • A row can have thousands of columns • Records can have multiple fields • Records are indexed by [row-key, column-key, timestamp] • Usually sharded • Advantages: Highly optimized for write operations, highly scalable, (quoted) extremely even performance • Examples: Google Analytics, Google Docs, Microsoft Azure Tables • Potential Issues: Lack of text search, very difficult to import and export data – query times out after 30 sec 16/22

Document Databases • Completely schemaless • All document data is stored in the document itself • Document usually encoded in JSON, BSON, XML • Scalability: good, implementing asynchronous replication • Advantages: client application can store data in its final form; support custom views • Examples: Couch DB, MongoDB, Terrastore • Best For: wikis, blogs, document management systems • Potential Issues: They actually don’t outperform RDBMS, not well supported 17/22

Columnar Databases • ‘Between’ SQL and NoSQL – can use SQL syntax, but use wide columns • Each columns stored separately on different disk location • Scalability and Performance: both good because rows and columns can be split across multiple nodes: rows – sharding, columns – column groups • Advantages – great when you need data aggregation • Examples: Vertica, HBase • Best At: Data warehousing, data mining • Potential Issues: Not great at handling complex relationship, better than RDBS only when row size is big and not many columns of a single row are required 18/22

Example: Amazon SimpleDB • Data Store Type: Entity-Attribute-Value • Data Model: Document Store/Big Table • Cloud Type: Platform as a Service • The data model based on domains, items, attributes and values: • Domains are currently limited to 10 GB each, and each account is limited to 100 domains • Domains are collections of items that are described by attribute-value pairs • Doesn’t have the concept of schema – everything is a string • Designed for reads rather than writes • Updates done to central database ONLY and distributed to ‘slaves’ • Client interface: SOAP and REST • Availability: multiple geographically distributed copies of each data item • Scalability: Great • Pay as you do model: Clients are charged by data storage, data transfer and machine utilization • Potential Issues: eventual consistency, no data types or constraints 19/22

Summary – RDBMS or NoSQL? It depends… • if you have a low-volume, medium-complexity suite of applications, don’t change it – this is what the RDBMS are good at • if your data is normalized and using joins – don’t move to the schemaless NoSQL • if you’re looking for an off-the-shelf system and don’t want to get involved in a customized development – choose RDBMS • if you problem can’t be resolved using RDBMS [e.g. you have serious scalability issues] and you’re determined to fix it at any cost – go ‘NoSQL’ • if you have access to sufficient quantities of sufficiently smart people - choose NoSQL. 20/22

Summary – RDBMS or NoSQL? ‘choose a right tool for the job’ 21/22

Questions? 22/22

Analysis of Cloud Data Management Systems

Analysis of Cloud Data Management Systems

Presentation Transcript

Crime Analysis Data Management

Data Management and Analysis

Cloud Computing, Data Management, and Emergency Management

Massively Parallel Cloud Data Storage Systems

Data Management in the Cloud

Cloud Data Management

Data Management Systems

Data Security for Cloud Storage Systems

Cloud Systems

Visual Analysis of Hierarchical Management Data

Systems Data Management Solution

Data Base Management Systems

LIGO Data Analysis Systems Data Management and Analysis

Data Management and Analysis

Data Management in Cloud Workflow Systems Dong Yuan

JWST Data Management Systems

Data Management Systems

Data management in grid. Comparative analysis of storage systems in WLCG.

Cloud Data Management Company in Toronto

Massively Parallel Cloud Data Storage Systems

Oncology Data Management Systems

Data Stream Management Systems