230 likes | 393 Views
Analysis of Cloud Data Management Systems. Student: Miro Szydlowski Supervisor: Prof. Mehmet Orgun Date: 11.11.11. INTRODUCTION. ?. Distributed Databases NoSQL Cloud Data Stores. Relational Database Management Systems. 1/22. Presentation Plan.
E N D
Analysis of Cloud Data Management Systems Student: Miro Szydlowski Supervisor: Prof. Mehmet Orgun Date: 11.11.11
INTRODUCTION ? Distributed Databases NoSQL Cloud Data Stores Relational Database Management Systems 1/22
Presentation Plan • Origins of Database Management Systems • Raise to power • ACID qualities • Problems and Solutions • Consequences of being popular • Partitioning, Replication, Load Balancing, • Distributed Database Management Systems • Challenges of Connected World • Cloud Computing • Definition, Type • Place of DBMS in Cloud • Cloud Data Management Systems • CAP, BASE, NoSQL and few other concepts • NoSQL by implementation type • Example: AmazonDB • Which one to choose? 2/22
Database Management Systems “…set of software programs that control the organisation, storage, management and data retrieval” Database Models: Hierarchical Network Relational Object-relational 3/22
Origins of Relational Database Management Systems • 1970, University of California • In the following 20 years became not only accepted not only essential, but considered the only solution for enterprise data storage • Why? • Data normalisation • Metadata reuse • User Views <-> Community View <-> Storage • SQL! • Guarantees data integrity - ACID 4/22
ACID • Atomicity • Consistency • Isolation • Durability • Provides consistent state of the database • …but at a cost 5/22
Problems and Solutions • Load Balancing • …and finally • Distributed Database Management Systems • Very successful solution, but the businesses were growing… • Data volume • Data warehousing, business intelligence • Merges and acquisitions • WWW • New Solutions: • Partitioning • Hardware • Horizontal • Vertical • Replication • Multi-master • Master-Slave …but the challenges kept coming… 6/22
Challenges of the Connected World • Search Engines • Mobile Devices • Business-To-Business (Web Services) • Stream Processing • Data Warehousing • Directory Services • Current example: 2011 Twitter statistics: • • 1 Billion Tweets per week • • 140 million Tweets per day in average • • 177 million Tweets sent on March 11, 2011. • • Current record: 6,939 TPS - set 4 seconds after midnight in Japan on New Year’s Day. • New Solutions needed ASAP 7/22
What is Cloud Computing? • Lots of definitions, one of them below: • “…a pool of highly scalable, abstracted infrastructure, capable of hosting end-customer applications, that is billed by consumption” (James Staten) • Automation • Virtualization • Scalability • Pay-as-you-go pricing model 8/22
Cloud Computing Types By Deployment Type By Service Type Cloud Data Management Systems? IaaS or PaaS 9/22
Dark Cloud • Beginning of 21st century – open critique of the relational database management systems: • Too complex for an average user • Can’t cope with data volumes • Relational mapping is an overkill • One size doesn’t fit all – we want to prioritize some features • Why do we need to build the ORM? • Distributed RDMSs are fake! • Scalability! Why don’t we re-engineer and rebuild instead of constantly ‘patching’ RDBMS? 10/22
CAP and BASE • Eric Brewer at ACM Symposium in 2000 made a statement: • It is unachievable to implement all three qualities of a “shared-data system” at once: • • Consistency • • Availability • • Partition Tolerance • …so – pick any two! • Since we can’t guarantee ACID, lets BASE our systems on another principle: • Basically Available • Soft State • Eventually Consistent • These two ideas changed the approach to the database design… • …and gave birth to the ‘NoSQL’ movement 11/22
Few new concepts • Hash – based partitioning • certain property of each entity is used to calculate a hash value, which is used to determine which database server to use to store the entity • ‘Shared nothing’ architecture • cluster of independent machines that communicate over a high-speed network • Sharding • splitting up a database across multiple machines • MapReduce • not a database system, but a programming framework • every job sent is divided into two parts: a ‘Map’, and a ‘Reduce’ 12/22
NoSQL Movement • Their main objection: unnecessary complexity of the relational databases • Motto: “select a right tool for the job” • “Tool in the box” approach • Principles of NoSQL data stores: • Built for performance • Built for real scalability • Build for high availability • Typically use a very specific data access pattern • Either schemaless or implementing very simple schemas • Weak consistency guarantees • Declarative query language (such as SQL) replaced with simple APIs 13/22
NoSQL Databases by Implementation Type • Key/Value Stores • BigTable • Document-based • Columnar • (also, graph, object-oriented, distributed object stores and dozen of others…) 14/22
Key/Value Stores • Data is stored as a key/value pair • Basic APIs – Put/Get/Remove • Scalability: Sharding or Replicating data items • Advantages: Performance and scalability • Best For: High-performance systems that deal with one type of object • Examples: HBase, SimpleDB, Cassandra • Potential Issues: Data Integrity has to be supported by application, supports only one type of query 15/22
‘BigTable’ Databases • Named after Google’s ‘BigTable’ implementation • Each row can have different set of columns • A row can have thousands of columns • Records can have multiple fields • Records are indexed by [row-key, column-key, timestamp] • Usually sharded • Advantages: Highly optimized for write operations, highly scalable, (quoted) extremely even performance • Examples: Google Analytics, Google Docs, Microsoft Azure Tables • Potential Issues: Lack of text search, very difficult to import and export data – query times out after 30 sec 16/22
Document Databases • Completely schemaless • All document data is stored in the document itself • Document usually encoded in JSON, BSON, XML • Scalability: good, implementing asynchronous replication • Advantages: client application can store data in its final form; support custom views • Examples: Couch DB, MongoDB, Terrastore • Best For: wikis, blogs, document management systems • Potential Issues: They actually don’t outperform RDBMS, not well supported 17/22
Columnar Databases • ‘Between’ SQL and NoSQL – can use SQL syntax, but use wide columns • Each columns stored separately on different disk location • Scalability and Performance: both good because rows and columns can be split across multiple nodes: rows – sharding, columns – column groups • Advantages – great when you need data aggregation • Examples: Vertica, HBase • Best At: Data warehousing, data mining • Potential Issues: Not great at handling complex relationship, better than RDBS only when row size is big and not many columns of a single row are required 18/22
Example: Amazon SimpleDB • Data Store Type: Entity-Attribute-Value • Data Model: Document Store/Big Table • Cloud Type: Platform as a Service • The data model based on domains, items, attributes and values: • Domains are currently limited to 10 GB each, and each account is limited to 100 domains • Domains are collections of items that are described by attribute-value pairs • Doesn’t have the concept of schema – everything is a string • Designed for reads rather than writes • Updates done to central database ONLY and distributed to ‘slaves’ • Client interface: SOAP and REST • Availability: multiple geographically distributed copies of each data item • Scalability: Great • Pay as you do model: Clients are charged by data storage, data transfer and machine utilization • Potential Issues: eventual consistency, no data types or constraints 19/22
Summary – RDBMS or NoSQL? It depends… • if you have a low-volume, medium-complexity suite of applications, don’t change it – this is what the RDBMS are good at • if your data is normalized and using joins – don’t move to the schemaless NoSQL • if you’re looking for an off-the-shelf system and don’t want to get involved in a customized development – choose RDBMS • if you problem can’t be resolved using RDBMS [e.g. you have serious scalability issues] and you’re determined to fix it at any cost – go ‘NoSQL’ • if you have access to sufficient quantities of sufficiently smart people - choose NoSQL. 20/22
Summary – RDBMS or NoSQL? ‘choose a right tool for the job’ 21/22
Questions? 22/22