1 / 24

: what’s all the buzz about?

: what’s all the buzz about?. http://nosql-database.org/. Next generation databases are: Non-relational, Distributed, Open-source, Horizontal scalable Often more characteristics:

randi
Download Presentation

: what’s all the buzz about?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. : what’s all the buzz about?

  2. http://nosql-database.org/ Next generation databases are: • Non-relational, • Distributed, • Open-source, • Horizontal scalable Often more characteristics: Schema-free, easy replication support, simple API, eventually consistent / BASE (not ACID), a huge data amount

  3. List of NoSQL databases [122+] • Wide Column Store / Column Families HBase, Cassandra, Hypertable, Cloudata, Cloudera, Amazon SimpleDB • Document Stores CouchDB, MongoDB, Terrastore, ThruDB, OrientDB, RavenDB, Citrusleaf, SisoDB • Key Value / Tuple Store Azure Table Storage, MEMBASE, Riak, Redis, Chordless, GenieDB, Scalaris, Tokyo Cabinet / Tyrant, Keyspace Berkeley DB, MemcacheDB, Faircom C-Tree, Mnesia, LightCloud, Hibari, HamsterDB, STSdb, Pincaster, RaptorDB • Eventually Consistent Key Value Stores Amazon Dynamo, Voldemort, Dynomite, KAI • Graph Databases Neo4J, Infinite Graph, Sones, InfoGrid, HyperGraphDB, Trinity, AllegroGraph, Bigdata, DEX, OpenLink, Virtuoso, VertexDB, FlockDB • Object Databases db4o, Versant, Objectivity, Gemstone, Progress, Starcounter, Perst, Caching, ZODB, NEO, PicoLisp, Sterling • More and more databases

  4. So what’s wrong with relational databases?

  5. Main principals of RDBMS • SQL • ACID • Atomic “all or nothing” • Consistentmeans that data moves from one correct state to another correct state, with no possibility that readers could view different values that don’t make sense together. • Isolatedmeans that transactions executing concurrently will not become entangled with each other. • Durable once a transaction has succeeded, the changes will not be lost.

  6. Shortcomings of RDBMS • Transactions under heavy load • Complexities of vertical scaling • 2 phase commit (2PC) protocol

  7. Sharding If you can’t split it, you can’t scale it (Randy Shoup, distinguished architect, eBay) • Sharging approach • Feature-based shard or functional segmentation • Key-based sharding • Lookup table • Shared-nothing or Cassandra like sharding

  8. The real question is not “What’s wrong with relational databases?” but rather, “What problem do you have?”

  9. Brewer’s CAP Theorem Availability Partition Tolerance Consistency

  10. Brewer’s CAP Theorem Availability Amazon Dynamo derivatives: Cassandra, Voldemort, Riak, CouchDB Relational: MySQL, Oracle, MSSQL Partition Tolerance Consistency Neo4j, Google Big Table and its derivatives: MongoDB, Redis, Hypertable

  11. in 50 words or less Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneablyconsistent, column-oriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web.

  12. Cassandra case studies

  13. Cassandra outlines • BASE (Basically Available Soft-state Eventual consistency) and not ACID (Atomicity, Consistency, Isolation, Durability) • Distributed and decentralized • Elastic scalability • High availability and fault tolerance • Tunable consistency

  14. Use cases for Cassandra • Large deployments • Lots of writes, statistics and analysis • Geographical distribution • Evolving applications

  15. Writes Memtable • No reads • No seeks • Fast • Sequential disk access • Atomic within a column family • Any node • Always writable (hinted hand-off) • ≈ 0.2 ms Commit log Threshold Write SSTable SSTable

  16. Reads Memtable • Bloomfilter field to determine whether a provided key is in the SSTable • Index field for quick read • Any node • Read repair • ≈ 15 ms Read SSTable SSTable

  17. The tenets of column-oriented model

  18. Column Family\Column Column A name value pair (contains also a time-stamp for conflict resolution on the server side) column name 1 column name n column name : byte[] row key + timestamp : long column value 1 column value n column value : byte[] Column Family A container for columns sorted by their names. Column Families are referenced and sorted by row keys.

  19. Super Column Family\Super Column Super Column A sorted associative array of columns. super column name column name 1 column name n column value 1 column value n Super Column Family A container for super columns sorted by their names. Like Column Families, Super Column Families are referenced and sorted by row keys. super column name m super column name 1 column name nm column name 1 column name n1 column name 1 row key column value nm column value 1 column value 1 column value n1

  20. Addressing Column Family column name 1 column name n row key • Four-dimensional hash • [Keyspace][ColumnFamily][Key][Column] column value 1 column value n Addressing Super Column Family super column name m super column name 1 column name nm column name 1 column name n1 column name 1 row key column value nm column value 1 column value 1 column value n1 • Five-dimensional hash • [Keyspace][ColumnFamily][Key][SuperColumn][SubColumn]

  21. Cassandra client options Thrift (12 different languages) Avro (data serialization system) Java: Hector: http://github.com/rantav/hector (abstraction over thrift) Pelops: http://github.com/s7/scale7-pelops (abstraction over thrift) CQL: JDBC driver for Cassandra version starting from 0.8 (SQL like language) Hector JPA: https://github.com/riptano/hector-jpa (ORM client) Cassandrelle: http://demoiselle.sf.net/component/demoiselle-cassandra/ (documentation ???) Kundera: http://code.google.com/p/kundera/(buggy ???) Python: Pycassa, Telephus Grails: grails-cassandra .NET: Aquiles, FluentCassandra Ruby: Cassandra PHP: phpcassa, SimpleCassie

  22. Cassandra\RDBMS query differences • No update query • Record-level atomicity on writes • No duplicate keys • Basic write properties: consistency level (ZERO, ANY, ONE, QUORUM, ALL) • Basic read properties: consistency level (ONE, QUORUM, ALL)

  23. Integrating Hadoop (http://hadoop.apache.org) is a set of open source projects that deal with large amounts of data in a distributed way. • Hadoop Distributed File System (HDFS): a distributed file system that provides high-throughput access to application data. • Hadoop MapReduce: a software framework for distributed processing of large data sets on compute clusters. Other Hadoop-related projects at Apache include: • Cassandra™: a scalable multi-master database with no single points of failure. • Hive™: a data warehouse infrastructure that provides data summarization and ad hoc querying. • Mahout™: a Scalable machine learning and data mining library. • Pig™: a high-level data-flow language and execution framework for parallel computation.

  24. The end Questions?

More Related