Prof. Dr. Stefan Edlich NoSQL in der Cloud

Prof. Dr. Stefan Edlich NoSQL in der Cloud

nosqlberlin.de nosqlfrankfurt.de nosql powerdays

http://nosql-database.org

NoSQL is specialization! • Big Data • Massive Write Performance • Fast KV Access • Write Availability • Flexible Schema (Migration) + Flexible Datatypes • Easier maintainability, administration and operations • No single point of failure • Programmer ease of use

Theorie?! Map/Reduce  Map/Reduce Nachfolger! ACID / BASE & CAP  P liegt in der Regel nie vor! Consistent Hashing  Basis skalierbarer K/V Stores MVCC  non blocking Vorteile Vector Clocks  [122:1] [147:2|122:1] [97:3|147:2|122:1]

Google Protocol Buffers =>

Apache Avro! • JSON • Binary data transfer • automaticRPC generation • no code generation • Client + Server tauschen Schema bei Änderung unbedingt evaluieren!

Datenmodelle

Column Family DocumentDBs Voldemort, Chordless, Scalaris, Dynamo / Dynomite Key/ValueDBs GraphDBs db4o, Versant, Objectivity, Gemstone, Progress, Mark Logic, EMC Momentum, Tamino, GigaSpaces, Hazelcast, Terracotta, … andere

Cassandra HBase SimpleDB

+ Skalierung = new node + Replikation + Konfiguration (r, w) - Dokumentation - Abfragen + stressfreie SaaS Lösung + transparent scaling - UTF-8 String - Daten liegen bei Amazon +- kein tuning / config + Skalierung = new node + Community + API - Replikation - Aufsetzen, Optimierung, Wartung

Document Databases

any JS-Client no Middleware! DB+WebServer +evolving App

2.Runde += 6,5 Mio $

nicht normalisiert (Duplicates, Delete Orphans, ...) • (konfigurierbare Zeit Crash anfällig) (Journaling) • Eventually Consistent • echte Skalierung nur über Sharding • - (noch nicht kill -9fest)

67 GB Index Data  EC2 Node 66 GB EC2 Node 66 GB 11 hours + 1 day off

+ nicht normalisiert + Schema Agilität + Doku exzellent + Speed (MemMapped Files) + Installation+save =28 sek! + beliebige Indizes + MapReduce + Rich Query Language + GridFS(statt HDFS) + einfache Replizierung (Master-Slave / Replica Sets)

db.system.indexes.find(); db.friends.getIndexes(); db.friends.ensureIndex({friend: 1}); db.friends.ensureIndex({friend: 1, zip: 1}); //compound db.friends.find({friend: „Mario“, zip: „13755“}).explain(); Queries: age: {$gt: 10} food:{$all: [„pizza“, „noodles“]} $gt, $lt, $lte, $ne, $in, $nin, $mod, $all, $size, $exists, $type, , $or, $elem, $elemMatch, regexp, ... NoSQL Query LockIn?!

Sich veränderndes Schema Migrations Architektur-Pattern: A) Blacklist try { ... } catch (FirstException | SecondException ex) { // newName = BlackList.checkName(OldName)} rename 

B) „Rails“ Migration new name new name new name new name old name new name old name new name old name new name old name new name (nicht wenn zu oft repliziert)

Duplikate = SpaceAktualität der Daten „Pre-Joined“ Daten! „pre-computeD“ • wachsende Daten • raus oder Pre-SPACED

In die Cloud…

Clients mongos ROUTER Config Servers Shard B Shard C Shard A RAM+DISK+ Replica Set POSSIBLE ARBITER micro 64 bit [extra | double | quadrupel] Large

Erfahrungen… • RAID Konfigurationen (00,01,10,03,05, …) • Journaling-Dateisysteme (ext4, xfs, …) • (Security) Ports, F-Deskriptoren, Snapshots,… • www.mongodb.org/display/DOCS/Amazon+EC2

K/V-Stores Datenstrukturen abbilden -> + sehr schnell > 100.000 /sek + konfigurierbarer Disc sync + API für eigene Anbindung + einfache Replikation + hash, list, set, sorted set, messages + Installation UNIX: 38 sek Windows: 18 sek - cloud-cluster erst in Version 3.*

Sorted Set

memcached API

simply dynamic scaling (up & down) • scales linear • bullet proof by Zynga.com • limited membase protocol • Membase Tap (Protocol Interception) • Code-Node:

Membase in der Cloud • Fertige RightScale & AMI templates • Diverse Ports öffnen • DNS Eintrag und keine verändernden IPs • Master Node angeben • legt Quota für die Erben fest • Backups für EBS

GraphDBs Property Graph

player

Graph DBs in der Cloud • > N Milliarden Knoten? Sharding! • aber meistens kein „predictablelookup“  • möglich nur bei Domain SpecificKnowledge • ausbalancierte DBs ohne sweetspots kaum möglich • Access Patterns + Heuristiken (Insert Sharding / RuntimeSharding) => partitionierungs Algorithmen • (HA) Neo4j Cache Sharding! • Multi-Master Cluster forConsistent Routing

> 220 DBs durchausfrustrierendes Consulting…

Data Transactions Performance Queries Architecture • other Non-Functional Requirements

Analyse your Data Domain-Data, Log-Data, Event-Data, Message-Data, critical Data, Business-Data, Meta-Data, temp Data, Session-Data, Geo Data, etc. Data- / Storage-Model: relational, column-o, doc-alike, graphs, objects, etc. What Types / Type-System? Data-Navigation, Data Amount, Data Komplexity (Deep XML?) ACID vs. BASE vs. Mixture? CAP decisions Performance Dimension Analysis Latency, Request behaviour, Throughput Scale-Up vs Scale-Out Query Requirements Typical queries, Tools, Ad-Hoc Queries, SQL / LINQ needed, Map/Reduce? … Distribution Architecture local, parallel, distributed / grid, service, cloud, mobile, p2p, … Data Access Patterns read / write distribution, random / sequential, Access Design Patterns Non Functional Requirements:Replication, Refactoring Frequency, DB-Support, Qualification / simplicity, Company restrictions, DB diversity (allowed?), Security, Safety / Backup & Restore, Crash Resistance, Licence…

NoSQLFAZIT

Unbedingt RAM & SDD annehmen! RethinkDB Gustavo Alonso Lot‘s of >1 PT RAM DBsin California! SAP-Strategie? Service, RAM, Cloud, Mobile

DaaS Zeitalter Alleine für MongoDB weit über 100 „Database-as-a-Service“ Provider! Amazon: SimpleDB, Hadoop, etc.

Viele clevere hybrid Lösungen! CouchBase, Hadoop+MySQL

Availability Ad Hoc Query OLAP Database-aaS=> best Mix!

(View, Domain, Stamm, Meta, Log, …)by Couch, MongoDB, Redis, Membase, … unkritischeDaten kritischeDaten Management Zahlungsdaten, persönliche Daten, …by classic RDBMS, Vertica, VoltDB, Database.com, GenieDB, … Hadoop* BI OLAP BI Analytics Dwight Merriman (10gen)

Links • nosql-database.org • nosqltapes.com • mynosql.com .com

Thanks for listening! http://edlich.de Diskussion!

funktionale (graph) Dekomposition? Oder… Schutzpatent  Group By Use Case:Aggregate pi -> 1015 -> 1000 cluster

Programmierung top! Programmierung nervt! herrlich paralellisierbar Nur `large data indexing` „A giant step back! Imcompatible, missing features, not new, …“ Stonebraker Starke Konkurrenz: Stratosphere (TUB), ePic, SwissBox, etc.

Cross Paralellization Contracts Map Match CoGroup Reduce Graph Ops u.v.m… => compile, analyze, optimize auf einer atmenden Cloud!

Eventually Consistent ACID WATER BASE • Amazon Dynamo • MySQL Replikation

Prof. Dr. Stefan Edlich NoSQL in der Cloud

Prof. Dr. Stefan Edlich NoSQL in der Cloud

Presentation Transcript

Nosql in Facebook

Dr Stefan Vaz

NoSQL and NOSQL

Prof. Dr. Stefan Zagelmeyer Professor of Economics

LOCAL PLAYERS ON METROPOLITAN STAGES Prof. Dr. Stefan Hulfeld (Wien)

Prof. Stefan Decker

“INTEROPERABILITY AMONG NoSQL DATABASES IN CLOUD”

Qualitätskriterien in der Cloud

Cloud Computing Clase 8 - NoSQL