Database Systems 10 NoSQL Systems

Database Systems10 NoSQL Systems Matthias Boehm Graz University of Technology, AustriaComputer Science and Biomedical Engineering Institute of Interactive Systems and Data Science BMVIT endowed chair for Data Management Last update: May 20, 2019

Announcements/Org • #1 Video Recording • Since lecture 03, video/audio recording • Link in TeachCenter & TUbe • #2 Exercises • Exercise 1 graded, feedback in TC, office hours • Exercise 2 in progress of being graded • Exercise 3 published, due Jun 04, 11.59pm • #3 Exam Dates • Jun 24, 4pm, HS i13 • Jun 27, 4pm, HS i13 • Jun 27, 7.30pm, HS i13 • Additional dates for repetition (beginning of WS19) 77.4% 60.4% Exam starts +10min, working time: 90min(no lecture materials)

SQL vs NoSQL Motivation • #1 Data Models/Schema • Non-relational: key-value, graph, doc, time series(logs, social media, documents/media, sensors) • Impedance mismatch / complexity • Pay-as-you-go/schema-free (flexible/implicit) • #2 Scalability • Scale-up vs simple scale-out • Horizontal partitioning (sharding) and scaling • Commodity hardware, network, disks ($) • NoSQL Evolution • Late 2000s: Non-relational, distributed, open source DBMSs • Early 2010s: NewSQL: modern, distributed, relational DBMSs • Not Only SQL: combination with relational techniques  RDBMS and specialized systems (consistency/data models) [Credit:http://nosql-database.org/]

Agenda • Consistency and Data Models • Key-Value Stores • Document Stores • Graph Processing • Time Series Databases Lack of standardsand imprecise classification [Wolfram Wingerath, Felix Gessert, Norbert Ritter: NoSQL & Real-Time Data Management in Research & Practice. BTW 2019]

Consistency and Data Models

Consistency and Data Models Recap: ACID Properties • Atomicity • A transaction is executed atomically (completely or not at all) • If the transaction fails/aborts no changes are made to the database (UNDO) • Consistency • A successful transaction ensures that all consistency constraints are met(referential integrity, semantic/domain constraints) • Isolation • Concurrent transactions are executed in isolation of each other • Appearance of serial transaction execution • Durability • Guaranteed persistence of all changes made by a successful transaction • In case of system failures, the database is recoverable (REDO)

Consistency and Data Models Two-Phase Commit (2PC) Protocol • Distributed TX Processing • N nodes with logically related but physically distributed data (e.g., vertical data partitioning) • Distributed TX processing to ensure consistent view (atomicity/durability) • Two-Phase Commit (via 2N msgs) • Phase 1 PREPARE: check for successful completion, logging • Phase 2 COMMIT: release locks, and other cleanups • Problem: Blocking protocol • Excursus: Wedding Analogy • Coordinator: marriage registrar • Phase 1: Ask for willingness • Phase 2: If all willing, declare marriage DBS 2 coordinator DBS 1 DBS 3 DBS 4

Consistency and Data Models CAP Theorem • Consistency • Visibility of updates to distributed data (atomic or linearizable consistency) • Different from ACIDs consistency in terms of integrity constraints • Availability • Responsiveness of a services (clients reach available service, read/write) • Partition Tolerance • Tolerance of temporarily unreachable network partitions • System characteristics (e.g., latency) maintained • CAP Theorem • Proof ”You can have AT MOST TWO of these properties for a networked shared-data systems.” [Eric A. Brewer: Towards robust distributed systems (abstract). PODC 2000] [Seth Gilbert, Nancy A. Lynch: Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 2002]

Consistency and Data Models CAP Theorem, cont. 1 write A • CA: Consistency & Availability (ACID single node) • Network partitions cannot be tolerated • Visibility of updates (consistency) in conflict with availability no distributed systems • CP: Consistency & Partition Tolerance (ACID distributed) • Availability cannot be guaranteed • On connection failure, unavailable(wait for overall system to become consistent) • AP: Availability & Partition Tolerance (BASE) • Consistency cannot be guaranteed, use of optimistic strategies • Simple to implement, main concern: availability to ensure revenue ($$$)  BASE consistency model 2 3 4 7 5 6 read A

Consistency and Data Models BASE Properties • Basically Available • Major focus on availability, potentially with outdated data • No guarantee on global data consistency across entire system • Soft State • Even without explicit state updates, the data might change due to asynchronous propagation of updates and nodes that become available • Eventual Consistency • Updates eventually propagated, system would reach consistent state if no further updates, and network partitions fixed • No temporal guarantees on changes are propagated

Consistency and Data Models Eventual Consistency [Peter Bailis, Ali Ghodsi: Eventual consistency today: limitations, extensions, and beyond. Commun. ACM 2013] • Basic Concept • Changes made to a copy eventually migrate to all • If update activity stops, replicas will converge to a logically equivalent state • Metric: time to reach consistency (probabilistic bounded staleness) • #1 Monotonic Read Consistency • After reading data object A, the client never reads an older version • #2 Monotonic Write Consistency • After writing data object A, it will never be replaced with an other version • #3 Read Your Own Writes / Session Consistency • After writing data object A, a client never reads an older version • #4 Causal Consistency • If client 1 communicated to client 2 that data object A has been updated, subsequent reads on client 2 return the new value

Key-Value Stores

Key-Value Stores Motivation and Terminology • Motivation • Basic key-value mapping via simple API (more complex data models can be mapped to key-value representations) • Reliability at massive scale on commodity HW (cloud computing) • System Architecture • Key-value maps, where values can be of a variety of data types • APIs for CRUD operations (create, read, update, delete) • Scalability via sharding(horizontal partitioning) • Example Systems • Dynamo (2007, AP) Amazon DynamoDB (2012) • Redis (2009, CP/AP) users:1:a “Inffeldgasse 13, Graz” users:1:b “[12, 34, 45, 67, 89]” users:2:a “Mandellstraße 12, Graz” users:2:b “[12, 212, 3212, 43212]” [Giuseppe DeCandia et al: Dynamo: amazon's highly available key-value store. SOSP 2007]

Key-Value Stores Example Systems • RedisData Types • Redis is not a plain KV-store, but “data structure server” withpersistent log (appendfsyncno/everysec/always) • Key: ASCII string (max 512MB, common key schemes: comment:1234:reply.to) • Values: strings, lists, sets, sorted sets, hashes (map of string-string), etc • Redis APIs • SET/GET/DEL: insert a key-value pair, lookup value by key, or delete by key • MSET/MGET: insert or lookup multiple keys at once • INCRBY/DECBY: increment/decrement counters • Others: EXISTS, LPUSH, LPOP, LRANGE, LTRIM, LLEN, etc • Other systems • Classic KV stores (AP): Riak, Aerospike, Voldemort, LevelDB, RocksDB, FoundationDB, Memcached • Wide-column stores: Google BigTable (CP), Apache HBase (CP), Apache Cassandra (AP)

Key-Value Stores Log-structured Merge Trees [Patrick E. O'Neil, Edward Cheng, Dieter Gawlick, Elizabeth J. O'Neil: The Log-Structured Merge-Tree (LSM-Tree). Acta Inf. 1996] • LSM Overview • Many KV-stores rely on LSM-trees as their storage engine (e.g., BigTable, DynamoDB, LevelDB, Riak, RocksDB, Cassandra, HBase) • Approach: Buffers writes in memory, flushes data as sorted runs to storage, merges runs into larger runs of next level (compaction) • System Architecture • Writes in C0 • Reads against C0 and C1 • Compaction (rolling merge): sort, merge, includingdeduplication in-memory buffer (C0)max capacity T C0 writes reads compaction C1t on-disk storage (C1) C1t+1

Key-Value Stores Log-structured Merge Trees, cont. • LSM Tiering • Keep up to T-1 runs per level L • Merge all runs of Li into 1 run of Li+1 • L1 • L2 • L3 • LSM Leveling • Keep 1 run per level L • Merge run of Li with Li+1 • L1 • L2 • L3 write-optimized read-optimized [Niv Dayan: Log-Structured-Merge Trees, Comp115 guest lecture, 2017]

Document Stores

Document Stores Recap: JSON (JavaScript Object Notation) • JSON Data Model • Data exchange format for semi-structured data • Not as verbose as XML(especially for arrays) • Popular format (e.g., Twitter) • Query Languages • Most common:libraries for tree traversal and data extraction • JSONig: XQuery-like query language • JSONPath: XPath-like query language {“students:”[ {“id”: 1, “courses”:[ {“id“:“INF.01014UF”, “name“:“Databases”}, {“id“:“706.550”, “name“:“AMLS”}]}, {“id”: 5, “courses”:[ {“id“:“706.004”, “name“:“Databases 1”}]}, ]} JSONiq Example: declare option jsoniq-version “…”; for $x in collection(“students”) where $x.id lt 10 let $c := count($x.courses) return {“sid”:$x.id, “count”:$c} [http://www.jsoniq.org/docs/JSONiq/html-single/index.html]

Document Stores Motivation and Terminology • Motivation • Application-oriented management of structured, semi-structured, and unstructured information (pay-as-you-go, schema evolution) • Scalability via parallelization on commodity HW (cloud computing) • System Architecture • Collections of (key, document) • Scalability via sharding(horizontal partitioning) • Custom SQL-like or functionalquery languages • Example Systems • MongoDB (C++, 2007, CP) RethinkDB, Espresso, Amazon DocumentDB (Jan 2019) • CouchDB (Erlang, 2005, AP) CouchBase 1234 {customer:”Jane Smith”, items:[{name:”P1”,price:49}, {name:”P2”,price:19}]} 1756 {customer:”John Smith”, ...} 989 {customer:”Jane Smith”, ...}

Document Stores Example MongoDB [Credit:https://api.mongodb.com/python/current] • Creating a Collection • Inserting into a Collection • Querying a Collection importpymongoas mconn = m.MongoClient(“mongodb://localhost:123/”)db= conn[“dbs19”] # database dbs19cust= db[“customers”] # collection customers mdict= { “name“: “Jane Smith”, “address”: “Inffeldgasse13, Graz” } id = cust.insert_one(mdict).inserted_id # ids = cust.insert_many(mlist).inserted_ids print(cust.find_one({"_id": id})) ret = cust.find({"name": "Jane Smith"}) for x in ret: print(x)

Graph Processing

Graph Processing Motivation and Terminology • Ubiquitous Graphs • Domains: social networks, open/linked data, knowledge bases, bioinformatics • Applications: influencer analysis, ranking, topology analysis • Terminology • Graph G = (V, E) of vertices V (set of nodes) and edges E (set of links between nodes) • Different types of graphs Undirected Graph MultiGraph Data/Property Graph DirectedGraph Labeled Graph k1=v1 k2=v2 inter-acts k2=v3 Gene

Graph Processing Terminology and Graph Characteristics • Terminology, cont. • Path: Sequence of edges and vertices (walk: allows repeated edges/vertices) • Cycle: Closed walk, i.e., a walk that starts and ends at the same vertex • Clique: Subgraph of vertices where every two distinct vertices are adjacent • Metrics • Degree (in/out-degree): number of incoming/outgoing edges of that vertex • Diameter: Maximum distance of pairs of vertices (longest shortest-path) • Power Law Distribution • Degree of most real graphs follows a power law distribution out-degree2 in-degree3 e.g., 80-20 rule Tall head Long tail

Graph Processing Vertex-Centric Processing [GrzegorzMalewicz et al: Pregel: a system for large-scale graph processing. SIGMOD 2010] • Google Pregel • Name: Seven Bridges of Koenigsberg (Euler 1736) • “Think-like-a-vertex” computation model • Iterative processing in super steps, comm.: message passing • Programming Model • Represent graph as collection of vertices w/ edge (adjacency) lists • Implement algorithms via Vertex API • Terminate if all vertices halted / no more msgs 1 2 3 4 5 6 7 [1, 3, 4] 2 public abstract classVertex { public String getID(); publiclongsuperstep(); publicVertexValuegetValue(); publiccompute(Iterator<Message> msgs); publicsendMsgTo(String v, Message msg); public voidvoteToHalt(); } [5, 6] Worker 1 7 [1, 2] 4 [1, 2, 4] 1 [6, 7] 5 Worker 2 [2] 3 [5, 7] 6

Graph Processing Vertex-Centric Processing, cont. • Example1: Connected Components • Determine connected components of a graph (subgraphs of connected nodes) • Propagate max(current, msgs) if != current to neighbors, terminate if no msgs • Example 2: Page Rank • Ranking of webpages by importance / impact • #1: Initialize vertices to 1/numVertices() • #2: In each super step • Compute current vertex value:value = 0.15/numVertices()+0.85*sum(msg) • Sendto all neighbors:value/numOutgoingEdges() Step 3converged Step 0 Step 2 Step 1 1 4 4 2 4 4 3 3 4 4 4 4 7 5 7 7 6 7 7 7 7 [Credit: https://en. wikipedia.org/wiki/PageRank ]

Graph Processing Graph-Centric Processing • Motivation • Exploit graph structure for algorithm-specific optimizations(number of network messages, scheduling overhead for super steps) • Large diameter / average vertex degree • Programming Model • Partition graph into subgraphs (block/graph) • Implement algorithm directly against subgraphs (internal and boundary nodes) • Exchange messages in super steps only between boundary nodes  faster convergence 1 2 3 4 5 6 7 Worker 2 Worker 1 1 [YuanyuanTian, Andrey Balmin, Severin Andreas Corsten, ShirishTatikonda, John McPherson: From "Think Like a Vertex" to "Think Like a Graph". PVLDB 2013] 2 3 4 [Da Yan, James Cheng, Yi Lu, Wilfred Ng: Blogel: A Block-Centric Framework for Distributed Computation on Real-World Graphs. PVLDB 2014] 5 Worker 3 6 7

Graph Processing Resource Description Framework (RDF) uri1:BayernMunich • RDF Data • Data and meta data description via triples • Triple: (subject, predicate, object) • Triple components can be URIs or literals • Formats: e.g., RDF/XML, RDF/JSON, Turtle • RDF graph is a directed, labeled multigraph • Querying RDF Data • SPARQL (SPARQL Protocol And RDF Query Language) • Subgraph matching • Selected Example Systems uri4#worksFor uri2:T.Mueller uri4#age rdf#type uri3:Player 29 SELECT?person WHERE{ ?personrdf:type uri3:Player ; uri4:worksFor uri1:”Bayern Munich” . }

Graph Processing Example Systems • Understanding Use in Practice • Types of graphs user have • Graph computations run • Types of graph systems used • Summary of State of the ArtRuntime Techniques [Siddhartha Sahu, Amine Mhedhbi, SemihSalihoglu, Jimmy Lin, M. Tamer Özsu: The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing. PVLDB 2017] Graph X [Da Yan, Yingyi Bu, Yuanyuan Tian, Amol Deshpande, James Cheng: Big Graph Analytics Systems. SIGMOD 2016]

Time Series Databases

Time Series Databases Motivation and Terminology • Ubiquitous Time Series • Domains: Internet-of-Things (IoT), sensor networks, smart production/planet, telemetry, stock trading, server/application metrics, event/log streams • Applications: monitoring, anomaly detection, time series forecasting • Dedicated storage and analysis techniques  Specialized systems • Terminology • Time series X is a sequence of data points xi for a specific measurement identity (e.g., sensor) and time granularity • Regular (equidistant) time series (xi) vs irregular time series (ti, xi) regular 1s 1s irregular

Time Series Databases Example InfluxDB Measurement Tags [Paul Dix: InfluxDB Storage Engine Internals, CMU Seminar, 09/2017] • Input Data • System Architecture • Written Go, originally key-value store, now dedicated storage engine • Time Structured Merge Tree (TSM), similar to LSM • Organized in shards, TSM indexes and inverted index for reads cpu,region=west,host=A user=85,sys=2,idle=10 1443782126 Attributes (values) Time Index per TSM file:Header | Blocks | Index | Footer append-only fsync WAL Write KeyLen | Key | Type | Min T | Max T| Off | … periodic flushes In-MemIndex compaction & compression periodic drop of shards (files) according to retention policy TSM Indexes

Time Series Databases Example InfluxDB, cont. • Compression (of blocks) • Compress up to 1000 values per block (Type | Len | Timestamps | Values) • Timestamps: Run-length encoding for regular time series; Simple8B or uncompressed for irregular • Values: double delta for FP64, bits for Bool, double delta + zig zag for INT64, Snappy for strings • Query Processing • SQL-like and functional APIs for filtering (e.g., range) and aggregation • Inverted indexes SELECT percentile(90, user) FROMcpuWHERE time>now()-12h AND “region”=‘west’ GROUP BY time(10m), host Posting lists: cpu [1,2,3,4,5,6] host=A  [1,2,3] host=B  [4,5,6] region=west  [1,2,3] Measurement to fields: cpu [user,sys,idle] host [A, B] Region  [west, east]

Time Series Databases Other Systems • Prometheus • Metrics, high-dim data model, sharding and federationcustom storage and query engine, implemented in Go • OpenTSDB • TSDB on top of HBase or Google BigTable, Hadoop • TimescaleDB • TSDB on top of PostgreSQL, standard SQL and reliability • Druid • Column-oriented storage for time series, OLAP, and search • IBM Event Store • HTAP system for high data ingest rates, and data-parallel analytics via Spark • Shard-local logs  groomed data [Ronald Barber et al: Evolving Databases for New-Gen Big Data Applications. CIDR 2017]

Conclusions and Q&A • Summary 10 NoSQL Systems • Consistency and Data Models • Key-Value and Document Stores • Graph and Time Series Databases • Next Lectures (Part B:Modern Data Management) • 11 Distributed file systems and object storage [May 27] • 12 Data-parallel computation (MapReduce, Spark)[May 27] • 13 Data stream processing systems [Jun 03] • Jun 17: Q&A and exam preparation

Database Systems 10 NoSQL Systems