340 likes | 354 Views
Explore Non-relational, distributed databases like key-value, document, and graph stores, focusing on scalability, evolution, and the CAP Theorem in modern data management systems. Understand ACID properties, 2PC protocol, and BASE consistency model in NoSQL systems.
E N D
Database Systems10 NoSQL Systems Matthias Boehm Graz University of Technology, AustriaComputer Science and Biomedical Engineering Institute of Interactive Systems and Data Science BMVIT endowed chair for Data Management Last update: May 20, 2019
Announcements/Org • #1 Video Recording • Since lecture 03, video/audio recording • Link in TeachCenter & TUbe • #2 Exercises • Exercise 1 graded, feedback in TC, office hours • Exercise 2 in progress of being graded • Exercise 3 published, due Jun 04, 11.59pm • #3 Exam Dates • Jun 24, 4pm, HS i13 • Jun 27, 4pm, HS i13 • Jun 27, 7.30pm, HS i13 • Additional dates for repetition (beginning of WS19) 77.4% 60.4% Exam starts +10min, working time: 90min(no lecture materials)
SQL vs NoSQL Motivation • #1 Data Models/Schema • Non-relational: key-value, graph, doc, time series(logs, social media, documents/media, sensors) • Impedance mismatch / complexity • Pay-as-you-go/schema-free (flexible/implicit) • #2 Scalability • Scale-up vs simple scale-out • Horizontal partitioning (sharding) and scaling • Commodity hardware, network, disks ($) • NoSQL Evolution • Late 2000s: Non-relational, distributed, open source DBMSs • Early 2010s: NewSQL: modern, distributed, relational DBMSs • Not Only SQL: combination with relational techniques RDBMS and specialized systems (consistency/data models) [Credit:http://nosql-database.org/]
Agenda • Consistency and Data Models • Key-Value Stores • Document Stores • Graph Processing • Time Series Databases Lack of standardsand imprecise classification [Wolfram Wingerath, Felix Gessert, Norbert Ritter: NoSQL & Real-Time Data Management in Research & Practice. BTW 2019]
Consistency and Data Models Recap: ACID Properties • Atomicity • A transaction is executed atomically (completely or not at all) • If the transaction fails/aborts no changes are made to the database (UNDO) • Consistency • A successful transaction ensures that all consistency constraints are met(referential integrity, semantic/domain constraints) • Isolation • Concurrent transactions are executed in isolation of each other • Appearance of serial transaction execution • Durability • Guaranteed persistence of all changes made by a successful transaction • In case of system failures, the database is recoverable (REDO)
Consistency and Data Models Two-Phase Commit (2PC) Protocol • Distributed TX Processing • N nodes with logically related but physically distributed data (e.g., vertical data partitioning) • Distributed TX processing to ensure consistent view (atomicity/durability) • Two-Phase Commit (via 2N msgs) • Phase 1 PREPARE: check for successful completion, logging • Phase 2 COMMIT: release locks, and other cleanups • Problem: Blocking protocol • Excursus: Wedding Analogy • Coordinator: marriage registrar • Phase 1: Ask for willingness • Phase 2: If all willing, declare marriage DBS 2 coordinator DBS 1 DBS 3 DBS 4
Consistency and Data Models CAP Theorem • Consistency • Visibility of updates to distributed data (atomic or linearizable consistency) • Different from ACIDs consistency in terms of integrity constraints • Availability • Responsiveness of a services (clients reach available service, read/write) • Partition Tolerance • Tolerance of temporarily unreachable network partitions • System characteristics (e.g., latency) maintained • CAP Theorem • Proof ”You can have AT MOST TWO of these properties for a networked shared-data systems.” [Eric A. Brewer: Towards robust distributed systems (abstract). PODC 2000] [Seth Gilbert, Nancy A. Lynch: Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 2002]
Consistency and Data Models CAP Theorem, cont. 1 write A • CA: Consistency & Availability (ACID single node) • Network partitions cannot be tolerated • Visibility of updates (consistency) in conflict with availability no distributed systems • CP: Consistency & Partition Tolerance (ACID distributed) • Availability cannot be guaranteed • On connection failure, unavailable(wait for overall system to become consistent) • AP: Availability & Partition Tolerance (BASE) • Consistency cannot be guaranteed, use of optimistic strategies • Simple to implement, main concern: availability to ensure revenue ($$$) BASE consistency model 2 3 4 7 5 6 read A
Consistency and Data Models BASE Properties • Basically Available • Major focus on availability, potentially with outdated data • No guarantee on global data consistency across entire system • Soft State • Even without explicit state updates, the data might change due to asynchronous propagation of updates and nodes that become available • Eventual Consistency • Updates eventually propagated, system would reach consistent state if no further updates, and network partitions fixed • No temporal guarantees on changes are propagated
Consistency and Data Models Eventual Consistency [Peter Bailis, Ali Ghodsi: Eventual consistency today: limitations, extensions, and beyond. Commun. ACM 2013] • Basic Concept • Changes made to a copy eventually migrate to all • If update activity stops, replicas will converge to a logically equivalent state • Metric: time to reach consistency (probabilistic bounded staleness) • #1 Monotonic Read Consistency • After reading data object A, the client never reads an older version • #2 Monotonic Write Consistency • After writing data object A, it will never be replaced with an other version • #3 Read Your Own Writes / Session Consistency • After writing data object A, a client never reads an older version • #4 Causal Consistency • If client 1 communicated to client 2 that data object A has been updated, subsequent reads on client 2 return the new value
Key-Value Stores Motivation and Terminology • Motivation • Basic key-value mapping via simple API (more complex data models can be mapped to key-value representations) • Reliability at massive scale on commodity HW (cloud computing) • System Architecture • Key-value maps, where values can be of a variety of data types • APIs for CRUD operations (create, read, update, delete) • Scalability via sharding(horizontal partitioning) • Example Systems • Dynamo (2007, AP) Amazon DynamoDB (2012) • Redis (2009, CP/AP) users:1:a “Inffeldgasse 13, Graz” users:1:b “[12, 34, 45, 67, 89]” users:2:a “Mandellstraße 12, Graz” users:2:b “[12, 212, 3212, 43212]” [Giuseppe DeCandia et al: Dynamo: amazon's highly available key-value store. SOSP 2007]
Key-Value Stores Example Systems • RedisData Types • Redis is not a plain KV-store, but “data structure server” withpersistent log (appendfsyncno/everysec/always) • Key: ASCII string (max 512MB, common key schemes: comment:1234:reply.to) • Values: strings, lists, sets, sorted sets, hashes (map of string-string), etc • Redis APIs • SET/GET/DEL: insert a key-value pair, lookup value by key, or delete by key • MSET/MGET: insert or lookup multiple keys at once • INCRBY/DECBY: increment/decrement counters • Others: EXISTS, LPUSH, LPOP, LRANGE, LTRIM, LLEN, etc • Other systems • Classic KV stores (AP): Riak, Aerospike, Voldemort, LevelDB, RocksDB, FoundationDB, Memcached • Wide-column stores: Google BigTable (CP), Apache HBase (CP), Apache Cassandra (AP)
Key-Value Stores Log-structured Merge Trees [Patrick E. O'Neil, Edward Cheng, Dieter Gawlick, Elizabeth J. O'Neil: The Log-Structured Merge-Tree (LSM-Tree). Acta Inf. 1996] • LSM Overview • Many KV-stores rely on LSM-trees as their storage engine (e.g., BigTable, DynamoDB, LevelDB, Riak, RocksDB, Cassandra, HBase) • Approach: Buffers writes in memory, flushes data as sorted runs to storage, merges runs into larger runs of next level (compaction) • System Architecture • Writes in C0 • Reads against C0 and C1 • Compaction (rolling merge): sort, merge, includingdeduplication in-memory buffer (C0)max capacity T C0 writes reads compaction C1t on-disk storage (C1) C1t+1
Key-Value Stores Log-structured Merge Trees, cont. • LSM Tiering • Keep up to T-1 runs per level L • Merge all runs of Li into 1 run of Li+1 • L1 • L2 • L3 • LSM Leveling • Keep 1 run per level L • Merge run of Li with Li+1 • L1 • L2 • L3 write-optimized read-optimized [Niv Dayan: Log-Structured-Merge Trees, Comp115 guest lecture, 2017]
Document Stores Recap: JSON (JavaScript Object Notation) • JSON Data Model • Data exchange format for semi-structured data • Not as verbose as XML(especially for arrays) • Popular format (e.g., Twitter) • Query Languages • Most common:libraries for tree traversal and data extraction • JSONig: XQuery-like query language • JSONPath: XPath-like query language {“students:”[ {“id”: 1, “courses”:[ {“id“:“INF.01014UF”, “name“:“Databases”}, {“id“:“706.550”, “name“:“AMLS”}]}, {“id”: 5, “courses”:[ {“id“:“706.004”, “name“:“Databases 1”}]}, ]} JSONiq Example: declare option jsoniq-version “…”; for $x in collection(“students”) where $x.id lt 10 let $c := count($x.courses) return {“sid”:$x.id, “count”:$c} [http://www.jsoniq.org/docs/JSONiq/html-single/index.html]
Document Stores Motivation and Terminology • Motivation • Application-oriented management of structured, semi-structured, and unstructured information (pay-as-you-go, schema evolution) • Scalability via parallelization on commodity HW (cloud computing) • System Architecture • Collections of (key, document) • Scalability via sharding(horizontal partitioning) • Custom SQL-like or functionalquery languages • Example Systems • MongoDB (C++, 2007, CP) RethinkDB, Espresso, Amazon DocumentDB (Jan 2019) • CouchDB (Erlang, 2005, AP) CouchBase 1234 {customer:”Jane Smith”, items:[{name:”P1”,price:49}, {name:”P2”,price:19}]} 1756 {customer:”John Smith”, ...} 989 {customer:”Jane Smith”, ...}
Document Stores Example MongoDB [Credit:https://api.mongodb.com/python/current] • Creating a Collection • Inserting into a Collection • Querying a Collection importpymongoas mconn = m.MongoClient(“mongodb://localhost:123/”)db= conn[“dbs19”] # database dbs19cust= db[“customers”] # collection customers mdict= { “name“: “Jane Smith”, “address”: “Inffeldgasse13, Graz” } id = cust.insert_one(mdict).inserted_id # ids = cust.insert_many(mlist).inserted_ids print(cust.find_one({"_id": id})) ret = cust.find({"name": "Jane Smith"}) for x in ret: print(x)
Graph Processing Motivation and Terminology • Ubiquitous Graphs • Domains: social networks, open/linked data, knowledge bases, bioinformatics • Applications: influencer analysis, ranking, topology analysis • Terminology • Graph G = (V, E) of vertices V (set of nodes) and edges E (set of links between nodes) • Different types of graphs Undirected Graph MultiGraph Data/Property Graph DirectedGraph Labeled Graph k1=v1 k2=v2 inter-acts k2=v3 Gene
Graph Processing Terminology and Graph Characteristics • Terminology, cont. • Path: Sequence of edges and vertices (walk: allows repeated edges/vertices) • Cycle: Closed walk, i.e., a walk that starts and ends at the same vertex • Clique: Subgraph of vertices where every two distinct vertices are adjacent • Metrics • Degree (in/out-degree): number of incoming/outgoing edges of that vertex • Diameter: Maximum distance of pairs of vertices (longest shortest-path) • Power Law Distribution • Degree of most real graphs follows a power law distribution out-degree2 in-degree3 e.g., 80-20 rule Tall head Long tail
Graph Processing Vertex-Centric Processing [GrzegorzMalewicz et al: Pregel: a system for large-scale graph processing. SIGMOD 2010] • Google Pregel • Name: Seven Bridges of Koenigsberg (Euler 1736) • “Think-like-a-vertex” computation model • Iterative processing in super steps, comm.: message passing • Programming Model • Represent graph as collection of vertices w/ edge (adjacency) lists • Implement algorithms via Vertex API • Terminate if all vertices halted / no more msgs 1 2 3 4 5 6 7 [1, 3, 4] 2 public abstract classVertex { public String getID(); publiclongsuperstep(); publicVertexValuegetValue(); publiccompute(Iterator<Message> msgs); publicsendMsgTo(String v, Message msg); public voidvoteToHalt(); } [5, 6] Worker 1 7 [1, 2] 4 [1, 2, 4] 1 [6, 7] 5 Worker 2 [2] 3 [5, 7] 6
Graph Processing Vertex-Centric Processing, cont. • Example1: Connected Components • Determine connected components of a graph (subgraphs of connected nodes) • Propagate max(current, msgs) if != current to neighbors, terminate if no msgs • Example 2: Page Rank • Ranking of webpages by importance / impact • #1: Initialize vertices to 1/numVertices() • #2: In each super step • Compute current vertex value:value = 0.15/numVertices()+0.85*sum(msg) • Sendto all neighbors:value/numOutgoingEdges() Step 3converged Step 0 Step 2 Step 1 1 4 4 2 4 4 3 3 4 4 4 4 7 5 7 7 6 7 7 7 7 [Credit: https://en. wikipedia.org/wiki/PageRank ]
Graph Processing Graph-Centric Processing • Motivation • Exploit graph structure for algorithm-specific optimizations(number of network messages, scheduling overhead for super steps) • Large diameter / average vertex degree • Programming Model • Partition graph into subgraphs (block/graph) • Implement algorithm directly against subgraphs (internal and boundary nodes) • Exchange messages in super steps only between boundary nodes faster convergence 1 2 3 4 5 6 7 Worker 2 Worker 1 1 [YuanyuanTian, Andrey Balmin, Severin Andreas Corsten, ShirishTatikonda, John McPherson: From "Think Like a Vertex" to "Think Like a Graph". PVLDB 2013] 2 3 4 [Da Yan, James Cheng, Yi Lu, Wilfred Ng: Blogel: A Block-Centric Framework for Distributed Computation on Real-World Graphs. PVLDB 2014] 5 Worker 3 6 7
Graph Processing Resource Description Framework (RDF) uri1:BayernMunich • RDF Data • Data and meta data description via triples • Triple: (subject, predicate, object) • Triple components can be URIs or literals • Formats: e.g., RDF/XML, RDF/JSON, Turtle • RDF graph is a directed, labeled multigraph • Querying RDF Data • SPARQL (SPARQL Protocol And RDF Query Language) • Subgraph matching • Selected Example Systems uri4#worksFor uri2:T.Mueller uri4#age rdf#type uri3:Player 29 SELECT?person WHERE{ ?personrdf:type uri3:Player ; uri4:worksFor uri1:”Bayern Munich” . }
Graph Processing Example Systems • Understanding Use in Practice • Types of graphs user have • Graph computations run • Types of graph systems used • Summary of State of the ArtRuntime Techniques [Siddhartha Sahu, Amine Mhedhbi, SemihSalihoglu, Jimmy Lin, M. Tamer Özsu: The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing. PVLDB 2017] Graph X [Da Yan, Yingyi Bu, Yuanyuan Tian, Amol Deshpande, James Cheng: Big Graph Analytics Systems. SIGMOD 2016]
Time Series Databases Motivation and Terminology • Ubiquitous Time Series • Domains: Internet-of-Things (IoT), sensor networks, smart production/planet, telemetry, stock trading, server/application metrics, event/log streams • Applications: monitoring, anomaly detection, time series forecasting • Dedicated storage and analysis techniques Specialized systems • Terminology • Time series X is a sequence of data points xi for a specific measurement identity (e.g., sensor) and time granularity • Regular (equidistant) time series (xi) vs irregular time series (ti, xi) regular 1s 1s irregular
Time Series Databases Example InfluxDB Measurement Tags [Paul Dix: InfluxDB Storage Engine Internals, CMU Seminar, 09/2017] • Input Data • System Architecture • Written Go, originally key-value store, now dedicated storage engine • Time Structured Merge Tree (TSM), similar to LSM • Organized in shards, TSM indexes and inverted index for reads cpu,region=west,host=A user=85,sys=2,idle=10 1443782126 Attributes (values) Time Index per TSM file:Header | Blocks | Index | Footer append-only fsync WAL Write KeyLen | Key | Type | Min T | Max T| Off | … periodic flushes In-MemIndex compaction & compression periodic drop of shards (files) according to retention policy TSM Indexes
Time Series Databases Example InfluxDB, cont. • Compression (of blocks) • Compress up to 1000 values per block (Type | Len | Timestamps | Values) • Timestamps: Run-length encoding for regular time series; Simple8B or uncompressed for irregular • Values: double delta for FP64, bits for Bool, double delta + zig zag for INT64, Snappy for strings • Query Processing • SQL-like and functional APIs for filtering (e.g., range) and aggregation • Inverted indexes SELECT percentile(90, user) FROMcpuWHERE time>now()-12h AND “region”=‘west’ GROUP BY time(10m), host Posting lists: cpu [1,2,3,4,5,6] host=A [1,2,3] host=B [4,5,6] region=west [1,2,3] Measurement to fields: cpu [user,sys,idle] host [A, B] Region [west, east]
Time Series Databases Other Systems • Prometheus • Metrics, high-dim data model, sharding and federationcustom storage and query engine, implemented in Go • OpenTSDB • TSDB on top of HBase or Google BigTable, Hadoop • TimescaleDB • TSDB on top of PostgreSQL, standard SQL and reliability • Druid • Column-oriented storage for time series, OLAP, and search • IBM Event Store • HTAP system for high data ingest rates, and data-parallel analytics via Spark • Shard-local logs groomed data [Ronald Barber et al: Evolving Databases for New-Gen Big Data Applications. CIDR 2017]
Conclusions and Q&A • Summary 10 NoSQL Systems • Consistency and Data Models • Key-Value and Document Stores • Graph and Time Series Databases • Next Lectures (Part B:Modern Data Management) • 11 Distributed file systems and object storage [May 27] • 12 Data-parallel computation (MapReduce, Spark)[May 27] • 13 Data stream processing systems [Jun 03] • Jun 17: Q&A and exam preparation