310 likes | 322 Views
Learn about relational databases and the SQL language used to query and manipulate data. Understand how relational databases are organized on disk and their role in cloud infrastructure.
E N D
14-848Cloud Infrastructure Lecture 10 * Fall 2018 * Kesden
Socrative • https://api.socrative.com/rc/Nfu6Lp
Today’s Goals • A brief introduction to relational databases • A brief introduction to the SQL language used by relational databases • A brief look at how relational databases are organized on disk • Reflect on how they can be used within clouds • Reflect on why there are limits to their ability to scale
Not Among Today’s GoalS • A deep study of relational databases • Or, even minimal functional competency with SQL databases • If interested, take 15-615 or 15-645
Database “A database is an organized collection of data, stored and accessed electronically.” -- https://en.wikipedia.org/wiki/Database
Relational Database • “A relational database is a digital database based on the relational model of data, as proposed by E. F. Codd in 1970. “ • “The relational model (RM) for database management is an approach to managing data using a structure and language consistent with first-order predicate logic…where all data is represented in terms of tuples, grouped into relations. • “Virtually all relational database systems use SQL (Structured Query Language) for querying and maintaining the database.” -- https://en.wikipedia.org/wiki/Relational_database -- https://en.wikipedia.org/wiki/Relational_model
Relational DatabasesRole In Clouds • Most widely used type of database since the 1980s • Unlikely to lose that title, or see reduced growth in use, any time soon • Good for up to millions of records • Beyond that, for “Big Data” applications, we’ll talk about various NoSQL databases • Everything from managing system configuration data to customer accounts to inventories to order and shipping records, to HR data, etc.
Relational Database, Informally • A database wherein the data is structured in tables known as relations. • Rows represent entities and are unordered • A relational database, although drawn as a table, functions more like an unordered set of rows. • Columns represent attributes • Constraints can be enforced w.r.t. types and values • Schema for each table describes attributes and associated types. • Operations can be performed across tables.
Keys:What Are They? • Used to name rows within tables • Used to associate rows across tables • Can be tuple of attributes
KeYS:Example Role • Unique within relation • Used to associate relations
SQL (“Sequel”)Structured Query Language • Used by nearly all relational databases • Both a Data Definition Language (DDL) and a Data Manipulation Language (DML) • DDL: Create, define, modify, and administer (access restrictions, etc) relations • Structure, types, constraints, etc – not values therein • DML: Query and modify values: Tuples and attributes thereof • Hides organization of data, e.g. whether and how indexed, how stored, location, etc. • Big difference from older database languages that described tree paths, etc. • Name derived from earlier IBM predecessor, “Standard English QUeryLanguage”
Keys:Types • Candidate – Unique to row • Primary – Used as name for row • Foreign – Key used in another table • Think of a foreign key as a pointer
SQL:CREATE CREATE TABLE StudentBios ( StudentID CHAR(9) NOT NULL, FName VARCHAR(200)NOT NULL, LName VARCHAR(200)NOT NULL, YearEnrolled YEAR )
SQL:CREATE, REVISITED CREATE TABLE StudentCourses ( StudentID CHAR(9) NOT NULL, Course CHAR(6) NOT NULL, Semester CHAR(3) NOT NULL, Grade CHAR(2) PRIMARY KEY (StudentID, Course, Semester), FOREIGN KEY (StudentID) REFERENCES StudentBios ) Why is the primary key (StudentID, Course, Semester)? Not just StudentID?
SQL:INSERT INSERT INTO StudentBios (CHAR(9), VARCHAR(200),VARCHAR(200),YEAR) VALUES (123456789, ‘lkennedy’, ‘Laurie’, ‘Kennedy’, 2014)
SQL:SELECT SELECTFname, Lname from StudentBios WHERE YearEnrolled >= 2015
SQL:SELECT, Continued SELECTStudentID, Fname, Lname from StudentBios WHERE YearEnrolled <= 2015 ORDER BY FName,LName
SQL:DELETE DELETE FROM StudentBios WHERE StudentBios.Fname = ‘Laurie’
SQL:JOIN • Combines columns from one or more relations into a new relation • Rows are aligned based upon a matching key • Columns are selected from logically combined rows • Yes, a relation can be joined with itself, hence one or more • Complexity • Some keys may be present in some relations but not others • Depending upon which table is being iterated, some rows may be missed or some columns may be empty • This leads to different flavors of JOIN
SQL:JOIN SELECT AndrewID, Course, Semester, Grade FROM StudentBiosINNER JOIN StudentCourses WHEREStudentBIOS.StudentID = StudentCourses.StudentID • This effectively forms a new relation which is the union of the StudentBios and StudentCourses tables • The rows of the two are associated up based upon the StudentID of the the rows matching. • INNER JOIN requires matches in both tables • If there isn’t a matching key in both tables it isn’t included • The result has (AndrewID, Course, Semester, Grade) columns
SQL:Views CREATEVIEWStudentClassList AS SELECT AndrewID, Course, Semester FROM StudentBios INNER JOIN StudentCourses WHERE StudentBIOS.StudentID = StudentCourses.StudentID • This forms what is essentially a virtual table • Abstracts away complexity of underlying tables • Allows view to be created from tables that may be from different domains or systems • Allows protections to be applied to table, e.g. can access course list for student, but not course grades • May enable better system performance by aiding caching, etc.
Relational Databases:Memory Hierarchy: A Quick Review • Really approximate access times (latency): • Registers: Speed of processor • SRAM (Cache): 2-3nS = ~2.5x10-6 mS = ~2.5/1,000,000 mS • DRAM (Main Memory): 20-30nS = ~2.5x10-5 mS = 2.5/100,000 mS • FlashRAM (SSD): 0.035 mS • Disk Drive: 1 mS, sequential • Disk Drive: 10 mS, random • The name of the game is… • To build a system that performs nearly as well as if all of its memory is of the fastest type • At a cost nearly as low as if all of its memory is of the slowest type • Caching is the magic sauce
Relational Databases:Additional Considerations • Memory is limited • High throughput is key • ACID properties • Atomicity – A transaction is all done, or note done at all, but not half done (Succeed or fail) • Consistency – Find things consistent, leave things consistent. Don’t break things with side-effects. • Isolation – Even if there is concurrency, effects should be the same as if sequential • Durability – Once it is done, it is done (.) • Minimize overhead, while maximizing performance
Relational Databases:Implications OF ENVIRONMENT, GOALS • Caching is critical • Durability will require consideration of hardware • SSD or other FlashRAM layer? • Battery powered backup? • Etc • Indexes and data stored separately • Pack as many indexes into RAM as possible. Think like Haystack, but 40 years earlier. • Multiple indexes, separate for each use • Indexes and data different caching domains
Relational Databases:Typical Organization • Block/Page based storage of data pages • Often LRU cached • All same size for interchangeable allocation and caching, mapping to disk blocks, etc • Indexes constructed to speed queries • B-Tree based – sorted, range-based, etc • Caching may be page-based LRU. • Hash based – fast lookup without rich features • Trades disk space for lookup speed • Exact lookup only, not ranges, etc. • May be dynamic • Lookup results can be cached. • Indexes point into data pages • Commonly created by administrator, not based upon any intrinsic understanding by the database, itself
Relational Databases:Obtaining Robustness • Backups are hard because they may require a freeze • Blurry backups may not be useful because of ACID • Some algorithms can make blurry backup and fix-up • Replicated databases are a common option • A few replicas provide robustness without blowing scale • Need policies for working through failure and obtaining recovery • Need atomic commit protocol or master-slave, etc, to maintain consistency
Relational Databases:Scaling Up • Scaling to a point is easy enough • Buy bigger disk • Solution can be replicated • Scaling even bigger is easy enough • Buy an array • Solution can be replicated • Scaling even bigger gets complicated • Distributing whole tables makes joins more expensive • Sharding tables by row makes summarizing across attributes slow • Sharding tables by column makes lookup by entity slow • At a certain point, it may be better to focus on precise needs and structure a solution to meet them • NoSQL Databases: Row-Oriented, Column-Oriented, Key-Value Stores, etc.
Relational Databases:Scaling Up, Simple Ideas • Distill data as much as possible. • Make representation of data as dense as possible. • Keep data redundantly, if (absolutely needed) to allow distribution with performance
SQL:The STORY CONTINUES • There is a ton of theory behind relational databases • There is a ton more to the SQL language • There is a ton more to how SQL databases are organized, indexed, etc. • Take 15-615 or 15-645 if interested.
Moving Forward:Data (Not File) Storage at Scale • Key-Value Stores • Column, Row, and Document-Oriented Databases • RAM layer storage