Databases 101 & Multimedia Support

Databases 101 & Multimedia Support Joe Hellerstein Computer Science Division UC Berkeley

Overview for the day • Databases 101 • Motivation • Relational databases • Relational Algebra & SQL • Functional units in a DBMS Engine • Indexes and Query Optimizers • MM Objects in a DBMS • ADT support • GiST & multi-d indexing • Similarity Search and MM indexing

What we’re skipping • Concurrency and Recovery • How DBs ensure your data is safe, despite concurrent access and crashes • Buffer Management • How to manage a main-mem cache, and still guarantee correctness in the face of crashes • Query optimization • how to map a declarative query (SQL) to a query plan (relational algebra + implementations) • Query processing algorithms • sort, hash, join algorithms • Database design • logical design: E-R models & normalization • physical design: indexes, clustering, tuning

More Background • Overview Texts: • Ramakrishnan, Database Management Systems (the cow book) • Silberschatz, Korth, Sudarshan: Database System Concepts (the sailboat book) • O’Neil: Database Principles, Programming, Performance • Ullman & Widom: A 1st Course in Database Systems • Graduate-level Texts • Stonebraker & Hellerstein, eds. Readings in Database Systems (http://redbook.cs.berkeley.edu) • Gray & Reuter: Transaction Processing: Concepts and Techniques. • Ullman: Principles of Database Systems • Faloutsos: Searching Multimedia Databases by Content

Database Management Systems • What more could we want than a file system? • Simple, efficient ad hoc1 queries • concurrency control • recovery • benefits of good data modeling 1ad hoc: formed or used for specific or immediate problems or needs

Describing Data: Data Models • A data modelis a collection of concepts for describing data. • Aschemais a description of a particular collection of data, using the a given data model. • The relational model of datais the most widely used model today. • Main concept: relation, basically a table with rows and columns. • Every relation has a schema, which describes the columns, or fields.

Levels of Abstraction View 1 View 2 View 3 • Many views, single conceptual (logical) schemaand physical schema. • Views describe how users see the data. • Conceptual schema defines logical structure • Physical schema describes the files and indexes used. Conceptual Schema Physical Schema

Example: University Database • Conceptual schema: • Students(sid: string, name: string, login: string, age: integer, gpa:real) • Courses(cid: string, cname:string, credits:integer) • Enrolled(sid:string, cid:string, grade:string) • Physical schema: • Relations stored as unordered files. • Index on first column of Students. • External Schema (View): • Course_info(cid:string,enrollment:integer)

Data Independence • Applications insulated from how data is structured and stored. • Logical data independence: Protection from changes in logical structure of data. • Physical data independence: Protection from changes in physical structure of data. • One of the most important benefits of using a DBMS!

Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB These layers must consider concurrency control and recovery Structure of a DBMS • A typical DBMS has a layered architecture. • The figure does not show the concurrency control and recovery components. • This is one of several possible architectures; each system has its own variations.

Advantages of a DBMS • Data independence • Efficient data access • Data integrity & security • Data administration • Concurrent access, crash recovery • Reduced application development time • So why not use them always? • Can be expensive, complicated to set up and maintain • This cost & complexity must be offset by need • - Often worth it!

Relational Algebra p By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental power of the race. -- Alfred North Whitehead (1861 - 1947)

Relational Query Languages • Query languages: Allow manipulation and retrieval of data from a database. • Relational model supports simple, powerful QLs: • Strong formal foundation based on logic. • Allows for much optimization. • Query Languages != programming languages! • QLs not expected to be “Turing complete”. • QLs not intended to be used for complex calculations. • QLs support easy, efficient access to large data sets.

Formal Relational Query Languages Two mathematical Query Languages form the basis for “real” languages (e.g. SQL), and for implementation: • Relational Algebra: More operational, very useful for representing internal execution plans. (Database byte-code…) • Relational Calculus: Lets users describe what they want, rather than how to compute it. (Non-operational, declarative -- SQL comes from here.)

Preliminaries • A query is applied to relation instances, and the result of a query is also a relation instance. • Schemasof input relations for a query are fixed (but query will run regardless of instance!) • The schema for the resultof a given query is also fixed! Determined by definition of query language constructs. • Languages are closed (can compose queries)

R1 Example Instances • “Sailors” and “Reserves” relations for our examples. • We’ll use positional or named field notation, assume that names of fields in query results are `inherited’ from names of fields in query input relations. S1 S2

Relational Algebra • Basic operations: • Selection ( ) Selects a subset of rows from relation. • Projection ( ) Deletes unwanted columns from relation. • Cross-product( ) Allows us to combine two relations. • Set-difference ( ) Tuples in reln. 1, but not in reln. 2. • Union( ) Tuples in reln. 1 and in reln. 2. • Additional operations: • Intersection, join, division, renaming: Not essential, but (very!) useful.

Projection • Deletes attributes that are not in projection list. • Schema of result: • exactly the fields in the projection list, with the same names that they had in the (only) input relation. • Projection operator has to eliminate duplicates! (Why??) • Note: real systems typically don’t do duplicate elimination unless the user explicitly asks for it. (Why not?)

Selection • Selects rows that satisfy selection condition. • No duplicates in result! • Schema of result: • identical to schema of (only) input relation. • Result relation can be the input for another relational algebra operation! (Operatorcomposition.)

Union, Intersection, Set-Difference • All of these operations take two input relations, which must be union-compatible: • Same number of fields. • `Corresponding’ fields have the same type. • What is the schema of result?

Cross-Product • S1 x R1: Each row of S1 is paired with each row of R1. • Result schema has one field per field of S1 and R1, with field names `inherited’ if possible. • Conflict: Both S1 and R1 have a field called sid. • Renaming operator:

Joins • Condition Join: • Result schema same as that of cross-product. • Fewer tuples than cross-product, might be able to compute more efficiently • Sometimes called a theta-join.

Joins • Equi-Join: Special case: condition c contains only conjunction of equalities. • Result schema similar to cross-product, but only one copy of fields for which equality is specified. • Natural Join: Equijoin on all common fields.

SELECT [DISTINCT] target-list FROMrelation-list WHERE qualification Basic SQL Query • relation-list : A list of relation names • possibly with a range-variable after each name • target-list : A list of attributes of tables in relation-list • qualification : Comparisons combined using AND, OR and NOT. • Comparisons are Attr op const or Attr1 op Attr2, where op is one of • DISTINCT: optional keyword indicating that the answer should not contain duplicates. • Default is that duplicates are not eliminated!

Conceptual Evaluation Strategy • Semantics of an SQL query defined in terms of the following conceptual evaluation strategy: • Compute the cross-product of relation-list. • Discard resulting tuples if they fail qualifications. • Delete attributes that are not in target-list. • If DISTINCT is specified, eliminate duplicate rows. • Probably the least efficient way to compute a query! • An optimizer will find more efficient strategiessame answers.

Example of Conceptual Evaluation SELECT S.sname FROM Sailors S, Reserves R WHERE S.sid=R.sid AND R.bid=103

A Note on Range Variables • Really needed only if the same relation appears twice in the FROM clause. The previous query can also be written as: SELECT S.sname FROM Sailors S, Reserves R WHERE S.sid=R.sid AND bid=103 Some folks suggest using range variables always! SELECT sname FROM Sailors, Reserves WHERE Sailors.sid=Reserves.sid AND bid=103 OR

Find sailors who’ve reserved at least one boat SELECT S.sid FROM Sailors S, Reserves R WHERE S.sid=R.sid • Would adding DISTINCT to this query make a difference? • What is the effect of replacing S.sid by S.sname in the SELECT clause? • Would adding DISTINCT to this variant of the query make a difference?

Expressions and Strings SELECT S.age, age1=S.age-5, 2*S.age AS age2 FROM Sailors S WHERE S.sname LIKE ‘B_%B’ • Arithmetic expressions, string pattern matching. • ASand = are two ways to name fields in result. • LIKE is used for string matching. • `_’ stands for any one character and `%’ stands for 0 or more arbitrary characters.

COUNT (*) COUNT ( [DISTINCT] A) SUM ( [DISTINCT] A) AVG ( [DISTINCT] A) MAX (A) MIN (A) Aggregate Operators • Significant extension of relational algebra. single column SELECT COUNT (*) FROM Sailors S SELECT S.sname FROM Sailors S WHERE S.rating= (SELECT MAX(S2.rating) FROM Sailors S2) SELECT AVG (S.age) FROM Sailors S WHERE S.rating=10 SELECT COUNT (DISTINCT S.rating) FROM Sailors S WHERE S.sname=‘Bob’ SELECT AVG ( DISTINCT S.age) FROM Sailors S WHERE S.rating=10

Database APIs: ODBC/JDBC • special procedures/objects for procedural languages to access DBs • pass SQL strings from language, presents result sets in a language-friendly way • Microsoft’s ODBC becoming C/C++ standard on Windows • Sun’s JDBC a Java equivalent • Supposedly DBMS-neutral • a “driver” traps the calls and translates them into DBMS-specific code • database can be across a network

SQL API in Java (JDBC) Connection con = // connect DriverManager.getConnection(url, ”login", ”pass"); Statement stmt = con.createStatement(); // set up stmt String query = "SELECT COF_NAME, PRICE FROM COFFEES"; ResultSet rs = stmt.executeQuery(query); try { // handle exceptions // loop through result tuples while (rs.next()) { String s = rs.getString("COF_NAME"); Float n = rs.getFloat("PRICE"); System.out.println(s + " " + n); } } catch(SQLException ex) { System.out.println(ex.getMessage () + ex.getSQLState () + ex.getErrorCode ()); }

Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB These layers must consider concurrency control and recovery Structure of a DBMS • A typical DBMS has a layered architecture. • The figure does not show the concurrency control and recovery components. • This is one of several possible architectures; each system has its own variations.

Query Optimization & Processing • Optimizer maps SQL to algebra tree with specific algorithms • access methods and join algorithms • relational operators implemented as iterators • open() • next(possible with condition) • close • processing engine is a pull-based, single-threaded data flow • parallelizes naturally

Index Entries (Direct search) Data Entries ("Sequence set") B+ Tree: The Most Widely Used Index • Insert/delete at log F N cost; keep tree height-balanced. (F = fanout, N = # leaf pages) • Minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries. The parameter d is called the order of the tree. • Supports equality and range-searches efficiently.

Example B+ Tree • Search begins at root, and key comparisons direct it to a leaf. • Search for 5*, 15*, all data entries >= 24* ... Root 30 24 13 17 39* 22* 24* 27* 38* 3* 5* 19* 20* 29* 33* 34* 2* 7* 14* 16* • Based on the search for 15*, we know it is not in the tree!

MM objects in a Standard RDBMS? create table frames (frameno integer, image BLOB, category integer) • Binary Large Objects (BLOBs) can be stored and fetched • User-level code must provide all logic for BLOBs • Scenario: client (Machine A) requests “thumbnail” images for all frames in DBMS (Machine B) • Inefficient, too hard to express queries.

Solution 2: Object-Relational DBMS • Idea: add OO features to the type system of SQL. I.e. “plain old SQL”, but... • columns can be of new types (ADTs) • user-defined methods on ADTs • columns can be of complex types • reference types and “deref” • inheritance and collection inheritance • old SQL schemas still work! (backwards compatibility) • Relational vendors all moving this way (SQL99).

ADTs: User-Defined Atomic Types • Built-in SQL types (int, float, text, etc.) limited • have simple methods as well (math, LIKE, etc.) • ORDBMS: can define new types (& methods) create type jpeg (internallength = variable, input = jpeg_in, output = jpeg_out); • Not naturally composed of built-in types • new atomic types • Need input & output methods for types • convert from text to internal type and back • we’ll see how to do method definition soon...

An Example Queries in SQL-3 • Clog cereal wants to license an image of Herbert the Worm in front of a sunrise: • The thumbnail method produces a small image • The Sunrise_rank method returns the relative similarity of a pic to a sunrise • The Herbert_rank method returns the relative similarity of a pic to Herbert select F.frameno, thumbnail(F.image), C.lease_price from frames F, categories C where F.category = C.cid order by Sunrise_rank(F.image) + Herbert_rank(F.image);

User-Defined Methods • New ADTs will need methods to manipulate them • e.g. for jpeg: thumbnail, crop, rotate, smooth, etc. • expert user writes these methods in a language like Java or C • register methods with ORDBMS: create function thumbnail(jpeg) returns jpeg as external name ‘/a/b/c/Dinkey.o’ • ORDBMS dynamically links functions into server.

Modifications to support this? • Indexing • Need extensible indexing architecture • Indexing on arbitrary comparison methods (not just < > =) • Must support similarity search • Chunked array support • Optimization • WHERE clause exprs can be expensive • Need to carefully order selections • May want to do join before selection • Teach optimizer about extensible indexes • Query execution • Make sure you trust downloaded code • Cache method invocations

Beyond B-trees • Question: Can B+-trees handle more complicated searches? • A typical example: “gpa > 3.7 and age < 18” • Same thing: “all restaurants in downtown Berkeley” • Even fancier: “all pictures resembling </tmp/sunset.gif>” • (Easy: “all pictures identical to </tmp/sunset.gif>“) • B+-trees exploit data order to do range search • 1-d range search is not always what you want

Search Trees in General • Lots of trees invented for multidimensional data • e.g. R-trees • New tree indexes for other kinds of data • DNA sequences, etc. • Basically “unordered” B+-trees, fancy keys • In some ways, B+-tree is just a “special case” of these trees.

key1 key2 ... Index Nodes (directory) Data Nodes (linked list) Search Trees from 30,000 feet

How to search a disorderly B+-tree? Equality search is identical to traditional! Range search = traversing multiple paths follow all pointers where key range overlaps query range. [32 - 61) [17 - 32) [61 - inf) [-inf - 17) A “Disorderly” B+-tree Traditional 17 32 61 Disorderly

[47 - 80) [17 - 32) [74 - 103) [1 - 17) Another Disorderly B+-Tree • Insert 41??? • keys need not cover all possible values • Search for 77?? • keys may “overlap”

Generalized Search Tree (GiST) • A disorderly B+-tree, with user-defined keys • tree doesn’t interpret keys. • “user” implements keys as OO objects, with methods that guide search, insert, delete, split, etc. • Structure: balanced tree of (p, ptr) pairs • p is a key “predicate” (assertion) • p holds for all data records below ptr • have to have n keys for n pointers (unlike B+-tree)

User-provided Key Methods • Search: • Consistent(E,q): E.p Ù q? (no/maybe) • e.g. E.p = “gpa > 3.7 and age < 18” q = “2.0 < gpa < 3.0 and 15 < age < 18” • Generating new keys after splits: • Union(P): new key that holds for all tuples in P • Clustering: • Penalty(E1,E2): penalty of inserting E2in subtree E1 • PickSplit(P): split P into two groups of entries

Search • General technique: • DFS tree where Consistent is TRUE • More sophisticated • Priority queue to control the next Consistent node to visit • Example: similarity (near-neighbor) search for ranked queries • We’ll return to this idea

Databases 101 & Multimedia Support