180 likes | 191 Views
This recitation covers the history and growth of relational databases, the difference between database systems and file systems, the concept of data abstraction, tables in a relational database, queries and relational operators, query processing, and relational joins.
E N D
6001structure & interpretation of computer programsrecitation 8/ october 10, 1997
overview • today’s ideas • relational databases • relational operators and queries • simple query optimization
history • origins • 1960’s: network and hierarchical databases • 1970: Codd invents the relational model; revolutionizes the field • early 1980’s: IBM starts DB2, Oracle founded • growth • late 1980’s: RDBs become big business: Oracle, Sybase, Ingres, Informix, etc • 1997: Oracle’s revenue is $5B, total RDB market about $10-15B • future • object-oriented databases unlikely to take off • but hybrid ‘object-relational databases’ are starting to get popular
what’s a database? • what does a database do? • stores large volumes of data • allows rapid access for many different kinds of query • databases includes mechanisms for handling • secondary storage management: indexing and storage of data that doesn’t fit in memory • persistence: data shouldn’t go away! • concurrency control: many users at once • distribution: users at different sites • data protection: making sure updates are consistent • recovery: when system fails, eg • the difference between DB systems and file systems • programmer doesn’t have to worry about how data is laid out on disk(sometimes called the ‘data independence principle’)
the big idea • what was wrong with early databases (network and hierarchical) • the user’s operations were coded directly in terms of data structures • these data structures closely mirrored the way the data was actually stored • when developer changed the data structures for performance,the queries had to be rewritten • the solution • have an abstract data model • queries and insertions all expressed in terms of abstract model • hide the details of the actual data structures from the user • in other words… • data abstraction!
tables • what’s in a relational database • an RDB consists of a collection of tables or relations • each table has a name • columns have names too (called attributes) • each row is a sequence of values one for each column • these may be numbers, booleans, strings, etc • rows are also called tuples or records • example • EVENTS table • EVENT CAFEMEISTER HOURS • java-jump ben 10 • tour-de-café alice 15 • seattle-sleepless carol 8 • ATHLETES table • NAME SPONSOR • alice coffee-connection • ben starbucks • carol green-mountain
queries • to extract information from the RDB • present it with a query • examples • who won the java-jump event? • what company is sponsoring ben? • which company sponsored the winner of the java-jump event? • kinds of query • some queries can be answered by examining only one table • who won the java-jump event? –> look at EVENTS table • what company is sponsoring ben? –> look at ATHLETES table • some need more than one table • which company sponsored the winner of the java-jump event? • –> look up winner in EVENTS and then find sponsor in ATHLETES • query processing • queries are declarative: they say what you want, not how to get it • database system translates query into operation on internal data structures • may optimize the query before applying, so that it goes faster
queries on a single table • for uniformity • think of result of a query as a table itself • two ways to make a table smaller • eliminate some columns • eliminate some rows • eliminating columns • query • select <column names> from <table> • example • select CAFEMEISTER, HOURS from EVENTS • CAFEMEISTER HOURS • ben 10 • alice 15 • carol 8
simple queries, continued • eliminating rows • query • filter <condition> from <table> • example • filter NAME = ben from ATHLETES • NAME SPONSOR • ben starbucks • operators • filter and select are called relational operators • they are like the operations of an abstract data type • take a table and produce another table
combining queries • example • who won the java-jump event? • want a table containing only the name of the winner • so only column is CAFEMEISTER • want to restrict rows so that EVENT is java-jump • query: • select CAFEMEISTER from • filter EVENT = java-jump from • EVENTS • could do it the other way round too • other examples • who sponsors alice? • which athletes did more than 9 hours? • whom does starbucks sponsor?
relational join • relational join • given two tables T1 and T2 with columns a1, b1, c1, … and a2, b2, c2, … • make a new table with columns a1, b1, c1, … a2, b2, c2, … • for each row R1 in T1 and each row R2 in T2, make a new row R1 R2 • how many rows if each table has k rows? k^2 • assume that column names of T1 and T2 are disjoint
example of a join • join EVENTS and ATHLETES • EVENT CAFEMEISTER HOURS NAME SPONSOR • java-jump ben 10 alice coffee- connection • tour-de-café alice 15 alice coffee- connection • seattle-sleepless carol 8 alice coffee- connection • java-jump ben 10 ben starbucks • tour-de-café alice 15 ben starbucks • seattle-sleepless carol 8 ben starbucks • java-jump ben 10 carol green-mountain • tour-de-café alice 15 carol green-mountain • seattle-sleepless carol 8 carol green-mountain
queries involving two tables • now we can formulate a query on two tables by • joining the tables together into a single table • applying filter and select to the new table • example • which company sponsored the winner of the java-cup event? • query is • select SPONSOR from • filter EVENT = java-jump from • filter CAFEMEISTER = NAME from • join EVENTS and ATHLETES • before filter, we have • EVENT CAFEMEISTER HOURS NAME SPONSOR • java-jump ben 10 ben starbucks • final table (result of query) is • SPONSOR • starbucks
more example queries • exercises • which events did coffee-connection sponsor the winner of? • who sponsored an athlete who did more than 12 hours?
query optimization • note • join is expensive! • want to apply it to small tables whereever possible • could have done one of the selects before the join first • instead of • select SPONSOR from • filter EVENT = java-jump • filter CAFEMEISTER = NAME from • join EVENTS and ATHLETES • we could write • select SPONSOR from • filter CAFEMEISTER = NAME from • join ATHLETES and • filter EVENT = java-jump from EVENTS • now join gives table with only 3 rows instead of 9 • we’d like • to write the query either way, and have the database figure out a better way to write it • this is called query optimization: can be expressed with simple rules
join of tables with shared column names • convenient to • use same names for related columns • for example, CAFEMEISTER in EVENTS and NAME in ATHLETES could both be NAME • refine the meaning of join • can’t have two columns with same name • so • form a table with both columns • eliminate rows with different values in the two columns • merge the two columns • example • assume CAFEMEISTER column in EVENTS is now called NAME • join of EVENTS and ATHLETES: • EVENT NAME HOURS SPONSOR • tour-de-café alice 15 coffee- connection • java-jump ben 10 starbucks • seattle-sleepless carol 8 green-mountain
query optimizations • which of these rules are valid? • filter C1 from filter C2 from T = filter C1 and C2 from T • select COLS1 from select COLS2 from T = select COLS1 U COLS2 from T • under what conditions are these valid? • select COLS from (join of T1 and T2) = join (select COLS from T1) and (select COLS from T2) • filter C from (join of T1 and T2) = join of (select C from T1) and T2
puzzles for student presentations • procedures with variable length argument lists • find out from the R4RS manual how to write a procedurethat can take a variable number of arguments • make up an entertaining procedure that exploits this feature • powerlists • define a procedure that takes a list and returns a list ofall the sublists • for example, (p (list 1 2 3)) should return a list containing()(1)(2)(3)(1 2)(1 3)(2 3)(1 2 3)in some order • use map!