250 likes | 362 Views
This Week. Introduction to Databases Course Information Grading and Other Things Questionnaire The E/R Data Model. Textbooks for COSC 6340.
E N D
This Week • Introduction to Databases • Course Information • Grading and Other Things • Questionnaire • The E/R Data Model
Textbooks for COSC 6340 • Required Text:Raghu Ramakrishnan and Johannes Gehrke, Data Management Systems, McGraw Hill, Third Edition, 2002 (complication: the chapter numbers in the new edition are different!!) • Recommended:Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufman Publishers, 2001, ISBN 1-55860-489-8 (4 chapters will be covered) • Other books with relevant material:Ramez Elmasri and Shamkant Navathe, Fundamentals of Database Systems, Third Edition Addison Wesley ISBN: 0-8053-1755-4
Lectures in COSC 6340 • I: Basic Database Management Concepts --- Review of basic database concepts, techniques, and languages (9 classes, Chapters 1-5, 8-11, and 16 of the textbook). • III: Introduction to KDD and Data Warehousing centering on data warehouses, OLAP, and data mining; moreover, more detailed coverage of querying and mining data streams and database clustering (5 classes; Chapters 1, 2, 6, and 7 of the Han/Kamber book & chapter 25 of our textbook and additional material) • III: Relational Database Design (2 classes, chapters 19) • IV: Implementation of Relational Operators, Query Optimization, and Physical Database Design (Chapters 12+14+20, 3-4 classes) • V: Internet Databases and XML (1 class, chapter 27 of the textbook) • VI: Other: discussion of home works and exams, student presentations; discussion of course projects (3 classes)
Tentative Schedule COSC 6340 • Jan. 18: Introduction to COSC 6340 • Fast Review of Undergraduate Material (Jan. 20-Feb. 15) • Jan. 20: Entity-Relationship Data Model more detailed than textbook • Jan. 25: Entity-Relationship Data Model • Jan. 27: Relational Data Model • Feb. 1: Mapping E/R Diagrams to Relations • Feb. 8/10/15: Index & storage structures, B+-trees, and hashing, PDBD • Feb. 15/17: Relational Algebra • Feb. 22: Writing SQL Queries (somewhat short) • Feb. 24: Leftovers / Review • March 1: Exam0 (Undergraduate Review Exam)
Tentative Schedule COSC 6340 Part2 • II: KDD and Data Warehousing (approx. 5.5 classes) • March 3: Introduction to KDD • March 8: Similarity Assessment • March 10: Clustering • March 22: Association Rule Mining • March 29: Spatial Databases • March 31: Data Warehouses and OLAP • April 8 (30 minutes): Spatial Data Mining • April 5+7+[8]: III: Relational Database Design (2.5 classes) • April 14: Midterm Exam (30 minutes review on April 12, 2005) • April 19+26+[28]: Student Presentations • IV: Physical Database Design and Query Optimization (3 classes) • April 8(makeup): Implementation of Relational Operators • April 8(makeup)/12: Introduction to Query Optimization • April 12/19: Physical Database Design II • V: Internet Databases (1 class) • April 21: Introduction to XML and Semi-Structured Data • April 28: Course Summary + Teaching Evaluation + Leftovers
Exams and Homeworks COSC 6340 • Tu., March 1: Undergraduate Material Review Exam • Th., April 14: Midterm Exam • 3 graded homeworks: deadlines: Feb. 17, April 11, April 28 • Ungraded Homeworks • Final Exam: Tu/TH., May 10, 11a-1:30p • Qualifying Exam Part2: Fr., May 13, 10-11:30a
Why are integrated databases popular? • Avoidance of uncontrolled redundancy • Making knowledge accessible that would otherwise not be accessible • Standardization --- uniform representation of data facilitating import and export • Reduction of software development (though the availability of data management systems) Bookkeeping Device Integrated Database Car Salesman
Popular Topics in Databases • Efficient algorithms for data collections that reside on disks (or which are distributed over multiple disk drives, multiple computers or over the internet). • Study of data models (knowledge representation, mappings, theoretical properties) • Algorithms to run a large number of transactions on a database in parallel; finding efficient implementation for queries that access large databases; database backup and recovery,… • Database design • How to use database management systems as an application programmer / end user. • How to use database management systems as database administrator • How to implement database management systems • Data summarization, knowledge discovery, and data mining • Special purpose databases (genomic, geographical, internet,…)
Data Model Data Model is used to define Schema (defines a set of database states) Current Database State
when name title ssn phone Person Check_out Schema for the Library Example using the E/R Data Model author B# Book (0,35) (0,1) 1-to-1 1-to Many Many-to-1 Many-to-Many
Relational Schema for Library Example in SQL/92 CREATE TABLE Person (ssn CHAR(9), name CHAR(30), phone INTEGER, PRIMARY KEY (ssn)); CREATE TABLE Book (B# INTEGER, title CHAR(30), author CHAR(20), PRIMARY KEY (B#)); CREATE TABLE Checkout( book INTEGER, person CHAR(9), since DATE, PRIMARY KEY (book), FOREIGN KEY (book) REFERENCES Book, FOREIGN KEY (person) REFERENCES Person));
Referential Integrity in SQL/92 CREATE TABLE Enrolled (sid CHAR(20), cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid,cid), FOREIGN KEY (sid) REFERENCESStudents ON DELETE CASCADE ON UPDATE SET DEFAULT ) • SQL/92 supports all 4 options on deletes and updates. • Default is NO ACTION (delete/update is rejected) • CASCADE (also delete all tuples that refer to deleted tuple) • SET NULL / SET DEFAULT(sets foreign key value of referencing tuple)
Example of an Internal Schemafor the Library Example INTERNAL Schema Library12 references Library. Book is stored sequentially, index on B# using hashing, index on Author using hashing. Person is stored using hashing on ssn. Check_out is stored sequentially, index on since using B+-tree.
Example: Stored Database Index on B# Block#= B# mod 10 Index on Author Relation Book 1 11 51 20 30 0 W, … (1, C,W) (20, Y,W) (51, C, B) 1 (11, Y,W) (30, Z, B) Relation Checkout (101,…) (200,…) (500,…) 0 Index on since Relation Person Block#= sss mod 10 1
Modern Relational DBMS Transaction Concepts; capability of running many transactions in parallel; support for backup and recovery. Support for Web-Interfaces, XML, and Data Exchange Modern DBMS Support for OO; capability to store operations Support for data- driven computing Efficient Implementation of Queries (Query Optimization, Join & Selection & Indexing techniques) Support for Data Mining operations Support for OLAP and Data Warehousing Support for special Data-types: long fields, images, html-links, DNA-sequences, spatial information,… Support for higher level user interfaces: graphical, natural language, form-based,…
Disks and Files • DBMS stores information on (“hard”) disks. • This has major implications for DBMS design! • READ: transfer data from disk to main memory (RAM). • WRITE: transfer data from RAM to disk. • Both are high-cost operations, relative to in-memory operations, so must be planned carefully!
Why Not Store Everything in Main Memory? • Costs too much. $100 will buy you either 512MB of RAM or 50GB of disk today --- that is disk storage 100 times cheaper(but it is approx. 10000 times slower). • Main memory is volatile. We want data to be saved between runs. (Obviously!) • Typical storage hierarchy: • Main memory (RAM) for currently used data. • Disk for the main database (secondary storage). • Tapes for archiving older versions of the data (tertiary storage). Remark: All reported disk performance/prize data are as of middle of 2003
Tracks Arm movement Arm assembly Components of a Disk Spindle Disk head • The platters spin (say, 90rps). • The arm assembly is moved in or out to position a head on a desired track. Tracks under heads make a cylinder(imaginary!). Sector Platters • Only one head reads/writes at any one time. • Block size is a multiple of sector size (which is fixed).
Accessing a Disk Page • Time to access (read/write) a disk block: • seek time (moving arms to position disk head on track) • rotational delay (waiting for block to rotate under head) • transfer time (actually moving data to/from disk surface) • Seek time and rotational delay dominate. • Seek time varies from about 1 to 20msec • Rotational delay varies from 0 to 10msec • Transfer rate is about 1msec per 32KB page
Review: The ACID properties • A tomicity: All actions in the Xact happen, or none happen. • C onsistency: If each Xact is consistent, and the DB starts consistent, it ends up consistent. • I solation: Execution of one Xact is isolated from that of other Xacts. • D urability: If a Xact commits, its effects persist. • The Recovery Manager guarantees Atomicity & Durability.
Example • Consider two transactions (Xacts): T1: BEGIN A=A+100, B=B-100 END T2: BEGIN A=1.06*A, B=1.06*B END • Intuitively, the first transaction is transferring $100 from B’s account to A’s account. The second is crediting both accounts with a 6% interest payment. • There is no guarantee that T1 will execute before T2 or vice-versa, if both are submitted together. However, the net effect must be equivalent to these two transactions running serially in some order.
Atomicity of Transactions • A transaction mightcommitafter completing all its actions, or it could abort(or be aborted by the DBMS) after executing some actions. • A very important property guaranteed by the DBMS for all transactions is that they are atomic. • DBMS logs all actions so that it can undothe actions of aborted transactions and redo the actions of successful transactions.
Concurrency in a DBMS • Users submit transactions, and can think of each transaction as executing by itself. • Concurrency is achieved by the DBMS, which interleaves actions (reads/writes of DB objects) of various transactions. • Each transaction must leave the database in a consistent state if the DB is consistent when the transaction begins. • DBMS will enforce some ICs, depending on the ICs declared in CREATE TABLE statements. • Beyond this, the DBMS does not really understand the semantics of the data. (e.g., it does not understand how the interest on a bank account is computed). • Issues:Effect of interleaving transactions, and crashes.
Example (Contd.) • Consider a possible interleaving (schedule): T1: A=A+100, B=B-100 T2: A=1.06*A, B=1.06*B • This is OK. But what about: T1: A=A+100, B=B-100 T2: A=1.06*A, B=1.06*B • The DBMS’s view of the second schedule: T1: R(A), W(A), R(B), W(B) T2: R(A), W(A), R(B), W(B)
Summary • Concurrency control and recovery are among the most important functions provided by a DBMS. • Users need not worry about concurrency. • System automatically inserts lock/unlock requests and schedules actions of different transactions in such a way as to ensure that the resulting execution is equivalent to executing the transactions one after the other in some order. • Write-ahead logging (WAL) is used to undo the actions of aborted transactions and to restore the system to a consistent state after a crash.