1.17k likes | 1.62k Views
CS5423 Principles of Database Systems. DB Fundamentals. What are Database Management Systems. DBMS is a system for providing EFFICIENT , CONVENIENT , and SAFE MULTI-USER storage of and access to MASSIVE amounts of PERSISTENT data. Capabilities of DBMS. Persistent Storage
E N D
CS5423Principles of Database Systems DB Fundamentals
What are Database Management Systems DBMS is a system for providing EFFICIENT, CONVENIENT, and SAFEMULTI-USER storage of and access to MASSIVE amounts of PERSISTENT data
Capabilities of DBMS • Persistent Storage • Data Structures for efficient access to very large amounts of data • Programming Interface • More than reading and writing of files. Access/search and modify data through a query language • Transaction Management- ACID • a set of properties of database transactions intended to guarantee validity even in the event of errors, power failures, etc • Atomicity: each transaction is treated as a single "unit • Consistency: any data written to the database must be valid according to all defined rules, • Isolation: controls how and when changes are made and if they become visible to each other, users, and systems • Durability: completed transactions will survive permanently
Example: Banking System • Data • Information on accounts, customers, balances, current interest rates, transaction histories, etc. • MASSIVE • Many gigabytes at a minimum for big banks, more if keep history of all transactions, even more if keep images of checks -> Far too big to fit in main memory • PERSISTENT • Data outlives programs that operate on it
Example: Banking System • SAFE: • from system failures • from malicious users • CONVENIENT: • simple commands to debit account, get balance, write statement, transfer funds, etc. • also unpredicted queries should be easy • EFFICIENT: • don't search all files in order to get balance of one account, get all accounts with low balances, get large transactions, etc. • massive data! -> DBMS's carefully tuned for performance
Multi-user Access • Many people/programs accessing same database, or even same data, simultaneously -> Need careful controls • Alex @ ATM1: withdraw $100 from account #007 get balance from database; if balance >= 100 then balance := balance - 100; dispense cash; put new balance into database; • Bob @ ATM2: withdraw $50 from account #007 get balance from database; if balance >= 50 then balance := balance - 50; dispense cash; put new balance into database; • Initial balance = 120. Final balance = ??
Why File Systems Won’t Work • Storing data: file system is limited • size limit by disk or address space • when system crashes we may lose data • Password/file-based authorization insufficient • Query/update: • need to write a new C++/Java program for every new query • need to worry about performance • Concurrency: limited protection • need to worry about interfering with other users • need to offer different views to different users (e.g. registrar, students, professors) • Professor: All students’ grade • Student-Own grade • Schema change: • entails changing file formats • need to rewrite virtually all applications That’s why the notion of DBMS was motivated!
DBMS Architecture User/Web Forms/Applications/DBA query transaction DDL commands Query Parser Transaction Manager DDL Processor Query Rewriter Concurrency Control Logging & Recovery Query Optimizer Query Executor Records Indexes Lock Tables Buffer: data, indexes, log, etc Buffer Manager Main Memory Storage Manager Storage data, metadata, indexes, log, etc CS411
Data Structuring: Model, Schema, Data • Data model • conceptual structuring of data stored in database • E.g. arrays, objects in C or Java • ex: data is set of records, for a university system, student table • each with student-ID, name, address, courses, photo • ex: airplane system: • data is graph where nodes represent cities, edges represent airline routes • Schema versus data • schema: describes how data is to be structured, defined at set-up time, rarely changes (also called "metadata") • A set of attributed with a name for a relation • Student(Id, Name, Address, Curse,..) • datais actual "instance" of database, rows of relations(tables) changes rapidly • Each row has a component(value) for each attribute of relation defined in schema • vs. types and variables in programming languages
Schema vs. Data • Schema: name, name of each field, the type of each field • Students (Sid:string, Name:string, Age: integer, GPA: real) • A template for describing a student • Data: an example instance of the relation
Data Structuring: Model, Schema, Data • Data definition language (DDL) • commands for setting up schema of database • CREATE, DROP, ALTER • Data Manipulation Language (DML) • Commands to manipulate data in database: • Select, INSERT, DELETE, MODIFY • Also called "query language“ • Does not affect schema
Create another column e.g. address • DDL • Insert another student information • DML
People • DBMS user: queries/modifies data • DBMS application designer • set up schema, loads data, … • DBMS administrator • user management, performance tuning, … • DBMS implementer: builds systems
Key Steps in Building DB Applications • Step 0: pick an application domain • Step 1: conceptual design • Discuss what to model in the application domain • Need a modeling language to express what you want • ER model is the most popular such language • Output: an ER diagram of the application domain • Step 2: pick a type of DBMS’s • Relational DBMS is most popular and is our focus
Key Steps in Building DB Applications • Step 3: translate ER design to a relational schema • Use a set of rules to translate from ER to relational schema • Use a set of schema refinement rules to transform the above relational schema into a goodrelational schema • 1NF, 2NF, 3NF, BCNF, 4NF,…… • At this point • You have a good relational schema on paper
Key Steps in Building DB Applications • Step 4: Implement your relational DBMS using a "database programming language" called SQL • SELECT-FROM-WHERE-GROUPBY-HAVING • Step 5: Ordinary users cannot interact with the database directly and the database also cannot do everything you want, hence write your application program in C++, Java, PHP, etc. to handle the interaction and take care of things that the database cannot do
ER Model • ER ? • A language to specify • what information a database must hold • what are the relationships among components of that information • What we will cover • basic constructs, and design principles • Proposed by Peter Chen in 1976 • "The Entity-Relationship Model --- Toward a Unified View of Data". in ACM transactions on database systems (TODS) • One of the most cited CS papers
Entities and Attributes • Entities • Real-world objects distinguishable from other objects • Described using a set of attributes • Attributes • each has an atomic domain: string, integers, reals, etc. • Entity set: a collection of similar entities
makes Product Company 1 a b 2 A= c 3 B= d (Binary) Relationship • A mathematical definition: • if A, B are sets, then a relation R is a subset of A x B (Cartesian product) • A={1,2,3}, B={a, b, c, d}, R = {(1,a), (1,c), (3,b)} • makes is a subset of Product x Company:
makes Product Company a b c 1 2 3 1 2 3 1 2 3 a b c d a b c d Multiplicity of E/R Relationships • one-one: • many-one • many-many
did budget name dname ssn lot Departments Employees Manages Works_In Example • An employee can work in many departments; a department can have many employees • In contrast, each department has at most one manager. • So, how can we create ER diagram for this ?
ER Model vs. Relational Model • Both are used to model data • ER model has many concepts • Entities, relationships, attributes, etc. • Well-suited for capturing the app. requirements • Not well-suited for computer implementation • Relational model • Has just a single concept: relation (table) • World is represented with a collection of tables • Well-suited for efficient manipulations on computers
Relations • Schema vs. instance = columns vs. rows • Schema of a relation • Relation name • Attribute names • Attribute types (domains) • Schema of a database • A set of relation schemas • Questions • When do you determine a schema (instance)? • How often do you change your mind?
Relation: An Example Name of Table (Relation) Column (Field, Attribute) Products Domain (Atomic type) Row (Record, Tuple)
Relations • The database maintains a current database state • Updates to the data happen very frequently • add a tuple • delete a tuple • modify an attribute in a tuple • Updates to the schema are relatively rare, and rather painful. Why?
Translating ER Diagram to Rel. Design • Basic cases • entity set E = relation with attributes of E • relationshipR = relation with attributes being keys of related entity sets + attributes of R • Special cases • combining two relations • translating weak entity sets • translating is-a relationships and subclasses
since name dname ssn lot Employees Manages Translating ER Diagrams budget did Departments Translation to relational model? 1-to-1 1-to-Many Many-to-1 Many-to-Many
E2 ER Diagram to Relations Relational schema e.g.account=(bname, acct_no, bal) E/R diagram E = ( a1, …, an ) E a1 ….. an R1 E1 R1= ( a1, b1, c1, …, ck) a1 …. an c1 …. ck b1 …. bm
E2 ER Diagram to Relations R1 E1 c1 …. ck a1 …. an b1 …. bm • Could have : R1= ( a1, b1, c1, …, ck) • Put b1 as the key for R1, it is also the key for E2=(b1, …., bn) • Usual strategy(combination) • ignore R1 • Add a1, c1, …., ck to E2 instead, i.e. • E2=(b1, …., bn, a1, c1, …, ck)
E2 ER Diagram to Relations R1 ? ? E1 c1 …. ck a1 …. an b1 …. bm E2 = ( b1, …, bm ) E1 = ( a1, …, an ) R1 R1 = ( a1, b1, c1 …, ck ) E1 = ( a1, …, an ) R1 E2 = ( b1, …, bm , a1, c1, …, ck) E1 = ( a1, …, an , b1, c1, …, ck) R1 E2 = ( b1, …, bm ) R1 Treat as n:1 or 1:m
Relational Algebra • Querying the database: specify what we want from our database • Find all the people who earn more than $1,000,000 and pay taxes in Oklahoma • Could write in C++/Java, but a bad idea • Instead use high-level query languages: • Theoretical: Relational Algebra, • Practical: SQL • Relational algebra: a basic set of operations on relations that provide the basic principles
What is an “Algebra”? • Mathematical system consisting of: • Operands--- variables or values from which new values can be constructed • Operators--- symbols denoting procedures that construct new values from given values • Examples • Arithmetic algebra, linear algebra, Boolean algebra …… • What are operands? • Arithmetic Algebra: variables like x and constants like 15 • What are operators? • Arithmetic Algebra: addition, subtraction…
What is Relational Algebra? • An algebra • Whose operands are relations or variables that represent relations • Whose operators are designed to do common things that we need to do with relations in a database • relations as input, new relation as output • Can be used as a query languagefor relations
Relational Operators at a Glance • Five basic RA operations: • Basic Set Operations • union, difference (no intersection, no complement) • Selection: (s) eliminates some rows of relations • Projection: (p) eliminates some columns of relations • Cartesian Product: (X) combine the tuples of two relations • When our relations have attribute names: • Renaming: (r) change the relation schema • Derived operations: • Intersection, complement • Joins (natural join, equi-join, theta join, semi-join, ……)
Set Operations • Union: all tuples in R1 or R2, denoted as R1 U R2 • R1, R2 must have the same schema • R1 U R2 has the same schema as R1, R2 • Example: • Active-Employees U Retired-Employees • If any, is duplicate elimination required? • yes • Intersection: all tuples in both R1 and R2 as R1 ∩ R2 • R1, R2 must have the same schema • R1 U R2 has the same schema as R1, R2 • Example: • Active-Employees U Retired-Employees • Difference: all tuples in R1 but not in R2, denoted as R1 – R2 • R1, R2 must have the same schema • R1 - R2 has the same schema as R1, R2 • Example • All-Employees - Retired-Employees
Selection • Applied to a relation R, produce a new relation with a subset of R’s tuples • Returns all tuples (rows) which satisfy a condition that involves the attributes of R, denoted assc(R) • c is a condition: =, <, >,≤, ≥, ≠, AND, OR, NOT • Output schema: same as input schema • Find all employees with salary more than $40,000: • sSalary > 40000(Employee)
Projection • Unary operation: returns certain columns, denoted as P A1,…,An(R) • Eliminates duplicate tuples ! • Input schema R(B1, …, Bm) • Condition: {A1, …, An} {B1, …, Bm} • Output schema S(A1, …, An) • Example: project social-security number and names: • PSSN, Name (Employee)
Selection vs. Projection • Think of relation as a table • How are they similar? • Selection of subsets from table • How are they different? • Horizontal vs. vertical? • selection-projection • Duplicate elimination for both? • No duplicate in the result of selection
Cartesian Product • Each tuple in R1 with each tuple in R2, denoted as R1 x R2 • Input schemas R1(A1,…,An), R2(B1,…,Bm) • Output schema is S(A1, …, An, B1, …, Bm) • Very rare in practice; but joinsare very common • Example: Employee x Dependent
Example Employee Dependent Employee x Dependent
Renaming • Does not change the relational instance, denoted as Notation: rS(B1,…,Bn) (R) • Changes the relational schema only • Input schema: R(A1, …, An) • Output schema: S(B1, …, Bn) • Example: • Soc-sec-num, firstname(Employee)
Set Operations: Intersection • Intersection: all tuples both in R1 and in R2, denoted as R1 R2 • R1, R2 must have the same schema • R1 R2 has the same schema as R1, R2 • Example: UnionizedEmployeesRetiredEmployees • Intersection is derived: • R1 R2 = R1 – (R1 – R2) why ? R1 R2 R1 ∩ R2
Theta Join • A join that involves a predicate q, denoted as R1 qR2 • Input schemas: R1(A1,…,An), R2(B1,…,Bm) • Output schema: S(A1,…,An,B1,…,Bm) • Derived operator: R1 qR2 = sq (R1 x R2) • Take the product R1 x R2 • Then apply SELECTC to the result • As for SELECT, C can be any Boolean-valued condition
Theta Join: Example Product Component ProductInfo:= Product Product.pname=Component.PName Component
Natural Join • Notation: R1 R2 • Input Schema: R1(A1, …, An), R2(B1, …, Bm) • Output Schema: S(C1,…,Cp) • Where{C1, …, Cp} = {A1, …, An} U{B1, …, Bm} • Meaning: combine all pairs of tuples in R1 and R2 that agree on the attributes: • {A1,…,An} {B1,…, Bm}(called the join attributes)
Natural Join: Examples Employee Dependent Employee Dependent = PSSN, Name, Dependent-Name(sEmployee.SSN=Dependent.SSN(Employee x Dependent)
Natural Join: Examples R S R S
Natural Join • Given the schemas R(A, B, C, D), S(A, C, E), what is the schema of R S ? • Given R(A, B, C), S(D, E), what is R S? • Given R(A, B), S(A, B), what is R S?
Equi-join • Special case of theta join: condition c contains only conjunction of equalities • Result schema is the same as that of Cartesian product • May have fewer tuples than Cartesian product • Most frequently used in practice: R1 A=BR2 • Natural join is a particular case of equi-join • A lot of research on how to do it efficiently