820 likes | 845 Views
This course explores database management systems with a focus on sentient spaces and new data models. Topics include data quality, new computing architectures, data privacy, and a walk down the history of data models.
E N D
ICS 224: Database Management Systems Spring 2011 Professor Sharad Mehrotra Information and Computer Science Department University of California, Irvine
Course General Info • URL: http://www.ics.uci.edu/~cs224/ • All course info will be posted online • Lecture times: Tue-Thurs 5 – 6.30 • Instructor: Sharad Mehrotra, BH 2082, sharad@ics.uci.edu • Office Hours: on request Notes 01
Prerequisites • Basic Data Management Concepts: • DB design, relational model, SQL, database programming • CS 122 or equivalent • Database system implementation • Indexing, query optimization, query processing, storage management, etc. • ICS 222 or equivalent • Basic Computer Science Concepts: • Depth-first search, directed/undirected graphs, “big O” notation, computational complexity, NP completeness … Notes 01
Course Requirements • Class Participation: 50% • Attendance, presentations, comments, interaction, enthusiasm, etc. • Class Projects: 50% • Implementation Oriented: • Take a idea/topic, identify a project, get it okayed by instructor, develop a demonstration • Survey of an area • In depth survey in the style of computing survey articles. Provide your own perspective in a subarea. • MUST commit to project at end of 2nd week. Notes 01
Class Structure • Each week we will • Pick a topic • identify 1 paper per student/group of 2 students • 2 papers as lead papers for presentation (one for each class), others presented as short presentations • Each week • Start with overview • Lead paper presentation • short presentation of other papers (main idea) • Discussions Notes 01
This course … Most important ideas in data management (instructor’s pick) But with the eye towards an end application … Sentient spaces Notes 01
Sentient Spaces … • Spaces in which sensors are used to capture the dynamic evolving state which is then analyzed for implementing adaptations. • Numerous examples … • intelligent transportation systems • reconnaissance • surveillance systems • smart buildings • smart grid ...
Query Example:Smart Video Surveillance Query Analysis Event Database CS Building in UC Irvine Semantic Extraction Surveillance Video Database Video collection
Implications of Sentient Space focus .. • Class focuses on topics which you might need to know if you wanted to explore application in sentient space … • Projects should target something about sentient spaces … • E.g., data cleaning of sentient data, data model to represent sentient spaces, … Notes 01
Data Models (2 weeks) • Representing time - TSQL2 • Representing space • Querying streaming data – CQL, ASQL • Semi-structured data –OEM, Lore Notes 01
New Ideas in Storage & Indexing (2 weeks) • New storage models • Key-Value store • Bigtable • Column Stores • New database system architecture • Data outsourcing • Multitenant databases • New Indexing techniques • Correlation maps Notes 01
Data Quality (2 weeks) • Data quality issues • Inaccuracy, incompleteness, ambiguity, errors, … • Two aspects: • Techniques to improve quality • Exploiting contextual knowledge, issues of efficiency • Techniques to tolerate poor quality of data in applications. Notes 01
New Computing Architecture (2 weeks) • Map Reduce framework • Hive • Pig latin • Join processing • HadoopDB • Hyrax? Notes 01
Data Privacy (2 weeks) • Use cases • Data publishing, queries, sharing, data outsourcing. • Diverse criteria • Differential privacy, Anonymity, l-diversity, .. • Mechanisms to implement Notes 01
A walk down the history of data models … Two papers (MUST READ) Inclusion of New types in relational databases, Stonebraker Postgrest Next Generation databsase, Stonebraker.
The Paleolithic Period … • There were no general purpose tools for managing large volumes of data… • OS provided resource management • Data was stored in files • Applications performed data management functionalities • Fault-tolerance • Concurrency control • Reliability • Optimizations • … • Such functionalities had to be re-implemented for each application Notes 01
The Neolithic Period… • Early file systems evolve into general-purpose data management tools. • DBMS Goals: • Efficiency and scalability (faster than files) • Management of large heterogeneous types of structured data • High reliability • Information sharing (multiple users) • DBMS Users: • E-commerce companies, banks, airlines, transportation companies, corporate databases, government agencies, … • Anyone you can think of! Notes 01
The Dark Ages …. • Network & hierarchical data models • Resulted in data spaghetti • Applications needed to chase pointers • There was little data abstraction or separation of concerns • little difference between physical data representation and logical data representation • optimization was entirely left to application writers • There were no clean data management languages • Unless you are a Cobol fan! Notes 01
The Relational Era.. • Relational model proposed by Codd • Everything is a relation • Query consists of algebraic composition of a few powerful operators • Equivalent to a first-order relational calculus • Primary features • Simple clean data representation • solid mathematical basis • data abstraction • Users did not need to be concerned about how data is stored physically • simple declarative query language • User’s specify what to compute not how to do it. • optimization by the system Notes 01
Data Wars (1) • Codasyl versus relational debates began… • Heated arguments during early SIGMODS • Codasyl: relational model is too simple, applications built using it will never scale in performance. • Relational: network/hierarchical models have no formal basis, are too complex, and unmanageable as application complexity increases. • Relational model found many supporters • Specially at universities • Its simplicity was enticing Notes 01
Data Wars (2) • Many projects started off trying to implement a relational DBMS • System R @ IBM Almaden • Ingres @ Berkeley • These early systems led to the technologies that drive modern data management • Early prototypes became products • DB2 & Ingres • Principle designers from both the System R teams & Ingres left to start companies • Oracle, Sybase • Early relational companies went door to door converting industry to the relational model • Industry got hooked on to the simplicity of writing complex applications in relational model • Boeing among the first converts Notes 01
Pointer’s Strike Back… Application data structures • Complex objects in emerging DBMS applications cannot be effectively represented as records in relational model. • Representing information in RDBMSs requires complex and inefficient conversion into and from the relational model to the application programming language • ODBMSs provide a direct representation of objects to DBMSs overcoming the impedance mismatch problem Copy and translation Transparent ODBMS data transfer Relational representation RDBMS Notes 01
Object Model • Object: • observable entity in the world being modeled • similar to concept to entity in the E/R model • An object consists of: • attributes: properties built in from primitive types • relationships: properties whose type is a reference to some other object or a collection of references • methods: functions that may be applied to the object. Notes 01
Object Oriented Databases • Evolved as persistent Object Oriented Programming Languages: • Start with an OO language (e.g., C++, Java, SMALLTALK) which has a rich type system • Add persistence to the objects in programming language where persistent objects stored in databases Notes 01
Persistent Programming Languages • Single programming language for application and data management • Update to persistent variable results in automatic update to database. • Persistent data could be types such as sets and lists and arrays. • Application can follow pointers (OID) to navigate through data. Notes 01
Persistence • Objects created may have different lifetimes: • transient: allocated memory managed by the programming language run-time system. • E.g., local variables in procedures have a lifetime of a procedure execution • global variables have a lifetime of a program execution • persistent: allocated memory and stored managed by ODBMS runtime system. • Classes are declared to be persistence-capable or transient. • Different languages have different mechanisms to make objects persistent: • creation time: Object declared persistent at creation time (e.g., in C++ binding) (class must be persistent-capable) • persistence by reachability: object is persistent if it can be reached from a persistent object (e.g., in Java binding) (class must be persistent-capable). Notes 01
Persistent Object-Oriented Programming Languages • Persistent objects are stored in the database and accessed from the programming language. • Single programming language for applications as well as data management. • Avoid having to translate data to and from application programming language and DBMS • efficient implementation • less code • Programmer does not need to write explicit code to fetch data to and from database • persistent objects to programmer looks exactly the same as transient objects. • System automatically brings the objects to and from memory to storage device. (pointer swizzling). Notes 01
Approaches To Persistent Programming • Persistent Virtual Memory • disk representation and memory representation of data is identical. • No cost to translate data from one representation to another— efficient! • DB size limited to address space 32bit processor 2^32 byte addressability (4 GBytes) • Differentiating persistent objects and non-persistent objects is difficult. • Difficult to optimize disk layout and locality of access. • Example system using approach: OBJECT STORE. Notes 01
Approaches To Persistent Programming Languages • Store persistent objects in files • Objects brought to memory on demand. • Implementation of OID complex since pointers do not suffice in general. • If object in memory pointer can be used for OID • if object on disk a disk address still not good as OID since storage can be reorganized. A separate mechanism needed. • Pointer swizzling for efficiency. Notes 01
Challenges In Building Persistent Languages • Efficient caching of objects in client address space. • Cache coherence. • In OODB data migrates to clients unlike relational client server systems where query migrates to server. • Given a large number of clients each with the cache of objects ensuring consistency of object across multiple clients is a challenge. Notes 01
Disadvantages of ODBMS Approach • Low protection • since persistent objects manipulated from applications directly, more changes that errors in applications can violate data integrity. • Non-declarative interface: • difficult to optimize queries • difficult to express queries • But ….. • Most ODBMSs offer a declarative query language OQL to overcome the problem. • OQL is very similar to SQL and can be optimized effectively. • OQL can be invoked from inside ODBMS programming language. • Objects can be manipulated both within OQL and programming language without explicitly transferring values between the two languages. • OQL embedding maintains simplicity of ODBMS programming language interface and yet provides declarative access. Notes 01
The Return of the Relations … POSTGRES • Relational model evolved into ORDBMSs that include “best of” object-oriented concepts • Amongst the first ORDBMS prototype built @ BerkeleyPOSTGRES Illustra Informix IUS • Has had major impact on major commercial DBMS which have all migrated to ORDBMS model. • SQL3 supported by modern databases adapted many of the concepts developed in Postgres commercialized bought by Notes 01
POSTGRES — Combinations • Introduced object orientation into relation DBMSs. • Fundamental Concepts. • Each record has an OID. • Access to data though: • query language POSTQUEL. • navigation through OIDs. • Classes: • Inheritance: • Types: rich set of types available for columns. • Functions: can be called within POSTQUEL. Notes 01
Classes And Inheritance • Class analogous to relation • User can create new classcreate Emp (name = c12, salary = float, age = int) • Classes can inherit from otherscreate Salesman (quota = float) inherits Emp • Multiple inheritance permitted. If new class causes ambiguity it is not created. • Classes: • real: base classes or relations • derived: views • version: maintained differentially compared to parent class Notes 01
Types In POSTGRES • Standard base types • float, int, charac. Strings, etc. • Abstract data type (ADT) facility to create new base types e.g.; create type point (x = int, y = int)create type polygon • ADT’s can be used in class definitions.Create Dept( dname = c10, mgr = c12, floorspace = polygon mailstop = point ) mailstop Notes 01
Functions In POSTGRES • Three types: (1) C functions (2) Operators (3) POSTQUEL functions • C-functions • any C-function over base types or composite type retrieve (Dept. name) where area (Dept. floorspace) > 500 retrieve (Emp. name) where overpaid (Emp) Function over a class or method Notes 01
Operators • Arbit C-functions are not optimized by query optimizers. • Special functions - operators can utilize indexes for their evaluation. • Operator: function with 1 or 2 operand retrieve (Dept. name) where Dept. floor space-AGT “(0,0), (1,1), (0,2)” • Index (e.g.; B-tree) defined properly can be used to speed up evaluation of operators such as AGT. Area Greater Than Notes 01
Other Features Of POSTGRES • Allowed creation of new indices by user. • To an extent pioneered the approach of extensible database technology which is prevalent with vendors today • Supported transitive closure in query. retrieve* into ans (parent. older) from a in answer where. Parent. younger = “John” or parent. younger = a. older • Supported rules Notes 01
POSTQUEL Functions • Any collection of commands in POSTQUEL. • query = POSTQUEL function.define function high-pay returns Emp as retrieve (Emp. all) where Emp. salary > 50k • POSTQUEL function with parameters.define function Sal-lookup (c12) returns float as retrieve (Emp. Salary) where Emp. name = $1 • Usage of POSTQUEL function retrieve Emp. name where Emp. Salary = Sal-lookup (“Joe”) Notes 01
Composite Types In POSTGRES • POSTQUEL: • Composite types accessed via path expressions, using nested dot notation.remove (Emp mgr age) where (Emp name = ‘joe’) • Prevents having to specify a join. Notes 01
Composite Types In POSTGRES • Attributes can have a class name as a type resulting in complex objects with structure.Create Emp ( name = c12, salary = float [c12], age = int, mgr = Emp, coworker = Emp ) • A set type that can hold elements of any class.Add to Emp (hobbies = set) Refers to 0 or more references of Emp class. Could be elements of any class Notes 01
Types In POSTGRES • Array type (constructor) crate Emp ( name = c12, salary = float [12], age = int ) • POSTQUEL queryretrieve (Emp name) where (Emp salary [4] = 1000) Salary for each month. Array in query usage. Notes 01
Database Technology Matrix Q u r y S u p p o r t Y E S RDBMSs ORDBMSs File System N O OODBMSs Simple Complex Database Types Notes 01
XML & RDF - the new revolution • Just when relational model had driven out object-oriented database technology, WWW led to the proliferation of semi-structured data. • 2 approaches to supporting XML/RDF • Extend relational technology to support XML/RDF • Native XML databases Notes 01
Summary of Evolution of Data Model • The Dark Ages: network & heirarchical models • Victory of simplicity and beauty over data spaghetti: The Relational DBMS: • The pointers strike back -- Object-Orientation, OODBMSs • The return of the relations -- ORDBMS -- took the best of the OO concepts and incorporated them in the relational model. • The current and near future -- support for XML & RDF • The final frontier -- anyone’s guess! Notes 01
Key Database Technologies • File Management • provides a file abstraction as a collection of records stored in disk • Index Management and Access Methods • implements techniques for associative access to data • Query Optimization and Processing • given a query and data storage structures, determines an efficient strategy to evaluate the query. • Transaction management • ensures consistency of the database in presence of concurrent transactions and various types of failures • Catalog Management • maintains database schema information • Authorization and Integrity Management • tests for integrity constraints and user authorization Notes 01
Database Management System Architecture Application Queries Schema changes compilers optimizer Query processor Metadata and data dictionary evaluator Buffer manager Transaction Manager Storage manager File system Database and Indices Notes 01
Storage Media and their Properties • Main Memory • costs $100/Mbyte -- reduces every year • ‘volatile’ -- does not survive system failures • random I/O very fast • data can be processed by CPU directly • capacity limited to orders of magnitude lower than what database needs. • Magnetic Disk • costs $0.50/Mbyte -- reduces each year • Non-volatile (except when disk crashes) • random I/O not as fast • CPU cannot directly process data. Needs to be transferred to main memory • Tape • Cheaper but slower than disks. Sequential I/O devices. Handy for backups, sometimes for archival. Notes 01
Databases and Storage Devices • Due to capacity, cost, volatility factors databases traditionally stored in disks. • Data brought to main memory for processing from disks • There are many ways to interface memory with disk resident data • E.g., virtual memory: • VM size limited to max address generated by CPU • Existing VM does not support durability • File system provides a more powerful mapping between memory and disk storage • A bunch of tricks used ensure that high latency of secondary storage does not impact application response time and system throughput • access disks asynchronously with active applications • prefetch data before application needs it • intelligent caching techniques Notes 01