1 / 82

ICS 224: Database Management Systems Spring 2011

This course explores database management systems with a focus on sentient spaces and new data models. Topics include data quality, new computing architectures, data privacy, and a walk down the history of data models.

brianb
Download Presentation

ICS 224: Database Management Systems Spring 2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ICS 224: Database Management Systems Spring 2011 Professor Sharad Mehrotra Information and Computer Science Department University of California, Irvine

  2. Course General Info • URL: http://www.ics.uci.edu/~cs224/ • All course info will be posted online • Lecture times: Tue-Thurs 5 – 6.30 • Instructor: Sharad Mehrotra, BH 2082, sharad@ics.uci.edu • Office Hours: on request Notes 01

  3. Prerequisites • Basic Data Management Concepts: • DB design, relational model, SQL, database programming • CS 122 or equivalent • Database system implementation • Indexing, query optimization, query processing, storage management, etc. • ICS 222 or equivalent • Basic Computer Science Concepts: • Depth-first search, directed/undirected graphs, “big O” notation, computational complexity, NP completeness … Notes 01

  4. Course Requirements • Class Participation: 50% • Attendance, presentations, comments, interaction, enthusiasm, etc. • Class Projects: 50% • Implementation Oriented: • Take a idea/topic, identify a project, get it okayed by instructor, develop a demonstration • Survey of an area • In depth survey in the style of computing survey articles. Provide your own perspective in a subarea. • MUST commit to project at end of 2nd week. Notes 01

  5. Class Structure • Each week we will • Pick a topic • identify 1 paper per student/group of 2 students • 2 papers as lead papers for presentation (one for each class), others presented as short presentations • Each week • Start with overview • Lead paper presentation • short presentation of other papers (main idea) • Discussions Notes 01

  6. This course … Most important ideas in data management (instructor’s pick) But with the eye towards an end application … Sentient spaces Notes 01

  7. Sentient Spaces … • Spaces in which sensors are used to capture the dynamic evolving state which is then analyzed for implementing adaptations. • Numerous examples … • intelligent transportation systems • reconnaissance • surveillance systems • smart buildings • smart grid ...

  8. Query Example:Smart Video Surveillance Query Analysis Event Database CS Building in UC Irvine Semantic Extraction Surveillance Video Database Video collection

  9. Implications of Sentient Space focus .. • Class focuses on topics which you might need to know if you wanted to explore application in sentient space … • Projects should target something about sentient spaces … • E.g., data cleaning of sentient data, data model to represent sentient spaces, … Notes 01

  10. Data Models (2 weeks) • Representing time - TSQL2 • Representing space • Querying streaming data – CQL, ASQL • Semi-structured data –OEM, Lore Notes 01

  11. New Ideas in Storage & Indexing (2 weeks) • New storage models • Key-Value store • Bigtable • Column Stores • New database system architecture • Data outsourcing • Multitenant databases • New Indexing techniques • Correlation maps Notes 01

  12. Data Quality (2 weeks) • Data quality issues • Inaccuracy, incompleteness, ambiguity, errors, … • Two aspects: • Techniques to improve quality • Exploiting contextual knowledge, issues of efficiency • Techniques to tolerate poor quality of data in applications. Notes 01

  13. New Computing Architecture (2 weeks) • Map Reduce framework • Hive • Pig latin • Join processing • HadoopDB • Hyrax? Notes 01

  14. Data Privacy (2 weeks) • Use cases • Data publishing, queries, sharing, data outsourcing. • Diverse criteria • Differential privacy, Anonymity, l-diversity, .. • Mechanisms to implement Notes 01

  15. A walk down the history of data models … Two papers (MUST READ) Inclusion of New types in relational databases, Stonebraker Postgrest Next Generation databsase, Stonebraker.

  16. The Paleolithic Period … • There were no general purpose tools for managing large volumes of data… • OS provided resource management • Data was stored in files • Applications performed data management functionalities • Fault-tolerance • Concurrency control • Reliability • Optimizations • … • Such functionalities had to be re-implemented for each application Notes 01

  17. The Neolithic Period… • Early file systems evolve into general-purpose data management tools. • DBMS Goals: • Efficiency and scalability (faster than files) • Management of large heterogeneous types of structured data • High reliability • Information sharing (multiple users) • DBMS Users: • E-commerce companies, banks, airlines, transportation companies, corporate databases, government agencies, … • Anyone you can think of! Notes 01

  18. The Dark Ages …. • Network & hierarchical data models • Resulted in data spaghetti • Applications needed to chase pointers • There was little data abstraction or separation of concerns • little difference between physical data representation and logical data representation • optimization was entirely left to application writers • There were no clean data management languages • Unless you are a Cobol fan! Notes 01

  19. The Relational Era.. • Relational model proposed by Codd • Everything is a relation • Query consists of algebraic composition of a few powerful operators • Equivalent to a first-order relational calculus • Primary features • Simple clean data representation • solid mathematical basis • data abstraction • Users did not need to be concerned about how data is stored physically • simple declarative query language • User’s specify what to compute not how to do it. • optimization by the system Notes 01

  20. Data Wars (1) • Codasyl versus relational debates began… • Heated arguments during early SIGMODS • Codasyl: relational model is too simple, applications built using it will never scale in performance. • Relational: network/hierarchical models have no formal basis, are too complex, and unmanageable as application complexity increases. • Relational model found many supporters • Specially at universities • Its simplicity was enticing Notes 01

  21. Data Wars (2) • Many projects started off trying to implement a relational DBMS • System R @ IBM Almaden • Ingres @ Berkeley • These early systems led to the technologies that drive modern data management • Early prototypes became products • DB2 & Ingres • Principle designers from both the System R teams & Ingres left to start companies • Oracle, Sybase • Early relational companies went door to door converting industry to the relational model • Industry got hooked on to the simplicity of writing complex applications in relational model • Boeing among the first converts Notes 01

  22. Pointer’s Strike Back… Application data structures • Complex objects in emerging DBMS applications cannot be effectively represented as records in relational model. • Representing information in RDBMSs requires complex and inefficient conversion into and from the relational model to the application programming language • ODBMSs provide a direct representation of objects to DBMSs overcoming the impedance mismatch problem Copy and translation Transparent ODBMS data transfer Relational representation RDBMS Notes 01

  23. Object Model • Object: • observable entity in the world being modeled • similar to concept to entity in the E/R model • An object consists of: • attributes: properties built in from primitive types • relationships: properties whose type is a reference to some other object or a collection of references • methods: functions that may be applied to the object. Notes 01

  24. Object Oriented Databases • Evolved as persistent Object Oriented Programming Languages: • Start with an OO language (e.g., C++, Java, SMALLTALK) which has a rich type system • Add persistence to the objects in programming language where persistent objects stored in databases Notes 01

  25. Persistent Programming Languages • Single programming language for application and data management • Update to persistent variable results in automatic update to database. • Persistent data could be types such as sets and lists and arrays. • Application can follow pointers (OID) to navigate through data. Notes 01

  26. Persistence • Objects created may have different lifetimes: • transient: allocated memory managed by the programming language run-time system. • E.g., local variables in procedures have a lifetime of a procedure execution • global variables have a lifetime of a program execution • persistent: allocated memory and stored managed by ODBMS runtime system. • Classes are declared to be persistence-capable or transient. • Different languages have different mechanisms to make objects persistent: • creation time: Object declared persistent at creation time (e.g., in C++ binding) (class must be persistent-capable) • persistence by reachability: object is persistent if it can be reached from a persistent object (e.g., in Java binding) (class must be persistent-capable). Notes 01

  27. Persistent Object-Oriented Programming Languages • Persistent objects are stored in the database and accessed from the programming language. • Single programming language for applications as well as data management. • Avoid having to translate data to and from application programming language and DBMS • efficient implementation • less code • Programmer does not need to write explicit code to fetch data to and from database • persistent objects to programmer looks exactly the same as transient objects. • System automatically brings the objects to and from memory to storage device. (pointer swizzling). Notes 01

  28. Approaches To Persistent Programming • Persistent Virtual Memory • disk representation and memory representation of data is identical. • No cost to translate data from one representation to another— efficient! • DB size limited to address space 32bit processor  2^32 byte addressability (4 GBytes) • Differentiating persistent objects and non-persistent objects is difficult. • Difficult to optimize disk layout and locality of access. • Example system using approach: OBJECT STORE. Notes 01

  29. Approaches To Persistent Programming Languages • Store persistent objects in files • Objects brought to memory on demand. • Implementation of OID complex since pointers do not suffice in general. • If object in memory pointer can be used for OID • if object on disk a disk address still not good as OID since storage can be reorganized. A separate mechanism needed. • Pointer swizzling for efficiency. Notes 01

  30. Challenges In Building Persistent Languages • Efficient caching of objects in client address space. • Cache coherence. • In OODB data migrates to clients unlike relational client server systems where query migrates to server. • Given a large number of clients each with the cache of objects ensuring consistency of object across multiple clients is a challenge. Notes 01

  31. Disadvantages of ODBMS Approach • Low protection • since persistent objects manipulated from applications directly, more changes that errors in applications can violate data integrity. • Non-declarative interface: • difficult to optimize queries • difficult to express queries • But ….. • Most ODBMSs offer a declarative query language OQL to overcome the problem. • OQL is very similar to SQL and can be optimized effectively. • OQL can be invoked from inside ODBMS programming language. • Objects can be manipulated both within OQL and programming language without explicitly transferring values between the two languages. • OQL embedding maintains simplicity of ODBMS programming language interface and yet provides declarative access. Notes 01

  32. The Return of the Relations … POSTGRES • Relational model evolved into ORDBMSs that include “best of” object-oriented concepts • Amongst the first ORDBMS prototype built @ BerkeleyPOSTGRES Illustra Informix IUS • Has had major impact on major commercial DBMS which have all migrated to ORDBMS model. • SQL3 supported by modern databases adapted many of the concepts developed in Postgres commercialized bought by Notes 01

  33. POSTGRES — Combinations • Introduced object orientation into relation DBMSs. • Fundamental Concepts. • Each record has an OID. • Access to data though: • query language POSTQUEL. • navigation through OIDs. • Classes: • Inheritance: • Types: rich set of types available for columns. • Functions: can be called within POSTQUEL. Notes 01

  34. Classes And Inheritance • Class analogous to relation • User can create new classcreate Emp (name = c12, salary = float, age = int) • Classes can inherit from otherscreate Salesman (quota = float) inherits Emp • Multiple inheritance permitted. If new class causes ambiguity it is not created. • Classes: • real: base classes or relations • derived: views • version: maintained differentially compared to parent class Notes 01

  35. Types In POSTGRES • Standard base types • float, int, charac. Strings, etc. • Abstract data type (ADT) facility to create new base types e.g.; create type point (x = int, y = int)create type polygon • ADT’s can be used in class definitions.Create Dept( dname = c10, mgr = c12, floorspace = polygon mailstop = point ) mailstop Notes 01

  36. Functions In POSTGRES • Three types: (1) C functions (2) Operators (3) POSTQUEL functions • C-functions • any C-function over base types or composite type retrieve (Dept. name) where area (Dept. floorspace) > 500 retrieve (Emp. name) where overpaid (Emp) Function over a class or method Notes 01

  37. Operators • Arbit C-functions are not optimized by query optimizers. • Special functions - operators can utilize indexes for their evaluation. • Operator: function with 1 or 2 operand retrieve (Dept. name) where Dept. floor space-AGT “(0,0), (1,1), (0,2)” • Index (e.g.; B-tree) defined properly can be used to speed up evaluation of operators such as AGT. Area Greater Than Notes 01

  38. Other Features Of POSTGRES • Allowed creation of new indices by user. • To an extent pioneered the approach of extensible database technology which is prevalent with vendors today • Supported transitive closure in query. retrieve* into ans (parent. older) from a in answer where. Parent. younger = “John” or parent. younger = a. older • Supported rules Notes 01

  39. POSTQUEL Functions • Any collection of commands in POSTQUEL. • query = POSTQUEL function.define function high-pay returns Emp as retrieve (Emp. all) where Emp. salary > 50k • POSTQUEL function with parameters.define function Sal-lookup (c12) returns float as retrieve (Emp. Salary) where Emp. name = $1 • Usage of POSTQUEL function retrieve Emp. name where Emp. Salary = Sal-lookup (“Joe”) Notes 01

  40. Composite Types In POSTGRES • POSTQUEL: • Composite types accessed via path expressions, using nested dot notation.remove (Emp  mgr  age) where (Emp  name = ‘joe’) • Prevents having to specify a join. Notes 01

  41. Composite Types In POSTGRES • Attributes can have a class name as a type resulting in complex objects with structure.Create Emp ( name = c12, salary = float [c12], age = int, mgr = Emp, coworker = Emp ) • A set type that can hold elements of any class.Add to Emp (hobbies = set) Refers to 0 or more references of Emp class. Could be elements of any class Notes 01

  42. Types In POSTGRES • Array type (constructor) crate Emp ( name = c12, salary = float [12], age = int ) • POSTQUEL queryretrieve (Emp  name) where (Emp  salary [4] = 1000) Salary for each month. Array in query usage. Notes 01

  43. Database Technology Matrix Q u r y S u p p o r t Y E S RDBMSs ORDBMSs File System N O OODBMSs Simple Complex Database Types Notes 01

  44. XML & RDF - the new revolution • Just when relational model had driven out object-oriented database technology, WWW led to the proliferation of semi-structured data. • 2 approaches to supporting XML/RDF • Extend relational technology to support XML/RDF • Native XML databases Notes 01

  45. Summary of Evolution of Data Model • The Dark Ages: network & heirarchical models • Victory of simplicity and beauty over data spaghetti: The Relational DBMS: • The pointers strike back -- Object-Orientation, OODBMSs • The return of the relations -- ORDBMS -- took the best of the OO concepts and incorporated them in the relational model. • The current and near future -- support for XML & RDF • The final frontier -- anyone’s guess! Notes 01

  46. Key Data Management Technologies (quick review)…

  47. Key Database Technologies • File Management • provides a file abstraction as a collection of records stored in disk • Index Management and Access Methods • implements techniques for associative access to data • Query Optimization and Processing • given a query and data storage structures, determines an efficient strategy to evaluate the query. • Transaction management • ensures consistency of the database in presence of concurrent transactions and various types of failures • Catalog Management • maintains database schema information • Authorization and Integrity Management • tests for integrity constraints and user authorization Notes 01

  48. Database Management System Architecture Application Queries Schema changes compilers optimizer Query processor Metadata and data dictionary evaluator Buffer manager Transaction Manager Storage manager File system Database and Indices Notes 01

  49. Storage Media and their Properties • Main Memory • costs $100/Mbyte -- reduces every year • ‘volatile’ -- does not survive system failures • random I/O very fast • data can be processed by CPU directly • capacity limited to orders of magnitude lower than what database needs. • Magnetic Disk • costs $0.50/Mbyte -- reduces each year • Non-volatile (except when disk crashes) • random I/O not as fast • CPU cannot directly process data. Needs to be transferred to main memory • Tape • Cheaper but slower than disks. Sequential I/O devices. Handy for backups, sometimes for archival. Notes 01

  50. Databases and Storage Devices • Due to capacity, cost, volatility factors databases traditionally stored in disks. • Data brought to main memory for processing from disks • There are many ways to interface memory with disk resident data • E.g., virtual memory: • VM size limited to max address generated by CPU • Existing VM does not support durability • File system provides a more powerful mapping between memory and disk storage • A bunch of tricks used ensure that high latency of secondary storage does not impact application response time and system throughput • access disks asynchronously with active applications • prefetch data before application needs it • intelligent caching techniques Notes 01

More Related