560 likes | 678 Views
CS 319: Theory of Databases: C1. Dr. Alexandra I. Cristea http://www.dcs.warwick.ac.uk/~acristea/. Lecturers. Alexandra I. Cristea Hugh Darwen: hughdarwen@gmail.com Other invited talks?: TBA. Schedule. Usual: Mo 16-17 Wed 9-10 Thu 16-17 Exceptions: week 2: 11.10.06: no course
E N D
CS 319: Theory of Databases: C1 Dr. Alexandra I. Cristea http://www.dcs.warwick.ac.uk/~acristea/
Lecturers • Alexandra I. Cristea • Hugh Darwen: hughdarwen@gmail.com • Other invited talks?: TBA
Schedule • Usual: • Mo 16-17 • Wed 9-10 • Thu 16-17 • Exceptions: • week 2: 11.10.06: no course • week 3: 16.10.06: 2 hours • week 3: 18.10.06: no course • Others: TBA: check forum, website, course
Slides • Thanks to: • Dr. Paul Goldberg: • http://www.dcs.warwick.ac.uk/people/academic/Paul.Goldberg/cs319/cs319index.html • Dr. Meurig Beynon: • http://www.dcs.warwick.ac.uk/people/staff/Meurig.Beynon/ • Dr. Ad Aerts: • http://wwwis.win.tue.nl/~aaerts/ • Prof. Dr. Paul De Bra: • http://wwwis.win.tue.nl/~debra/ • Others: mentioned directly
Contact • Forum: http://forums.warwick.ac.uk/wf/browse/category.jsp?cat=24 • IF (and ONLY IF) a question is personal, you might address it to acristea@dcs.warwick.ac.uk • FORMAT: subject of email should contain ‘CS319’ and topic of the email (otherwise it will be filtered out)
Course site(s): • Current: • http://www.dcs.warwick.ac.uk/~acristea/courses/CS319/ • Will contain current slides, as taught at the course • Will contain notifications: check BEFORE & AFTER the course • Official: • http://www.dcs.warwick.ac.uk/undergraduate/modules/cs319.html • Past: • http://www.dcs.warwick.ac.uk/people/academic/Paul.Goldberg/cs319/cs319index.html
(provisionary) Content: Topics • Generalities DB • Integrity constraints (FD revisited) • Temporal Data • Relational Algebra (revisited) • Query optimisation • Tuple calculus • Domain calculus • Query equivalence • LLJ, DP and applications • The Askew Wall • Datalog
Books • Korth and Silberschatz, Database System Concepts, McGraw-Hill,1991. • Ullman J D, Principles of Database Systems (Vols 1 and 2), Computer Science Press,1988.
Purpose of this course • More in-depth information on the theory of databases • How (some of) the existing db languages fit in the theory (and how they don’t)
Overlaps and sequencing • Form: optional • Prerequisites • CS252: Fundamentals of Database Systems • Optional: CS253: Topics in Database Systems • (actual previous CS233: Database Systems)
Organization of the course • 15 CATS • CS, CSE, CBS, Mathematics • ~ 30 1-hour lectures • Exam at the end: 3 hours • Rules of the game: • Read also comments on the slides. • Presence is optional, but beware: slides-only are NOT ENOUGH to learn from for the exam; you need to participate, take your own notes, so self-study!
Scope of CS 319 • Expressive power of db query languages • Algorithms for the computational problems • Limitations to classical relational theory • A central contribution of theory is to say what cannot be done, not just what can. • Consider how this observation is relevant to the above topics.
A theme of CS319 • How do theory and computing practice relate to specific reference to databases? • Key ideas: • Theory: based on Codd's relational theory. • There is an excellent correspondence between relational theory and practical database application of a certain kind. • Relational databases can be seen as a precursor of 2 principal kinds of computer application: • environments for end-user programming and • computer-based models of real-world state.
DB generalities : What is a database? • Chris Date: • Database = computer-based record keeping system • R.W. Engles: "A Tutorial on DB Organisation" (1974) • Db = collection of stored operational data used by the applications system of a particular enterprise • enterprise: hospital, university, bank, company etc • operational data: • data on products, accounts, patients etc • typically persistent cf conventional program IO data
DB generalities : Why use a db? • Case-study: Banking (Korth &Silberschatz Chap. 1) • ? How to meet needs using a traditional file-processing system supported by a conventional OS • Files: permanent records of customers, accounts • Applications programs (APs): enable user to modify files • to credit or debit an account • to add a new account • to find the balance in an account • to generate monthly statements • APs written by systems programmers as required • new requirements new files + new programs
Original context for data modelling 1 • 1970s style applications • unsophisticated computer users • batch mode interaction • modest response times • no visualisation or GUI • modest expectations for ease-of-use • programming perceived as technical • simple infrastructure and environment • no PC, web etc • no live feeds of data • textual interaction the norm
Original context for data modelling 2 • 1970s style applications • Business context • simple business model, limited automation, access etc • low volume of data • not initially distributed • Computing context • existing/emerging DB proposals unconvincing • computers not very powerful • human and computing resources very expensive
Summary of issues for data management • Data redundancy and inconsistency • Difficulty in accessing data • Data isolation • Concurrent access anomalies • Security problems • Integrity problems
DB generalities : +’s and –’s of DB use • Conventional file systems have certain characteristics • we will review the key issues for data management: + indicates a positive impact of using a database – indicates a potentially negative impact of using a DB
Data redundancy & inconsistency • programmer uses ≠ format file, developed at ≠ stage in history of enterprise => data duplicated: + in a DB rationalise and standardise data [rationalise: shared source for data] … rationalise doesn't necessarily mean centralise
Data redundancy & inconsistency –compromises are needed • where users suit themselves => efficient results perfect data organisation to suit all users – duplication: insurance against info loss
Difficulty in accessing data • unforeseen requests, new functionality • new programs, possibly new data structures + in a DB, simplify access & manipulation by intelligent organisation of data cf. modelling approach to requirements e.g. in use of UML in OOSE
Data isolation • data to retrieve from many sources in APs • + in DB, hide source physical data : higher level of abstraction – automation: less human interaction with data • risk of corrupted data
Concurrent access anomalies • would like multiple access (efficiency & faster response) e.g. simultaneous withdrawal +concurrency can't be managed without a form of overall control
Security problems • to restrict access to (un)authorised users for confidential info + security needs a form of overall control – issues: where should the control be? inside or outside computer system * e.g., non-trivial problem to determine what can be inferred from responses to queries that aren't explicit
Integrity problems • integrity constraints • may arise dynamically: • difficult to modify programs to cope; • hard to guarantee if data stored in ≠ files + automated management demands a form of overall control – automation reduces scope for human intervention / interpretation
Conclusions: Issues for data management • For many commercial applications, a good solution is offered by a database management system (DBMS). • A DBMS is an unconventional OS operating over a structured file system.
Motivating the DBMS concept • devise an abstract model of the entire corpus of operational data that simplifies the data processing activity, so that • simple queries can be handled without writing new application programs • if APs => accessing & manipulating operational data consistently and efficiently is greatly simplified
The ingredients of a database • Data • integrated • shared • (possibly distributed) • Hardware • storage • Software • database management system: DBMS • protects users from hardware level detail • serves the needs of all users
DB Users • end-user: • non-specialist accessing data via a query language • naïve user accessing data via a special-purpose interface performs data retrieval and update (extend / modify) • applications programmer: • writes programs that use the DB by embedding queries to the DB in a HLL • develops interfaces for the naïve user
DB Users • Database Administrator (DBA): • responsible for overall control • decides what data is to be stored • designs the conceptual scheme used to represent the operational data • implements authorisation checks • decides strategy for backup and recovery • monitors performance • oversees modification to suit user requirements
Data abstraction in a db • addresses issues of design, use, management and implementation • Data model describes (formally) 3 different levels of abstraction: • physical level • conceptual level • view level
3 Levels • physical level: • how is the data actually represented in the hardware? • bits, bytes • conceptual level: • what meaningful relationships are expressed by the physical data? • entities, and relationships between entities • view level: • what particular relationships are required by users? • more abstract because partial typically very high-level knowledge constitutes the view
Illustrating data abstraction: • DB w. date of birth of a client (bit string). • senior citizens?: clients aged > 65 • Representations at ≠ abstraction levels: • Conceptual: date of birth • Physical: bit string • View: age (not stored in DB!!)
Data abstraction in a database DESIGN & MANAGEMENT USE IMPLEMENTATION
Role of abstraction (1) • Internal & external translation schemas: * protect conceptual model from change • when physical organisation changes / new views are required * protect user from a need to change views * protect altering physical organisation • if conceptual model is modified
Role of data abstraction (2) • physical data independence: protecting conceptual model from change when physical organisation changes • logical data independence: protecting user from need to change views when conceptual model changes
Characteristics of electronic data 1970 • “Abstract model of the entire corpus of operational data” • Demands of the abstract model in 1970 quite low … • small volumes of data, modest performance • limited levels of volatility and automation tolerated • Today different, BUT • (subject to viewing human agency as a metaphor for any agency, ) • key issues of a classical database are still vital • Any DB modelling paradigm must handle 70s problems
db models • 2 principal kinds of abstract data model: • object-based models • record-based models • earliest • reflects file system culture they displaced
Object-based models • main models: • entity-relationship models • object-oriented data models • Others: semantic & functional data models. • E-R model widely used to model data abstractly • OO model gaining acceptance in practice: effectively represents data + operations on it.
Record-based Logical Models • Used at the conceptual and view levels. Specify both • overall logical structure of the database • higher-level description of the implementation. • Record-based: uses records in fixed-format of several types. simplifies implementation <> trend towards richness and variety in structures used to implement OODBs
Varieties of record-based logical model • hierarchical model • records & links organised as a family of trees • network model • records & links organised as a family of graphs • relational model • uses tables to record relationships between data
Physical Data Models • not our primary concern in this module. • Relevant issues: • are data tables stored using hashing? • how are data tables indexed? • how are entries in data tables encoded and ordered? • what algorithms are used to retrieve and update?
Classical database features • Instances and Schemes • State of DB changes over time: structure vs. current state. • overall design of DB = database schema • current content of DB = instance of the DB • Useful analogy with procedural variables: • database schema type definition for variable • instance of database value of the variable
Data abstraction and schemas • physical scheme at lowest level • conceptual scheme at intermediate level • several sub-schemes (possibly user-defined) at highest level (views of the DB)
Data Definition Language (DDL) • for database schema • compiling DDL: Data Dictionary • the storage & access methods: specified in storage & definition language Implementation details usually hidden from users
Data Manipulation Language (DML) • data manipulation: accessing DB to retrieve, insert, delete, or modify data • data retrieval • most common • "querying the DB" • retrieval component of DML = query language (abusively: ‘query language’ ~ synonym DML)
Varieties of Data Manipulation Language • There is a tension between • efficiency at physical level • intelligibility / ease of use at higher level • Have both procedural and non-procedural DMLs • procedural: requires knowledge of data implementation • non-procedural: need only specify what data is needed
Data Manipulation Languages (DML) for typical data models • Procedural: • object-based, hierarchical, network models • user explicit responsibility: optimising queries, • needs knowledge of data organisation • Non-procedural: • relational models • formulate queries without above, • implementation has to be optimised