Introduction to Database Systems

Introduction to Database Systems

Today’s session • Some history • Origins of DB • Organisational structure • Organisational information systems • Data storage • DBMS • Data models

What is “Data”? • ANSI definition: • Data • A representation of facts, concepts, or instructions in a formalized manner suitable for communication, interpretation, or processing by humans or by automatic means. • Any representation such as characters or analog quantities to which meaning is or might be assigned. Generally, we perform operations on data or data items to supply some informationabout an entity. • Volatile vs. persistent data • Our concern is primarily with persistent data

PROGRAM 1 DATA SET 1 Data Management PROGRAM 2 DATA SET 2 Data Management PROGRAM 3 DATA SET 3 Data Management Early Data Management - Ancient History • Data are not stored on disk • Programmer defines both logical data structure and physical structure (storage structure, access methods, I/O modes, etc) • One data set per program. High data redundancy.

Problems • There is no persistence. • All data is transient and disappears when the program terminates. • Random access memory (RAM) is expensive and limited • All data may not fit available memory • Programmer productivity low • The programmer has to do a lot of tedious work.

File Processing - Recent (and Current) History • Data are stored in files with interface between programs and files. • Various access methods exist (e.g., sequential, indexed, random) • One file corresponds to one or several programs. PROGRAM 1 Data Management FILE 1 File System Services PROGRAM 2 Redundant Data Data Management PROGRAM 3 FILE 2 Data Management

File System Functions • Mapping between logical files and physical files • Logical files: a file viewed by users and programs. • Data may be viewed as a collection of bytes or as a collection of records (collection of bytes with a particular structure) • Programs manipulate logical files • Physical files: a file as it actually exists on a storage device. • Data usually viewed as a collection of bytes located at a physical address on the device • Operating systems manipulate physical files. • A set of services and an interface (usually called application independent interface – API)

Problems With File Systems • Data are highly redundant • sharing limited and at the file level • Data is unstructured • “flat” files • High maintenance costs • data dependence; accessing data is difficult • ensuring data consistency and controlling access to data • Sharing granularity is very coarse • Difficulties in developing new applications

Database Approach PROGRAM 1 DBMS Query Processor Integrated Database PROGRAM 1 Transaction Mgr … PROGRAM 2

Origins of Databases • c1455: Gutenberg invents printing. Explosive interest in publication of books (analogous with explosive growth of Web in early 1990s) leads to public libraries • Libraries were first to introduce standards for information storage and retrieval • These paper-based systems were extended and enhanced, and filing, indexing and classification schemes were developed • Second World War: accelerated R&D in computing technologies spawned capability to computerise maintenance of records

Computerised Data Storage • Advantages of computerised data storage over paper-based systems include: • ability to store data compactly (e.g Britannica CD) • enhanced data retrieval • ability to access data remotely, e.g. from a mobile workstation, off-site location, or distant branch • ability to share data amongst multiple users with concurrent access • facility to automate regular, speedy back-ups • enhanced data editing • Most significant disadvantage is vulnerability e.g system crash, corrupted data, viruses, hackers, etc.

Information: A Corporate Asset • Information is a vital corporate asset. Without accurate, current, relevant information, mistakes and misjudgements may be made • Data management is an essential capability in the modern business environment / information society • A knowledge organisation is one in which the primary asset is information; its competitive advantage is derived from effective use of documented knowledge. Examples: accounting firms, marketing companies, software houses • Organisational memory extends and amplifies information / knowledge by capturing, organising, disseminating, and reusing it

Organisational Memory Database (logical grouping of related files) File / Table (collection of related records) Record (collection of related fields, bound together as single units) Field (part of a record reserved for a particular data item) Byte (group of eight binary bits - can represent a single character) Organisational Memory

Organisational Memory Organisational memory Data Informed decisions Improved products and services

Attributes of Organisational Memory • Current • Timely • real-time systems are commonplace in modern business environment • Relevant • data is only useful if relevant to task in hand • Shareable • Complete • Accurate and consistent

Attributes of Organisational Memory • Transportable • authorised personnel should have access to data anywhere, anytime • Secure against unauthorised access

Organisational Information Systems • Generally, there is an alignment between business units and core operational systems • Typical core systems are: • Sales & Marketing Department: Customer management system, Order processing system • Operations Unit: Purchasing system, Inventory control system • Finance Department: Accounts payable and receivable systems, Credit Management system • There are interdependencies between these systems; hence the need for an integrated data management approach

Data Storage • Information systems create, read, update and delete data • Data can be stored in conventional files or databases • Filesare collections of similar records • Databasesare collections of interrelated files. • Records can be linked through specified relationships to records in other files

File Information System File File File Conventional Files • In the file environment, data storage is built around the applications that will use the files • Essentially, the file “belongs” to a specific application. This is termed program-data dependence • As applications are developed, customised files are created which may be unusable by other applications

Conventional Files • First attempts at computerised storage of records followed traditional paper-based metaphors (“Flat file” systems) • Flat files were inefficient for data retrieval: it might be necessary to search entire file for a record (which, it may transpire, does not exist). Remedy: index files • Indexing improved data retrieval, but conventional files have other disadvantages: • Program-Data dependence • Proprietary file formats (closed systems) • Poor scalability

Conventional Files • Disadvantages (Cont’d): • Duplicated and redundant data • ambiguity: same thing being referred to by different names in different places • inconsistency: conflicting / unsynchronised data • wasted effort • Separation and isolation of data • data dispersed amongst many files complicates processing • Inflexibility • cumbersome data structures and report layouts • not responsive to ad hoc queries • excessive program maintenance • Development environment • procedural -v- non-procedural (3GL -v- 4GL)

Conventional Files • Advantages: • Historically, conventional files have been faster to process than DBMS applications • As legacy file-based systems become candidates for reengineering, the trend is to replace them with database systems

Information System Information System Information System Information System Database Database Management Systems • A database is a large, integrated collection of data which models a real-world enterprise • A Database Management System (DBMS)is a software package designed to store and manage databases • In a DBMS environment, applications are built around an integrated adaptable database

Database Management Systems • Advantages: • ability to share the same data across multiple applications and systems • data independence • control of redundancy • enforced data integrity • improved data security • uniform data administration • concurrent access • improved backup and recovery facility • flexible data structures

Database Management Systems • Advantages (Cont’d): • databases allow the use of the data in ways not originally specified by the end-users (ad hoc queries) • database definition can be extended without impacting existing programs that use it • economies of scale

Database Management Systems • Disadvantages: • database technology is more complex than file technology • requires more sophisticated hardware and software (DBMS) • DBMS’s can still be slower than file-based systems • database technology requires a significant investment • database administration • operating costs and ongoing maintenance • end-user training • higher impact of system failure

Database Management Systems • Roles in a DBMS environment • Data Administrator • Database Administrator • Database Designer • Application Programmer • End-User

Database Architectures • Hierarchical Data Model • Network Data Model • Relational Data Model • Object-Relational Model

Data Relationships • One-to-One (1:1) • Example: Bank Manager manages one and only one Bank Branch; Bank Branch is managed by one and only one Bank Manager • One-to-Many (1:M) • Example: An Employee works in a designated Department; in any Department, there may be many Employees • Many-to-Many (M:M) • Example: A Student registers for one or more Courses; for any Course, there may be one or more registered Students

Hierarchical Data Model

Hierarchical Data Model • Arose as a solution to a practical problem • Managing millions of parts for the space program (standard “Bill of Materials” problem) • The basic structure is a hierarchy or tree • Parent-child relationship • Relationships are represented by pointers • Restrictions: • Each segment has at most one parent • All relationships are 1:M

Hierarchical Data Model • Problem: how to represent a M:M relationship ? • A hierarchical structure (tree) can only support 1:M relationships. Therefore, to represent M:M, must create multiple hierarchies • … but, this means that records are duplicated in different hierarchies • duplication gives rise to data anomalies • duplication can be eliminated using pointers; must decide in which table to store data, and in which table to store pointer. This is a very awkward means of implementing M:M

Network Data Model

Network Data Model • Objective was to overcome shortcomings of the hierarchical data model, in particular, representation of M:M relationship • Like the hierarchical model, it can be likened to trees; unlike hierarchical model, several trees can share branches • In practice, enjoyed little commercial success • Too complex: suited to use by programmers as opposed to end-users • Overtaken by the relational model • No clear theoretical base

Network Model: Structures • Data item • a field or attribute • Record • a collection of data items • Vectors (repeating structures) are permitted • Relationships are represented by sets • Sets have owners (parent) and members (children) • A member cannot have two parents in the one set • cannot directly represent M:M relationships • Member records of a set can be ordered

Shortcomings of Early Models • Languages of both hierarchical and network are procedural and record-at-a-time • To retrieve data … • you must navigate (find a path) to the required record • issue multiple statements directing the system to traverse that path • It was necessary to issue multiple requests to the DBMS to retrieve a data item • It was necessary to have a detailed understanding of how data was stored and structured • this is contrary to data independence principle

Relational Data Model • Relational model, developed by E. F. Codd, has a strong theoretical basis and overcomes shortcomings of network / hierarchical models • Relational model is non-procedural (What?, not How?), and set-at-a-time • Navigation is automatic • Relations • 2-dimensional data set consisting of N columns (fields / attributes) and M rows (records) • All rows in a relation are unique • A relational database is a set of relations

Relational Model: Structures

Relational Model: Structures • Primary key • A unique identifier of a row in a relation; can be composite • Candidate key • An attribute that could be a primary key • Alternative key • A candidate key that is not selected as the primary key • Foreign key • An attribute of a relation that is the primary key of another relation; can be composite

SQL Relational Algebra SELECT * FROM A WHERE condition A where condition Restrict A [X] SELECT X FROM A Project A times B SELECT * FROM A, B Product SELECT * FROM A UNION SELECT * FROM B A union B Union SELECT * FROM A WHERE NOT EXISTS (SELECT * FROM B WHERE A.X = B.X AND A.Y=B.Y AND ... A minus B Difference Relational Algebra and SQL • A major strength of the relational model is that it supports SQL, a flexible data retrieval language which facilitates ad hoc queries • Relational algebra is a standard for judging a data retrieval language

Relational Databases • A truly relational database supports • structures (domains and relations) • integrity rules • a manipulation language • The word “relational” is sometimes used too freely; many commercial systems are not fully relational because they do not support domains and integrity rules • E. F. Codd has set forward 12 rules that a database must satisfy before it can be said to be truly relational

Codd’s 12 Rules • Information representation • All data, including metadata (data definition, constraints, user names, etc.), is represented solely and explicitly as values in a table • Guaranteed access • Every value in a database is accessed by specifying a combination of table name, column name, and value of the primary key of the row in which it is stored • No artificial paths, such as linked lists • Systematic treatment of null values • There must be a distinct representation for unknown data, irrespective of data type • Null values are not equivalent to zero or the empty string

Codd’s 12 Rules • Dynamic on-line catalog based on the relational model • database description is represented at the logical level in the same way as ordinary data • thus, only one manipulation language needs to be learned • Comprehensive data sublanguage rule • A relational system may support several languages and various modes of terminal use • However, there must be at least one language that supports data definition, data manipulation, security and integrity constraints, and transaction processing operations • View updating • If the base tables of a view are updated, then the view itself should be updated

Codd’s 12 Rules • High-level insert, update, and delete • The system must support set-at-a-time operations; for example, a set of rows can be deleted by a single statement • Physical data independence • Changes to storage representation or access methods will not affect application programs • Logical data independence • Implementation of changes to base tables will not affect application programs; for example, if a table is restructured or split, applications should be immune to change (views are beneficial here)

Codd’s 12 Rules • Integrity independence • Integrity constraints should be part of a database's definition rather than embedded within application programs • It must be possible to change integrity constraints without affecting any existing application programs • Distribution independence • Introduction of a distributed DBMS or redistributing existing distributed data should have no impact on existing applications • Nonsubversion • It must not be possible to use low-level record-at-a-time interface to by-pass high-level set-at-a-time security or integrity constraints

Codd’s Rule 0 • A relational DBMS must be able to manage databases entirely through its relational capacities • A DBMS is either totally relational or it is not relational

Tasks in the seminars 1 - 3 • In order to do MMDB you need to be able to use SQL – self study sessions with tutor support

Introduction to Database Systems