CSC 485E/CSC 571 Advanced Databases Introduction

CSC 485E/CSC 571 AdvancedDatabases Introduction

What’s a database? • In essence a database is nothing more than a collection of information that exists over a long period of time. • Databases are empowered by a body of knowledge and technology embodied in specialized software called a database management system, or DBMS. • A DBMS is a powerful tool for creating and managinglarge amounts of data efficiently and allowing it to persist over long periods of time, safely. • Among the most complex types of software available.

The database [management] system • Allows users to create new databases and specify their schema(logical structure of the data), using a data-definition language. • Enables users to query and modify the data, using a query language and data-manipulation language. • Supports intelligent storage of very large amounts of data. • Protects data from accident or not proper use. • Example: We can require from the DBMS to not allow the insertion of two different employees with the same SIN. • Allows efficient access to the data for queries and modifications. • Example: Indexes over a specified fields • Controls access to data from many users at once (concurrency), without allowing “bad” interactions that can corrupt the data accidentally. • Recovers from software failures and crashes.

Early DBMS’s (1960’s) • They encouraged the user to view the data much as it was stored. • The chief models were the Hierarchical and Network. • The main characteristic of these models was the possibility of easy jumping or navigating from one object to another through pointers. • E.g. From one employee to his department. • However these models didn’t provide a high-level query language for the data. • So, one had still to write programs for querying the data. • Also they didn’t allow on-line schema modifications.

Relational databases • Codd (1970) • A database system should present the user with a view of data organized as tables (also called relations). • Behind the scene there could be a complex data structure that allows rapid response to a variety of queries. • But the user would not be concerned with the storage structure. • Queries could be expressed in a very high-level language, which greatly increases the efficiency of database programmers. • This high-level query language for relational databases is called: Structured Query Language (SQL)

Database Studies • Design of databases. • What kinds of information go into the database? • How is the information structured? • How do data items connect? • Database programming. • How does one express queries on the database? • How does one use other capabilities of a DBMS, such as transactions or constraints, in an application? • How is database programming combined with conventional programming? • Database system implementation. • How does one build a DBMS, including such matters as query processing, transaction processing and organizing storage for efficient access? We’ll focus on this part

Fictitious Megatron 2006 DBMS • Stores relations as Unix files • Students(name, sid, dept) is stored in the file /home/megatron/students as Smith#123#CS Jones#533#EE • Schemas are stored in /home/megatron/schemas e.g. Students#name#STR#id#INT#dept#STR Depts#name#STR#office#str

Megatron sample session mayne$ megatron WELCOME TO MEGATRON 2006 megaSQL% SELECT * FROM Students; Name id dept ---------------------------------- Smith 123 CS Johnson 522 EE megaSQL%

Megatron sample session II megaSQL% SELECT * FROM Students WHERE id >= 500; Johnson#522#EE megaSQL% quit THANK YOU FOR USING MEGATRON 2006 mayne$

Megatron Implementation • To execute SELECT * FROM R WHERE <COND> • Read file schema to get attributes of R • Check that the <COND> is semantically valid for R • Read file R, • for each line • check condition • if OK, display • If we pipe the result into a file, say T, then add an entry for T in the file /home/megatron/schemas

Megatron Implementation II • To execute SELECT office FROM Students, Dept WHERE Students.name = 'Smith' AND Students.dept = Depts.name; • Read file schema to get attributes and do semantic check. • If Ok, then, for each tuple s in Students for each tuple d in Depts if s and d satisfy the WHERE condition, display the office value from s

What’s wrong with Megatron? • Tuple layout on disk: no flexibility for DB modifications. • Change CS to ECON and the entire file has to be rewritten. • Search Expensive: no indexes; always read entire relation. • Bruteforce query processing. • Did we need to look at all pairs of studentdept tuples? • No buffer manager: everything comes off of disk all the time. • No concurrency control: several users can modify a file at the same time with unpredictable results. • No reliability: can lose data in a crash or leave operations half done. • Little security: file system protection too coarse.

Architecture of a DBMS • The “cylindrical” component contains not only data, but also metadata, i.e.info about the structure of data. • If DBMS is relational, metadata includes: • names of relations, • names of attributes of those relations, and • data types for those attributes (e.g., integer or character string). • A database also maintains indexes for the data. • Indexes are part of the stored data. • Description of which attributes have indexes is part of the metadata.

What will be covered • Secondary Storage Management • Disks • Accelerated access • Handling disk failures • Arranging data on disk • Index Structures • B-Trees, Extensible Hash Tables, etc. • Multidimensional Indexes (for GIS and OLAP) • Query Execution (we concentrate a lot here) • Algorithms for relational operators • Join methods. • Query Compiler (we concentrate a lot here) • Algebraic laws for improving query plans. • Cost based plan selection • Join orders

What will be covered • Concurrency Control • Pessimistic schemes (locking) • Optimistic schemes (timestamps) • Parallel and Distributed Databases • Parallel algorithms on relations • Distributed query processing • Distributed transactions • Google’s Map-Reduce framework • Peer-to-peer distributed search • Data Mining • Frequent-Itemset Mining • Finding similar items • Clustering of large-scale data • Databases and the Internet • Search engines • PageRank • Data streams • Data mining of streams

CSC 485E/CSC 571 Advanced Databases Introduction

CSC 485E/CSC 571 Advanced Databases Introduction

Presentation Transcript

Scientific Databases Lecture: Hubble Space Telescope Science Databases

CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases

Logic databases

Introduction to Databases: Relational and XML Models and Languages

RIS622 – Advanced Databases

Advanced Editing of Pathway/Genome Databases

EBSCO Publishing

Introduction to Databases 2

Relational Databases

Databases for Bioinformatics A very partial introduction

Databases and the Web

ITN270 Advanced Internet Databases

CS1315: Introduction to Media Computation

CS1315: Introduction to Media Computation

Databases

Introduction to Relational Databases

Databases in Bioinformatics

Advanced Databases The ODMG Standard for Object Databases

PubMed Advanced: Linking PubMed to NCBI Genetics Databases

Object Oriented Databases - Overview

HAPTER 4

Chapter 22: Distributed Databases