150 likes | 161 Views
CSC 485E/CSC 571 Advanced Databases Introduction. What’s a database?. In essence a database is nothing more than a collection of information that exists over a long period of time.
E N D
What’s a database? • In essence a database is nothing more than a collection of information that exists over a long period of time. • Databases are empowered by a body of knowledge and technology embodied in specialized software called a database management system, or DBMS. • A DBMS is a powerful tool for creating and managinglarge amounts of data efficiently and allowing it to persist over long periods of time, safely. • Among the most complex types of software available.
The database [management] system • Allows users to create new databases and specify their schema(logical structure of the data), using a data-definition language. • Enables users to query and modify the data, using a query language and data-manipulation language. • Supports intelligent storage of very large amounts of data. • Protects data from accident or not proper use. • Example: We can require from the DBMS to not allow the insertion of two different employees with the same SIN. • Allows efficient access to the data for queries and modifications. • Example: Indexes over a specified fields • Controls access to data from many users at once (concurrency), without allowing “bad” interactions that can corrupt the data accidentally. • Recovers from software failures and crashes.
Early DBMS’s (1960’s) • They encouraged the user to view the data much as it was stored. • The chief models were the Hierarchical and Network. • The main characteristic of these models was the possibility of easy jumping or navigating from one object to another through pointers. • E.g. From one employee to his department. • However these models didn’t provide a high-level query language for the data. • So, one had still to write programs for querying the data. • Also they didn’t allow on-line schema modifications.
Relational databases • Codd (1970) • A database system should present the user with a view of data organized as tables (also called relations). • Behind the scene there could be a complex data structure that allows rapid response to a variety of queries. • But the user would not be concerned with the storage structure. • Queries could be expressed in a very high-level language, which greatly increases the efficiency of database programmers. • This high-level query language for relational databases is called: Structured Query Language (SQL)
Database Studies • Design of databases. • What kinds of information go into the database? • How is the information structured? • How do data items connect? • Database programming. • How does one express queries on the database? • How does one use other capabilities of a DBMS, such as transactions or constraints, in an application? • How is database programming combined with conventional programming? • Database system implementation. • How does one build a DBMS, including such matters as query processing, transaction processing and organizing storage for efficient access? We’ll focus on this part
Fictitious Megatron 2006 DBMS • Stores relations as Unix files • Students(name, sid, dept) is stored in the file /home/megatron/students as Smith#123#CS Jones#533#EE • Schemas are stored in /home/megatron/schemas e.g. Students#name#STR#id#INT#dept#STR Depts#name#STR#office#str
Megatron sample session mayne$ megatron WELCOME TO MEGATRON 2006 megaSQL% SELECT * FROM Students; Name id dept ---------------------------------- Smith 123 CS Johnson 522 EE megaSQL%
Megatron sample session II megaSQL% SELECT * FROM Students WHERE id >= 500; Johnson#522#EE megaSQL% quit THANK YOU FOR USING MEGATRON 2006 mayne$
Megatron Implementation • To execute SELECT * FROM R WHERE <COND> • Read file schema to get attributes of R • Check that the <COND> is semantically valid for R • Read file R, • for each line • check condition • if OK, display • If we pipe the result into a file, say T, then add an entry for T in the file /home/megatron/schemas
Megatron Implementation II • To execute SELECT office FROM Students, Dept WHERE Students.name = 'Smith' AND Students.dept = Depts.name; • Read file schema to get attributes and do semantic check. • If Ok, then, for each tuple s in Students for each tuple d in Depts if s and d satisfy the WHERE condition, display the office value from s
What’s wrong with Megatron? • Tuple layout on disk: no flexibility for DB modifications. • Change CS to ECON and the entire file has to be rewritten. • Search Expensive: no indexes; always read entire relation. • Bruteforce query processing. • Did we need to look at all pairs of studentdept tuples? • No buffer manager: everything comes off of disk all the time. • No concurrency control: several users can modify a file at the same time with unpredictable results. • No reliability: can lose data in a crash or leave operations half done. • Little security: file system protection too coarse.
Architecture of a DBMS • The “cylindrical” component contains not only data, but also metadata, i.e.info about the structure of data. • If DBMS is relational, metadata includes: • names of relations, • names of attributes of those relations, and • data types for those attributes (e.g., integer or character string). • A database also maintains indexes for the data. • Indexes are part of the stored data. • Description of which attributes have indexes is part of the metadata.
What will be covered • Secondary Storage Management • Disks • Accelerated access • Handling disk failures • Arranging data on disk • Index Structures • B-Trees, Extensible Hash Tables, etc. • Multidimensional Indexes (for GIS and OLAP) • Query Execution (we concentrate a lot here) • Algorithms for relational operators • Join methods. • Query Compiler (we concentrate a lot here) • Algebraic laws for improving query plans. • Cost based plan selection • Join orders
What will be covered • Concurrency Control • Pessimistic schemes (locking) • Optimistic schemes (timestamps) • Parallel and Distributed Databases • Parallel algorithms on relations • Distributed query processing • Distributed transactions • Google’s Map-Reduce framework • Peer-to-peer distributed search • Data Mining • Frequent-Itemset Mining • Finding similar items • Clustering of large-scale data • Databases and the Internet • Search engines • PageRank • Data streams • Data mining of streams