1 / 28

Lecture 1: Introduction

Lecture 1: Introduction. AnHai Doan CS 511 Fall 05. Welcome to CS 511, Advanced Database Management!. Instructor: AnHai Doan, anhai@cs 2118 Siebel Office hours: WF 3:15-4:15 (right after each lecture) Won't have office hour today Home page: google for "cs511 uiuc"

kevork
Download Presentation

Lecture 1: Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 1: Introduction AnHai Doan CS 511 Fall 05

  2. Welcome to CS 511, Advanced Database Management! Instructor: AnHai Doan, anhai@cs • 2118 Siebel • Office hours: WF 3:15-4:15 (right after each lecture) • Won't have office hour today Home page: google for "cs511 uiuc" Texts and readings: • Hellerstein and Stonebraker: Readings in Database Systems, 4th ed. • Can be ordered online • Supplementary papers (will be linked via schedule)

  3. More Administrivia • TAs • Yoonkyong Lee (also the I2CS TA) • Gorvind Kabra • Their office hours will be announced in newsgroup • class.cs511 is very important • all announcements will appear there • Slides • integrated from those of Kevin Chang, Zack Ives, Jeff Naughton • will try to have them posted before each lecture

  4. What to Do if You are Confused? • Admin issues: • ask the newsgroup • ask your team members • ask your TA (I2CS students ask Yoonkyong) • ask me • Academic issues: • same thing

  5. Course Objectives • Study fundamental lessons in relational data management • Examine how to adapt those lessons to other settings • data integration, data mining, IR/Web search • Why? • the painter and the flower girl • Can't learn lessons without knowing gory details

  6. Prerequisites • Need cs 411-equivalent background • first homework will evaluate this • Strong programming skills

  7. Course Format I will lecture twice a week before each lecture you read a paper, send a brief review to newsgroup (can miss up to 3) attending lecture is required (can miss up to 3) participating in discussion is required Homework 1: evaluate your cs-411 background (individual) Homework 2: programming on DBlife (team) Programming project: team, on DBlife Each team presents 1-2 papers in Nov. Take-home final At the end, you should be equipped to do research in this field, or to take ideas from databases and apply them to your field

  8. Grading • Participation: 10% • reviews, discussion • attendance (not required for I2CS) • Two homeworks: 20% • Project: 35% • Presentation: 10% • Final exam: 25%

  9. Rough Schedule • September • relational model, 2 homeworks • October • data integration • November • data mining • IR/Web search

  10. For the rest of this lecture: Lets talk about relational models and its lessonsAs a sample opinion: see Zack's slides following this.

  11. So What Is This Course About? Not how to build an Oracle-driven Web site… … nor even how to build Oracle…

  12. What Is Unique about Data Management? • It’s been said that databases and data management focus on scalability to huge volumes of data • What is it that makes this possible – and what makes the work interesting if NOT at huge scale? • Why are data management techniques useful in situations where scale isn’t the bottleneck?

  13. The Key Principle: Data Independence • Most methods of programming don’t separate the logical and physical representations of data • The data structures, access methods, etc. are all given via interfaces! • The relational data model was the first model for data that is independent of its data structures and implementation

  14. What Is Data Independence? • Codd points out that previous methods had: • Order dependence • Index dependence • Access path dependence • What might you be able to do in removing those?

  15. The Relational Data Model More than just tables! • True relations: sets of tuples • The only data representation a user/programmer “sees” • Explicit encoding of everything in values Additional integrity constraints • Key constraints, functional dependencies, … General and universal means of encoding everything! • (Semantics are pushed to queries) A secondary concept: views • Define virtual, derived relations that are always “live” • A way of encapsulating, abstracting data

  16. Constraints and Normalization • Fundamental idea: we don’t want to build semantics into the data model, but we want to be able to encode certain constraints • Functional dependencies, key constraints, foreign-key constraints, multivalued dependencies, join dependencies, etc. • Allows limited data validation, plus opportunities for optimization • The theory of normalization (see CSE 330, CIS 550) makes use of known constraints • Idea: eliminate redundancy, in order to maintain consistency in the presence of updates • (Note that there’s no reason for normalization of data in views!) • Ergo, XML???

  17. Relational Completeness(Plus Extensions): Declarativity What is special about relational query languages that makes them amenable to scalability? • Limited expressiveness – particularly when we consider conjunctive queries (even with recursion) • Guaranteed polytime execution in size of data • Can reason about containment, invert them, etc. • “Magic sets” • (What about XQuery’s Turing-completeness???) • Equivalence between relational calculus and algebra • Calculus  fully declarative, basis of query languages • Algebra  imperative but polytime, basis of runtime systems • Predictability of operations  cost models • Ability to supplement data with auxiliary structures for performance

  18. Concurrency and Reliability(Generally requires full control) • Another key element of databases – ACID properties • Atomicity, Consistency, Isolation, Durability • Transaction : an atomic sequence of database actions (read/write) on data items (e.g. calendar entry) • Recoverability via a log: keeping track of all actions carried out by the database • How do distributed systems, Web services, service-oriented architectures, and the like affect these properties?

  19. Other Data Models • Concepts from the relational data model have been adapted to form object-oriented data models (with classes and subclasses), XML models, etc. • But doesn’t this result in some loss of logical-physical independence? • GMAP and answering queries using views?

  20. What Is a Data Management System? • Of course, there are traditional databases • The focus of most work in the past 25 years • “Tight loops” due to locally controlled data • Indexing, transactions, concurrency, recovery, optimization • But…

  21. 80% of the World’s Data is Not in Databases! Examples: • Scientific data (large images, complex programs that analyze the data) • Personal data • WWW and email (some of it is stored in something resembling a DBMS) • Network traffic logs • Sensor data • Are there benefits to declarative techniques and data independence in tackling these issues? • XML is a great way to make this data available • Also need to deal with data we don’t control and can’t guarantee consistency over

  22. An Example of Data Management with Heterogeneity: Data Integration A layer above heterogeneous sources, to combine them under a unified logical abstraction • Some of these are databases over which we have no control • Some must be accessed in special ways • Data integration system translates queries over mediated schema to the languages of the sources; converts answers to mediated schema “Mediated Schema” XML

  23. Other Interesting Points Data streams and sensor data How do we process infinite amounts of data? Peer-to-peer architectures What’s the best way of finding data here? Personal information management Can we use integration-style concepts and a bit of AI to manage associations between our data? Web search What’s the back-end behind Google? Semantic Web How do we semantically interrelate data to build a better Web?

  24. Layers of a Typical Data Management System API/GUI (Simplification!) Query Optimizer Stats Physical plan Exec. Engine Logging, recovery Schemas Catalog Data/etc Requests Access Methods Data/etc Requests Buffer Mgr Red = logical Blue = physical Pages Pages Physical retrieval Data Requests Source

  25. Query Answering in a Data Management System • Based on declarative query languages • Based on restricted first-order logic expressions over relations • Not procedural – defines constraints on the output • Converted into a query plan that exploits properties; run over the data by the query optimizer and query execution engine • Data may be local or remote • Data may be heterogeneous or homogeneous • Data sources may have different interfaces, access methods, etc. • Most common query languages: • SQL (based on tuple relational calculus) • Datalog (based on domain relational calculus, plus fixpoint) • XQuery (functional language; has an XML calculus core)

  26. Hash STUDENT Merge COURSE Takes by cid by cid Processing the Query Web Server / UI / etc Execution Engine Optimizer Storage Subsystem SELECT * FROM STUDENT, Takes, COURSE WHERE STUDENT.sid = Takes.sID AND Takes.cID = cid

  27. DBMSs in the Real World • Big, mature relational databases • IBM, Oracle, Microsoft • “Middleware” above these • SAP, PeopleSoft, dozens of special-purpose apps • “Application servers” • Integration and warehousing systems • Current trends: • Web services; XML everywhere • Smarter, self-tuning systems • Stream systems

  28. Our Agenda this Semester • Reading the canonical papers in the data management literature • Some are very systems-y • Some are very experimental • Some are highly algorithmic, complexity-oriented • Gaining an understanding of the principles of building systems to handle declarative queries over large volumes of data

More Related