880 likes | 891 Views
This course provides an introduction to databases and how to transform data into valuable information. Learn about the relational data model, ER diagrams, schema of tables, SQL, and more.
E N D
Introduction to Databases, CS 3/586Fall 2009 This work, and all other files in this series, are licensed under the Creative Commons Attribution License. To view a copy of this license, visit http://creativecommons.org/licensesor send a letter to Creative Commons, 171 2nd Street, Suite 300, San Francisco, California, 94105, USA. These slides have benefitted from the content of slides developed by Raghu Ramakrishnan and Johannes Gehrke, Lois Delcambre and David Maier, all of whom I thank. These slides are best viewed with “slide show”. To print the slides, consider using "File/Print/Properties/Pages per sheet" instead of Power Point's "handouts" mode. Slides will be ready by 6PM each Monday evening. Instructor: Len Shapiro, len at pdx.edu Syllabus: www.cs.pdx.edu/~len/386
Table of Contents • Relational Data Model • Relation, row, attributes, instance • Cardinality, Degree, Domain • Keys, Primary and Candidate • Schema, Legal Instance • Finding Candidate & Foreign Keys • ER Diag. Schema of Tables • Database, DBMS • Definitions • Languages • Postgresql • FEC data • SQL • History • NULL values • DDL, DML • SELECT/FROM/WHERE • DISTINCT,ORDER BY,LIKE • Introduction • Translating Data into Information • Three Grand Challenges • Data is widespread • Data is vast • Structured vs Unstructured Data • ER Diagrams • Requirements Analysis • Use Cases • Entities, Attributes, Relationships • Instance, legal instance • Keys, Primary and Candidate • Attributes of Relationship Sets • Ternary Relationship Sets • Enties vs Attributes
Prerequisites • The prerequisites for this course are CS 161 and 250. • This week's class will expect you to know the definitions of these terms from CS250: sets, bags, tuples, relations. • Please review these concepts from the text that you used for CS250 or its equivalent. • The text that we use at PSU is [James Hein, Discrete Structures, Logic and Compatibility, 3rd Edition, Jones and Bartlett, 2009].
Learning Objectives* LO1.1 Write an ER diagram from an RA using UML, including primary keys, cardinality constraints, attributes of relationship sets, and ternary relationship sets LO1.2 Given a schema, find all foreign keys for legal instances of the schema. LO1.3 Translate an ER diagram into a schema of tables, preserving all information. *An asterisk in the title means that there is info on the notes page (View/Notes page). This info is either what I say in class, or answers to questions posed on the slides. I recommend that you NOT print the notes page in the copy of the slides that you bring to class, but look at the notes page if you miss class or if your notes are incomplete.
Information • Digital Information is precious • Modern business, culture and society could not exist in its present form without digital information. • Think about how often you use digital information in your life. • If most modern businesses lost their current and backup information they would be candidates for bankruptcy. • But nature gives us data, not information. • Data: disorganized, slow to access, understand and visualize
Data vs. Information • Data has its place. It’s not useless. Data Information Atlas Yahoo Maps, Mapquest Encyclopedia Wikipedia, Google Search Brick and Mortar Store Online Store, e.g., Amazon Printout Data Access Form • In other CS courses, you learn how to manipulate (in-memory) data, using programs. • In CS386/586 you will learn to transform data into information and manage that information.
Three Challenges of Data* • Data is widespread – accessible by the Internet • Covered in CS4/594, Internetworking Protocols • Data is vast - Much too large to fit in memory • Library of congress: 20 terabytes(see notes view for units) • Photos uploaded to Facebook each month: 20 terabytes • Amazon.com: 42 terabytes • Choicepoint: 250 terabytes • Data processed by Google’s servers each hour: 1 Petabyte • In this class you’ll learn how to • Transform data into information • Manage disk-based information • Manage information in parallel
Database Challenges 3: Data Structure* • Some data, e.g., business data, has clearly defined attributes. In this class you’ll learn how to transform such structured data into information by the Databasics Anonymous 7-step method**: • Organizing structured data into an ER Diagram, Ch. 2 • Transforming an ER diagram into a Schema of Tables, Ch. 3 • Eliminating anomalies from those tables (normalization), Ch. 21 • Structuring those tables efficiently (physical design), Ch. 22 • Managing those tables using the intergalactic standard language (SQL), Ch. 5 • The DBMS manages the tables internally using Relational Algebra, Ch. 4 • Protecting those tables during concurrent use (ACID properties), Ch. 16 • Accessing those tables through a web-based interface (some scripting language) **I chose this name out of respect for Alchoholics Anonymous, which has saved innumerable lives
Data Challenges 3: Structure (ctd) • Some data’s attributes are not clearly defined, e.g., documents, web pages • The attributes are words in the data or tags • In this class you’ll learn how to transform unstructured data into information by: • Building data structures to make retrieval efficient (inverted indexes) • Ranking retrieval results (Google’s page rank algorithm) • Measuring the effectiveness of querying unstructured data (recall)
A Brief Database History* 1960s: Network and hierarchical models of structured data 1970s: E.F. Codd's relational model, implemented in Ingres and System R 1976: P. Chen's ER model 1980s: Relational systems/SQL dominate, PC DBMSs become popular 1990s: Client-server architecture, object-oriented model, Excel, WWW, Google
Overview • This ends our introduction • First we will study structured data, for which the intergalactic standard model is the relational model. • Recall the Databasics Anonymous 7-step method**: • Organizing structured data into an ER Diagram, Ch. 2 • Transforming an ER diagram into a Schema of Tables, Ch. 3 • Eliminating anomalies from those tables (normalization), Ch. 21 • Structuring those tables efficiently (physical design), Ch. 22 • Managing those tables using the intergalactic standard language (SQL), Ch. 5 • The DBMS manages the tables internally using Relational Algebra, Ch. 4 • Protecting those tables during concurrent use (ACID properties), Ch. 16 • Accessing those tables through a web-based interface (some scripting language) • We will perform steps 1, 2 and part of 5 this week.
Election Data • For examples in this course we’ll be using data, collected by the Federal Elections Commission, about contributions to campaign committees, from the 2008 election*. • General info about the FEC is at http://www.fec.gov/ans/answers_general.shtml. • Forms-based interfaces to the data are at http://www.fec.gov/finance/disclosure/disclosure_data_search.shtml • Raw data in tabular form is at http://www.fec.gov/finance/disclosure/ftpdet.shtml • Our, and all database texts, use simple data examples. We hope that by using real data you’ll not only gain more valuable experience but be able to ask interesting questions of the data. • Perhaps you’ll make some newsworthy discoveries. *This idea was suggested by Vassilis Papadimos, whom we thank.
Raw Data* • When you first encounter data, it may be in a raw, disorganized form. In front of you, and below, is some raw FEC data. Note • Some data (at the top) is descriptive, some (at the bottom) is in tables. • Some data items are self contained • But note “at the same address as his principal committee” • Some notation needs human interpretation • ‘names all start with “McCain Victory” and end with one of…’ Raw Data Document in http://www.cs.pdx.edu/~len/386/fec/Mats.doc
Real World Data • Real world data often starts out like the raw data in front of you, except that there is often much more data and many more attributes. • The first step in transforming this raw data into information is to draw a diagram of it, with the hope that your customer can understand the diagram and thus the structure of the data. • One way to make the diagram understandable is to use few concepts in the diagram: primarily entity and relationship.
Entity and Relationship • An entity is a real-world object distinguishable from other objects. • An entity is described using a set of attributes. • A relationship is an association among 2 or more entities. • What are some possible entities, relationships and attributes in the raw FEC data?
ER Diagram of FEC Data* Entity Set Candidate CandName Party Address Committee CommName Address Principal Associated Attribute Donated to Relationship Set Donor Occupation
Entity, Relationship Sets* • There is a subtle distinction here: • Gordon Smith, a candidate, is an entity. • “Candidate” is the name of the set of all candidates and is called an entity set. • Similarly, the pair (Jeff Merkley, Jeff Merkley for Oregon) is a relationship, but “Principal” is a relationship set. • What are some examples of entities, entity sets, relationships ,and relationship sets? • I will sometimes be sloppy and refer to Candidate, Committee, etc as entities, but that is not precise. • There is no such distinction for attributes
Requirements Analysis • The standard wisdom (from our text and CS300) is to transform the raw data, as in the handout, into a Requirements Analysis, or RA, as on the next page, then to transform the RA into an ER diagram. • From now on we’ll give you an RA and expect you to transform it into an ER diagram.
Requirements Analysis • Store information about employees, departments and projects • Each employee has a ssn and name • Each department has a code and a name • Each project has a number, name, due date and budget • Employees are assigned to projects • Departments sponsor projects • Each department has a manager and each employee has a home department
Guidelines • If something has an attribute, it is an entity set • Use language from the RA in the ER diagram • Relationship set names should be part of a sentence including the names of the entity sets. • Some parts of the RA may not belong in the ER diagram. • Keep the ER diagram as simple as possible. • No redundancy • Check that every item in the RA is in the ER diagram if it belongs there, and nothing else.
Instance, Legal Instance* • An instance of an entity set is the current contents of the entity set. • A legal instance is a an instance that could occur in the real world. • What is an example of an illegal instance? • Normally you need a domain expert to determine what is a legal instance. • For example, can two departments have the same name? The same code? The same name and the same code both? It depends on company policies – only a domain expert will know.
Keys* • A key is a minimal set of attributes that uniquely defines an entity in an entity set, for all legal instances of the entity set. • Is name a key for Employee? (Number,Name) a key for Project? • You don't know unless you ask a domain expert. • If there is more than one key in an entity set, one is chosen as the primary key; the others are called candidate keys. • The primary key is usually the one by which the data items are most often accessed • In ER diagrams, the primary key is underlined. • PRACTICE: Underline primary keys in the Emp-Dept ER diagram. • Len is the domain expert. • Note that some ER diagrams do not have keys, e.g., the FEC data. It is common, but not required, during the transformation from data to information, to add keys at some point.
Requirements Analysis: Professors • Professors have a SSN, a name and an age and their SSNs uniquely identify them. • There is a bulletin. In the bulletin, each course is listed with the professor who supervises the course and the professors who teach the course. • Courses are uniquely identified by their Quarter and CRN, and they have a name. • Sample data: CS386 taught by Len and Lois, supervised by Dave • PRACTICE: Draw an ER Diagram, including primary keys • INCLUDE ONLY INFO IN THE RA!
Sidebar: Use cases • In Homework 7 you will need to define 3-5 use cases for your application to illustrate how your users will interact with your application. • I'll give you an example so you can begin thinking about that assignment. • A use case describes how an actor interacts with the system you have built. • Here's an example use case for the Prof-Course ER diagram we just produced, assuming it has been implemented as a system.
Example Use Case: Find Course Supervisor Actor: Student who wants to find course supervisor • Student logs into Banner • If login is invalid, issue error message and return to step 1 • Student chooses "find course supervisor" from menu. • Student inserts quarter and CRN from dropdown list. • System returns course supervisor information. • Student chooses "exit" from menu.
Notes about Use Cases • Each use case must have a name and an actor. The actor may be another system, for example if another company is purchasing an item. • Each step of the use case is doable by the actor or the system you have built. • Error cases can be handled inline, as in the example, or at the end of the use case. • In order to find uses cases, think of who users are, what they will do, and/or what screens will look like. • Use the given format: start with the actor and number the steps.
Candidate CandName Party Address Cardinality Constraints on Relship sets:How many entities can participate? Associated i..j m..n Committee CommName Address • m..n means each candidate is the associated candidate for at least m and at most n different committees. i..j means that each committee has at least i and at most j associated candidates. • What are reasonable values for m, n, i and j? Don’t forget to skip over.
FEC With Cardinality Constraints* • Fill in the correct cardinalities Candidate CandName Party Address Committee CommName Address Principal Associated 0..1 0..* Donated to Donor Occupation A domain expert, who knows the semantics of the application, is needed to attach correct cardinalities.
More Cardinality Constraints • A relationship like Associated is one-to-many if one row is Associated with many rows but not vice versa.
More Cardinality Concepts • Many to Many Relationships • A left entity is related to many right entities and vice versa. • Which of the FEC relationships is many to many? • How can you tell that a relationship is many to many • The max cardinality on both sides is greater than one • One to Many Relationships • A left entity is related to many right entities but not vice versa. • Which of the FEC relationships is one to many? • Max cardinality on at most one side is greater than one • One to One Relationships • A left entity is related to exactly one right entity, and vice versa. • Which of the relationships is one to one? • Cardinality on both sides is 1..1 • Relatively rare, means two entity sets correspond exactly.
FEC with new Cardinalities One to Many Candidate CandName Party Address Committee CommName Address Principal 0..* 0..1 Associated 0..1 0..* One to Many 1..* Donated to Many to Many 0..* Donor Occupation
Employee Ssn Name lot Relationship sets can have attributes Department Code Name home 0..* 0..1 start-date descriptive attribute of the relationship set
Employee Ssn Name Lot Try all three locations for the attributes:What does each one mean? Department code name home 0..* 0..1 start-date ? start-date ? ? start-date
What is a possible attribute for the DonatedTo Relationship Set?* Candidate CandName Party Address Committee CommName Address Principal 0..1 1..1 Associated 0..1 0..* 1..* Donated to 0..* Donor Occupation
What’s wrong with this picture? Doctor Name Address Specialty Patient Ssn Address Insurance ID Primary Care Appointment Time
Binary vs. Ternary Relship Sets* • A classic example of a ternary relationship set is Parent, between the entity sets father, mother and child. • The alternative is to store the data in binary relationship sets Father and Mother. Father Mother • If Elayne is a Mother, Len is a Father and Dan, Ari and Joe are Children, what are examples of each relationship (not set)? Parent Father Mother Child
Binary vs. Ternary, ctd* • Ternary relationship sets are succint; Parent is one relationship set; FatherOf and MotherOf are two. • But binary relationship sets always (see notes) hold as much or more information as a ternary (or n-ary) relationship sets. • For example, if a child has no mother you can store information about the child's father in a binary relationship but not in a ternary relationship (without using NULLs). • However, one ternary relationship (not set) will store more information than one binary relationship • (Elayne,Len,Dan) vs. (Elayne,Dan) • In this class, if you have a choice, use a ternary relation! • (No cardinality constraints in ternary relations)
Employee Ssn Name Office Duality:entityattribute Project Number Name Due Date Assignment Manager Should Office be an attributeof Employee? or a separate entity set? Most attributes can be “promoted” to an entity set and some entities can be “demoted” to an attribute value. This explains why there are so many different ways to design a schema.
Employee Ssn Name Entity vs. Attribute Office Number Project Number Name Due Date Assignment Assigned Office Manager What are some reasons to model Office as an entity set? • an employee can have more than one office • there are other attributes of Office • Office needs to participate in other relationship sets such as a relationship set connecting to furniture or • telephones or network drops (located in the office)
Employee Ssn Name Entity vs. Attribute Office Number sq-ft orientation Project Number Name Due Date Assignment Assigned Office Manager Location NetworkDrop IP-address location
LO1.1 Practice: Pharmacy • Each patient has a unique ssn, age and primary doctor. • Each drug has a unique generic name and a unique chemical formula. • Each drug is sold by a at least two pharmacies. Each pharmacy sells the drug at some price. Each pharmacy has a unique address, a phone number, and sells at least 100 drugs. • Each prescription is written by a doctor for a patient and dispenses a quantity of a drug. • Gather into groups of 2-3. Draw the ER diagram. Include cardinality constraints and underline the primary keys.
Reminder: Guidelines • If something has an attribute, it is an entity set or a relationship set • Use language from the RA in the ER diagram • Relationship set names should be part of a sentence uincluding the names of the entity sets. • Some parts of the RA may not belong in the ER diagram. • Keep the ER diagram as simple as possible. • No redundancy • Check that every item in the RA is in the ER diagram if it belongs there, and nothing else.
Table of Contents • Relational Data Model • Relation, row, attributes, instance • Cardinality, Degree, Domain • Keys, Primary and Candidate • Schema, Legal Instance • Finding Candidate & Foreign Keys • ER Diag. Schema of Tables • Database, DBMS • Definitions • Languages • Postgresql • FEC data • SQL • History • NULL values • DDL, DML • SELECT/FROM/WHERE • DISTINCT,ORDER BY,LIKE • Introduction • Translating Data into Information • Three Grand Challenges • Data is widespread • Data is vast • Structured vs Unstructured Data • ER Diagrams • Requirements Analysis • Use cases • Entities, Attributes, Relationships • Instance, Legal Instance • Keys, Primary and Candidate • Attributes of Relationship Sets • Ternary Relationship Sets • Enties vs Attributes Done
Data Models • The next step in our quest to transform data into information is to transform an ER diagram into a more highly structured model than the ER model. • A DataModel is a language used to describe the structure of data. • The relational model is the intergalactic standard for storing structured data, although some legacy data is still stored in network and hierarchical models, and some niche data is stored in object, XML and other models. • If we were building, say, an XML database, then we would transform our ER diagram into an XML schema • The relational model is a theory, mathematically based, so we can use it to prove results, though some of its assumptions, like requiring keys, may not be realistic.
Relational Model: Notation* • A relation is a set of rows, each having the same attributes. • Look at the Sample Relations handout, also below. • Look at the instance (current contents) of the Candidate relation. • How many rows are in this instance of the Candidate relation? • Called the cardinality of the relation. • How many attributes are in this instance of the Candidate relation? • Called the degree of the relation Sample Relations Document in http://www.cs.pdx.edu/~len/386/fec/Mats.doc
Domains* • What are the names of the attributes in DonateTo? • Each attribute has a domain – its set of possible values. • Possible domains: character strings, integers, dates • Decide domain in terms of its planned use • Arithmetic? Integer. Otherwise character • Domain is like a type in computer science. • A relation is mathematically defined as subset of the cross product of the domains. • What is the domain of each attribute in DonateTo?
Keys • Key, primary key, and candidate key have the same definitions* as in ER diagrams, but the relational model requires that every relation have at least one key attribute. • The primary key of each relation is used to identify each data item/row. • The displayed candid, commid and fecid values were chosen by the FEC to identify each candidate, etc. *Key: A minimal set of attributes that uniquely identifies each row Primary key: One key chosen as the primary way to identify each row Candidate key: any key other than the primary key