330 likes | 410 Views
Databases. From A to Boyce Codd. What is a database?. It depends on your point of view.
E N D
Databases From A to Boyce Codd
What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in which multiple trajectories of interaction with the information are possible. When thinking about “the database” on this level of abstraction, the difference between a collection of discrete objects (e.g., a bunch of books) and a particular structured representation of those objects (e.g., a set of catalog records) are, for example, not so different. Manovich says exactly this.
What is a database? In contrast, in the computational universe, the difference between a set of files and a set of relational tables is significant, in terms of the operations that can be performed on each. For computational purposes, a database is a set of structured data; Brookshear includes the requirement that the data be structured in a multidimensional way, so that it can be presented “from a variety of perspectives.”
What is a database? Are these databases? In what way? The books in my office. The PCL. YouTube. The Internet Movie Database. The Web. Expedia. Epicurious. What is useful about calling any of these things databases or not?
What is a database? Even in the computational universe, a “database” can exist at multiple levels of abstraction: a conceptual model of a database (as presented through entity-relationship diagrams) can be implemented via different logical models (e.g., as different tables in a relational database, or potentially as objects in an object database, or...), and these can be stored and accessed differently (for example, distributed over many servers)...
Data modeling with ER diagrams Entity-relationship diagrams depict a conceptual model (schema) of data by specifying entities (things), attributes (properties of the things) and relationships between the things. ERDs document semantics, not implementation. There are many forms of notation for ERDs (Chen, the optional reading, is an early one).
Data modeling with ER diagrams There is no “right answer” when determining entities, attributes, and relationships; different entities might be defined, or entities might be defined as relationships instead. It can be difficult to predict consequences of one choice vs. another. Similar decisions and consequences occur when translating a conceptual data model to a logical model for implementation (and then in the actual implementation).
Data modeling with ER diagrams Let’s take a simple example to play with ERDs. We want to model the idea that students take courses and instructors teach courses. One way to do this is to have three entities: students, courses, and instructors.
Data modeling with ER diagrams We could also do this by having two entities: courses and people. In this model, people would have two roles, instructor and student. Chen has an example of a marriage as a relationship between two people, with roles for husband and wife (Chen was writing in 1976). Chen notes that a marriage can also be modeled as an entity.
Data modeling with ER diagrams If we are interested in capturing other information about students and instructors besides their names, then we probably want to separate them (e.g., we might want to track how many credits students have and how many requirements they have completed, and these don’t apply to instructors). If there was overlap between students and instructors, we might have some redundancy going on. We might want to keep this in mind.
Data modeling with ER diagrams Say we want to add the idea of grades to this model. Instructors assess student performance in courses by assigning them grades. How might we model grades? Are they entities? Are they attributes of some existing entity? Are they are relationship between entities? Let’s think about it.
From ERD to database Databases can be implemented in different ways. Brookshear describes relational databases and object-oriented databases. In each case, the database would be implemented in a database management system (DBMS), which would hide details of the actual data composition and storage and such from application programs that use the data.
Relational databases In a relational database, entities are described as rows in a table. The table is called a relation. The table columns are attributes. Each table row is also called a tuple.
A relation Here is a relation that contains information about peer reviewers for an academic conference. There are some problems with this relation.
A relation One immediate problem is that we have multiple values for a single attribute. That’s going to make it hard to figure out how many papers are really assigned to Barb Chen (is that one number or two numbers), or to reassign papers, etc.
Revised relation We can fix this by making additional tuples so that no cell has multiple values. But then we have some redundancy in the relation; every time we assign a new paper, we need to input that name, school and city information as well. Redundancy can cause problems with data integrity.
Revised relations To fix this redundancy problem, we need to split up the table into multiple tables and relate each table with a “foreign key” that links back to the original table. Then we can also include more information about the papers, which is probably necessary anyway.
Revised relations There’s still a problem with that original table, though; the city attribute isn’t really about the reviewer, is it? It’s about the school. It might not be likely that schools will change cities, but it’s still good practice to keep all the attributes in a table about the entity specified by the primary key (in this case, the reviewer).
Revised relations So that’s good. But there was still redundancy in those Paper and Expertise tables. Also we have deletion problems...if a paper isn’t assigned to a reviewer, does that mean it can’t exist in the database? How would we fix that?
Normalization Wowee, that’s a lot of tables there! Do we really want to do that? Yes. And no. On the one hand, it’s good to minimize data redundancy, because this can lead to problems with updating. On the other hand, performing lots of Join operations to put information together again can be inefficient, from a performance perspective.
Database operations Databases are powerful because we can reassemble data in myriad ways. Basic relational database operations include Select, Project, and Join.
Database operations Select extracts rows (tuples) from a relation. Project extracts columns (attribute values). Join combines multiple relations into a new relation.
Query languages Database query languages, such as SQL, may implement the basic operations in different ways. A SQL statement may perform all three operations—Join, Select, and Project.
Very large databases For tremendous datasets such as Facebook, Twitter, and so on (which include not only things like posts but also relationships between items and user behavior), it’s hard to develop comprehensive models that align with the dynamic data environment. Plus, queries with lots of tables are slow.
Very large databases Hadoop is a set of technologies to manage extremely large, hetereogeneous, dynamic data environments. Primary Hadoop technologies include HDFS, a system for distributing data across multiple computers, and MapReduce, software for distributing processing on the distributed data. There are a variety of associated technologies emerging to facilitate processing of Hadoop data.
So...what then? What if I learn MySQL and it becomes obsolete? Of course it will become obsolete! But that doesn’t mean it’s foolish to learn it. Also, it’s important to remember where MySQL lies in the realm of “the database.” For example, conceptual models may be stable where logical models and implementation details change.