1.3k likes | 1.49k Views
Introduction to Databases. Structure. Motivation Introduction to MySQL Example Queries Using SQL to query SCOP. Motivation. In the last term , we accessed most information online via the web we interacted directly and manually with databases and tools
E N D
Structure Motivation Introduction to MySQL Example Queries Using SQL to query SCOP
Motivation In the last term, we accessed most information online via the web we interacted directly and manually with databases and tools we had to manually submit queries, interpret results. select interesting results, cut&paste them, and submit queries again,… Pro: Reasonably easy to get hold of information Con: Not possible to ask many queries Queries limited by interface provided by web page Difficult/impossible to integrate information from different sites In this term, we will look at the databases underlying the online front ends How is the data internally stored? How can we - and more important computer programs - directly interact with the underlying data, so that we can ask more powerful queries, large queries, and integrate different systems
What actually happens when you retrieve data online? LLNE YLEEVE EYEEDE LLNE YLEEVE EYEEDE Client Message Web Server 1 Get home page Get it and send it 2 Send home page 3 Display home page, enter query, and press submit Send query Start programme that evaluates query by accessing database… 4 Display result Send result … Compose result web page and send it
What actually happens You are limited by what web server allows you to ask: Example CATH: • PDB ID, • CATH code, or • General text But you cannot ask: • In how many different PDB structures is there a P-loop domain? • Is there a PDB entry with a P-loop and a DNA-binding domain • How many different superfamilies does the largest structure in PDB have? • With direct access to the underlying database you could answer all these questions (and many more)
Querying over the Web Problem is always the same: The web interface limits access to the underlying database How can we interact directly with the database
What databases are about Logical organization of data data models, schema design, dictionaries Physical organization of data Fast retrieval, indexing, compact storage of data Other requirements: Logging (important to know who did what to the data) Security and access control (important to know who can do what) Transactions and concurrency control (important when more than one person is working on database) Integrity (important to ensure that only valid entries in the database) Recovery (important as hardware and software can sometimes fail
Different types of databases Flat files XML Relational database (Object databases) (Object relational databases)
Flat files We can store any data in a flat file, e.g. EMBL But is this a database? Logical data organisation: None, unless we define one (as done for EMBL) and adhere to it, which is not enforced Physical data organisation: None, we cannot optimise retrieval for common queries Logging: No Access control: Implicit through Unix Transaction and concurrency control: None Integrity: None Recovery: If files are backed-up they can be recovered. However, not on the fly ID BTBPTIG standard; genomic DNA; MAM; 3998 BP. XX AC X03365; K00966; XX SV X03365.1 XX DT 18-NOV-1986 (Rel. 10, Created) DT 20-MAY-1992 (Rel. 31, Last updated, Version 3) XX DE Bovine pancreatic trypsin inhibitor (BPTI) gene XX KW Alu-like repetitive sequence; protease inhibitor; trypsin inhibitor. XX OS Bos taurus (cow) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; OC Eutheria; Cetartiodactyla; Ruminantia; Pecora; Bovoidea; Bovidae; Bovinae; OC Bos. XX RN [1] RP 1-3998 RX MEDLINE; 86158754. RX PUBMED; 2420326. RA Kingston I.B., Anderson S.; RT "Sequences encoding two trypsin inhibitors occur in strikingly similar RT genomic environments"; RL Biochem. J. 233(2):443-450(1986). XX RN [2] RX MEDLINE; 84070725. RX PUBMED; 6580617. RA Anderson S., Kingston I.B.; RT "Isolation of a genomic clone for bovine pancreatic trypsin inhibitor by RT using a unique-sequence synthetic DNA probe.";
XML files We can store any data in XML, the eXtentable Mark-up Language, e.g. Medline But is this a database? Logical data organisation: yes, XML schema, which is enforced Physical data organisation: None, we cannot optimise retrieval for common queries Logging: No Access control: Implicit through Unix Transaction and concurrency control: None Integrity: None Recovery: If files are backed-up they can be recovered. However, not on the fly <Article> <Journal> <ISSN>0270-7306</ISSN> <JournalIssue> <Volume>19</Volume> <Issue>11</Issue> <PubDate> <Year>1999</Year> <Month>Nov</Month> </PubDate> </JournalIssue> </Journal> <ArticleTitle>Differential regulation of the cell wall integrity mitogen-activated protein kinase pathway in budding yeast by the protein tyrosine phosphatases Ptp2 and Ptp3. </ArticleTitle> <Pagination> <MedlinePgn>7651-60</MedlinePgn> </Pagination> <Abstract> <AbstractText>Mitogen-activated protein kinases (MAPKs) are inactivated by dual-specificity and protein tyrosine phosphatases (PTPs) in yeasts. In Saccharomyces cerevisiae, two PTPs, Ptp2 and Ptp3, inactivate the MAPKs, Hog1 and Fus3, with different specificities... </AbstractText> </Abstract> <Affiliation>Department of Chemistry, University of Colorado, Boulder, Colorado 80309-0215, USA. </Affiliation>…
Relational Database Central Idea: Data as relations in a table E.g. SCOP, Structural Classification of Proteins +-------+------+---------+---------+--------------------------------------+ | id | type | sccs | sid | description | +-------+------+---------+---------+--------------------------------------+ | 46457 | cf | a.1 | - | Globin-like | | 46458 | sf | a.1.1 | - | Globin-like | | 46459 | fa | a.1.1.1 | - | Truncated hemoglobin | | 46460 | dm | a.1.1.1 | - | Truncated hemoglobin | | 46461 | sp | a.1.1.1 | - | Ciliate (Paramecium caudatum) | | 14982 | px | a.1.1.1 | d1dlwa_ | 1dlw A: | | 46462 | sp | a.1.1.1 | - | Green alga (Chlamydomonas eugametos) | | 14983 | px | a.1.1.1 | d1dlya_ | 1dly A: | | 63437 | sp | a.1.1.1 | - | Mycobacterium tuberculosis | | 62301 | px | a.1.1.1 | d1idra_ | 1idr A: | +-------+------+---------+---------+--------------------------------------+
Relational Database Central Idea: Data as relations in a table E.g. Employee +-------+------+---------+---------+ | id | name | salary | role | +-------+------+---------+---------+ | 46457 | pete | 50.000 | director| | 46458 | jane | 60.000 | nurse | | 46459 | asif | 70.000 | driver | +-------+------+---------+---------+
Relational Database Central Idea: Data as relations in a table E.g. pets +----------+--------+---------+------+------------+------------+ | name | owner | species | sex | birth | death | +----------+--------+---------+------+------------+------------+ | Whistler | Gwen | bird | | 0000-00-00 | NULL | | Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 | | Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 | | Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 | | Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 | | Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 | | Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 | | Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 | +----------+--------+---------+------+------------+------------+
Relational Database Central Idea: Data as relations in a table E.g. school +-------+------+---------+ | id | prof | subject | +-------+------+---------+ | 51221 | bert | bio | | 55435 | anne | math | +-------+------+---------+ +-------+------+---------+ | id | name | subject | +-------+------+---------+ | 46458 | rick | bio | | 46459 | gerd | bio | | 46460 | mary | bio | | 46461 | ella | math | | 14982 | anne | math | | 46462 | paul | math | +-------+------+---------+ +---------+------+-----+------+ | subject | room | day | time | +---------+------+-----+------+ | bio | A | mo | 3pm | | math | B | tue | 1pm | +---------+------+-----+------+
Relational Database A cell in the table stores a single number or string, but not a list Lists, sets need to be flattened +------+-------------+ | prof | subject | +------+-------------+ | bert | bio | | bert | sport | | anne | arts | | anne | math | +------+-------------+ +------+-------------+ | prof | subjects | +------+-------------+ | bert | {bio,sport} | | anne | {math,arts} | +------+-------------+
Bioinformatics: 10 years of resistance to flattening! Why the resistance? Bioinformatics data is naturally nested Extensive Use of sets and lists E.g. Swissprot: Features, keywords, References Such data can be flattened, but the resulting relational schema is hard to understand hence it is hard to formulate queries. For example, storing the SWISSPROT entry in a relational database would split it over 15-20 tables.
Relational Databases RDB introduced in 1970 by Codd Took off in the 80s In the business world, relational databases are the rule (Oracle, Sybase, mySQL, DB2, Microsoft Access). Large biomedical databases typically use a relational technology; but there are also a lot of homegrown systems (ACeDB, SRS indexed files). Data is almost always viewed and exported in a variety of flat file formats (EMBL, GenBank among others)
The flood of biomedical data… Since 1980, the number and size of biomedical databases has been growing exponentially. How can you find sources of information you are seeking? Nucleic Acids Research Database Issue in January of every year (http://nar.oupjournals.org/) Dbcat (http://www.infobiogen.fr/ ): a flat file database of 500 biological databases. Susan B. Davidson, Biol537/CIS636, Fall 2003
Relational Schema The schema of a database is a set of relation names, their field names and types. Example: Entry(ID: int, Length: int, Seq: string, Mod:date) Feature(ID: int, Type: string, From: int,To: int) EntryandFeature are relation names, ID, Seq, Mod, etc are attribute names, and int, string, dateare domains Susan B. Davidson, Biol537/CIS636, Fall 2003
Relation Instance An instance of a relation is a set of tuples of the type of the relation. A tuple of Entry could be: ( ID: 82814, Length: 597, Seq:“ccagctaaccg”, Mod: 1-7-95) A tuple of Feature could be: (ID: 82814, Type:“source”, From:1,To:8959) Susan B. Davidson, Biol537/CIS636, Fall 2003
Tabular representation Typically, relations are displayed as tables Sequence: Feature: attributes ID Length Seq Mod 82814 597 “ccagctaa...” 1-07-95 98608 18976 “accgcct...” 2-14-98 tuples ID Type From To 82814 “Source”1 184 82814 “Gene” 23 65 Susan B. Davidson, Biol537/CIS636, Fall 2003
Entities and Relationships There is a one-many relationship from Entry to Feature; each entry can have many features, but a feature can be on at most one entry. Put another way, the existence of a feature depends on the existence of the owning entry referential integrity Susan B. Davidson, Biol537/CIS636, Fall 2003
Integrity Constraints IDis the keyof Entry, indicated by underlining: No two tuples of any instance of Entrycan have the same ID. InFeature, there is a referential integrity constraint on ID: Every ID in Featuremust appear in some tuple in Entry. This is specified in the data definition language (DDL), and enforced by the system as updates are made to the instance. Susan B. Davidson, Biol537/CIS636, Fall 2003
DDL for this relational schema CREATE TABLE Feature (Id INTEGER, Type CHAR(15), From INTEGER, To INTEGER, PRIMARY KEY (Id, Type, From, To) FOREIGN KEY (Id) REFERENCES Entry ON DELETE CASCADE ON UPDATE CASCADE) CREATE TABLE Entry (Id INTEGER, Length INTEGER, Sequence LONGCHAR, Mod DATE, PRIMARY KEY (Id) ) Susan B. Davidson, Biol537/CIS636, Fall 2003
Querying relational databases The language SQL has become a standard for querying relational databases. Based on a curious mixture of the relational algebra and relational calculus (formal languages), it allows new relations of information to be computed from a set of relations. Unlike the relational algebra, it allows other useful stuff: count, sum, min, max, etc. Susan B. Davidson, Biol537/CIS636, Fall 2003
Basic Query relation-list A list of relation names (possibly with a range-variable after each name). target-list A list of attributes of relations in relation-list. * can be used to denote all atts. qualification Comparisons (Attr op const or Attr1 op Attr2, where op is one of <, <=, >, >=, =, <> combined using AND, OR and NOT. DISTINCT(optional) keyword indicates that the answer should not contain duplicates. Default is that duplicates are not eliminated! SELECT [DISTINCT] target-list FROMrelation-list WHERE qualification Susan B. Davidson, Biol537/CIS636, Fall 2003
Conceptual Evaluation Strategy Compute the product of relation-list Discard tuples that fail qualification Project over attributes in target-list If DISTINCT then eliminate duplicates This is probably a very bad way of executing the query, and a good query optimizer will use all sorts of tricks to find efficient strategies to compute the same answer. Susan B. Davidson, Biol537/CIS636, Fall 2003
Sample tables Sequence: Feature: ID Length Seq Mod 82814 597 “ccagctaa...” 1-07-95 98608 18976 “accgcct...” 2-14-98 16665 “gtgtaa….” 1-19-97 76582 9976 “actgga…” 2-29-00 ID Type From To 82814 “Source”1 597 82814 “Gene” 23 65 “Gene” 3 9999 13428 “Gene” 11000 16665 13428 “Source” 1 16665 Susan B. Davidson, Biol537/CIS636, Fall 2003
Simple queries Print all sequences with length less than 10000. SELECT * FROM Sequence WHERE Length < 10000; ID Length Seq Mod 82814 597 “ccagctaa...” 1-07-95 76582 9976 “actgga…” 2-29-00 Type “Source” “Gene” “Gene” “Gene” “Source” SELECT Type FROM Feature; Print the type of all features. Susan B. Davidson, Biol537/CIS636, Fall 2003
Distinct Note that SQL did not eliminate duplicates. We need to request this explicitly. SELECT DISTINCT Type FROM Feature; Print the type of all features (no duplicates). Type “Source” “Gene” Susan B. Davidson, Biol537/CIS636, Fall 2003
Pattern Matching Can be used in where clause. “_” denotes any character, “%” 0 or more characters. SELECT * FROM Sequence WHERE Seq LIKE ‘a_%g' ID Length Seq Mod 98608 18976 “accgcct...” 2-14-98 76582 9976 “actgga…” 2-29-00 Susan B. Davidson, Biol537/CIS636, Fall 2003
Arithmetic “as” can be used to label columns in the output; arithmetic can be used to compute results SELECT DISTINCT ID, To-From+1 as Length FROM Feature; ID Length 82814 597 82814 43 13428 9997 13428 5666 13428 16665 Susan B. Davidson, Biol537/CIS636, Fall 2003
Set operations -- union SELECT ID FROM Sequence WHERE Length<10000 UNION SELECT ID FROM Feature WHERE Type=“Source”; ID 76582 82814 13428 • Duplicates do not occur in the union. Susan B. Davidson, Biol537/CIS636, Fall 2003
The UNION ALL operator preserves duplicates ID 76582 82814 82814 13428 SELECT ID FROM Sequence WHERE Length<10000 UNION ALL SELECT ID FROM Feature WHERE Type=“Source”; Susan B. Davidson, Biol537/CIS636, Fall 2003
Intersection and difference ID 82814 13428 ID 98608 76582 SELECT Id FROM Sequence INTERSECT SELECT Id FROM Feature; SELECT Id FROM Sequence MINUS SELECT Id FROM Feature; Susan B. Davidson, Biol537/CIS636, Fall 2003
Products Note that the ID column name is duplicated in the output. SELECT * FROM Sequence,Feature; ID Length Seq Mod ID Type From To 82814 597 “ccagctaa...” 1-07-95 82814 “Source” 1 597 98608 18976 “accgcct...” 2-14-98 82814 “Source” 1 597 16665 “gtgtaa….” 1-19-97 82814 “Source” 1 597 76582 9976 “actgga…” 2-29-00 82814 “Source” 1 597 ……. (lots more!) Susan B. Davidson, Biol537/CIS636, Fall 2003
Conditional join SELECT * FROM Sequence, Feature WHERE Sequence.Id = Feature.Id; ID Length Seq Mod ID Type From To 82814 597 “ccagctaa...” 1-07-95 82814 “Source” 1 597 82814 597 “ccagctaa...” 1-07-95 82814 “Gene” 23 65 13428 16665 “gtgtaa….” 1-19-97 13428 “Gene” 3 9999 16665 “gtgtaa….” 1-19-97 13428 “Gene” 11000 16665 13428 16665 “gtgtaa….” 1-19-97 13428 “Source” 1 16665 Susan B. Davidson, Biol537/CIS636, Fall 2003
Counting Surprisingly, the answer to both of these is the following: COUNT(TYPE) 5 Print the number of feature entries. SELECT COUNT(*) FROM Feature; Print the number of types of features. SELECT COUNT(Type) FROM Feature; Susan B. Davidson, Biol537/CIS636, Fall 2003
Counting, cont. To fix this, we use the keyword “DISTINCT”: Can also use SUM, AVG, MIN and MAX. COUNT(DISTINCT Type) 3 SELECT COUNT(DISTINCT Type) FROM Feature; Susan B. Davidson, Biol537/CIS636, Fall 2003
Group by So far, these aggregate operators have been applied to all qualifying tuples. Sometimes we want to apply them to each of several groups of tuples. For example: “Print the type and number of features of each type.” Susan B. Davidson, Biol537/CIS636, Fall 2003
Group by Note that only the columns that appear in the GROUP BY statement and “aggregated” columns can appear in the output. So the following would generate an error. Type COUNT(*) “Source” 2 “Gene” 3 SELECT Type, COUNT(*) FROM Feature GROUP BY Type; SELECT Type, From, To, COUNT(*) FROM Feature GROUP BY Type; Susan B. Davidson, Biol537/CIS636, Fall 2003
Group by … having HAVING is to GROUP BY as WHERE is to FROM “HAVING” is used to restrict the groups that appear in the result. TYPE COUNT(*) “Source” 2 SELECT Type, COUNT(*) FROM Feature WHERE From-To > 50 GROUP BY Type HAVING AVG(From-To)> 8500 Susan B. Davidson, Biol537/CIS636, Fall 2003
Summary SQL is “relationally complete”: allows you to perform operators in an algebra of relations (the relational algebra). Additional features: string comparisons, set membership, arithmetic and grouping. In contrast, Entrez is a much more limited language. Susan B. Davidson, Biol537/CIS636, Fall 2003
A Little Exercise Given the table pet below let us formulate some queries… +----------+--------+---------+------+------------+------------+ | name | owner | species | sex | birth | death | +----------+--------+---------+------+------------+------------+ | Whistler | Gwen | bird | | 0000-00-00 | NULL | | Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 | | Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 | | Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 | | Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 | | Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 | | Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 | | Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 | +----------+--------+---------+------+------------+------------+
A Little Exercise Get all pet names… +----------+--------+---------+------+------------+------------+ | name | owner | species | sex | birth | death | +----------+--------+---------+------+------------+------------+ | Whistler | Gwen | bird | | 0000-00-00 | NULL | | Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 | | Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 | | Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 | | Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 | | Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 | | Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 | | Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 | +----------+--------+---------+------+------------+------------+
A Little Exercise SELECT name FROM pet; Get all owners and list them only once +----------+--------+---------+------+------------+------------+ | name | owner | species | sex | birth | death | +----------+--------+---------+------+------------+------------+ | Whistler | Gwen | bird | | 0000-00-00 | NULL | | Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 | | Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 | | Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 | | Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 | | Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 | | Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 | | Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 | +----------+--------+---------+------+------------+------------+
A Little Exercise SELECT DISTINCT owner FROM pet; Select the names of all birds… +----------+--------+---------+------+------------+------------+ | name | owner | species | sex | birth | death | +----------+--------+---------+------+------------+------------+ | Whistler | Gwen | bird | | 0000-00-00 | NULL | | Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 | | Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 | | Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 | | Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 | | Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 | | Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 | | Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 | +----------+--------+---------+------+------------+------------+
A Little Exercise SELECT name FROM pet WHERE species=“bird”; Select the names of all female birds… +----------+--------+---------+------+------------+------------+ | name | owner | species | sex | birth | death | +----------+--------+---------+------+------------+------------+ | Whistler | Gwen | bird | | 0000-00-00 | NULL | | Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 | | Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 | | Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 | | Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 | | Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 | | Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 | | Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 | +----------+--------+---------+------+------------+------------+
A Little Exercise SELECT name FROM pet WHERE species=“bird” AND sex=“f”; Select names and owners +----------+--------+---------+------+------------+------------+ | name | owner | species | sex | birth | death | +----------+--------+---------+------+------------+------------+ | Whistler | Gwen | bird | | 0000-00-00 | NULL | | Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 | | Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 | | Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 | | Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 | | Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 | | Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 | | Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 | +----------+--------+---------+------+------------+------------+