1.02k likes | 1.13k Views
Please use linux today if possible!. Introduction to Molecular Biology Databases. Alinda Nagy & Hedi Hegyi, PhD @ Institute of Enzymology, Budapest The BioSapiens Permanent School of Bioinformatics. Budapest, Sept 4-8, 2006. Databases. What is a database?.
E N D
Introduction to Molecular Biology Databases Alinda Nagy & Hedi Hegyi, PhD @ Institute of Enzymology, Budapest The BioSapiens Permanent School of Bioinformatics Budapest, Sept 4-8, 2006
What is a database? • A database is a structured collection of information. (An organized array of information.) • A database consists of basic objects called records or entries. • Each recordconsists of fields, which hold defined data that is related to that record. • For example, a protein database would typically have proteins as records and protein properties as fields (i.e. name, length, sequence, taxonomical origin, etc.) Noam Kaplan
What is a database? • A database is searchable (index) -> table of contents • A database is updated periodically (release) -> new edition • A database is cross-referenced (hyperlinks) > links with other db
Why Databases? • The purpose of databases is not merely to collect and organize data, but mainly to allow advanced data retrieval. • A query is a method to retrieve information from the database. • The organization of each record into predetermined fields allows us to use queries on fields. • Example: Find all human proteins that are enzymes and have a length of 1000-1200 aa. Noam Kaplan
Databases on the Internet • Biological databases often have a web interface, which allows the user to send queries to the database. • Some databases can be accessed by different web servers, each offering a different interface. request query result web page User Web server Database server Noam Kaplan
Databases on the Internet Information system Query system Storage System Data Francis Ouellette
- GenBank flat file - PDB file - Interaction Record- Title of a book - Book Databases on the Internet Information system Query system Storage System Data Francis Ouellette
- Boxes - Oracle - MySQL - PC binary files - Unix text files - Bookshelves Databases on the Internet Information system Query system Storage System Data Francis Ouellette
Databases on the Internet - A List you look at- A catalogue- indexed files- SQL- grep Information system Query system Storage System Data Francis Ouellette
- The UBC library - Google - Entrez (NCBI) - SRS (Sequence Retrieval System) Databases on the Internet Information system Query system Storage System Data Francis Ouellette
Database download • Nearly all biological databases are available for download as simple text files. • A local version of the database removes limitations on how you process the data. • Processing data in files requires some minimal computer-programming skills. • PERL is an easy programming language that can be used for extraction and analysis of data from files. Noam Kaplan
Tour of the major molecular biology databases • There is a tremendous amount of information about biomolecules in publicly available databases. • Today, we will just look at some of the main databases and what kind of information they contain. • Exercises will give you a little practice at browsing databases.
List of molecular biology databases • Nucleic Acids Research publishes an annual database issue. The 2006 updateof the online Molecular BiologyDatabase Collectionincludes 858 databases • http://www3.oup.co.uk/nar/database/c/
Large Growth in the Number of Biological Databases NAR Database Issue
Molecular biology data types Mouse chromosome X from the Mouse Genome Informatics project http://www.informatics.jax.org/ Organisms Genome maps Lei Liu
Molecular biology data types Organisms Genome maps DNA sequences RNA sequences ...AATGGTACCGATGACCTGGAGCTTGGTTCGA... Lei Liu
Molecular biology data types Organisms Genome maps DNA sequences RNA sequences Protein sequences ...TRLRPLLALLALWPPPPARAFVNQHLCGSHLVEA... Lei Liu
Molecular biology data types Organisms Genome maps DNA sequences RNA structures RNA sequences Protein sequences Protein structures PDB entry 1CIS P.Osmark, P.Sorensen, F.M.Poulsen Lei Liu
Molecular biology data types Organisms Genome maps DNA motifs DNA sequences RNA expression RNA structures RNA sequences Protein sequences Protein structures Protein motifs Lei Liu
Types of molecular biology databases 14 main NAR categories: Nucleotide Sequence RNA sequence Protein sequence Structure Genomics (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Proteomics Resources Other Organelle Plant Immunological
Resources are Becoming More Diverse NAR – Database Categories 2004 2006
NAR 2006 – A Closer Look • Genome scale databases have proliferated • Traditional sequence databases are now a small part • Databases around new specific data types are emerging • Pathway and disease orientated databases are emerging
Using a database • How to get information out of a database: • Summaries: how many entries, average or extreme values • Browsing: no targeted information to retrieve • Search: looking for particular information • Searching a database: • Must have a key that identifies the element(s) of the database that are of interest. • Name of gene • Sequence of gene • Other information Larry Hunter
Searching sequence databases • Start from sequence, find information about it • Many kinds of input sequences • Could be amino acid or nucleotide sequence • Genomic or mRNA/cDNA or protein sequence • Complete or fragmentary sequences • Exact matches are rare (even uninteresting in many cases), so often goal is to retrieve a set of similar sequences. • Both small (mutations) and large (required for function) differences within “similar” can be interesting. Larry Hunter
What might we want to know about a sequence? • Is this sequence similar to any known genes? How close is the best match? Significance? • What do we know about that gene? • Genomic (chromosomal location, allelic information, regulatory regions, etc.) • Structural (known structure? structural domains? etc.) • Functional (molecular, cellular & disease) • Evolutionary information: • Is this gene found in other organisms? • What is its taxonomic tree? Larry Hunter
What can be discovered about a gene by a database search? • A little or a lot, depending on the gene • Evolutionary information: homologous genes, taxonomic distributions, allele frequencies, synteny, etc. • Genomic information: chromosomal location, introns, UTRs, regulatory regions, shared domains, etc. • Structural information: associated protein structures, fold types, structural domains • Expression information: expression specific to particular tissues, developmental stages, phenotypes, diseases, etc. • Functional information: enzymatic/molecular function, pathway/cellular role, localization, role in diseases Larry Hunter
NCBI and Entrez • One of the most useful and comprehensive sources of databases is the NCBI (National Center for Biotechnology Information), part of the NIH (National Institute of Health). • NCBI provides interesting summaries, browsers for genome data, and search tools • Entrez is their database search interfacehttp://www.ncbi.nlm.nih.gov/Entrez • Can search on gene names, sequences, chromosomal location, diseases, keywords, ... Larry Hunter
BLAST: Searching with a sequence • Goals is to find other sequences that are more similar to the query than would be expected by chance (and therefore are homologous). • Can start with nucleotide or amino acid sequence, and search for either (or both) • Many options • E.g. ignore low information (repetitive) sequence, set significance critical value • Defaults are not always appropriate: READ THE NCBI EDUCATION PAGES! Larry Hunter
Major choices: • Translation • Database • Filters • Restrictions • Matrix Larry Hunter
Close hit: Rat ADH alpha Larry Hunter
Distant hit:Human sorbitol dehydrogenase Larry Hunter
Parameters (at bottom!) Larry Hunter
Click on: Larry Hunter
BLAST searches online • http://www.ncbi.nlm.nih.gov/BLAST/ • Sequences: >ENSP00000002501 pep:known chr:NCBI36:16:88598804:88613382 MEPPEGAGTGEIVKEAEVPQAALGVPAQGTGDNGHTPVEEEVGGIPVPAPGLLQVTERRQ PLSSVSSLEVHFDLLDLTELTDMSDQELAEVFADSDDENLNTESPAGLHPLPRAGYLRSP SWTRTRAEQSHEKQPLGDPERQATVLDTFLTVERPQED >ENSP00000314902 chr:18 gene:ENSG00000176890 tr:ENST00000323250 MPVAGSELPRRPLPPAAQERDAEPRPPHGELQYLGQIQHILRCGVRKDDRTGTGTLSVFG MQARYSLRDYSGQGVDQLQRVIDTIKTNPDDRRIIMCAWNPRDLPLMALPPCHALCQFYV VNSELSCQLYQRSGDMGLGVPFNIASYALLTYMIAHITGLKPGDFIHTLGDAHIYLNHIE PLKIQLQREPRPFPKLRILRKVEKIDDFKAEDFQIEGYNPHPTIKMEMAV
Take home messages • There are a lot of molecular biology databases, containing a lot of valuable information • Not even the best databases have everything (or the best of everything) • These databases are moderately well cross-linked, and there are “linker” databases • Sequence is a good identifier, maybe even better than gene name! Larry Hunter
Protein sequence databases • General sequence databases(e.g. UniProt) • Protein properties (e.g. PFD – Protein Folding Database) • Protein localization and targeting • (e.g. NPD - Nuclear Protein Database) • Protein sequence motifs and active sites • (e.g. BLOCKS,InterPro, PROSITE, PRINTS) • Protein domain databases; protein classification • (e.g. InterPro, ProDom, SMART, Pfam) • Databases of individual protein families • (e.g. Histone Database) • http://www3.oup.co.uk/nar/database/cat/1