1.12k likes | 1.49k Views
Introducing Bioinformatics Databases. Tan Tin Wee/Victor Tong/Susan Moore Dept of Biochemistry NUS Mohammad Asif Khan Perdana University Graduate School of Medicine. Sources of Biological Knowledge. Past: textbooks, monographs, books, journals.
E N D
Introducing Bioinformatics Databases Tan Tin Wee/Victor Tong/Susan Moore Dept of Biochemistry NUS Mohammad Asif Khan Perdana University Graduate School of Medicine
Sources of Biological Knowledge Past: textbooks, monographs, books, journals. Today: online accessible databasesKeyword searchable, e.g. Google. Every class of biological molecule has at least a few databases associated with it. Every area of biology, biotechnology, medicine and life science research will have some kind of database associated with it. Must be aware and familiar with MAJOR databases Must be able to discover NEW databases and master them as and when they appear.
Biological knowledge today! • STORED digitallyAlmost critical biological data, information, knowledge is currently stored in computers • ACCESSIBLE globallyAll current critical biological knowledge is publicly accessible via the Internet network of computers • SHARED extensivelyMost research data is exchanged via the Internet today if not publicly and free, then shared among international collaborators • PUBLISHED onlineMost scientific journals are now published with a digital version accessible online, free open access or for a subscription fee paid by the individual or by the institution 10 years ago, this was not so. There has been tremendous change.
UNSTOPPABLE DATA GROWTH 100 90 80 70 60 100 90 80 70 60 Growth of GenBank DNA Sequence (2005 – 2009) >100,000,000 sequences Exponential Increase Next Gen Sequencing Technologies Growth of PDBProtein and MacromolecularStructuresDriven by various Structural Genomics initiatives such as Protein Structure Initiative http://www.nigms.nih.gov/Initiatives/PSI JCSG http://www.jcsg.org/ http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html http://www.pdb.org/pdb/statistics/contentGrowthChart.do?content=total&seqid=100 2005 2008
RELENTLESS INCREASE IN DATABASESMichael Y. Galperin and Guy R. Cochrane (2009) Nucl. Acids Res. 37:D1-D4 . Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2009 (doi:10.1093/nar/gkn942) http://nar.oxfordjournals.org/cgi/content/full/37/suppl_1/D1 A lot of data A lot of databases What do they mean? Most of the data begins to make sense if they are Integrated But many plans to integrate these databases have failed
Biological Databases – examples and general considerations • Biological databases – what they are; purpose • Some general considerations • Sample databases
Biological databases Many (but not all) definitions of “database” include: • Storage of data on a computer in an organized way • Provision for searching and data extraction. • By these definitions web pages, books, journal articles, text files, and spreadsheet files cannot be considered as databases Purposes of biological databases: • To disseminate biological data and information • To provide biological data in computer-readable form • To allow analysis of biological data
But first…a few terms • Database Record: “A collection of related data, arranged in fields and treated as a unit. The data for each [item] in a database make up a record.”www.d.umn.edu/lib/reference/skills/vocab.html • Field: “the part of a record reserved for a particular type of data…”www.amberton.edu/VL_terms.htm
Example from the “Grocery Shopping Database”: Date: 18/08/2006 Item: White bread Store: Dover Provision Price: $1.29 Fields A different view of the first “record”: A record Field Values
Some features of Biological Databases • Data/information… • Stored in records according to some predetermined structure/format • +/- evidence • +/- unique identifiers • +/- additional annotation • +/- DB Xrefs (cross references)
Authoritative and Reliable • Most biological databases are from authoritative and reliable sources, however… • Not all Websites and Databases are reliable. • Not all data and information stored in authoritative and reliable websites or databases are accurate or correct, or up-to-date • Nevertheless, most of them are useful and instructive • Many of them contain valuable information and knowledge Identification of authority and Evaluation of reliability – very important Every serious scientist must be critical of the information they read, whether online or not.
Discoverability • Most publications, books and courses include online references – Web address (URL)e.g. http://www.pdb.org/ for protein structural data • Most useful resources are also listed and taught in courses, or spread by word of mouth. • Most databases are searchable by appropriate keywords and their authority determined by their web addresses, the institutions behind the databases or the authors’ reputation Most databases have full details of their content and how to use them.
NAR Database Categories List From: http://nar.oxfordjournals.org
TABLE OF NAR DATABASES ISSUE http://en.wikipedia.org/wiki/Biological_database http://www.oxfordjournals.org/nar/database/c/ • Nucleotide Sequence Databases • RNA sequence databases • Protein sequence databases • Structure Databases • Genomics Databases (non-vertebrate) • Metabolic and Signaling Pathways • Human and other Vertebrate Genomes • Human Genes and Diseases • Microarray Data and other Gene Expression Databases • Proteomics Resources • Other Molecular Biology Databases • Organelle databases • Plant databases • Immunological databases • Bibliographic databases
Database of Biological Databases • Alphabetical order http://www.oxfordjournals.org/nar/database/a/ • Categoryhttp://www3.oup.co.uk/nar/database/cap/
Human Genome Project – DNA sequence Microarray – RNA expression and levels Proteomics – protein expression and concentration in cells Structural proteomics or genomics – protein structure (and function) Functional genomics- protein function Information flow in Biology
Examples of Major Bioinformatics Resources • Browsing databases • NCBI Entrez http://www.ncbi.nlm.nih.gov/sites/gquery • EBI Ensembl http://www.ensembl.org/index.html • Retrieving sequences • SRS - Sequence Retrieval System http://srs.ebi.ac.uk/ • ExPASy – Expert Protein Analysis System – Proteomics server • http://au.expasy.org/
Bibliographic Information • PubMed and Medline • Recent National Institutes of Health USA policy • Google Scholar • Web of Science and Science Citation Index • Online journals • SuperTier Top Journals – Nature, Science, Cell, PNAS, etc. • Open access journals • Public Library of Science PLoS • Biomed Central
Literature - PubMed • Citations and abstracts for articles from approx. 5000 (not all!) biomedical journals • Text searching to identify citations of interest • Links to full-text articles (free or otherwise) • More than 16,000,000 records* * 16000000 As of Dec 29 2005. PubMed News. http://www.ncbi.nlm.nih.gov/feed/rss.cgi?ChanKey=PubMedNews
Literature –PubMed p53 cancer Authors Article Title Bibliographic Information (Journal name, date, volume, issue, page numbers) PMID: Unique ID for this record
STORING YOUR OWN BIBLIOGRAPHIC INFORMATIONOnline Wizfolio: http://www.wizfolio.comSoftware: ENDNOTE or REFMAN
Genetic and Genomic Databases • From sequencing of specific genes or genomic sequence of entire genomes • Data are prepared, annotated and stored in databases • Genbank, NCBI • DDBJ, NIG • EBI/EMBL • Making Deposits http://www.ncbi.nlm.nih.gov/Genbank/update.html • Bankit • Sequin
Nucleic Acid Databases Include: • GenBank • DDBJ • EMBL • RefSeq • Archives of Primary data • Exchange data amongst themselves Summary/Integration of primary data
GenBank • Data from: • Individual laboratories • Sequencing centres • Any organism • Individual records may be incomplete or inaccurate • Eg: sequencing errors • Eg: incomplete sequences NCBI Handbook
p53 Genbank record: HEADER Identifiers, Version, Definition Line Organismal Source Data sources
p53 Genbank record: FEATURES Cross-References to Other DBs Protein product
Links from p53 GenPept record Available links vary from one record to another
With so many records how do we know which one to work with? They may: • Come from different source databases • eg DDBJ, GenBank, EMBL (nucleotide) • Have the same or different sequence information • Single changes in nucleotides/amino acids • Incomplete sequence • Have variable extra annotation • Eg: Signal peptide; domains; DB XRefs etc
The RefSeq Project • Goal: a “comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.”http://www.ncbi.nlm.nih.gov/RefSeq/index.html • Info from: • Predictions from genomic sequence • Analysis of GenBank Records • Collaborating databases
p53 RefSeq mRNA features include… • Links: • GeneID – locus and display of genomic, mRNA and protein sequences; extensive additional annotation • OMIM – Online Mendelian Inheritance in Man – disease information • CDD – conserved protein domain • HGNC – official nomenclature for human genes • HPRD – Human Protein Reference Database • CDS (CoDing Sequence) • Gene Ontology terms applied to the protein • Nucleotide sequence range of translated product • Translation – the protein sequence • Link to RefSeq Protein record • Other features – sequence ranges refer to the nucleotide • Nuclear Localization Signal • Polyadenylation site etc
p53 RefSeq Protein continued Sequence ranges in features refer to the amino acid sequence
Interpreting RefSeq identifiers Genomic DNA • NC_123456 - complete genome, complete chromosome, complete plasmid • NG_123456 - genomic region • NT_123456 - genomic contig mRNA - NM_123456 Protein - NP_123456 Gene and protein models from genome annotation projects: • XM_123456 - mRNA • XR_123456 - RNA (non-coding transcripts) • XP_123456 - protein
RefSeq status • Validated • Reviewed • Provisional --------------- • Predicted • Model • Inferred • Genome Annotation Most confident Least confident
Protein Database – Swiss-Prot SWISS-PROT A curated database of protein sequences • Trained biologists extract and analyze relevant evidence from scientific publications • Post translational modifications, sequence variations, functions, etc TrEMBL = Translated EMBL UniProtKB = Swiss-Prot + TrEMBL
Protein Database – Swiss-Prot SWISS-PROT A curated database of protein sequences • Trained biologists extract and analyze relevant evidence from scientific publications • Post translational modifications, sequence variations, functions, etc TrEMBL = Translated EMBL UniProtKB = Swiss-Prot + TrEMBL
Structures: PDB • Three-dimensional structures of biomolecules Image: Eric Martz RasMol Gallery. http://www.umass.edu/microbio/rasmol/galmz.htm (Accessed Aug 16, 2006)