Introducing Bioinformatics Databases

Introducing Bioinformatics Databases Tan Tin Wee/Victor Tong/Susan Moore Dept of Biochemistry NUS Mohammad Asif Khan Perdana University Graduate School of Medicine

Sources of Biological Knowledge Past: textbooks, monographs, books, journals. Today: online accessible databasesKeyword searchable, e.g. Google. Every class of biological molecule has at least a few databases associated with it. Every area of biology, biotechnology, medicine and life science research will have some kind of database associated with it. Must be aware and familiar with MAJOR databases Must be able to discover NEW databases and master them as and when they appear.

Biological knowledge today! • STORED digitallyAlmost critical biological data, information, knowledge is currently stored in computers • ACCESSIBLE globallyAll current critical biological knowledge is publicly accessible via the Internet network of computers • SHARED extensivelyMost research data is exchanged via the Internet today if not publicly and free, then shared among international collaborators • PUBLISHED onlineMost scientific journals are now published with a digital version accessible online, free open access or for a subscription fee paid by the individual or by the institution 10 years ago, this was not so. There has been tremendous change.

UNSTOPPABLE DATA GROWTH 100 90 80 70 60 100 90 80 70 60 Growth of GenBank DNA Sequence (2005 – 2009) >100,000,000 sequences Exponential Increase Next Gen Sequencing Technologies Growth of PDBProtein and MacromolecularStructuresDriven by various Structural Genomics initiatives such as Protein Structure Initiative http://www.nigms.nih.gov/Initiatives/PSI JCSG http://www.jcsg.org/ http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html http://www.pdb.org/pdb/statistics/contentGrowthChart.do?content=total&seqid=100 2005 2008

RELENTLESS INCREASE IN DATABASESMichael Y. Galperin and Guy R. Cochrane (2009) Nucl. Acids Res. 37:D1-D4 . Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2009 (doi:10.1093/nar/gkn942) http://nar.oxfordjournals.org/cgi/content/full/37/suppl_1/D1 A lot of data A lot of databases What do they mean? Most of the data begins to make sense if they are Integrated But many plans to integrate these databases have failed

Biological Databases – examples and general considerations • Biological databases – what they are; purpose • Some general considerations • Sample databases

Biological databases Many (but not all) definitions of “database” include: • Storage of data on a computer in an organized way • Provision for searching and data extraction. • By these definitions web pages, books, journal articles, text files, and spreadsheet files cannot be considered as databases Purposes of biological databases: • To disseminate biological data and information • To provide biological data in computer-readable form • To allow analysis of biological data

But first…a few terms • Database Record: “A collection of related data, arranged in fields and treated as a unit. The data for each [item] in a database make up a record.”www.d.umn.edu/lib/reference/skills/vocab.html • Field: “the part of a record reserved for a particular type of data…”www.amberton.edu/VL_terms.htm

Example from the “Grocery Shopping Database”: Date: 18/08/2006 Item: White bread Store: Dover Provision Price: $1.29 Fields A different view of the first “record”: A record Field Values

Some features of Biological Databases • Data/information… • Stored in records according to some predetermined structure/format • +/- evidence • +/- unique identifiers • +/- additional annotation • +/- DB Xrefs (cross references)

Authoritative and Reliable • Most biological databases are from authoritative and reliable sources, however… • Not all Websites and Databases are reliable. • Not all data and information stored in authoritative and reliable websites or databases are accurate or correct, or up-to-date • Nevertheless, most of them are useful and instructive • Many of them contain valuable information and knowledge Identification of authority and Evaluation of reliability – very important Every serious scientist must be critical of the information they read, whether online or not.

Discoverability • Most publications, books and courses include online references – Web address (URL)e.g. http://www.pdb.org/ for protein structural data • Most useful resources are also listed and taught in courses, or spread by word of mouth. • Most databases are searchable by appropriate keywords and their authority determined by their web addresses, the institutions behind the databases or the authors’ reputation Most databases have full details of their content and how to use them.

NAR Database Categories List From: http://nar.oxfordjournals.org

TABLE OF NAR DATABASES ISSUE http://en.wikipedia.org/wiki/Biological_database http://www.oxfordjournals.org/nar/database/c/ • Nucleotide Sequence Databases • RNA sequence databases • Protein sequence databases • Structure Databases • Genomics Databases (non-vertebrate) • Metabolic and Signaling Pathways • Human and other Vertebrate Genomes • Human Genes and Diseases • Microarray Data and other Gene Expression Databases • Proteomics Resources • Other Molecular Biology Databases • Organelle databases • Plant databases • Immunological databases • Bibliographic databases

Database of Biological Databases • Alphabetical order http://www.oxfordjournals.org/nar/database/a/ • Categoryhttp://www3.oup.co.uk/nar/database/cap/

Human Genome Project – DNA sequence Microarray – RNA expression and levels Proteomics – protein expression and concentration in cells Structural proteomics or genomics – protein structure (and function) Functional genomics- protein function Information flow in Biology

Examples of Major Bioinformatics Resources • Browsing databases • NCBI Entrez http://www.ncbi.nlm.nih.gov/sites/gquery • EBI Ensembl http://www.ensembl.org/index.html • Retrieving sequences • SRS - Sequence Retrieval System http://srs.ebi.ac.uk/ • ExPASy – Expert Protein Analysis System – Proteomics server • http://au.expasy.org/

Bibliographic Information • PubMed and Medline • Recent National Institutes of Health USA policy • Google Scholar • Web of Science and Science Citation Index • Online journals • SuperTier Top Journals – Nature, Science, Cell, PNAS, etc. • Open access journals • Public Library of Science PLoS • Biomed Central

Literature - PubMed • Citations and abstracts for articles from approx. 5000 (not all!) biomedical journals • Text searching to identify citations of interest • Links to full-text articles (free or otherwise) • More than 16,000,000 records* * 16000000 As of Dec 29 2005. PubMed News. http://www.ncbi.nlm.nih.gov/feed/rss.cgi?ChanKey=PubMedNews

Literature –PubMed p53 cancer Authors Article Title Bibliographic Information (Journal name, date, volume, issue, page numbers) PMID: Unique ID for this record

AbstractPlus view - PubMed

STORING YOUR OWN BIBLIOGRAPHIC INFORMATIONOnline Wizfolio: http://www.wizfolio.comSoftware: ENDNOTE or REFMAN

Genetic and Genomic Databases • From sequencing of specific genes or genomic sequence of entire genomes • Data are prepared, annotated and stored in databases • Genbank, NCBI • DDBJ, NIG • EBI/EMBL • Making Deposits http://www.ncbi.nlm.nih.gov/Genbank/update.html • Bankit • Sequin

Nucleic Acid Databases Include: • GenBank • DDBJ • EMBL • RefSeq • Archives of Primary data • Exchange data amongst themselves Summary/Integration of primary data

GenBank • Data from: • Individual laboratories • Sequencing centres • Any organism • Individual records may be incomplete or inaccurate • Eg: sequencing errors • Eg: incomplete sequences NCBI Handbook

Searching Entrez Nucleotide for human p53

p53 Genbank record: GI 48094186

p53 Genbank record: HEADER Identifiers, Version, Definition Line Organismal Source Data sources

p53 Genbank record: FEATURES Cross-References to Other DBs Protein product

p53 Genbank record: SEQUENCE

The linked protein record: GenBank  GenPept

Links from p53 GenPept record Available links vary from one record to another

With so many records how do we know which one to work with? They may: • Come from different source databases • eg DDBJ, GenBank, EMBL (nucleotide) • Have the same or different sequence information • Single changes in nucleotides/amino acids • Incomplete sequence • Have variable extra annotation • Eg: Signal peptide; domains; DB XRefs etc

The RefSeq Project • Goal: a “comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.”http://www.ncbi.nlm.nih.gov/RefSeq/index.html • Info from: • Predictions from genomic sequence • Analysis of GenBank Records • Collaborating databases

RefSeq:

Example: p53 RefSeq mRNA record

p53 RefSeq mRNA features

p53 RefSeq mRNA features continued

p53 RefSeq mRNA features include… • Links: • GeneID – locus and display of genomic, mRNA and protein sequences; extensive additional annotation • OMIM – Online Mendelian Inheritance in Man – disease information • CDD – conserved protein domain • HGNC – official nomenclature for human genes • HPRD – Human Protein Reference Database • CDS (CoDing Sequence) • Gene Ontology terms applied to the protein • Nucleotide sequence range of translated product • Translation – the protein sequence • Link to RefSeq Protein record • Other features – sequence ranges refer to the nucleotide • Nuclear Localization Signal • Polyadenylation site etc

p53 RefSeq Protein

p53 RefSeq Protein continued

p53 RefSeq Protein continued Sequence ranges in features refer to the amino acid sequence

Interpreting RefSeq identifiers Genomic DNA • NC_123456 - complete genome, complete chromosome, complete plasmid • NG_123456 - genomic region • NT_123456 - genomic contig mRNA - NM_123456 Protein - NP_123456 Gene and protein models from genome annotation projects: • XM_123456 - mRNA • XR_123456 - RNA (non-coding transcripts) • XP_123456 - protein

RefSeq status • Validated • Reviewed • Provisional --------------- • Predicted • Model • Inferred • Genome Annotation Most confident Least confident

Protein Database – Swiss-Prot SWISS-PROT A curated database of protein sequences • Trained biologists extract and analyze relevant evidence from scientific publications • Post translational modifications, sequence variations, functions, etc TrEMBL = Translated EMBL  UniProtKB = Swiss-Prot + TrEMBL

Structures: PDB • Three-dimensional structures of biomolecules Image: Eric Martz RasMol Gallery. http://www.umass.edu/microbio/rasmol/galmz.htm (Accessed Aug 16, 2006)

Introducing Bioinformatics Databases