620 likes | 726 Views
”Resources of Biomolecular Data: Sequences, Structures and Functionality” PhD course #27803. Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark nikob @cbs.dtu.dk. Outline. Magnitudes and Scales Resources: Data Sources & Tools
E N D
”Resources of Biomolecular Data: Sequences, Structures and Functionality” PhD course #27803 Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark nikob@cbs.dtu.dk
Outline • Magnitudes and Scales • Resources: Data Sources & Tools • Primary DNA sources • Sequence Repositories • Structure Repositories • Functional Categorization • Integration of Databases • The Human Genome • Genome Browsers • Prediction Tools • Evaluation of Prediction Servers • Starting points • Link collections
Learning Objectives • The student should be able to: • Describe differences between sequence repositories and curated databases • Describe the challenges of maintaining genome-wide biological databases • List two entry points for getting an overview of ”my gene of interest” • Describe how prediction servers may be evaluated
Resources: Sources & Tools • There is A LOT OF biomolecular databases/sources • A LOT OF overlap of information/redundancy • A LOT OF TOOLS • Personal picks/preferences • User-friendliness • Update intervals • Curation efforts / error correction • Linkage to other DBs
Human Genome Published HUGO: Nature, 15.feb.2001 Celera: Science, 16.feb.2001
Magnitudes and Scales • Human genome 3,200,000,000 bp • Single basepair full genome is 9 orders of magnitude • Genome = Football field: ~3 billion leaves of grass • Single base A T G C (or SNP) =1 leaf of grass • Genome browsing • Zooming from whole stadium to single leaf
How we got the sequence • Sanger chain termination method
Primary DNA sources • Trace files repositories • Single read: 500-1000 bp (~golf ball size/ jig saw puzzle) • Variable quality • WashU-Merck Human EST Project / Trace files • ”Base-calling” non-trivial G, C or nothing?
Sequence repositories - GenBank et al. • GenBank / EMBL / DDBJ • Highly redundant (many versions of same gene) • Cross-updated daily • Version history is recorded • Previous sequence records can be retrieved • Contigs/HTGS (100-200 kb) finishing at different stages • DraftFinished • Includes genomic DNA, cDNA, ESTs, translated peptides
Non-redundant and Curated databases • Non-redundant • Manual or automatic curation • DNA • RefSeq (NCBI; semi-automated) • Ensembl gene index (automated) • Protein • RefSeq (NCBI; semi-automated) • TrEMBL (EMBL; automated)
Curated database: UniProt/SwissProt • SIB - Swiss Institute of Bioinformatics • Protein Knowledgebase / Sequence Database • Highly curated • Experimental evidence evaluated (e.g. modifications) • All 80,000 entries checked by Amos Bairoch himself ;-) • ExPASy - Expert Protein Analysis System • Proteomics tools: links + local servers
Structure databases / Protein Data Bank (PDB) • X-ray , NMR biomolecular structures • Protein Data Bank (PDB) • http://www.rcsb.org/pdb/
Functional Categorization • Gene Ontology (GO) • Hierarchical • Controlled vocabulary
Functional Categorization • Gene Ontology (GO) http://www.geneontology.org/ • Molecular Function - the tasks performed by individual gene products; examples are transcription factor and DNA helicase • Biological Process - broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions • Cellular Component - subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex
Integration of databases - Webs of web-sites • Links, links, links... • SRS = Sequence Retrieval System • Powerful, complex query language • BioDAS – Distributed Annotation System http://srs.ebi.ac.uk/
For ’my gene’, how do I: • Get an overview of the sequence information known? (GeneCards+OMIM) • Examine the ’Genome Neighbourhood’? (Genome Browsers) • Predict protein post-translational modifications (PTMs)? (Prediction servers) • (Evaluate the value of predicted features)
Genetic/Medical Information • OMIM, Online Mendelian Inheritance in Man (NCBI) • The OMIM database is a catalog of human genes and genetic disorders • >16,000 entries (April, 2006) • Examples: cystic fibrosis, prions, amyloid precursor protein • Condensed, highly curated descriptions of genetics/disease/animal models/references
For ’my gene’, how do I: • Get an overview of the sequence information known? (GeneCards+OMIM) • Examine the ’Genome Neighbourhood’? (Genome Browsers) • Predict protein post-translational modifications (PTMs)? (Prediction servers) • (Evaluate the value of predicted features)
Genome Browsing • Three public • Open access • Use same genome build/assembly • NCBI (U.S.) • UCSC (Santa Cruz, U.S.) • EnsEmbl (EBI, EU) • (One private) • (Restricted, commercial; closed 2005)
Genome Browsers - Portals to the Genomic World • UCSC – Univ. California – Santa Cruz (U.S.) • http://genome.ucsc.edu/ • NCBI – National Center for Biotechnology Information (U.S.) • http://www.ncbi.nlm.nih.gov/Genomes/index.html • EnsEmbl – European Molecular Biology Laboratory (E.U.) • http://www.ensembl.org/
For ’my gene’, how do I: • Get an overview of the sequence information known? (GeneCards) • Examine the ’Genome Neighbourhood’? (Genome Browsers) • Predict protein post-translational modifications (PTMs) or Gene Structure? (Prediction servers) • ...and evaluate the reliability of prediction methods
NetPhos – a prediction server http://www.cbs.dtu.dk/services/NetPhos/