1 / 63

Biology 4900

Biology 4900. Biocomputing. Chapter 2. Molecular Databases and Data Analysis. Literature Databases. Online databases available at CSU Galileo JSTOR Online databases at other sites

manjit
Download Presentation

Biology 4900

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Biology 4900 Biocomputing

  2. Chapter 2 Molecular Databases and Data Analysis

  3. Literature Databases • Online databases available at CSU • Galileo • JSTOR • Online databases at other sites • PubMed. If you find a useful article, you can check PubMed Central to see if it is available online for free. • Where to get articles • PubMed Central • GIL • Interlibrary loan

  4. Sources of Molecular Data DNA RNA protein phenotype protein sequence databases cDNA *ESTs UniGene genomic DNA databases *Expressed Sequence Tags

  5. Molecular Databases • Primary Database • Archival - sequences submitted directly from experimental sequencing results • Very little interpretation • Anyone can submit; accuracy not checked • Examples • Nucleic Acid: EMBL, DDJB, GenBANK • Protein: Swiss-Prot, PIR, PDB • Secondary Databases • Curated– sequences are validated/checked and may be annotated • Refseq (nucleic acids and proteins, but limited to certain organisms) • TrEMBL, GenPept, Uniprot

  6. Nucleic Acid Databases • Contain: • Nucleic acid sequences • Chain termination method (Sanger sequencing) • Used for sequences 100-1000 bp • Whole Genome Shotgun (WGS) Sequencing • Used for sequences >1000 bp • DNA chopped into little chunks • Sequenced using chain termination method (reads) • Numerous, overlapping reads are collected and assembled into sequence (computational methods) • Annotations for each sequence • Putative identification of open reading frames (ORFs = parts of gene that encode protein) in sequence • Putative intron(excised)/exon(retained) locations • Authors, dates, publication, etc.

  7. International Nucleotide Sequence Database Collaboration(Public nucleotide and protein sequence databases) Name: GenBank Location: National Institutes of Health, National Center for Biotechnology Information GenBank Daily Info sharing Daily Info sharing DDBJ EMBL Daily Info sharing Name: European Molecular Biology Laboratory (EMBL) Location: European Bioinformatics Institute (EBI) Name: DNA Database of Japan (DDBJ) Location: National Institute of Genetics, Mishima

  8. GenBank • As of April 2011, There were approximately 126,551,501,141 bases in 135,440,924 sequence records in the traditional GenBank divisions. • Read the following paper: http://www.ncbi.nlm.nih.gov/pubmed/21071399 • Home Page: http://www.ncbi.nlm.nih.gov/genbank/ Homo sapiens 14.9 billion bases Mus musculus8.9b Rattus norvegicus6.5b Bos taurus 5.4b Zea mays 5.0b Sus scrofa 4.8b Danio rerio3.1b Strongylocentrotus purpurata 1.4b Oryza sativa (japonica)1.2b Nicotiana tabacum 1.2b Most sequenced organisms

  9. GenBank Home Page

  10. NCBI Resources • PubMed • BLAST • OMIM • Taxonomy Browser • Structure

  11. NCBI key features: PubMed • National Library of Medicine's search service • 21 million citations from MEDLINE & others (as of 2011) • Links to other online journals • http://www.ncbi.nlm.nih.gov/pubmed • Starting point for most research

  12. Literature Searches through PubMed

  13. Use the pull-down menu to access related resources such as Medical Subject Headings (MeSH)

  14. A “how to” pull-down menu links to tutorials

  15. Use “Advanced search” to limit by author, year, language, etc.

  16. PubMed search strategies Try the tutorial Use boolean queries (capitalize AND, OR, NOT) lipocalin AND disease Try using limits (see Advanced search) There are links to find Entrez entries and external resources

  17. 1 AND 2 1 2 lipocalin AND disease (504 results) 1 OR 2 1 2 lipocalin OR disease (2,500,000 results) 1 NOT 2 1 2 lipocalin NOT disease (2,370 results)

  18. Save Searches, Save Results, Get Papers

  19. PubMed Author Search

  20. Scholar Google Search • http://scholar.google.com/ • Includes references that may not be found in PubMed

  21. NCBI key features • A search from NCBI main page will search: • the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes • String search • Search by author, date, keyword, publication, etc. Classroom exercise: Author searches Paper searches Protein searches

  22. NCBI key features: BLAST • BLAST is… • Basic Local Alignment Search Tool • NCBI's sequence similarity search tool • supports analysis of DNA and protein databases 3CLN

  23. NCBI key features: OMIM • Online Mendelian Inheritance in Man • Catalog of human genes and genetic disorders

  24. NCBI key features: Taxonomy Browser • Browser for the major divisions of living organisms • (archaea, bacteria, eukaryota, viruses) • Taxonomy information such as genetic codes • Molecular data on extinct organisms • Useful to find a protein or gene from a species

  25. NCBI key features: Structure • Molecular Modelling Database (MMDB) • biopolymer structures obtained from • the Protein Data Bank (PDB) • Cn3D (a 3D-structure viewer) • vector alignment search tool (VAST)

  26. Cn3D • A 3D-structure viewer • Must download (ftp://ftp.ncbi.nlm.nih.gov/cn3d/Cn3D-4.3.msi) • Use to align structures identified as similar by VAST

  27. Example: Researching beta globin • Beta globin is protein, so it will be found in 3 different types of databases DNA *RNA Proteins Entrez Protein UniProt PDB SCOP CATH GenBank dbGSS GenBank dbHTGS GenBank dbSTS GenBank Entrez Gene GenBank dbEST UniGene Gene Expression Omnibus *Because RNA is unstable, it can be transcribed into complementary DNA (cDNA)

  28. Necessary (yet annoying) Definitions • Sequence Tagged Site (STS): Small DNA fragments with both DNA sequence data and mapping data (genes assigned to chromosomes) • Expressed Sequence Tags (EST): Partial DNA sequence of a complementary (cDNA) clone • Typically these are randomly-selected cDNA clones sequenced on a single strand (300-800 bp) • Useful for identifying novel genes • Higher rate of error http://genome.wellcome.ac.uk/doc_WTD020755.html

  29. Unigene • Unique Gene (Unigene) Project to create gene-oriented clusters by partitioning ESTs into non-redundant sets • http://www.ncbi.nlm.nih.gov/unigene • Ultimately there should be only 1 cluster per gene • Usually more than 1 due to errors • Types of errors • 2 or more clusters may represent different parts of the same gene • Sequence errors • Cloning artifacts (DNA transcribed during creation of cDNA that doesn’t correspond to authentic transcript) EST’s Unigene Cluster

  30. Unigene http://www.ncbi.nlm.nih.gov/unigene

  31. GenBankFlatfile • A format for organizing genomic sequence data. Includes the following: • Sequence and annotations • Header • Locus name or accession number: unique to sequence description • Size: number of nucleotide bases or amino acid residues • Molecule: DNA, RNA, strandedness (ds, ss), and type of RNA or DNA • Genbank division code: 18 divisions (PRI = primate, PLN = plant, BAC = bacterial, etc.) • Date of last modification • Definition Line: brief description of sequence (e.g. source organism, protein/gene name, function) • Accession: unique identifier for a record • Version • May be more than one accession • Record modification (accession.1; accession.2) • GI: is specific to version; may be more than one • Keywords • Source: organism or clone description • Reference: publications that discuss data reported • Authors and Journal publication info • PubMed identifier: link to sequence record (abstract) • Features: vary (chromosomal info., coding info, protein id, % of each nucleotide) • Sequence Data Jump to example

  32. What is an accession number? An accession number is label that is used to identify a sequence. It is a (unique) string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 GenBank genomic DNA sequence NT_030059 Genomic contig (overlapping DNA fragments) Rs7079946 dbSNP (single nucleotide polymorphism) N91759.1 An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript) NP_007635 RefSeq protein AAC02945 GenBank protein Q28369 SwissProt protein 1KT7 Protein Data Bank structure record DNA RNA protein

  33. NCBI’s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence. RefSeq identifiers include the following formats: Complete genome NC_###### Complete chromosome NC_###### Genomic contig NT_###### mRNA (DNA format) NM_###### e.g. NM_006744 Protein NP_###### e.g. NP_006735

  34. UniGene Name Search: Oncomodulin All results listed Allows filtering

  35. UniGene Name Search: Select Human Oncomodulin • 4 Expressed Sequence Tags from 1 complementary DNA library • Identifies chromosome and map position on chromosome • Compares cluster transcripts with refseq proteins

  36. UniGene Name Search: Select Human Oncomodulin Click on link for menu of other links: Conserved domains Gene summary Protein sequence Clicking on Protein sequence link then takes you to predicted protein sequence file (NP_006179.2) 

  37. UniGene Name Search: Select Human Oncomodulin 1 2 3 4 Once here, you can: Open FASTA file Run BLAST Identify and view conserved domains See related proteins

  38. Access to sequences: Gene at NCBI Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. Example: RefSeq provides a curated, optimal accession number for each DNA (NM_000518 for beta globin DNA corresponding to mRNA) or protein (NP_000509) These references should be more reliable data

  39. Gene Name Search: Oncomodulin Returns list of gene entries for oncomodulin for different organisms Click on a highlighted link to see details

  40. Gene Name Search: Select Human Oncomodulin Summary of all gene information, including mapping (when available). Note that this sequence has been validated as a RefSeq. Scrolling down, you can find link to protein data through UniProt. 

  41. Gene Name Search: Link to Oncomodulin Protein

  42. Protein Name Search: Oncomodulin Notice that I filtered this search so that results show only human oncomodulin

  43. You can change the display (as shown)…

  44. FASTA format: versatile, compact with one header line followed by a string of nucleotides or amino acids in the single letter code

  45. Comparison of Gene to other resources Gene: collects key information on each gene/protein from major databases. It covers all major organisms. UniGene: Database with information on where in a body, when in development, and how abundantly a transcript is expressed HomoloGene: Gathers information on sets of related proteins based on common genetic ancestry.

  46. Homologene Name Search: Oncomodulin Provides list of homologous (related) genes

  47. Homologene Name Search: Oncomodulin Shows conserved domains of protein sequences. If you click on graphic, takes you to summary of domain/family information.

  48. ExPASy to access protein and DNA sequences • ExPASy (Expert Protein Analysis System) sequence retrieval system • Visit http://www.expasy.ch/ • Similar to Entrez for NCBI Example: Search for calmodulin Jump to Prosite

  49. UniProt: a centralized protein database (uniprot.org) This is separate from NCBI, and interlinked.

  50. UniProt: Calmodulin • Search Results for bovine calmodulin (P62157)

More Related