1 / 39

Access to Sequence Data and Literature Information

Access to Sequence Data and Literature Information. Department of Computer and Information Science Spring 2012. NCBI. N ational C enter for B iotechnology I nformation @ National Institute of Health http://www.ncbi.nlm.nih.gov/ NCBI hosts GenBank

frey
Download Presentation

Access to Sequence Data and Literature Information

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Access to Sequence Data and Literature Information Department of Computer and Information Science Spring 2012 CISC 4020 Bioinformatics

  2. NCBI • National Center for Biotechnology Information @ National Institute of Health • http://www.ncbi.nlm.nih.gov/ • NCBI hosts GenBank • a database of most known public DNA, RNA, and protein sequences (about 100 billion nucleotides from 100 million sequences). • It contains bibliographic and biological annotation for these sequences. CISC 4020 Bioinformatics

  3. Growth of GenBank Base pairs of DNA (millions) Sequences (millions) 1982 1986 1990 1994 1998 2002 Year CISC 4020 Bioinformatics

  4. Three major public DNA databases EMBL GenBank DDBJ Housed at EBI European Bioinformatics Institute Housed at NCBI National Center for Biotechnology Information Housed in Japan DNA Data Bank of Japan The underlying raw DNA sequences are identical CISC 4020 Bioinformatics Page 14

  5. Taxonomy at NCBI: >200,000 species are represented in GenBank http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi CISC 4020 Bioinformatics

  6. National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov CISC 4020 Bioinformatics

  7. Entrez @ NCBI • A search and retrieval system • Entrez integrates… • the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes CISC 4020 Bioinformatics

  8. Entrez Gene • A great starting point • It collects key information on each gene/protein from major databases. It covers all major organisms. CISC 4020 Bioinformatics

  9. Accession number • A label for a molecular sequence. • Different databases have characteristic formats. • GenBank formats • X02775 • AF025334 • In addition, NCBI assigns unique sequence ids to the individual sequences within a record. E.g. GI:28302128 CISC 4020 Bioinformatics

  10. Example – beta globin • There are thousands of accession numbers for beta globin – thousands of core nucleotide sequences. • full length vs. partial. • naturally variants vs. spliced transcripts. • errors. CISC 4020 Bioinformatics

  11. Example:beta globin From the NCBI home page, type “beta globin” and hit “Search” CISC 4020 Bioinformatics

  12. CISC 4020 Bioinformatics

  13. RefSeq : The Reference Sequence Project • One RefSeq entry for a given gene or gene product • NM_000000 (or more digits) • Several RefSeq entries if there are splice variants or distinct loci. • Example : myoglobin has 3 distinct splice variants, which encode an identical protein. • NM_005368, NM_203377, and NM_203378 • NP_005359, NP_976311, and NP_976312 for corresponding proteins (different sources) CISC 4020 Bioinformatics

  14. Follow the link to “Gene” CISC 4020 Bioinformatics

  15. Entrez Gene is in the header Note the “Official Symbol” HBB for beta globin Note the “limits” option CISC 4020 Bioinformatics

  16. Using “limits” you can restrict your search to human (or any other organism) CISC 4020 Bioinformatics

  17. By applying limits, there are now far fewer entries CISC 4020 Bioinformatics

  18. Entrez Gene (top of page): Note a useful summary, and links to other databases CISC 4020 Bioinformatics

  19. Entrez Gene (middle of page): genomic region, Gene sequence CISC 4020 Bioinformatics

  20. GeneBank Entry CISC 4020 Bioinformatics

  21. Gene Sequence Lower part of the Entry Page CISC 4020 Bioinformatics

  22. Entrez Gene (middle of page, continued): phenotypes, function CISC 4020 Bioinformatics

  23. Entrez Gene (bottom of page): RefSeq accession numbers for Protein CISC 4020 Bioinformatics

  24. Entrez Protein: accession, organism, literature… Fig. 2.8 CISC 4020 Bioinformatics

  25. Entrez Protein: …features of a protein, and its sequence in the one-letter amino acid code Fig. 2.8 CISC 4020 Bioinformatics

  26. One-letter amino acid code CISC 4020 Bioinformatics

  27. Entrez Protein: You can change the display (as shown)… CISC 4020 Bioinformatics

  28. FASTA format: versatile, compact with one header line, followed by a string of nucleotides or amino acids in the single letter code CISC 4020 Bioinformatics

  29. Entrez Gene & Entrez Protein • Begin your search with Entrez Gene. • Identify the official gene name • Identify chromosomal location of the gene. • Identify RefSeq accession numbers associated with all the DNA and protein variants. • Visit Entrez Protein • Obtain FASTA format of the protein CISC 4020 Bioinformatics

  30. PubMed @ NCBI • National Library of Medicine's search service • 21 million citations in MEDLINE (as of 2011). • links to online full-text journal articles and the integrated molecular biology databases. • PubMed tutorial on the site or visit NLM: http://www.nlm.nih.gov/bsd/disted/pubmed.html CISC 4020 Bioinformatics

  31. BLAST @ NCBI • Basic Local Alignment Search Tool • NCBI's sequence similarity search tool • supports analysis of DNA and protein databases • 100,000 searches per day CISC 4020 Bioinformatics

  32. OMIM @ NCBI • Online Mendelian Inheritance in Man • catalog of human genes and genetic disorders • contains links to PubMed articles and sequence information. • created by Dr. Victor McKusick; led by Dr. Ada Hamosh at JHMI CISC 4020 Bioinformatics

  33. TaxBrowser @ NCBI • A taxonomy browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses). • It features taxonomy information • genetic codes • Taxonomy resources • molecular data on extinct organisms • Recent changes to classification schemes. • Practically useful to find a protein or gene from a species • Each organism or group in GenBank is assigned a unique taxonomy identifier. • txid9606 for human (Homo sapiens) CISC 4020 Bioinformatics

  34. How many gene entries for Human Being? CISC 4020 Bioinformatics

  35. Result CISC 4020 Bioinformatics

  36. Proteins for Human Being CISC 4020 Bioinformatics

  37. Proteins of molecular weight 2,000 to 10,000 daltons. Txid9606[Organism:exp] AND 002000:010000[molwt] CISC 4020 Bioinformatics

  38. Result CISC 4020 Bioinformatics

  39. Tips for using Entrez databases • Use the Boolean operators AND, OR and NOT (must be capitalized). • To narrow your search to a particular organism, try Taxonomy with the organism first. • Limit the weight of protein by adding XXXXXX:YYYYYY[molwt] to the query. • Put * as the surffix of your query, you get all records that begin with a particular text string. CISC 4020 Bioinformatics

More Related