590 likes | 782 Views
Introduction to NCBI (and Other Online Bioinformatics Resources) Society for Developmental Biology 2008. Li-San Wang Penn Center for Bioinformatics University of Pennsylvania lswang@mail.med.upenn.edu http://people.pcbi.upenn.edu/~lswang/. Outline.
E N D
Introduction to NCBI (and Other Online Bioinformatics Resources)Society for Developmental Biology 2008 Li-San Wang Penn Center for Bioinformatics University of Pennsylvania lswang@mail.med.upenn.edu http://people.pcbi.upenn.edu/~lswang/
Outline • Introduction of the NCBI databases and web services • Introduction to some concepts in bioinformatics • Hands-on experience • Other online resources: • UCSC Genome Browser and NIAID DAVID
http://www.ncbi.nlm.nih.gov/Database/datamodel/index.html (Nov 2004)
Entities Genome, Chromosome Gene, Exon, Intron Protein, Domain, SNP … Relations Homology Taxonomy Ontology OMIM etc. Annotations Phenotype Publication Gene Expression
Some Common Tasks • Find information about a gene/genome, etc. • Find homologs • Find genes related to a phenotype • Find similar sequences to an input sequence (BLAST)
NCBI Entrez (Google “Entrez”) http://www.ncbi.nlm.nih.gov/sites/gquery
Accession Numbers • Example: TP53 NM_000546.4 → NP_000537.3 tumor protein p53 isoform a NM_000546.4 gi: 187830767 NP_000537.3 gi:120407068 http://www.ncbi.nlm.nih.gov/Sequin/acc.html
File Format Fasta GFF XML
NCBI Entrez Gene (previously LocusLink) http://www.ncbi.nlm.nih.gov/sites/entrez
Exercise (NCBI Minicourse) • Retrieve human entries related to "prion protein" in Entrez Gene. • Name the map location of this gene on the human genome. • What is the function of this protein? • What are the alternate gene symbols? • Name the phenotypes associated with the mutations in this gene. • How many alternatively spliced products have been annotated for the gene?
Entrez Gene and dbSNP • Retrieve human prion protein by Entrez Gene (PRNP) • Identify the variations annotated on this gene by clicking on the SNP:geneView. • How many of them are nonsynonymous changes? • Are there known SNPs in the coding region of a gene associated with any phenotype?
Exercise • Find human GDNF on Map Viewer • Download the gene sequence and 5kb upstream by using the "dl" link. • Add the Component and Contig maps for this region. Name the contig and GenBank accession numbers for the sequence covering this region. Are the sequences finished? • Add the Ab initio (model) and Transcript (RNA) maps. How many alternatively spliced transcripts have been annotated for the gene • Display the current data as "Data As Table View". • Add the phenotype map. Name the disease with which the GDNF gene is associated. Obtain more information about the disease by linking to the corresponding OMIM record.
NCBI Genome and Genome Project http://www.ncbi.nlm.nih.gov/Genomes/
Relations Between Sequence Data • Gene • Unigene • Homologene • Taxonomy
UniGene http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?UGID=2723799&TAXID=9606&SEARCH=tp53%20AND%20human%20[organism]
Exercise • Locate chimpanzee using TaxBrowser. What is its lineage? How many sub-species are there? • How many genome projects are under Mammalia class? • Find the common tree of the following species: • Human/Chimp/Dog/Horse/Mouse/Rat/Chicken/Zebrafish • Which of mouse or dog is closer to human? • Which species diverged earliest from the human lineage?
Example Query • Gene: Prion Protein (PRNP) (or your preferred gene) • How many proteins does the gene encode? • What proteins in other organisms are homologous to this protein? • What are the domains in the protein? Find a sequence alignment to its homologs • View the conserved regions on the 3D structure (download NCBI CN3D)
GEO http://www.ncbi.nlm.nih.gov/geo/
Examples • What human genes are related to hypertension? Which of those genes are on chromosome 17? • List the OMIM entries that describe genes on chromosome 10. • List the OMIM entries that contain information about allelic variants. • Retrieve the OMIM record for the cystic fibrosis transmembrane conductance regulator (CFTR), and link to related protein sequence records via Entrez. • Find the OMIM record for the p53 tumor protein, and link out to related information in Entrez Gene and the p53 Mutation Database. http://www.ncbi.nlm.nih.gov/Omim/omimhelp.html#SampleQuestions
Complex Queries cancer[titl] AND 11[chrom] AND autosomal dominant [clin] • The Boolean operators, AND, OR, NOT, should be written in upper case. Use parentheses for precedence. • Search field tags are enclosed in square brackets
Quick Review • Genbank • Entrez Gene, HomoloGene, Unigene • Protein structures and CCD • Taxonomy • GEO • OMIM • Complex queries
Example (NCBI PubMed tutorial exercise 4) • Use the MeSH Database to build a strategy that will find citations to articles about schizophrenia resulting from prenatal exposure to influenza. Schizophrenia and influenza should be the major topics of the articles.
Basic Local Alignment Search Tool (BLAST) • Usage: Find sequences in a database that are similar to the input sequence • Applications: • Infer the function of newly sequenced genes • Predict new members of gene families • Explorer evolutionary relationships • Predict the location and function of protein-coding and transcription-regulation regions in genomic DNA
How BLAST works • Sequence databases are preprocessed for faster access by BLAST • Given an input sequence S: • List all k-mers (e.g. k=11 for DNA) of S • Find sequences in DB having similar k-mers • Extend the matched words to form High-Scoring Pairs (HSPs) • Evaluate the significance of HSP
http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHomehttp://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHome
>gi|187960039|ref|NM_001127233.1| Mus musculus transformation related protein 53 (Trp53), transcript variant 2, mRNA TTTCCCCTCCCACGTGCTCACCCTGGCTAAAGTTCTGTAGCTTCAGTTCATTGGGACCATCCTGGCTGTAGGTAGCGACTACAGTTAGGGGGCACCTAGCATTCAGGCCCTCATCCTCCTCCTTCCCAGCAGGGTGTCACGCTTCTCCGAAGACTGGATGACTGCCATGGAGGAGTCACAGTCGGATATCAGCCTCGAGCTCCCTCTGAGCCAGGAGACATTTTCAGGCTTATGGAAACTACTTCCTCCAGAAGATATCCTGCCATCACCTCACTGCATGGACGATCTGTTGCTGCCCCAGGATGTTGAGGAGTTTTTTGAAGGCCCAAGTGAAGCCCTCCGAGTGTCAGGAGCTCCTGCAGCACAGGACCCTGTCACCGAGACCCCTGGGCCAGTGGCCCCTGCCCCAGCCACTCCATGGCCCCTGTCATCTTTTGTCCCTTCTCAAAAAACTTACCAGGGCAACTATGGCTTCCACCTGGGCTTCCTGCAGTCTGGGACAGCCAAGTCTGTTATGTGCACGTACTCTCCTCCCCTCAATAAGCTATTCTGCCAGCTGGCGAAGACGTGCCCTGTGCAGTTGTGGGTCAGCGCCACACCTCCAGCTGGGAGCCGTGTCCGCGCCATGGCCATCTACAAGAAGTCACAGCACATGACGGAGGTCGTGAGACGCTGCCCCCACCATGAGCGCTGCTCCGATGGTGATGGCCTGGCTCCTCCCCAGCATCTTATCCGGGTGGAAGGAAATTTGTATCCCGAGTATCTGGAAGACAGGCAGACTTTTCGCCACAGCGTGGTGGTACCTTATGAGCCACCCGAGGCCGGCTCTGAGTATACCACCATCCACTACAAGTACATGTGTAATAGCTCCTGCATGGGGGGCATGAACCGCCGACCTATCCTTACCATCATCACACTGGAAGACTCCAGTGGGAACCTTCTGGGACGGGACAGCTTTGAGGTTCGTGTTTGTGCCTGCCCTGGGAGAGACCGCCGTACAGAAGAAGAAAATTTCCGCAAAAAGGAAGTCCTTTGCCCTGAACTGCCCCCAGGGAGCGCAAAGAGAGCGCTGCCCACCTGCACAAGCGCCTCTCCCCCGCAAAAGAAAAAACCACTTGATGGAGAGTATTTCACCCTCAAGATCCGCGGGCGTAAACGCTTCGAGATGTTCCGGGAGCTGAATGAGGCCTTAGAGTTAAAGGATGCCCATGCTACAGAGGAGTCTGGAGACAGCAGGGCTCACTCCAGCCTCCAGCCTAGAGCCTTCCAAGCCTTGATCAAGGAGGAAAGCCCAAACTGCTAGCTCCCATCACTTCATCCCTCCCCTTTTCTGTCTTCCTATAGCTACCTGAAGACCAAGAAGGGCCAGTCTACTTCCCGCCATAAAAAAACAATGGTCAAGAAAGTGGGGCCTGACTCAGACTGACTGCCTCTGCATCCCGTCCCCATCACCAGCCTCCCCCTCTCCTTGCTGTCTTATGACTTCAGGGCTGAGACACAATCCTCCCGGTCCCTTCTGCTGCCTTTTTTACCTTGTAGCTAGGGCTCAGCCCCCTCTCTGAGTAGTGGTTCCTGGCCCAAGTTGGGGAATAGGTTGATAGTTGTCAGGTCTCTGCTGGCCCAGCGAAATTCTATCCAGCCAGTTGTTGGACCCTGGCACCTACAATGAAATCTCACCCTACCCCACACCCTGTAAGATTCTATCTTGGGCCCTCATAGGGTCCATATCCTCCAGGGCCTACTTTCCTTCCATTCTGCAAAGCCTGTCTGCATTTATCCACCCCCCACCCTGTCTCCCTCTTTTTTTTTTTTTTACCCCTTTTTATATATCAATTTCCTATTTTACAATAAAATTTTGTTATCACTTAAAAAAAAAA
Databases • Protein • nr / refseq / swissprot / pat / pdb / month / env_nr • Nucleotide • nr / refseq_rna / refseq_genomic / est / est_human / est_others / gss / htgs / pat / pdb / month / dbsts / chromosome / wgs / env_nt http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#nucleotide_databases
Other Options • Number of hits to display • Weights for matching • Nucleotide: matching • Protein: scoring matrix • Weights for gap • (open) + (k-1) * (extend) • Organism • Mask low-complexity regions
gi: 4759254 >gi|4759254|ref|NP_004611.1| TNF receptor-associated factor 6 [Homo sapiens] MSLLNCENSCGSSQSESDCCVAMASSCSAVTKDDSVGGTASTGNLSSSFMEEIQGYDVEFDPPLESKYECPICLMALREAVQTPCGHRFCKACIIKSIRDAGHKCPVDNEILLENQLFPDNFAKREILSLMVKCPNEGCLHKMELRHLEDHQAHCEFALMDCPQCQRPFQKFHINIHILKDCPRRQVSCDNCAASMAFEDKEIHDQNCPLANVICEYCNTILIREQMPNHYDLDCPTAPIPCTFSTFGCHEKMQRNHLARHLQENTQSHMRMLAQAVHSLSVIPDSGYISEVRNFQETIHQLEGRLVRQDHQIRELTAKMETQSMYVSELKRTIRTLEDKVAEIEAQQCNGIYIWKIGNFGMHLKCQEEEKPVVIHSPGFYTGKPGYKLCMRLHLQLPTAQRCANYISLFVHTMQGEYDSHLPWPFQGTIRLTILDQSEAPVRQNHEEIMDAKPELLAFQRPTIPRNPKGFGYVTFMHLEALRQRTFIKDDTLLVRCEVSTRFDMGSLRREGFQPRSTDAGV gi:22027612 >gi|22027612|ref|NP_066961.2| TNF receptor-associated factor 2 [Homo sapiens]MAAASVTPPGSLELLQPGFSKTLLGTKLEAKYLCSACRNVLRRPFQAQCGHRYCSFCLASILSSGPQNCAACVHEGIYEEGISILESSSAFPDNAARREVESLPAVCPSDGCTWKGTLKEYESCHEGRCPLMLTECPACKGLVRLGEKERHLEHECPERSLSCRHCRAPCCGADVKAHHEVCPKFPLTCDGCGKKKIPREKFQDHVKTCGKCRVPCRFHAIGCLETVEGEKQQEHEVQWLREHLAMLLSSVLEAKPLLGDQSHAGSELLQRCESLEKKTATFENIVCVLNREVERVAMTAEACSRQHRLDQDKIEALSSKVQQLERSIGLKDLAMADLEQKVLEMEASTYDGVFIWKISDFARKRQEAVAGRIPAIFSPAFYTSRYGYKMCLRIYLNGDGTGRGTHLSLFFVVMKGPNDALLRWPFNQKVTLMLLDQNNREHVIDAFRPDVTSSSFQRPVNDMNIASGCPLFCPVSKMEAKNSYVRDDAIFIKAIVDLTGL http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi
Example (NCBI Minicourse #6) • Problem: A laboratory has generated an EST library from a hemochromatosis patient and wants to identify the gene(s) causing the phenotype. We will follow these steps to solve the problem: • Compare ESTs from a hemochromatosis patient to the human genome (using BLAST). • Identify the gene(s) aligning the ESTs and download their sequences (using Map Viewer). • Identify whether the ESTs contain any known nucleotide variations (single nucleotide polymorphisms) (using dbSNP). • Determine whether a mutant form of the gene is known to cause a phenotype (using OMIM).
Sequences • TGCCTCCTTTGGTGAAGGTGACACATCATGTGACCTCTTCAGTGACCACTCTACGGTGTCGGGCCTTGAACTACTACCCCCAGAACATCACCATGAAGTGGCTGAAGGATAAGCAGCCAATGGATGCCAAGGAGTTCGAACCTAAAGACGTATTGCCCAATGGGGATGGGACCTACCAGGGCTGGATAACCTTGGCTGTACCCCCTGGGGAAGAGCAGAGATATACGTACCAGGTGGAGCACCCAGGCCTGGATCAGCCCCTCATTGTGATCTGGG http://people.pcbi.upenn.edu/~lswang/seq1.txt http://www.ncbi.nlm.nih.gov/Class/minicourses/diseasegene.html
UCSC Genome Browserhttp://genome.ucsc.edu/ • Google “Genome Browser” http://genome.ucsc.edu/
Example • Locate MLL (myeloid/lymphoid or mixed-lineage leukemia) on the human genome • Find relevant information • Conservation across the gene • Retrieve the sequences of human MLL and divide into exon/intron regions • Retrieve the 5’ and 3’ flanking region sequences
BLAT • Blast-like alignment tool • Quickly finds genomic regions highly similar to the input query sequence
Example >hg18_knownGene_uc002gil.1_1 range=chr17:7531420-7531642 5'pad=0 3'pad=0 strand=- repeatMasking=none ACTTGTCATGGCGACTGTCCAGCTTTGTGCCAGGAGCCTCGCAGGGGTTGATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGG http://genome.ucsc.edu/cgi-bin/hgBlat?command=start&org=Human&db=hg18&hgsid=110368820
Other Tasks for Genome Browser • Download the database • Retrieve genomic sequences/annotations • Upload your own annotation (customized track) using .bed format and visualize on the browser • Many tasks are easier using the Galaxy web service from Penn State U (Google “Galaxy Trac” or go to http://galaxy.psu.edu/)
DAVID (NIAID) • Google “NIH DAVID”