1 / 34

Bioinformatics databases & sequence retrieval

Bioinformatics databases & sequence retrieval . Content of lecture Introduction Bioinformatics data & databases Sequence Retrieval with MRS Celia van Gelder CMBI UMC Radboud June 2009. I. Bioinformatics questions . Lookup Is the gene known for my protein (or vice versa)?

ophrah
Download Presentation

Bioinformatics databases & sequence retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics databases & sequence retrieval • Content of lecture • Introduction • Bioinformatics data & databases • Sequence Retrieval with MRS • Celia van Gelder • CMBI • UMC Radboud • June 2009

  2. I. Bioinformatics questions • Lookup • Is the gene known for my protein (or vice versa)? • What sequence patterns are present in my protein? • To what class or family does my protein belong? • Compare • Are there sequences in the database which resemble the protein I cloned? • How can I optimally align the members of this protein family? • Predict • Can I predict the active site residues of this enzyme? • Can I predict a (better) drug for this target? • How can I predict the genes located on this genome? ©CMBI 2009

  3. Sequence similarity Image, you sequenced this human protein. MVVSGAPPAL GGGCLGTFTS LLLLASTAIL NAARIPVPPA CGKPQQLNRVVGGEDSTDSE WPWIVSIQKN GTHHCAGSLL TSRWVITAAH CFKDNLNKPY LFSVLLGAWQ LGNPGSRSQK VGVAWVEPHP VYSWKEGACA DIALVRLERS IQFSERVLPI CLPDASIHLP PNTHCWISGW GSIQDGVPLP HPQTLQKLKV PIIDSEVCSH LYWRGAGQGP ITEDMLCAGY LEGERDACLG DSGGPLMCQV DGAWLLAGII SWGEGCAERN RPGVYISLSA HRSWVEKIVQ GVQLRGRAQG You know it is a serine protease. Which residues belong to the activesite? Is its sequence similar to the mouse serine protease? ©CMBI 2009

  4. Sequence Alignment MVVSGAPPAL GGGCLGTFTS LLLLASTAIL NAARIPVPPA CGKPQQLNRV VGGEDSTDSE MMISRPPPAL GGDQFSILIL LVLLTSTAPI SAATIRVSPD CGKPQQLNRI VGGEDSMDAQ *::* .**** **. :. : *:**:*** : .** * *.* *********: ****** *:: WPWIVSIQKN GTHHCAGSLL TSRWVITAAH CFKDNLNKPY LFSVLLGAWQ LGNPGSRSQK WPWIVSILKN GSHHCAGSLL TNRWVVTAAH CFKSNMDKPS LFSVLLGAWK LGSPGPRSQK ******* ** *:******** *.***:**** ***.*::** *********: **.**.**** VGVAWVEPHP VYSWKEGACA DIALVRLERS IQFSERVLPI CLPDASIHLP PNTHCWISGW VGIAWVLPHP RYSWKEGTHA DIALVRLEHS IQFSERILPI CLPDSSVRLP PKTDCWIAGW **:*** *** ******: * ********:* ******:*** ****:*::** *:*.***:** GSIQDGVPLP HPQTLQKLKV PIIDSEVCSH LYWRGAGQGP ITEDMLCAGY LEGERDACLG GSIQDGVPLP HPQTLQKLKV PIIDSELCKS LYWRGAGQEA ITEGMLCAGY LEGERDACLG ********** ********** ******:*. ******** . ***.****** ********** DSGGPLMCQV DGAWLLAGII SWGEGCAERN RPGVYISLSA HRSWVEKIVQ GVQLRGRAQG DSGGPLMCQV DDHWLLTGII SWGEGCAD-D RPGVYTSLLA HRSWVQRIVQ GVQLRG---- ********** *. ***:*** *******: : ***** ** * *****::*** ****** => Transfer of information ©CMBI 2009

  5. II. Bioinformatics data and databases mRNA expression profiles (DNA microarrays) Many different datatypes Amount of data is growingveryfast Collision Induced Dissociation Spectra (tandem mass-spectrometry) http://courses.washington.edu/bioinfo/

  6. EMBL DNA database ©CMBI 2009 ©CMBI 2009

  7. Biological databases (1) Primary databases contain biomolecular sequences or structures (experimental data!) and associatedannotationinformation Sequences Nucleic acid sequences EMBL, Genbank, DDBJ Protein sequences SwissProt, trEMBL, UniProt Structures Protein Structures PDB Structures of small compounds CSD Genomes Human Genome Database HGD Mouse Genome Database MGD ©CMBI 2009

  8. Biological databases (2) Secondary databases Contain data derived from primary database(s) Patterns, motifs, domains PROSITE, PFAM, PRINTS, INTERPRO,...... Disease mutations OMIM / MIM SNPs dbSNP Pathways KEGG ©CMBI 2009

  9. Databases • Data must be in a certain format for software to recognize • Every database can have its own format but some data elements are essential for every database:1. Unique identifier, or accession code2. Name of depositor3. Literature references4. Deposition date5. The real data ©CMBI 2009

  10. Quality of Data • SwissProt • Data is only entered by annotation experts • EMBL, PDB • “Everybody” can submit data • No human intervention when submitted; some automatic checks ©CMBI 2009

  11. SwissProt database Database of protein sequences 468.851 entries (June 2009) Ca. 200 Annotation experts worldwide Keyword-organised flatfile Obligatory deposit of in SwissProt before publication Presently, databases are being merged into UniProt. ©CMBI 2009 ©CMBI 2009

  12. Important records in SwissProt (1) ID HBA_HUMAN Reviewed; 142 AA.AC P69905; P01922; Q3MIF5; Q96KF1; Q9NYR7;DT 21-JUL-1986, integrated into UniProtKB/Swiss-Prot.DT 23-JAN-2007, sequence version 2.DT 23-SEP-2008, entry version 63. DE RecName: Full=Hemoglobin subunit alpha;DE AltName: Full=Hemoglobin alpha chain;DE AltName: Full=Alpha-globin; ©CMBI 2009

  13. Important records in SwissProt (2) Cross references section: Hyperlinks to all entries in other databases which are relevant for the protein sequence HBA_HUMAN genes & mRNA proteindomains structures diseases ©CMBI 2009

  14. Important records in SwissProt (3) Features section: post-translational modifications, signal peptides, binding sites, enzyme active sites, domains, disulfide bridges, local secondary structure, sequence conflicts between references etc. etc. ©CMBI 2009

  15. And finally, the amino acid sequence! ©CMBI 2009

  16. EMBL database Nucleotide database EMBL: 159 million sequence entries comprising 251 billion nucleotides (June 2009) EMBL records follows roughly same scheme as SwissProt Obligatory deposit of sequence in EMBL before publication Most EMBL sequences never seen by a human ©CMBI 2009 ©CMBI 2009

  17. Protein Data Bank (PDB) Databank for 3-dimensional structures of biomolecules: Protein DNA RNA Ligands Obligatory deposit of coordinates in the PDB before publication ~ 50000 entries (April 2008) ( ~2500 “unique” structures) PDB file is a keyword-organised flat-file (80 column) 1) human readable2) every line starts with a keyword (3-6 letters)3) platform independent ©CMBI 2009 ©CMBI 2009

  18. PDB important records (1) • PDB nomenclatureFilename= accession number= PDB CodeFilename is 4 positions (often 1 digit & 3 letters, e.g. 1CRN)HEADERdescribes molecule & gives deposition dateHEADER PLANT SEED PROTEIN 30-APR-81 1CRN • CMPNDname of moleculeCOMPND CRAMBIN • SOURCEorganismSOURCE ABYSSINIAN CABBAGE (CRAMBE ABYSSINICA) SEED ©CMBI 2009

  19. PDB important records (2) • SEQRESSequence of protein; be aware: Not always all 3d-coordinates are present for all the amino acids in SEQRES!!SEQRES 1 46 THR THR CYS CYS PRO SER ILE VAL ALA ARG SER ASN PHE 1CRN 51SEQRES 2 46 ASN VAL CYS ARG LEU PRO GLY THR PRO GLU ALA ILE CYS 1CRN 52SEQRES 3 46 ALA THR TYR THR GLY CYS ILE ILE ILE PRO GLY ALA THR 1CRN 53SEQRES 4 46 CYS PRO GLY ASP TYR ALA ASN 1CRN 54 • SSBONDdisulfide bridgesSSBOND 1 CYS 3 CYS 40 • SSBOND 2 CYS 4 CYS 32 ©CMBI 2009

  20. PDB important records (3) • and at the end of the PDB file the “real” data: • ATOMone line for each atom with its unique name and its x,y,z coordinatesATOM 1 N THR 1 17.047 14.099 3.625 1.00 13.79 1CRN 70ATOM 2 CA THR 1 16.967 12.784 4.338 1.00 10.80 1CRN 71ATOM 3 C THR 1 15.685 12.755 5.133 1.00 9.19 1CRN 72ATOM 4 O THR 1 15.268 13.825 5.594 1.00 9.85 1CRN 73ATOM 5 CB THR 1 18.170 12.703 5.337 1.00 13.02 1CRN 74ATOM 6 OG1 THR 1 19.334 12.829 4.463 1.00 15.06 1CRN 75ATOM 7 CG2 THR 1 18.150 11.546 6.304 1.00 14.23 1CRN 76ATOM 8 N THR 2 15.115 11.555 5.265 1.00 7.81 1CRN 77ATOM 9 CA THR 2 13.856 11.469 6.066 1.00 8.31 1CRN 78ATOM 10 C THR 2 14.164 10.785 7.379 1.00 5.80 1CRN 79ATOM 11 O THR 2 14.993 9.862 7.443 1.00 6.94 1CRN 80 ©CMBI 2009

  21. Structure Visualization • Structures from PDB can be visualized with: • Yasara (www.yasara.org) • SwissPDBViewer (http://spdbv.vital-it.ch/) • Protein Explorer (http://www.umass.edu/microbio/rasmol/) • Cn3D (http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml) ©CMBI 2009

  22. Part III: Sequence Retrieval with MRS • Google Thé best generic search and retrieval system • Google searches everywhere for everything • MRS Maarten’s Retrieval System (http://mrs.cmbi.ru.nl ) • MRS searches in selected data environments • MRS is the Google of the biological database world • Search engine (like Google) • Input/Query = word(s) • Output = entry/entries from database • Other programs exist: Entrez, SRS, .... ©CMBI 2009

  23. MRS • MRS is mainly used for (but not restricted to) protein/nucleic acid and related databases • DNA and protein sequences • Sequence related information (e.g. alignments, protein, domains, enzymes, metabolic pathways, structural information) • Genomic information • Hereditary information ©CMBI 2009

  24. MRS Search Steps • Select database(s) of choice • Formulate your query • Hit “Search” • The result is a “query set” or “hitlist” • Analyze the results ©CMBI 2009

  25. http://mrs.cmbi.ru.nl ©CMBI 2009

  26. MRS Database Selection You can choose between selecting all databases or just one of them. But think about your query first!! ©CMBI 2009

  27. MRS Search options Simply type your keywords in the keyword field and choose SEARCH. If you know the fields of the database you are searching in you can specify your query further But think about your query first!! ©CMBI 2009

  28. MRS Hitlist (1) ©CMBI 2009

  29. MRS Hitlist (2) ©CMBI 2009

  30. MRS Options • MRS creates a result, or a “query set”, or “hitlist”. • With the result you can do different things in MRS: • View the hits • Blast single hit sequences • Clustal multiple hit sequences ©CMBI 2009

  31. MRS - View Hits ©CMBI 2009

  32. Combine in MRS AND or &AND is implicit OR or | NOT or ! ©CMBI 2009

  33. MRS - Options Home brings you back to the start page of MRS. That is the page from which you can do keyword searches. Blast brings you to the MRS-page from which you can do Blast searches. Blast results brings you to the page where MRS stores your Blast results. Clustal brings you to the MRS-page from which you can do Clustal alignments. Settings lets you choose your favourite display style Databanks lists all databases that MRS can search in. DB:uniprot lists the currently selected database. Help provides some help ©CMBI 2009

  34. Try it yourself with the exercises! • Ground rules for bioinformatics • Don't always believe what programs tell you - they're often misleading & sometimes wrong! • Don't always believe what databases tell you - they're often misleading & sometimes wrong! • Don't always believe what lecturers tell you - they're sometimes wrong! • Don't be a naive user, computers don’t do biology & bioinformatics, you do! • free after Terri Attwood ©CMBI 2009

More Related