1 / 63

Bioinformatics approaches for…

Bioinformatics approaches for…. Teresa K Attwood Faculty of Life Sciences & School of Computer Science University of Manchester, Oxford Road Manchester M13 9PT, UK http://www.bioinf.man.ac.uk/dbbrowser/. ….analysing GPCRs…. …. which craft is best?. Overview. What are GPCRs?

drew
Download Presentation

Bioinformatics approaches for…

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics approaches for… Teresa K Attwood Faculty of Life Sciences & School of Computer Science University of Manchester, Oxford Road Manchester M13 9PT, UK http://www.bioinf.man.ac.uk/dbbrowser/

  2. ….analysing GPCRs….

  3. ….which craft is best?

  4. Overview • What are GPCRs? • why they’re interesting & important • why bioinformatics approaches are important • In silico function prediction • a reality check • Family-based methods for characterising GPCRs • Understanding the tools • problems with pair-wise & family-based approaches • estimating (biological) significance • Seeking deeper functional insights • Conclusions

  5. GTP GTP GTP GTP What are GPCRs?G protein-coupled receptors • A functionally diverse family of cell-surface 7TM proteins • Functional diversity achieved via • interaction with a variety of ligands • stimulation of various intracellular pathways via coupling to different G proteins GDP

  6. Why are GPCRs interesting?Attwood, TK & Flower, DR (2002) Trawling the genome for G protein-coupled receptors: the importance of integrating bioinformatic approaches. In Drug Design – Cutting Edge Approaches, pp.60-71. • They are ubiquitous • >800 GPCR genes in the human genome, from 3 major superfamilies • rhodopsin-, secretin- & metabotropic glutamate receptor-like • Share almost no sequence similarity • but are united by common 7TM architecture • Constitute a complex multi-gene family • populated by >50 families & >350 subtypes

  7. Isn’t just stamp collecting!Attwood, TK & Flower, DR (2002) Trawling the genome for G protein-coupled receptors: the importance of integrating bioinformatic approaches. In Drug Design – Cutting Edge Approaches, pp.60-71. • GPCRs are of profound biomedical importance • targets for >50% of prescription drugs • yield sales >$16 billion/annum • they’re big business! • Given their importance, we need to • characterise the ones we know about • identify new ones • & discover what they do! • e.g., as potential new drug targets

  8. Why studying GPCRs is difficult • Only 2 crystal structures available • bovine rhodopsin (2000) & human 2-adrenergic receptor (2007) • Many GPCRs haven’t been characterised experimentally • remain 'orphans’, with unknown ligand specificity • With >800 human GPCRs, this isn’t much to go on!

  9. Why use bioinformatics approaches? • Computational approaches are important • can be used to help identify, characterise & model novel receptors • usually by similarity & extrapolation of known characteristics • Bioinformatics thus offers complementary tools for elucidating the structures & functions of receptors • But the task is non-trivial • GPCRs exhibit rich relationships & complex molecular interactions • present many challenges for in silico analysis • in trying to derive meaningful functional insights, traditional methods are likely to be limited

  10. We’ve been using biology-unaware search tools to analyse such complex systems How far can we truly expect to understand cellular function with such naïve approaches…?

  11. In silico function prediction…a reality check • What is the function of this structure? • What is the function of this sequence? • What is the function of this motif? • the fold provides a scaffold, which can be decorated in different ways by different sequences to confer different functions - knowing the fold & function allows us to rationalise how the structure effects its function at the molecular level

  12. “A test case for structural genomics Structure-based assignment of the biochemical function of hypothetical protein mj0577” (Zarembinski et al., PNAS95 1998) Although the structure co-crystallised with ATP, thebiochemical functionof the protein isunknown

  13. What's in a sequence?

  14. Methods for family analysis Attwood, TK (2000). The quest to deduce protein function from sequence: the role of pattern databases. Int.J. Biochem. Cell Biol., 32(2), 139–155. Fuzzy regex (eMOTIF) Single motif methods Exact regex (PROSITE) Full domain alignment methods Profiles (Profile Library) HMMs (Pfam) Identity matrices (PRINTS) Multiple motif methods Weight matrices (Blocks)

  15. The challenge of family analysis • highly divergent family withsingle function? • superfamily withmany diverse functional families? • must distinguish if function analysis done in silico • a tough challenge!

  16. In the beginning was PROSITE TM domain • [GSTALIVMYWC]-[GSTANCPDE]-{EDPKRH}-X(2)-[LIVMNQGA]-X(2)-[LIVMFT]-[GSTANC]-LIVMFYWSTAC]-[DENH]-R

  17. Diagnostic limitations of PROSITE ID G_PROTEIN_RECEP_F1_1; PATTERN. AC PS00237; DT APR-1990 (CREATED); NOV-1997 (DATA UPDATE); SEP-2004 (INFO UPDATE). DE G-protein coupled receptors family 1 signature. PA [GSTALIVMFYWC]-[GSTANCPDE]-{EDPKRH}-x(2)-[LIVMNQGA]-x(2)-[LIVMFT]- PA [GSTANC]-[LIVMFYWSTAC]-[DENH]-R-[FYWCSH]-x(2)-[LIVM]. NR /RELEASE=44.6,159201; NR/TOTAL=1622(1621); /POSITIVE=1530(1529);/UNKNOWN=0(0); NR/FALSE_POS=92(92); /FALSE_NEG=261;/PARTIAL=61; • This represents an apparent 22% error rate • the actual rate is probably higher • Thus, a match to a pattern is not necessarily true • & a mis-match is not necessarilyfalse! • False-negatives are a fundamental limitation to this type of pattern matching • if you don't know what you're looking for,you'll never know you missed it!

  18. loop region TM domain TM domain Where do motifs (fingerprints) fit in? (fingerprints are hierarchical)

  19. Rhodopsin-likesuperfamily, family & subtypeGPCRs in PRINTS Attwood, TK (2001) A compendium of specific motifs for diagnosing GPCR subtypes. TiPS, 22(4), 162-165.

  20. Searching PRINTS - FingerPRINTScan Scordis, P, Flower, DR & Attwood, TK (1999) FingerPRINTScan: intelligent searching of the PRINTS motif database. Bioinformatics, 15, 523-524. • GPCR fingerprints are embedded in PRINTS • allows diagnosis of GPCR mosaics

  21. N C Visualising fingerprints Attwood, TK & Findlay, JBC (1993) Design of a discriminating fingerprint for G-protein-coupled receptors. Protein Eng., 6(2), 167–176. N C

  22. N C Visualising fingerprints Attwood, TK & Findlay, JBC (1993) Design of a discriminating fingerprint for G-protein-coupled receptors. Protein Eng., 6(2), 167–176.

  23. Diagnosing partial matches • Missed by PROSITE • wasn’t annotated as a FN

  24. An integrated approachMulder, NJ, Apweiler, R, Attwood, TK, Bairoch, A et al. (2007) New developments in InterPro. NAR, 35, D224-8. • To simplify sequence analysis, the family dbs were integrated within a unified annotation resource – InterPro • initial partners were PRINTS, PROSITE, profiles & Pfam • now many more partners • linked to its satellite dbs • but lags behind their coverage • by Oct 2007, it had 14,768 entries & covered 76% of UnitProtKB • major role in fly & human genome annotation

  25. InterPro – method comparison

  26. Where has this got us?

  27. Understanding the tools …estimating significance • How do we know what to believe? • Let’s explore some of the difficulties that arise when pair-wise search tools (BLAST & FastA) & family-based methods are used naïvely • these examples caution us to think about what the results actually mean in biological terms.....

  28. Identifying sequence similarity • GPCRs present many challenges for in silico functional analysis • Several signature-based methods now available • with different areas of optimum application • Yet naïve, pair-wise similarity searching has been the mainstay of functional annotation efforts • it allows us to identify/quantify relationships between sequences • But quantifying similarity between sequences is not the same as identifying their functions

  29. Problems with pairwise similarity toolsGaulton, A & Attwood, TK (2003) Bioinformatics approaches for the classification of G protein-coupled receptors. Current Opinion in Pharmacology, 3, 114-120. • For identifying precise families to which receptors belong & the ligands they bind, pair-wise tools are limited • at what level of seq ID is ligand specificity conserved? • some GPCRs with 25% ID share a common ligand; • others, with greater levels, don’t… • It may be impossible to tell from BLAST if an orphan belongs to a known family (the top hit), or if it will bind a novel ligand • e.g., for the now de-orphaned UR2R, BLAST indicates most similarity to the type 4 SSRs, yet it is known to bind a different (related) ligand

  30. When is a GPCR not an SSR? Query length: 389 AA Date run: 2002-10-18 09:08:29 UTC+0100 on sib-blast.unil.ch Taxon: Homo sapiensDatabase: XXswissprot 120,412 sequences; 45,523,583 total letters SWISS-PROT Release 40.29 of 10-Oct-2002 Db AC Description Score E-value sp Q9UKP6 Q9UKP6 Orphan receptor [Homo sapiens... 782 0.0 sp P31391 SSR4_HUMAN Somatostatin receptor type 4 (SS4R) [SSTR4]... 167 3e-41 sp O43603 GALS_HUMAN Galanin receptor type 2 (GAL2-R) (GALR2) [G... 147 4e-35 sp P30872 SSR1_HUMAN Somatostatin receptor type 1 (SS1R) (SRIF-2... 144 3e-34 sp P32745 SSR3_HUMAN Somatostatin receptor type 3 (SS3R) (SSR-28... 140 3e-33 sp P35346 SSR5_HUMAN Somatostatin receptor type 3 (SS5R) (SSTR5)... 140 6e-33 sp P30874 SPLICE ISOFORM B of P30874 [SSTR2] [Homo sapiens... 134 3e-31 sp P30874 SSR2_HUMAN Somatostatin receptor type 2 (SS2R) (SRIF-1... 134 3e-31 sp P48145 GPR7_HUMAN Neuropeptides B/W receptor type 1 (G protei... 133 7e-31 sp O60755 GALT_HUMAN Galanin receptor type 3 (GAL3-R) (GALR3) [G... 132 2e-30 sp P41143 OPRD_HUMAN Delta-type opioid receptor (DOR-1) [OPRD1] ... 128 2e-29 sp P35372 SPLICE ISOFORM 1A of P35372 [OPRM1] [Homo sapien... 125 1e-28 sp P35372 OPRM_HUMAN Mu-type opioid receptor (MOR-1) [OPRM1] [Ho... 125 1e-28

  31. When is a GPCR not an SSR?…when it’s a UR2R Query length: 389 AA Date run: 2002-10-18 09:08:29 UTC+0100 on sib-blast.unil.ch Taxon: Homo sapiensDatabase: XXswissprot 120,412 sequences; 45,523,583 total letters SWISS-PROT Release 40.29 of 10-Oct-2002 Db AC Description Score E-value sp Q9UKP6 UR2R_HUMAN Urotensin II receptor (UR-II-R) [GPR14] [Ho... 782 0.0 sp P31391 SSR4_HUMAN Somatostatin receptor type 4 (SS4R) [SSTR4]... 167 3e-41 sp O43603 GALS_HUMAN Galanin receptor type 2 (GAL2-R) (GALR2) [G... 147 4e-35 sp P30872 SSR1_HUMAN Somatostatin receptor type 1 (SS1R) (SRIF-2... 144 3e-34 sp P32745 SSR3_HUMAN Somatostatin receptor type 3 (SS3R) (SSR-28... 140 3e-33 sp P35346 SSR5_HUMAN Somatostatin receptor type 3 (SS5R) (SSTR5)... 140 6e-33 sp P30874 SPLICE ISOFORM B of P30874 [SSTR2] [Homo sapiens... 134 3e-31 sp P30874 SSR2_HUMAN Somatostatin receptor type 2 (SS2R) (SRIF-1... 134 3e-31 sp P48145 GPR7_HUMAN Neuropeptides B/W receptor type 1 (G protei... 133 7e-31 sp O60755 GALT_HUMAN Galanin receptor type 3 (GAL3-R) (GALR3) [G... 132 2e-30 sp P41143 OPRD_HUMAN Delta-type opioid receptor (DOR-1) [OPRD1] ... 128 2e-29 sp P35372 SPLICE ISOFORM 1A of P35372 [OPRM1] [Homo sapien... 125 1e-28 sp P35372 OPRM_HUMAN Mu-type opioid receptor (MOR-1) [OPRM1] [Ho... 125 1e-28

  32. The trouble with top hits • The most statistically significant hit is not always the most biologically relevant • Yet many rule-based ‘expert systems’ still rely on top BLAST or FastA hits to make their diagnoses • BLAST/FastA ‘see’ generic similarity & not the often-subtle differences that constitute the functional determinants between closely-related receptor families & subtypes • Failure to appreciate this fundamental point has generated numerous annotation errors in our databases

  33. m-opioid receptor true m-opioid receptor k-opioid receptor Misleading annotation via FastA

  34. Misleading results from BLAST • As we’ve seen, it’s tempting to use top hits from BLAST or FastA results to classify unknown proteins • but this may lead us (& especially computer programs) to false functional conclusions • PSI-BLAST is more sensitive than BLAST, because it creates a profile from hits above a given threshold • but this too can cause problems • let’s take a closer look

  35. So, is UL78 a GPCR? & if so, what sort?

  36. What PSI-BLAST said (profile dilution in action) * * *

  37. What GeneQuiz said… a thrombin receptor

  38. What GeneQuiz said later…

  39. Overview of results pair-wise & family-based methods

  40. What is UL78?   Bioinformatics tools, alone, cannot tell us!

  41. So, beware top hits…but also beware bottom hits! Let us now compare & contrast some InterPro results with those of its source dbs…

  42. Rhodopsin-like superfamily GPCRs in InterPro 2005 IPR000276GPCR_Rhodopsn7752 proteins PS50262G_PROTEIN_RECEP_F1_2 7702 proteins PF000017tm_1 7064 proteins PS00237G_PROTEIN_RECEP_F1_1 6527 proteins PR00237GPCRRHODOPSN 5821 proteins(don’t include partials)

  43. Rhodopsin-like superfamily GPCRs in the source databases PfamFP ? FN ? U ? TP? 8776 matches 7064 PROSITE (profile)FP 3 FN 3 U 12 TP 1837 matches 7702 PROSITE (regex)FP 92 FN 261U 0 TP 1530 matches 6527 PRINTSFP 0 FN ?U 0 TP 1154 matches 5821 >2165 updated

  44. Rhodopsin-like superfamily GPCRs in InterPro 2007 IPR000276GPCR_Rhodopsn16,845 proteins PS50262G_PROTEIN_RECEP_F1_216,714 proteins PF000017tm_115,712 proteins PR00237GPCRRHODOPSN 13,405 proteins PS00237G_PROTEIN_RECEP_F1_113,723 proteins No human curator has time to validate all these matches…

  45. 14,615 rhodopsin-like superfamily GPCRs in Pfam?

  46. Pfam match Q6NV75/24-297 ID Q6NV75 PRELIMINARY; PRT; 609 AA. AC Q6NV75; DT 05-JUL-2004 (TrEMBLrel. 27, Created) DT 05-JUL-2004 (TrEMBLrel. 27, Last sequence update) DT 05-JUL-2004 (TrEMBLrel. 27, Last annotation update) DE G protein-coupled receptor 153. GN Name=GPR153; OS Homo sapiens (Human). OX NCBI_TaxID=9606 RN [1] RP SEQUENCE FROM N.A. RC TISSUE=Brain; RA Strausberg R.L., Feingold E.A., Grouse L.H., Derge J.G., RA Jones S.J., Marra M.A.; RT "Generation and initial analysis of more than 15,000 full-length RT human and mouse cDNA sequences."; RL Proc. Natl. Acad. Sci. U.S.A. 99:16899-16903(2002). RP SEQUENCE FROM N.A. RC TISSUE=Brain; RA Strausberg R.; RL Submitted (MAR-2004) to the EMBL/GenBank/DDBJ databases. DR EMBL; BC068275; AAH68275.1; -. DR GO; GO:0004872 DR InterPro; IPR000276; GPCR_Rhodpsn. DR Pfam; PF00001; 7tm_1; 1. DR PROSITE; PS50262; G_PROTEIN_RECEP_F1_2; 1. KW Receptor SQ SEQUENCE 609 AA; 65341 MW; E525CC7F60D0891C CRC64; MSDERRLPGS AVGWLVCGGL SLLANAWGIL SVGAKQKKWK PLEFLLCTLA ATHMLNVAVP IATYSVVQLR RQRPDFEWNE GLCKVFVSTF YTLTLATCFS VTSLSYHRMW MVCWPVNYRL SNAKKQAVHT VMGIWMVSFI LSALPAVGWH DTSERFYTHG CRFIVAEIGL GFGVCFLLLV GGSVAMGVIC TAIALFQTLA VQVGRQADHR AFTVPTIVVE DAQGKRRSSI DGSEPAKTSL QTTGLVTTIV FIYDCLMGFP VLVVSFSSLR ADASAPWMAL CVLWCSVAQA LLLPVFLWAC DRYRADLKAV REKCMALMAN DEESDDETSL EGGISPDLVL ERSLDYGYGG DFVALDRMAK YEISALEGGL PQLYPLRPLQ EDKMQYLQVP PTRRFSHDDA DVWAAVPLPA FLPRWGSGED LAALAHLVLP AGPERRRASL LAFAEDAPPS RARRRSAESL LSLRPSALDS GPRGARDSPP GSPRRRPGPG PRSASASLLP DAFALTAFEC EPQALRRPPG PFPAAPAAPD GADPGEAPTP PSSAQRSPGP RPSAHSHAGS LRPGLSASWG EPGGLRAAGG GGSTSSFLSS PSESSGYATL HSDSLGSAS //  PROSITE (profile) no match false negative PROSITE (regex) no match PRINTS no match ClustalW – sequences too divergent to be aligned GPCR?

  47. Beware top & bottom hits…but also beware simplistic analysis tools coupled with wet experiments! Let’s finally look at how hydropathy profiles can compel biologists to make strange deductions… - & still get their results published in Science!

More Related