1 / 24

Day 6 Carlow Bioinformatics

Day 6 Carlow Bioinformatics. Proteins: structure, function, databases, formats. Wot’s a protein, then?. Hierarchical A collection of amino acids (0-D) AACompIdent can identify a protein from AA%s A sequence of AAs (1-D) 2ndry structural elements:  -helix etc. (2-D)

dora-potter
Download Presentation

Day 6 Carlow Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Day 6 Carlow Bioinformatics Proteins: structure, function, databases, formats

  2. Wot’s a protein, then? Hierarchical • A collection of amino acids (0-D) • AACompIdent can identify a protein from AA%s • A sequence of AAs (1-D) • 2ndry structural elements: -helix etc. (2-D) • Domains – (independent) functional units • Whole Protein (from single CDS) (3-D) • Quaternary structure: dipeptides, ribosomes • Interactome, pathways

  3. Amino acid propertiesagain … and again and again

  4. Marked with:"strong groups" STA NEQK NHQK NDEQ QHRK MILV MILF HY FYW Marked with a ."weak groups" CSA ATV SAG STNK STPA SGND SNDEQK NDEQHK NEQHRK FVLIM HFY ClustalW groups

  5. Amino acid groups • KR (Lys Arg) NH3+ basic • DE (Glu Asp) COO- acidic • WYF (Trp Tyr Phe) large aromatic • GP (Gly, Pro) -breaking • C (Cys) disulphide –S – S – bridges • C not disulphide bridges • etc.

  6. Secondary structure Easy like exon prediction • -helix (no Pro Gly) • 3.4 residues per turn • Leucine zipper …LXXXXXXLXXXXXXL… • Amphipathic helix (charged on one side) • Transmembrane (-helix, hydrophobic ~21AA long) • -sheet • 2 dimensional zigzag • Coil, random • Turn • clustalW knows about , : preferring gaps elsewhere

  7. Basic information How big is my protein? Where beta-sheets? Is there a signal peptide? Is there a trypsin cleavage site? • ProtParam tool (MWt etc.) • Tmpred, TMHMM transmembrane helixinside/outside, external loops. • JPRED for 2-D structure • See practical manual for examples

  8. Tertiary structure Difficult like Gene prediction • The holy grail of bioinformatics • 3-D orientation of known , • Proteins made of functional units “domains” • Tried tested module • Domain shuffling and exon boundaries • Bioinf tries to make predictive calls on aspects of the 3-D structure • Q. Why is 3-D important ?A. Structure = function

  9. What binf can do about 3-D • Expressed/exported proteins have signal peptide • Hydropathy plot, antigenicity index, amphipathicity get handle on surface probability • But homology to known 3-D structure (Xray, NMR) is best predictor – threading. • Plan to X-ray all “folds” in human genome.

  10. recaA

  11. SwissProt/UniProt Some of the 194 lines of info in a SwissProt entry ID RECA_ECOLI STANDARD; PRT; 352 AA. AC P0A7G6; P03017; P26347; P78213; RX MEDLINE=92114994; PubMed=1731246;; RA Story R.M., Weber I.T., Steitz T.A.; RT "The structure of the E. coli recA protein"; RL Nature 355:318-325(1992). DR EMBL; V00328; CAA23618.1; -; Genomic_DNA. DR PDB; 2REB; X-ray; @=-. DR PRINTS; PR00142; RECA. DR ProDom; PD000229; RecA; 1. DR SMART; SM00382; AAA; 1. DR TIGRFAMs; TIGR02012; tigrfam_recA; 1. DR PROSITE; PS00321; RECA_1; 1. FT HELIX 72 85 FT TURN 86 87 FT STRAND 90 94 FT HELIX 101 106 UniProt is the key hub of Bioinformatics databases

  12. Motifs and Domains • Blast compares (zillions) seqs pairwise. • Why blast? homology/structure/function • Is my protein/family/”fold” present? • Suppose you find another homolog: how to incorporate info from both to find a third. • Bioinformatics is increasingly “knowledge-based”. • So better able to cope with biological, noisy data.

  13. Homology? LVMFWSIVGE Known1 L W GE LIVYWTVIGE Unknown 40% ID ILVFYTVVGD Known2 V TV G LIVYWTVIGE Unknown 40% ID Is Unknown part of the same family? Or is this just a 4/10 co-incidence?

  14. RegEx RegEx LVMFWSIVGE Known1 ILVFYTVVGD Known2 [IL]-[LV]-[MV]-[FYW](2)-[ST]-[IV]-V-G-[DE] LIVYWTVIGE Unknown * ***** ** More convincing that it is same family? How modify RegEx to include 3rd sequence?

  15. Family Databases Three methods

  16. Prosite • Groups families by conserved motif. Which is • Present in all family members • Absent in all other proteins • No/few false positives (selectivity) • All true positives (sensitivity) • Motif defined with a Regular expression

  17. What prosite looks like ID RECA_1; PATTERN.AC PS00321; DT APR-1990 (CREATED); NOV-1997 DE recA signature. PA A-L-[KR]-[IF]-[FY]-[STA]-[STAD]-[LIVMQ]-R. NR /RELEASE=49.0,207132; NR /TOTAL=281(281); /POSITIVE=279(279); /UNKNOWN=0(0); NR/FALSE_POS=2(2);/FALSE_NEG=11; /PARTIAL=10; DR Q01840, RECA1_LACLA, T; P48291, RECA1_MYXXA, T; DR P48292, RECA2_MYXXA, T; Q9ZUP2, RECA3_ARATH, T; Etc for 70 lines DR Q7UJJ0, RECA_RHOBA , N; Q9EVV7, RECA_STRTR , N; DR Q4X0X6, EXO70_ASPFU, F; Q5AZS0, EXO70_EMENI, F; 3D 2REB; 2REC; DO PDOC00131; Documentation False positives PDB structures False negatives

  18. Prosite weak entries • Glycosamidoglycan site S-G-X-G • PKC phosphorylation site [ST]-X-[RK] • Amidation site X-G-[RK](2) • But this is all we know about the relevant sites. • False positives inevitable

  19. Prosite problems • RegEx now breaking down as recAs increase so no longer defines the protein • Database now huge so prob of finding any short motif is high. • Many copies of ELVIS hiding in UniProt • May be more than 1 motif defining a family • A great first attempt and still useful but too crude

  20. Prints • A database of multiple domains/motifs. • Multiple motifs abstracted to database • Stored as probability matrix • If two proteins have the same motifs in the same order they are likely to be homologous. • More biological/real/sensitive than ProSite

  21. ProDom • A French DB • All against all search of the nr protein Db. • Includes domains with no known function • cf synteny of non coding regions • Great for determining the domain structure of a particular protein.

  22. Pfam • Moves up from the short; highly conserved; easily aligned bits of protein family. • Uses PSSM position specific scoring matrix • … on complete aligned family members

  23. Multiple sequence alignment: 1234567890 NSGTIVFLWP DSGTAIFLKP ESGTIIFLHN DSDTVRSLKP Posn1 50% D, N, E Posn2 100% S Posn3 75% G, D Posn4 100% T Posn5 50% I, A , V Posn6 50% I, V, R Posn7 75% F, S Posn8 100% L Posn9 50% K, H , W Posn0 75% P, N PSSM

  24. Domain take home • Run your protein against • InterproScan • CD server at NCBI • Pfscan • Likely that the crucial bit of info is only in one of the above.

More Related