250 likes | 373 Views
Day 6 Carlow Bioinformatics. Proteins: structure, function, databases, formats. Wot’s a protein, then?. Hierarchical A collection of amino acids (0-D) AACompIdent can identify a protein from AA%s A sequence of AAs (1-D) 2ndry structural elements: -helix etc. (2-D)
E N D
Day 6 Carlow Bioinformatics Proteins: structure, function, databases, formats
Wot’s a protein, then? Hierarchical • A collection of amino acids (0-D) • AACompIdent can identify a protein from AA%s • A sequence of AAs (1-D) • 2ndry structural elements: -helix etc. (2-D) • Domains – (independent) functional units • Whole Protein (from single CDS) (3-D) • Quaternary structure: dipeptides, ribosomes • Interactome, pathways
Marked with:"strong groups" STA NEQK NHQK NDEQ QHRK MILV MILF HY FYW Marked with a ."weak groups" CSA ATV SAG STNK STPA SGND SNDEQK NDEQHK NEQHRK FVLIM HFY ClustalW groups
Amino acid groups • KR (Lys Arg) NH3+ basic • DE (Glu Asp) COO- acidic • WYF (Trp Tyr Phe) large aromatic • GP (Gly, Pro) -breaking • C (Cys) disulphide –S – S – bridges • C not disulphide bridges • etc.
Secondary structure Easy like exon prediction • -helix (no Pro Gly) • 3.4 residues per turn • Leucine zipper …LXXXXXXLXXXXXXL… • Amphipathic helix (charged on one side) • Transmembrane (-helix, hydrophobic ~21AA long) • -sheet • 2 dimensional zigzag • Coil, random • Turn • clustalW knows about , : preferring gaps elsewhere
Basic information How big is my protein? Where beta-sheets? Is there a signal peptide? Is there a trypsin cleavage site? • ProtParam tool (MWt etc.) • Tmpred, TMHMM transmembrane helixinside/outside, external loops. • JPRED for 2-D structure • See practical manual for examples
Tertiary structure Difficult like Gene prediction • The holy grail of bioinformatics • 3-D orientation of known , • Proteins made of functional units “domains” • Tried tested module • Domain shuffling and exon boundaries • Bioinf tries to make predictive calls on aspects of the 3-D structure • Q. Why is 3-D important ?A. Structure = function
What binf can do about 3-D • Expressed/exported proteins have signal peptide • Hydropathy plot, antigenicity index, amphipathicity get handle on surface probability • But homology to known 3-D structure (Xray, NMR) is best predictor – threading. • Plan to X-ray all “folds” in human genome.
SwissProt/UniProt Some of the 194 lines of info in a SwissProt entry ID RECA_ECOLI STANDARD; PRT; 352 AA. AC P0A7G6; P03017; P26347; P78213; RX MEDLINE=92114994; PubMed=1731246;; RA Story R.M., Weber I.T., Steitz T.A.; RT "The structure of the E. coli recA protein"; RL Nature 355:318-325(1992). DR EMBL; V00328; CAA23618.1; -; Genomic_DNA. DR PDB; 2REB; X-ray; @=-. DR PRINTS; PR00142; RECA. DR ProDom; PD000229; RecA; 1. DR SMART; SM00382; AAA; 1. DR TIGRFAMs; TIGR02012; tigrfam_recA; 1. DR PROSITE; PS00321; RECA_1; 1. FT HELIX 72 85 FT TURN 86 87 FT STRAND 90 94 FT HELIX 101 106 UniProt is the key hub of Bioinformatics databases
Motifs and Domains • Blast compares (zillions) seqs pairwise. • Why blast? homology/structure/function • Is my protein/family/”fold” present? • Suppose you find another homolog: how to incorporate info from both to find a third. • Bioinformatics is increasingly “knowledge-based”. • So better able to cope with biological, noisy data.
Homology? LVMFWSIVGE Known1 L W GE LIVYWTVIGE Unknown 40% ID ILVFYTVVGD Known2 V TV G LIVYWTVIGE Unknown 40% ID Is Unknown part of the same family? Or is this just a 4/10 co-incidence?
RegEx RegEx LVMFWSIVGE Known1 ILVFYTVVGD Known2 [IL]-[LV]-[MV]-[FYW](2)-[ST]-[IV]-V-G-[DE] LIVYWTVIGE Unknown * ***** ** More convincing that it is same family? How modify RegEx to include 3rd sequence?
Family Databases Three methods
Prosite • Groups families by conserved motif. Which is • Present in all family members • Absent in all other proteins • No/few false positives (selectivity) • All true positives (sensitivity) • Motif defined with a Regular expression
What prosite looks like ID RECA_1; PATTERN.AC PS00321; DT APR-1990 (CREATED); NOV-1997 DE recA signature. PA A-L-[KR]-[IF]-[FY]-[STA]-[STAD]-[LIVMQ]-R. NR /RELEASE=49.0,207132; NR /TOTAL=281(281); /POSITIVE=279(279); /UNKNOWN=0(0); NR/FALSE_POS=2(2);/FALSE_NEG=11; /PARTIAL=10; DR Q01840, RECA1_LACLA, T; P48291, RECA1_MYXXA, T; DR P48292, RECA2_MYXXA, T; Q9ZUP2, RECA3_ARATH, T; Etc for 70 lines DR Q7UJJ0, RECA_RHOBA , N; Q9EVV7, RECA_STRTR , N; DR Q4X0X6, EXO70_ASPFU, F; Q5AZS0, EXO70_EMENI, F; 3D 2REB; 2REC; DO PDOC00131; Documentation False positives PDB structures False negatives
Prosite weak entries • Glycosamidoglycan site S-G-X-G • PKC phosphorylation site [ST]-X-[RK] • Amidation site X-G-[RK](2) • But this is all we know about the relevant sites. • False positives inevitable
Prosite problems • RegEx now breaking down as recAs increase so no longer defines the protein • Database now huge so prob of finding any short motif is high. • Many copies of ELVIS hiding in UniProt • May be more than 1 motif defining a family • A great first attempt and still useful but too crude
Prints • A database of multiple domains/motifs. • Multiple motifs abstracted to database • Stored as probability matrix • If two proteins have the same motifs in the same order they are likely to be homologous. • More biological/real/sensitive than ProSite
ProDom • A French DB • All against all search of the nr protein Db. • Includes domains with no known function • cf synteny of non coding regions • Great for determining the domain structure of a particular protein.
Pfam • Moves up from the short; highly conserved; easily aligned bits of protein family. • Uses PSSM position specific scoring matrix • … on complete aligned family members
Multiple sequence alignment: 1234567890 NSGTIVFLWP DSGTAIFLKP ESGTIIFLHN DSDTVRSLKP Posn1 50% D, N, E Posn2 100% S Posn3 75% G, D Posn4 100% T Posn5 50% I, A , V Posn6 50% I, V, R Posn7 75% F, S Posn8 100% L Posn9 50% K, H , W Posn0 75% P, N PSSM
Domain take home • Run your protein against • InterproScan • CD server at NCBI • Pfscan • Likely that the crucial bit of info is only in one of the above.