School B&I TCD Bioinformatics

School B&I TCD Bioinformatics Proteins: structure,function,databases,formats

Wot’s a protein,then? Hierarchical • A collection of amino acids (0-D) • AACompIdent can identify a protein from AA%s • A sequence (string) of AAs (1-D) • 2ndry structural elements: -helix etc. (2-D) • Domains – (independent) functional units • Whole Protein (from single CDS) (3-D) • Quaternary structure: dipeptides,ribosomes • Interactome, pathways

Protein functions

Amino acid propertiesagain … and again and again

Amino acid groups • KR (Lys Arg) NH3+ basic • DE (Glu Asp) COO- acidic • WYF (Trp Tyr Phe) large aromatic • GP (Gly,Pro) -breaking • C (Cys) disulphide –S – S – bridges • C also not disulphide bridges • etc.

Secondary structure Easy like exon prediction • -helix (no Pro Gly) • 3.4 residues per turn • Leucine zipper …LXXXXXXLXXXXXXL… • Amphipathic helix (charged on one side) • Transmembrane (-helix,hydrophobic ~21AA long) • -sheet • 2 dimensional zigzag • Coil,random • Turn (kink)

Patterns to recognise(more reliable in MSA than in single seq) MSA improves 2ndary structure (a-helix b-sheet) prediction by >6%) • Alternate hydrophobic residues • Surface b-sheet (zig-zag-zig-zag) • Runs of hydrophobic residues • Interior/buried b-sheet • Residues with 3.5AA spacing (amphipathic) • a-helix WNNWFNNFNNWNNNF • Gaps/indels • Probably surface not core

Conserved residues • W,F,Y large hydrophobic, internal/core • conserved WFY best signal for domains • G,P turns, can mark end of a-helix b-sheet • C conserved with reliable spacing speaks C-C disulphide bridges - defensins • H,S often catalytic sites in proteases (and other enzymes) • KRDE charged: ligand binding or salt-bridge • L very common AA but not conserved • except in Leucine zipper L234567L234567L234567L

Basic information How big is my protein? Where beta-sheets? Is there a signal peptide? Is there a trypsin cleavage site? • ProtParam tool (MWt etc.) • Tmpred,TMHMM transmembrane helixinside/outside,external loops. • JPRED for 2-D structure • see practical manual for examples

Tertiary structure Difficult like Gene prediction • The holy grail of bioinformatics • 3-D orientation of known , • Proteins made of functional units “domains” • Tried tested module • Domain shuffling and exon boundaries • Bioinf tries to make predictive calls on aspects of the 3-D structure • Q. Why is 3-D important ?A. Structure = function

What binf can do about 3-D • Expressed/exported proteins have signal peptide • Hydropathy plot,antigenicity index,amphipathicity get handle on surface probability • But homology to known 3-D structure (Xray,NMR) is best predictor – threading. • Plan to X-ray all “folds” in human genome.

recaA

SwissProt/UniProt Some of the 194 lines of info in a SwissProt entry ID RECA_ECOLI STANDARD; PRT; 352 AA. AC P0A7G6; P03017; P26347; P78213; RX MEDLINE=92114994; PubMed=1731246;; RA Story R.M.,Weber I.T.,Steitz T.A.; RT "The structure of the E. coli recA protein"; RL Nature 355:318-325(1992). DR EMBL; V00328; CAA23618.1; -; Genomic_DNA. DR PDB; 2REB; X-ray; @=-. DR PRINTS; PR00142; RECA. DR ProDom; PD000229; RecA; 1. DR SMART; SM00382; AAA; 1. DR TIGRFAMs; TIGR02012; tigrfam_recA; 1. DR PROSITE; PS00321; RECA_1; 1. FT HELIX 72 85 FT TURN 86 87 FT STRAND 90 94 FT HELIX 101 106 UniProt is the key hub of Bioinformatics databases

Homology? LVMFWSIVGE Known1 L W GE LIVYWTVIGE Unknown 40% ID ILVFYTVVGD Known2 V TV G LIVYWTVIGE Unknown 40% ID Is Unknown part of the same family? Or is this just a 4/10 co-incidence?

RegEx RegEx LVMFWSIVGE Known1 ILVFYTVVGD Known2 [MILV](3)-[FYW](2)-[STA]-[MILV]-V-G-[DE] LIVYWTVIGE Unknown * ***** ** More convincing that it is same family? How modify RegEx to include 3rd sequence?

Family Databases Three methods

Prosite • Groups families by conserved motif. Which is • Present in all family members • Absent in all other proteins • No/few false positives (selectivity) • All true positives (sensitivity) • Motif defined with a Regular expression

cf SwissProt What prosite looks like ID RECA_1; PATTERN.AC PS00321; DT APR-1990 (CREATED); NOV-1997 DE recA signature. PA A-L-[KR]-[IF]-[FY]-[STA]-[STAD]-[LIVMQ]-R. NR /RELEASE=49.0,207132; NR /TOTAL=281(281); /POSITIVE=279(279); /UNKNOWN=0(0); NR/FALSE_POS=2(2);/FALSE_NEG=11; /PARTIAL=10; DR Q01840,RECA1_LACLA,T; P48291,RECA1_MYXXA,T; DR P48292,RECA2_MYXXA,T; Q9ZUP2,RECA3_ARATH,T; Etc for 70 lines DR Q7UJJ0,RECA_RHOBA ,N; Q9EVV7,RECA_STRTR ,N; DR Q4X0X6,EXO70_ASPFU,F; Q5AZS0,EXO70_EMENI,F; 3D 2REB; 2REC; DO PDOC00131; Documentation False positives PDB structures False negatives

Prosite problems • RegEx now breaking down as recAs increase so no longer defines the protein • Database now huge so prob of finding any short motif is high. • Many copies of ELVIS hiding in UniProt • May be more than 1 motif defining a family • A great first attempt and still useful but too crude

Prints • A database of multiple domains/motifs. • Multiple motifs abstracted to database • Stored as probability matrix • If two proteins have the same motifs in the same order they are likely to be homologous. • More biological/real/sensitive than ProSite

ProDom • A French DB • All against all search of the nr protein Db. • Includes domains with no known function • cf synteny of non coding regions • Great for determining the domain structure of a particular protein.

Pfam • Moves up from the short; highly conserved; easily aligned bits of protein family. • Uses PSSM position specific scoring matrix • … on complete aligned family members

Multiple sequence alignment: 1234567890 NSGTIVFLWP DSGTAIFLKP ESGTIIFLHN DSDTVRSLKP Posn1 50% D,N,E Posn2 100% S Posn3 75% G,D Posn4 100% T Posn5 50% I,A,V Posn6 50% I,V,R Posn7 75% F,S Posn8 100% L Posn9 50% K,H,W Posn0 75% P,N PSSM

Domain take home • Run your protein against • InterproScan • CD server at NCBI • Pfscan • Likely that the crucial bit of info is only in one of the above.

School B&I TCD Bioinformatics