610 likes | 722 Views
Proteiinianalyysi 6. Sekundaarirakenteen ennustaminen http://www.bioinfo.biocenter.helsinki.fi/downloads/teaching/spring2006/proteiinianalyysi. Secondary structure. Amino acid sequence => secondary structure Conformational preferences of amino acids 13-17 residue window
E N D
Proteiinianalyysi 6 Sekundaarirakenteen ennustaminen http://www.bioinfo.biocenter.helsinki.fi/downloads/teaching/spring2006/proteiinianalyysi
Secondary structure • Amino acid sequence => secondary structure • Conformational preferences of amino acids • 13-17 residue window • Correlations between positions > neural networks • Biophysical background … • http://www.chembio.uoguelph.ca/educmat/phy456/456lec01.htm
DSSP algorithm to define secondary structure Dictionary of Protein Secondary Structure: Pattern Recognition of Hydrogen-Bonded and Geometrical Features W. Kabsch & C. Sander Biopolymers 22, 2577-2637 (1983)
O C N H Hydrogen bonds +0.20e -0.20e +0.42e -0.42e E ~ q1 q2 [ 1/r(ON) + 1/r(CH) – 1/r(CN) – 1/r(OH) Ideal H-bond is co-linear, r(NO)=2.9 A and E=-3.0 kcal/mol Cutoffs in DSSP allow 2.2 A excess distance and ±60º angle
Elementary H-bond patterns • n-turn(i) =: Hbond(i,i+n), n=3,4,5 • Parallel bridge(i,j) =: [ Hbond(i-1,j) AND Hbond(j,i+1) ] OR [ Hbond(j-1,i) AND Hbond(i,j+1) ] • Antiparallel bridge(i,j) =: [ Hbond(i,j) AND Hbond(j,i) ] OR [ Hbond(i-1,j+1) AND Hbond(j-1,i+1) ]
N-turns -N-C-C--N-C-C--N-C-C--N-C-C- H O H O H O H O 3-turn -N-C-C--N-C-C--N-C-C--N-C-C--N-C-C- H O H O H O H O H O 4-turn -N-C-C--N-C-C--N-C-C--N-C-C-—N-C-C-—N-C-C- H O H O H O H O H O H O 5-turn
Parallel bridge -N-C-C--N-C-C--N-C-C--N-C-C—N-C-C- H O H O H O H O H O H O H O H O H O H O -N-C-C--N-C-C--N-C-C--N-C-C—N-C-C-
Antiparallel bridge -N-C-C--N-C-C--N-C-C--N-C-C- H O H O H O H O O H O H O H O H -C-C-N--C-C-N--C-C-N--C-C-N- Antiparallel beta-sheet is significantly more stable due to the well aligned H-bonds.
Cooperative H-bond patterns • 4-helix(i,i+3) =: [4-turn(i-1) AND 4-turn(i)] • 3-helix(i,i+2) =: [3-turn(i-1) AND 3-turn(i)] • 5-helix(i,i+4) =: [5-turn(i-1) AND 5-turn(i)] • Longer helices are defined as overlaps of minimal helices
Beta-ladders and beta-sheets • Ladder =: set of one or more consecutive bridges of identical type • Sheet =: set of one or more ladders connected by shared residues • Bulge-linked ladder =: two ladders or bridges of the same type connected by at most one extra residue on one strand and at most four extra residues on the other strand
3-state secondary structure • Helix • Strand • Loop • Quoted consistency of secondary structure state definition in structures between sequence-similar proteins is ~70 % • Richer descriptions possible • E.g. phi-psi regions
Amino acid preferences for different secondary structure • Alpha helix may be considered the default state for secondary structure. Although the potential energy is not as low as for beta sheet, H-bond formation is intra-strand, so there is an entropic advantage over beta sheet, where H-bonds must form from strand to strand, with strand segments that may be quite distant in the polypeptide sequence. • The main criterion for alpha helix preference is that the amino acid side chain should cover and protect the backbone H-bonds in the core of the helix. Most amino acids do this with some key exceptions. • alpha-helix preference: • Ala,Leu,Met,Phe,Glu,Gln,His,Lys,Arg
The extended structure leaves the maximum space free for the amino acid side chains: as a result, those amino acids with large bulky side chains prefer to form beta sheet structures: • just plain large:Tyr, Trp, (Phe, Met) • bulky and awkward due to branched beta carbon:Ile, Val, Thr • large S atom on beta carbon:Cys • The remaining amino acids have side chains which disrupt secondary structure, and are known as secondary structure breakers: • side chain H is too small to protect backbone H-bond:Gly • side chain linked to alpha N, has no N-H to H-bond;rigid structure due to ring restricts to phi = -60: Pro • H-bonding side chains compete directly with backbone H-bonds: Asp, Asn, Ser • Clusters of breakers give rise to regions known as loops or turns which mark the boundaries of regular secondary structure, and serve to link up secondary structure segments.
Secondary structure prediction • GOR method • Visual, expert assessment • Neural networks • Nearest neighbour assignment • … consensus filters
GOR method • State of central residue is influenced by adjacent positions in a window • …A...X……. • …A...Q……. • …A…X…L.. • Superseded by more accurate methods
Structure parsing • Multiple alignment • Conservation => core elements • Gaps, Pro, Gly, polar stretch => loops • 3.5 periodicity => amphiphilic helix • 2 periodicity => amphiphilic strand • Row of hydrophobics => buried strand
What are neural networks? • Parallel, distributed information processing structures which draw their ultimate inspiration from neurons in the brain • Main class = feed-forward network alias multi-layer perceptron • Paradigm for tackling pattern classification and regression tasks
Why (not) use neural networks? • Efficient at secondary structure prediction • “Black boxes” • Can deal with non-linear combination of multiple factors • Rule-based explanation can over-simplify and mislead
Neural networks are made of units that are often assumed to be simple in the sense that their state can be described by a single numbers, their "activation" values. Each unit generates an output signal based on its activation. Units are connected to each other very specifically, each connection having an individual "weight" (again described by a single number). Each unit sends its output value to all other units to which they have an outgoing connection. Through these connections, the output of one unit can influence the activations of other units. The unit receiving the connections calculates its activation by taking a weighted sum of the input signals (i.e. it multiplies each input signal with the weight that corresponds to that connection and adds these products). The output is determined by the activation function based on this activation (e.g. the unit generates output or "fires" if the activation is above a threshold value). Networks learn by changing the weights of the connections.
Feed-forward architecture Typical output 1.0 for all patterns
Output of each node in the network, for a given pattern p Squashing function f(x) is typically a sigmoid or logistic function
A two-layer neural network capable of calculating XOR. The numbers within the neurons represent each neuron's explicit threshold (which can be factored out so that all neurons have the same threshold, usually 1). The numbers that annotate arrows represent the weight of the inputs. This net assumes that if the threshold is not reached, zero (not -1) is output.
A two-layer neural network capable of calculating XOR. The numbers within the neurons represent each neuron's explicit threshold (which can be factored out so that all neurons have the same threshold, usually 1). The numbers that annotate arrows represent the weight of the inputs. This net assumes that if the threshold is not reached, zero (not -1) is output.
Training a feed-forward net • Supervised learning • Training pattern and associated target = training pair • Input patterns in training set must have the same number of elements as the net has input nodes • Every target must have the same number of elements as the net has output nodes
Ability to generalise • The number of training patterns versus the number of network weights • Rule of thumb: need at least 20 times as many patterns as network weights • The number of hidden nodes • Too few nodes impedes learning • Too many nodes impedes generalisation • The number of training iterations
Basic approach • Each training pair is of the form • Pattern: **LSADQISTVQASFDK • Target: H • Three target classes • DSSP classes: Prediction class: • H, G helix • E strand • B, I, S, T, e, g, h coil • Encoding • Alanine: 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 # • Helix: 1 0 0
Back-propagation algorithm • Gradient descent • wij = wij – n d E / dwij + m (1) Partial derivative of error E with respect to weights • E / dwij = (si – di) si (1-si) sj (2) Si = signal emitted by hidden node di = desired value of output N = rate of training (typical value 0.03) m = smoothing factor (typical value 0.2) Example: signal sj sent from Hj to Oi = 0.2; desired output = 1 d E / dwij = (0.2-1) x 0.2 x 0.8 x 0.2 = -0.0256 so wij will be increased according to (1)
Typical numbers • Training set • Several hundred non-homologous protein chains • Total number of residues = number of training patterns • Architecture • Fully-connected 17(21)-5-3 • 357 input nodes • 1,808 weights • Prediction • winner-takes-all
Performance measures • Q3 • Three-state residue prediction • Correlation coefficient • SOV • Segment overlap • Reliability index
Improvements on basic approach • Using evolutionary information • Up 6 %-points • Balanced training • Equal representation of H, E, L patterns • Increase the amount of training data • Up 4 %-points training on 128 / 318 proteins • Post-processing and filtering • Use an ensemble of networks • Jury of 10 nets: up 2 %-points
PSIPRED PSI-Blast multiple alignment analysed by two feed-forward neural networks
Prediction of secondary structure by nearest neighbor analysis • Examples of two of the most accurate nearest neighbor prediction programs • (1) NNSSP (accuracy to 73.5%) program chosing the PSSP / NNSSP option. The output probabilities Pa and Pb give a normalized score by co0nverting the values of fa, fb and fcoils to a scale of 0-9. • (2) Predator (accuracy 75%) using the FSSP assignments of secondary structure to the training sequences. Predator does not provide a normalized score. Predator predictions are shown below NNSP prediction on each line. The input sequence was the a subunit of S. typhimurium tryptophan synthase, Swiss-prot ID TRPA_SALTY, accession P00929, which is in the training sequences since the 3D structure is known.
10 20 30 40 50 PredSS aaaaaaaaaaaaa bbbbbb aaaaaaaaaaaaaaaaaaaaa AA seq MERYESLFAQLKERKEGAFVPFVTLGDPGIEQSLKIIDTLIEAGADALEL Prob a 99999999999974211100000010001688899999999974578863 Prob b 00000000000000001277788741000100000000000000001122 Predator ___HHHHHHHHHHHHHH_EEEEEE_______HHHHHHHHHHH________ 60 70 80 90 100 PredSS aaaaaaaaaa aaaaaaaaaaaaa bbba AA seq GIPFSDPLADGPTIQNATLRAFAAGVTPAQCFEMLALIRQKHPTIPIGLL Prob a 11111111100124568899887311058899999999852000111133 Prob b 23221101100012110000001111000000000000000002335544 Predator ______________HHHHHHHHH______HHHHHHHHHHH______HHHH 110 120 130 140 150 PredSS aaaaaaa aaaaaaaaaaa bbbbb aaaaaaa AA seq MYANLVFNKGIDEFYAQCEKVGVDSVLVADVPVEESAPFRQAALRHNVAP Prob a 54554453447899999988400100000111222234788998731111 Prob b 32112211000000000000011168986322110100000000000123 Predator HHHHH______HHHHHHHHH____EEEEEE________HHHHHHHH___E 160 170 180 190 200 PredSS bbb aaaaaaaaa bbbb aaaaaaaaaaaaaaaa AA seq IFICPPNADDDLLRQIASYGRGYTYLLSRAGVTGAENRAALPLNHLVAKL Prob a 00000000158999999731111212235211125556654388899999 Prob b 89852000000000000110113677531112211100112200000000 Predator EEE_______HHHHHHHH_____EEEEE______HHHHH_____HHHHHH 210 220 230 240 250 PredSS aaa aaaaaaaaa aaaaaaaaaaa aaa AA seq KEYNAAPPLQGFGISAPDQVKAAIDAGAAGAISGSAIVKIIEQHINEPEK Prob a 88632100111101114789999987453122226878888997542588 Prob b 00000000133433200000000000000122111010000000000000 Predator HHH_______________HHHHHHH___________HHHHHHHHH__HHH 260 PredSS aaaaaaaaaaaaaaaaa AA seq MLAALKVFVQPMKAATRS Prob a 989999998878898663 Prob b 000000000000000011 Predator HHHHHHHH__________
Paracelsuksen haaste • Paracelsus oli 1500-luvulla vaikuttanut alkemisti • protein design -haaste: suunnittele aminohapposekvenssi, jolla on vähintään 50 % identtisiä aminohappoja tunnetun proteiinin kanssa, mutta joka laskostuu toisenlaiseksi rakenteeksi. • Ensimmäinen haasteen täyttänyt keinotekoinen sekvenssi, nimeltään Janus (Dalal et al. 1997, Nat. Struct. Biol. 4, 548-552), muuntaa B1-domeenin beta-rakenteesta (bbabb) alfa-helikaaliseksi rakenteeksi (aa). • Janus on rakenteeltaan Rop-proteiinin kaltainen. Rop-monomeeri muodostaa kahden vastakkaissuuntaisen heliksin hiusneulan. Luonnossa Rop dimerisoituu ja muodostaa neljän heliksin kimpun.
(a) B1-domeenin rakenne. Januksen sekvenssissä säilytetyt aminohapot on merkitty punaisella. (b) ROP-dimeerin rakenne. Januksen sekvenssissä esiintyvät aminohapot on merkitty sinisellä.
(a) Laske B1-domeenin, Januksen ja Ropin parittaiset sekvenssi-identtisyydet • B1-Janus 27/56 • B1-Rop 3/56 • Janus-Rop 23/56
1 2 3 4 5 . 0 . 0 . 0 . 0 . 0 . CEEEEECCCSSCEEEEECCCSCHHHHHHHHHHHHHHTTCCEEEEECCCEEEEEECC MTYKLILNGKTLKGETITEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE B1-domeeni || | || | | || || || | |||| || | || || | MTKKAILALNTAKFLRTQAAVLAAKLEKLGAQEANDNAVDLEDTADDLYKTLLVLA Janus || ||| | | | | | || | | | | | | || || | GTKQEKTALNMARFIRSQTLTLLEKLNELDADEQADICESLHDHADELYRSCLARF Rop-monomeeri CCHHHHHHHHHHHHHHHHHHHHHHHHHHTTCHHHHHHHHHHHHHHHHHHHHHHHHH
(b) Merkitse linjaukseen identtisten aminohappojen lisäksi substituutiot, joiden pistemäärä on positiviinen BLOSUM62-matriisissa. • Ei yhtään B1:n ja Januksen välillä. • Kahdeksan Januksen ja Ropin välillä.
(c) Esiintyykö B1-perheessä tai Rop-perheessä luonnostaan Janukseen valittuja mutaatioita? • B1/Janus-mutaatioista mikään ei esiinny B1-perheessä. • Janus/Rop-mutaatioista 7 esiintyy muissa Rop-perheen jäsenissä. 5 näistä mutaatioista on yhteisiä B1-sekvenssin kanssa. • B1:stä on muutettu ytimen aminohappoja, kun taas Ropin ydin on säilytetty Januksessa.