1.04k likes | 1.18k Views
Secondary Structure Prediction Using Decision Lists. Deniz YURET Volkan KURT. Outline. What is the problem? What are the different approaches? How do we use decision lists and why? Why does evolution help?. What is the problem?. The generic prediction algorithm
E N D
Secondary Structure Prediction Using Decision Lists Deniz YURET Volkan KURT
Outline • What is the problem? • What are the different approaches? • How do we use decision lists and why? • Why does evolution help?
What is the problem? • The generic prediction algorithm • Some important pitfalls: definition, data set • Upper and lower bounds on performance • Evolution and homology enters the picture
Secondary Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----------HHHHHHHHHH------EEEEE-------
A Generic Prediction Algorithm • Sequence to Structure • Structure to Structure
A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ??????????????????????????????????????
A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD -?????????????????????????????????????
A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD -?????????????????????????????????????
A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD --????????????????????????????????????
A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD --????????????????????????????????????
A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ---???????????????????????????????????
A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----????????????????????????????
A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----H???????????????????????????
A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----H???????????????????????????
A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HH??????????????????????????
A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HHHHHHHHHH------EEEEE------?
A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HHHHHHHHHH------EEEEE-------
A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ?---H-----HHHHHHHHHH------EEEEE-------
A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HHHHHHHHHH------EEEEE-------
A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD -?--H-----HHHHHHHHHH------EEEEE-------
A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HHHHHHHHHH------EEEEE-------
A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD --?-H-----HHHHHHHHHH------EEEEE-------
A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HHHHHHHHHH------EEEEE-------
A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----?-----HHHHHHHHHH------EEEEE-------
A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----------HHHHHHHHHH------EEEEE-------
A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----------HHHHHHHHHH------EEEEE------?
A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----------HHHHHHHHHH------EEEEE-------
Pitfalls for newcomers • Definition of secondary structure • Choice of data set
Pitfall 1: Definition of Secondary Structure • DSSP: H, P, E, G, I, T, S • STRIDE: H, G, I, E, B, b, T, C • DEFINE: ??? • Convert all to H, --, and E • They only agree 71% of the time!!! (95% for DSSP and STRIDE) • Solution: Use DSSP
Pitfall 2: Dataset • Trivial to get 80%+ when homologies are present between the training and the test set • Homology identification keeps evolving • RS126, CB513, etc. • Comparison of programs on different data sets meaningless…
Performance Bounds • Simple baselines for lower bound • A method for estimating an upper bound
Baseline 1: 43% of all residues are tagged “loop” Performance Bounds 43%: assign loop
Baseline 2: 49% of all residues are tagged with the most frequent structure for the given amino-acid. Performance Bounds 49%: assign most frequent 43%: assign loop
Upper bound: Only consider exact matches for a given frame size. As the frame size increases accuracy should increase but coverage should fall. Performance Bounds 100% ??? 49%: assign most frequent 43%: assign loop
Upper bound: Only consider exact matches for a given frame size. As the frame size increases accuracy should increase but coverage should fall. Performance Bounds 100% ??? 75%: estimated upper bound 49%: assign most frequent 43%: assign loop
The Miracle of Homology • People used to be stuck at around 60%. • Rost and Sander crossed the 70% barrier in 1993 using homology information. • All algorithms benefit 5-10% from homology. • The homologues are of unknown structure, training and test sets still unrelated! • Why?
Outline • What is the problem? • What are the different approaches? • How do we use decision lists and why? • Why does evolution help?
GORV Sequence Secondary Structure PSI-BLAST +6.5% 66.9% Majority Vote Information Function / Bayesian Statistics Filter Secondary Structure Secondary Structure +73.4% * Garnier et al, 2002
Frequency Profile HSSP Neural Network Secondary Structure PHD Secondary Structure +4.3% Neural Network 62.6% / 67.4% Jury + Filter +3.4% Secondary Structure 70.8% 61.7% / 65.9% * Rost & Sander, 1993
JNet Profile Secondary Structure PSIBLAST HMMER2 CLUSTALW Neural Network Neural Network Jury + Jury Network Secondary Structure Secondary Structure 76.9% * Cuff & Barton, 2000
PSIPRED Secondary Structure Profiles PSI-BLAST Neural Network Neural Network Secondary Structure Secondary Structure 76.3% * Jones, 1999
Outline • What is the problem? • What are the different approaches? • How do we use decision lists and why? • Why does evolution help?
Introduction to Decision Lists • Prototypical machine learning problem: • Decide democrat or republican for 435 representatives based on 16 votes. Class Name: 2 (democrat, republican) 1. handicapped-infants: 2 (y,n) 2. water-project-cost-sharing: 2 (y,n) 3. adoption-of-the-budget-resolution: 2 (y,n) 4. physician-fee-freeze: 2 (y,n) 5. el-salvador-aid: 2 (y,n) 6. religious-groups-in-schools: 2 (y,n) … 16. export-administration-act-south-africa: 2 (y,n)
Introduction to Decision Lists • Prototypical machine learning problem: • Decide democrat or republican for 435 representatives based on 16 votes. 1. If adoption-of-the-budget-resolution = y and anti-satellite-test-ban = n and water-project-cost-sharing = y then democrat 2. If physician-fee-freeze = y then republican 3. If TRUE then democrat