Todd Taylor George Mason University School of Computational Sciences ttaylora@gmu

Review of Neutral networks in protein space: a computational study based on knowledge-based potentials of mean force and Exploring protein sequence space using knowledge-based potentials Todd Taylor George Mason University School of Computational Sciences ttaylora@gmu.edu

Summary of Prosa II Potential W(x,)= W[xi, xj, |i-j|; dij ] +  V[xi; (i) ] i<j W[xi, xj, |i-j|; dij ] = additive pair contribution  = C or C x = AA sequence  = structure a and b = amino acids: a at xi and b at xj |i-j| = separation in sequence of a and b dij = Euclidean distance between  atoms of a and b

Summary of Prosa II Potential Continued Z-score = ( W(x,) - W(x) ) / w(x) W(x)=average energy over all structures w(x) = standard deviation of energy over all structures V[xi; (i) ] = surface term  = C or C x = AA sequence a and b = amino acids: a at xi and b at xj  = the number of protein atoms in a sphere centered at xi

Low Prosa Z-scores Correspond to Native Structures

Definition of Adaptive Walk Pick a structure and the corresponding "wild type" sequence. This structure is what your sequences will "adaptively walk" toward. Pick some other sequence with the same AA frequencies as globular proteins generally. Compute the PROSA Z-score for this sequence on the above structure. If it is not less than the wild type Z-score, generate one-residue mutations until you find one that has a lower Z-score than than the original. Lower Z-scores are more significant. Sequence- structure alignments that PROSA scores are ungapped. Repeat until you find sequences with Z-scores below the wild type

Definition of Neutral Walk Start with a sequence found by an adaptive walk that has a Z-score at least as low as the wild type and that is therefore assured (at least for the purposes of this paper) of folding to the same structure as the wild type sequence. Make one-residue mutations until you find a second sequence that has a Z-score at least as low as the wild type. This becomes the current sequence. Repeat until you hit a dead end and cannot find a mutant with a sufficiently low Z-score.

Definition of Hamming Distance in the Context of these Papers The authors use the term Hamming distance even though their sequences come from the 20 letter AA alphabet. Here, Hamming distance means the number of places that the two sequences don't have the same letter. Sequence identity is 1-(Hamming dist/sequence length).

Prosa II Z-scores Along Adaptive Walks

Hamming Distances Between Neutral Sequences

HP Patterns in Neutral Sequences

HP Profiles of Highly Designable Sequences

Secondary Structure of Neutral Sequences

Data from Closest Approach Walks Df is Hamming distance between the pairs of final sequences D1 and D2 are the Hamming distances between wild type and dead end sequences in walks 1 and 2. N is the average Hamming distance between dead end seqeuences from all runs and n is the number of residues in the proteins.

Results for Adaptive and Neutral Walks Surprisingly, many sequences with Z-scores much better than wild type were found. Neutral networks seem to be very extensive and sequences tend to have low sequence identity with each other. The average Hamming distance between neutral net sequences is comparable to the distance between random sequences. Neutral network studies of RNA secondary structure indicate that the nets typically permeate all of fold space--there is "shape space covering", i.e., the distance is usually small from any randomly picked sequence to some other sequence that folds to any arbitrary structure you might pick. The authors claim their results indicate the same is true for proteins.

More Results for Adaptive and Neutral Walks As a check that neutral net sequences could actually fold to the structure, the authors did secondary structure prediction on the novel sequences and checked them against the known secondary structure assignments of the wild type sequence. The rates of agreement for neutral sequences with good Z-scores were high.Reduced alphabet neutral sequences (HP and ADLG) have higher sequence identity than 20 letter sequences but still seem to permeate fold space.

Results from Closest Approach Walks and Janus Protein The Hamming distance between wild type sequences in adjacent nets is large. The Hamming distance between the sequences on the border between two nets is small, ~5 mutations. The Janus sequence of Dalal has 50% sequence identity with 1PGB and 43% identity with 1ROP. 1PGB and 1ROP have very different structures and Janus folds to the same structure as 1ROP. The structure of Janus was correctly predicted by PROSA and several other sequences having high sequence homology to 1PGB but predicted to fold to the structure 1ROP were generated by neutral walks.

Interesting Points Raised by This Work The authors found sequences with Z-scores many standard deviations below the wild type. Is this due to inaccuracies in the PROSA potential or is stability beyond some threshold not strongly selected for? Is robustness to mutation optimized at the expense of stability? It is not stated in the Babajide papers what fraction of one-residue mutants were rejected at each point in the neutral walk, but you can guestimate ~80%+ from one figure. This fraction would correspond to the fraction of mutations that are deleterious (at least deleterious due to disruption of correct folding) and could presumably be checked experimentally.

More Interesting Points Raised by This Work The closest approach walks indicate that sequences at the "edges" of neutral nets are separated by only ~5 mutations, but the wild type sequences are widely separated. What is the topology of protein sequence neutral nets? How rapidly do the Z-scores change as you leave the neutral net, i.e.., for the 1 or 2 residue mutants near the neutral sequences on the boundary of the net? Sequences in a neutral network tend to have low sequence identity, as low as 10-15%. In structural genomics papers you often see statements like "the functions of 30-50% of putative proteins from complete genomes cannot be inferred due to low sequence homology with known proteins". Might it be true that most globular folds have already been found and many sequenced but unidentified proteins are remote neutral net neighbors of existing sequences?

References Babajide A, Farber R, Hofacker IL, Inman J, Lapedes AS, Stadler PF (2001): Exploring protein sequence space using knowledge-based potentials. J Theor Biol. 212(1):35-46. Babajide A, Hofacker IL, Sippl MJ, Stadler PF (1997): Neutral networks in protein space: a computational study based on knowledge-based potentials of mean force. Fold Des. 2(5):261-9.

Todd Taylor George Mason University School of Computational Sciences ttaylora@gmu

Todd Taylor George Mason University School of Computational Sciences ttaylora@gmu

Presentation Transcript

George Mason University: School of Management

George Mason School of Law

Taylor Hall George Mason University

Todd J. Zywicki Professor of Law George Mason University Law School

GEORGE MASON UNIVERSITY

George Mason School of Law

Joseph Ciarrochi, University of Wollongong Todd B. Kashdan, George Mason University

George Mason School of Law

George Mason School of Law

George Mason University

George Mason University

George Mason School of Law

George Mason School of Law

GEORGE MASON UNIVERSITY

George Mason School of Law

George Mason School of Law

George Mason School of Law

George Mason School of Law

George Mason University

George Mason School of Law

George Mason School of Law