130 likes | 244 Views
Identifying Functional signatures in Proteins - a computational design approach. David Bernick Rohl group 16-Mar-2005. The big picture. what is function? hinges substrate/DNA/protein binding/alignment/recognition catalytic sites what isn’t function ? (structure) secondary structures,
E N D
Identifying Functional signatures in Proteins - a computational design approach David Bernick Rohl group16-Mar-2005
The big picture • what is function? • hinges • substrate/DNA/protein binding/alignment/recognition • catalytic sites • what isn’t function ? (structure) • secondary structures, • fold architecture • thermodynamically required elements • nature selects for function (structure is implicit) • computational methods select for structure • can we predict…quickly ?
Some terms • pssm - position specific score matrix • a [20 x length] model of residue frequencies for every position of sequence family • homolog - natural sequences evolved from a common parent • morpholog - computationally derived sequence generated from a parent structure • ortholog - common ancestor, derived by speciation (constrained functional divergence) • paralog - common ancestor, same species (unconstrained functional divergence)
structure ensembles • Larson (2003) - Improved homology searches • Pei(2003) - Homology detection and active site searches • Kuhlman(2000) - Structural optimality of Natural sequences
Results - SH3 domain 11 Structures 62 additional sequences
Results - S100 domain Ca++ loop1 not detected backbone coordinated residues Ca++ loop2 not detected insufficient homolog depth 11 structures 30 additional sequences
the protocol Sequence CE+SCOPTaylorDomsFlexible Design cogs, pfam, reverse blast blast representative structure homolog Alignment paralog structures fixeddesign score pssmH pssmM statistical geometric
genome scale • high cost step - producing pssmM • precalculate pssmM for every domain
morpholog pssmsgenome scale • Data Sources • Taylor parsed Domain database • CE all-to-all + SCOP • Precompute pssms for every domain • ~8000 domains • 100 sequences ~90% diversity1000 sequences ~99% diversity • ~4-8 wks, 70p cluster for initial set
scoring • compare PSSMh to PSSMm • PSSMm contains only structure signal • PSSMh contains both function and structure • each position represents a count-normalized position in 20-space (H or M) • R-position -- average aa position • RH and RM define 20 space vectors • ‘function vector’ • ‘structure vector’
next steps • complete this set of domains - verification • full domain pssmM generation
acknowledgements • Carol Rohl • Kevin Karplus • Craig Lowe • Rohl group • HP