840 likes | 1.07k Views
RNA/Protein Structures. RNA structure. Stem-loop structure. RNA structure. A loop structure A loop between i and j when base at i pairs with base at j Base at i+1 pairs with at base j Or base at i pairs with base at j-1 Or a multiple loop. RNA secondary structure.
E N D
RNA structure • Stem-loop structure
RNA structure • A loop structure • A loop between i and j when base at i pairs with base at j • Base at i+1 pairs with at base j • Or base at i pairs with base at j-1 • Or a multiple loop
RNA secondary structure • Search for minimum free energy • Gibbs free energy at 37 degrees (C) • Free energy increments of base pairs are counted as stacks of adjacent pairs • Successive CGs: -3.3 kcal/mol • Unfavorable loop initiation energy to constrain bases in a loop
RNA structure prediction • Ad-hoc approach • Simply look at a strand and find areas where base pairing can occur • Possible to find many locations where folds can occur • Prediction should be able to determine the most likely one • What should be the criteria ? • 1980, Nussinov-Jacobson Algorithm • More stable one is the most likely structure • Find the fold that forms the greatest number of base pairs (base-pairing lowers the overall energy of the strand, more stable) • Checking for all possible folds is impossible -> dynamic programming
Amino Acid • General structure of amino acids • an amino group • a carboxyl group • α-carbon bonded to a hydrogen and a side-chain group, R • Side chain R determines the identity of particular amino acid • R: large white and gray • C: black • Nitrogen: blue • Oxygen: red • Hydrogen: white
Protein • Protein: polymer consisting of AA’s linked by peptide bonds • AA in a polymer is called a residue • Folded into 3D structures • Structure of protein determines its function • Primary structure: linear arrangement of AA’s • AA sequence (primary structure) determines 3D structure of a protein, which in turn determines its properties • N- and C-terminal • Secondary structure: short stretches of AAs • Tertiary structure: overall 3D structure
Secondary structure • Secondary structures have repetitive interactions resulting from hydrogen bonding between N-H and carboxyl groups of peptide backbone • Conformations of side chains of AA are not part of the secondary structure • α-helix
β-pleated sheet • Parallel/antiparallel • 3D form of antiparallel
Secondary structure: domain • Part of chain folds independently of foldings of other parts • Such independent folded protion of protein is called domain (super-secondary structure) • α unit • α α unit (helix-turn-helix) • meander • Greek key
Domain • Larger proteins are modular • Their structural units, domains or folds, can be covalently linked to generate multi-domain proteins • Domains are not only structurally, but also functionally, discrete units – domain family members are structurally and functionally conserved and recombined in complex ways during evolution • Domains can be seen as the units of evolution • Novelty in protein function often arises as a result of gain or loss of domains, or by re-shuffling existing domains along sequence • Pairs of protein domains with the same 3D fold, precise function is conserved to ~40% sequence identity (broad functional class is conserved ~20%)
Motif • Repetitive super-secondary structures is a motif (or module) • Greek key motif is often found in –barrel tertiary structure • complement control protein module • Immunoglobulin module • Fibronectin type I module • Growth factor module • Kringle module
Motif Representation • Motif • In multiple alignments of distinctly related sequences, highly conserved regions are called motifs, features, signatures or blocks • Tends to correspond to core structural and functional elements of the proteins
Linked series of -meanders • Greek key pattern • Alternative α untis • Top and side views (α-helical • section is outside)
Secondary structure: conformation • Two types of Protein Conformations • Fibrous • Globular –folds back onto itself to create a spherical shape • Schematic diagrams of fibrous and globular proteins • Computer-generated model of globular protein
SRC protein • Tyrosine kinase • Enzyme putting a phophate group on tyrosine AA (phosphorylation) • Activates an inactive protein, eventually activates cell-division proteins
Secondary Structure Prediction by PSIRED • Prediction of regions of the protein that form alpha-helix, beta-sheet, or random coil • NP_005408 >gi|4885609|ref|NP_005408.1| proto-oncogene tyrosine-protein kinase Src [Homo sapiens] MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSS DTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYV APSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKL DSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGC FGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLL DFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYT ARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPEC PESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL
Examining Crystal Structure • Cn3D: NCBI structure viewer and modeling tool • DeppView: SWISSPRROT • JMOL • NCBI Structure database • Links to NCBI MMDB (Molecular Modeling Database) • MMDB contains experimentally verified protein structures • SRC – MMDB ID 56157, PDB ID 1FMK • View Structure from NCBI Structure database • Opens up Cn3D window • Click to rotate; Ctrl_click to zoom; Shift_clcik to move • Rendering and coloring menus
Tertiary structure • 3D arrangment of all atoms in the module • Considers arrangement of helical and sheet sections, conformations of side chains, arrangement of atoms of side chains, etc. • Experimentally determined by • X-ray crystallography – measure diffraction patterns of atoms • NMR (Nuclear Magnetic Resonance) spectroscopy – use protein samples in aqueous solution
Protein families • Groups of genes of identical or similar sequence are common • Sometimes, repetition of identical sequences is correlated with the synthesis of increased quantities of a gene product • e.g., a genome contains multiple copies of ribosomal RNAs • Human chromosome 1 has 2000 genes for 5S rRNA (sedimentation coefficient), and chr 13, 14, 15, 21 and 22 have 280 copies of a repeat unit made up of 28S, 5.8S and 18S • Amplication of rRNA genes evolved because of heavy demand for rRNA synthesis during cell division • These rRNA genes are examples of protein families having identical or near identical sequences • Sequence similarities indicate a common evolutionary origin • α- and β-globin families have distinct sequence similarities evolved from a single ancestral globin gene
Protein families and superfamilies • Dayhoff classification, 1978 • Protein families – at least 50 % AA sequence similar (based on physico-chemical AA features) • Related proteins with less similarity (35%) belong to a superfamily, may have quite diverse functions • α- and β-globins are classified as two separate families, and together with myoglobins form the globin superfamily • families have distinct sequence similarities evolved from a single ancestral globin gene
Protein family database • Pattern or secondary database derived from sequences • a pattern may be the most conserved aspects of sequence families • The most conserved part may vary between species • Use scoring system to account for some variability • Position-specific scoring matrix (PSSM) or Profile • Contrast to a pairwise alignment, having the same weight regardless of positions • Protein family databases are derived by different analytical techniques • But, trying to find motifs, conserved regions, considered to reflect shared structural or functional characteristics • Three groups: single motifs, multiple motifs, or full domain alignments
Protein family databases • Pattern or secondary database derived from sequences
Single Motif Method • Regular expression • PROSITE • PDB 1ivy • Carboxypet_Ser_His (PS00560) • [LIVF]-x2-[[LIVSTA]-x[IVPST]-[GSDNQL]-[SAGV]-[SG]-H-x-[IVAQ]-P-x(3)-[PSA] • [] – any of the enclosed symbols • X- any residue • (3) – number of repeats • Fuzzy regular expression • Build regular expressions with info on shared biochemical properties of AA • Provide flexibility according to AA group clustering
Multiple motif methods • PRINTS • Encode multiple motifs (called fingerprints) in ungapped, unweighted local aligments • BLOCKS • Derived from PROSITE and PRINTS • Use the most highly conserved regions in protein families in PROSITE • Use motif-finding algorithm to generate a large number of candidate blocks • Initially, three conserved AA positions anywhere in the alignment are identified and used as anchors • Blocks are iteratively extended and ultimately encoded as ungapped local alignments • Graph theory is used to assemble a best set of blocks for a given family • Use position specific scoring matrix (PSSM), similar to a profile
Full domain alignment • Profiles • Use family-based scoring matrix via dynamic programming • Has position-specific info on insertions and deletions in the sequence family • Hidden Markov Model (HMM) • PFAM, SMART, TIGRFAM represent full domain alignments as HMMs • PFAM • Represents each family as seed alignment, full alignment, and an HMM • Seed contains representative members of the family • Full alignment contains all members of the family as detected with HMM constructed from seen alignment
Hidden Markov Model (HMM) • Markov Process • Decomposed into a successive discrete states • e.g., first-order Markov process – a traffic light • Process states are not directly observable – spoken sounds vs. physical changes in vocal chords, position of tongue, etc. • Profile HMM • Discrete states correspond to successive columns of protein multiple sequence alignment • Match, insertion, deletion states • States have associated symbol emission probability distribution • Position-specific gap weight represents transition probability from indel to match
Protein structures • Domain • Polypeptide chain in a protein folds into a ‘tertiary’ structure • One or more compact globular regions called domains • The tertiary structure associated with a domain region is also described as a protein fold • Multi-domain • Proteins with polypeptide chains fold into several domains • Nearly half the known globular structures are multidomain, more than half in two domains • Automatic structure comparison methods are introduced in 1970s shortly after the first crystal structures are stored in PDB
Reasons for structural comparisons • Ligand binding • Binding of ligand or a substrate to an active-site in a protein induces a structural change which facilitates the reaction being catalyed at the site or promotes a binding of substrates at another site • Comparing bound and unbound structures of ligand sheds light on these processes and drug designs. • Distant evolutionary relationship • Protein structure is more highly conserved than sequence • Structure comparison can detect homologs with substantial changes • Structural variations in protein families • Identification of common structural motifs
Structure comparison algorithms • Two main components in structure comparison algorithms • Scoring similarities in structural features • Optimization strategy maximizing similarities measured • Most are based on geometric properties from 3D coordinates • Intermolecular method • Superpose structures by minimizing distance between superposed position • Intra • Compare sets of internal distances between positions to identify an alignment maximizing the number of equivalent positions • Distance is described by RMSD (Root Mean Square Deviation), squared root of the average squared distance between equivalent atoms
Distant homolog • Structure is more conserved than sequences during evolution • Structural similarity between distant homologs can be found • Pairwise sequence similarity • SSAP structural similarity score in parenthesis (0 – 100)
Structure comparison algorithms • SSAP, 1989 • Residue level, Intra, Dynamic programming • DALI, 1993 • Residue fragment level, intra, Monte Carlo optimization • COMPARER, 1990 • Multiple element level, both, Dynamic programming
Structure classification • Most structure classifications are established at the domain level • Thought to be an important evolutionary unit and easier to determine domain boundaries from structural data than from sequence data • Criteria for assessing domain regions within a structure • The domain possesses a compact globular structure • Residues within a domain make more internal contacts than to residues in the rest of polypeptide • Secondary structure elements are usually not shared with other regions of the polypeptide • There is evidence for existence of this region as an evolutionary unit
Structure classification hierarchy • Class level -- proteins are grouped according to their structural class (composition of residues in a α -helical and β-strand conformations) • Mainly- α, mainly- β, alternating α- β, α plus β (mainly- α and – β are segregated) • Architecture • the manner by which secondary structure elements are packed together (arrangement of sec. structures in 3D space) • Fold group (topology) • Orientation of sec. structures and the connectivity between them • Superfamily • Family
Protein Structure databases • PDB • Over 20,000 entries deduced from X-ray diffraction, NMR or modeling • Massively redundant • 1FMK, 1BK5, 2F9C, .. • SCOP (Structural Classification of Proteins) • Multi-domain protein is split into its constituent domains • Known structures are classified according to evolutionary and structural relationship • Domains in SCOP are grouped by species and hierarchically classified into families, superfamilies, folds and classes • Family level – group together domains with celar sequence similarities • Superfamily – group of domains with structural and functional evidence for their descent from a common evolutionary ancestor • Gold – group of domains with the same major secondary structure with the same chain topology • Domains identified manually by visually inspecting structures
Protein Structure databases • SCOP (cont’d) • Proteins in the same superfamily often have the same function • CATH (Class, Architecture, Topology, Homology) • Homology – clustered domains with 35% sequence identity and shared common ancestry • 800 fold families, 10 of which are super-folds • 2009 www.cs.uml.edu/~kim/580/08_cath.pdf
Protein Function Prediction • In the absense of experimental data, function of a protein is usually inferred from its sequence similarity to a protein of known function • The more similar the sequence, the more similar the function is likely to be • Not always true • Can clues to function be derived directly from 3D structure • Definition of function • Function can be described at many levels: biochemical, biological processes, pathways, organ level • Proteins are annotated at different degrees of functional specificity: ubiquitin-like dome, signaling protein, .. • GO (Gene Ontology) scheme
Protein Function Prediction • Sequence-based – largely unreliable • Profile-based • Profiles are constructed from sequences of whole protein families with families are grouped by 3D structure or function (as in Pfam) • Start with sequences matched by an initial search, iteratively pull in more remote homologues • More sensitivity than simple sequence comparison because profiles implicitly contain information on which residues within the family are well conserved and which sites are more variable • Structure-based • Fold-based • Proteins sharing simlar functions often shave similar folds, resulting from descent from a common ancestral protein • Sometimes, function of proteins alter during evolution with the folds unchanged • Thus, fold match is not always reliable • Surface clefts and binding pockets