650 likes | 1.16k Views
Exploring 3D Molecular Structures Using NCBI Tools ... By Structure Similarity: VAST. By Conserved Function: RPS-BLAST and CDD. Finding a Structural Template for a Query Protein ...
E N D
Slide 1:Exploring 3D Molecular Structures Using NCBI Tools
A Field Guide June 17, 2004
Slide 2:NCBI Structure Resources
Overview of Structural Informatics at NCBI How 3D Macromolecular Structures are Determined Indexing Structural Data at NCBI Finding Homologous Structures By Sequence Similarity: BLAST By Structure Similarity: VAST By Conserved Function: RPS-BLAST and CDD Finding a Structural Template for a Query Protein
Slide 3:The National Center for Biotechnology Information
Created as a part of NLM in 1988 Establish public databases Perform research in computational biology Develop software tools for sequence analysis Disseminate biomedical information
Slide 4:Structural Informatics
Chemical Formula 3D Conformation Function ARKLMPQSCSW… Modifications Ions Ligands Binding Sites Catalytic Residues Kinetics Thermodynamics Substrates Intermediates Structure Dynamics Active States Folding
Slide 5:Structural Informatics
Chemical Formula 3D Conformation Function GenPept NCBI RefSeq SWISS-PROT PIR PRF Multiple Sequence Alignments: Pfam, SMART, COGs, CDD PDB
Slide 6:Structural Informatics at NCBI
Chemical Formula 3D Conformation Function GenPept NCBI RefSeq SWISS-PROT PIR PRF Multiple Sequence Alignments: Pfam, SMART, COGs, CDD PDB 4,818,495 25,003 11,382 103,820
Slide 7:The Entrez System
Entrez Nucleotide PubMed Protein Taxonomy Structure Domains 3D Domains Books Journals PMC OMIM UniSTS PopSet Genome SNP UniGene Gene GEO GEO Datasets MeSH
Slide 9:More About Resolution
1EJG: Crambin at 0.54 Å 2TMA: Tropomyosin at 15 Å protons!! only alpha carbons!! 3 Å “Temperature”
Slide 11:PDB
Slide 12:PDB File: Header
HEADER ISOMERASE/DNA 01-MAR-00 1EJ9 TITLE CRYSTAL STRUCTURE OF HUMAN TOPOISOMERASE I DNA COMPLEX COMPND MOL_ID: 1; COMPND 2 MOLECULE: DNA TOPOISOMERASE I; COMPND 3 CHAIN: A; COMPND 4 FRAGMENT: C-TERMINAL DOMAIN, RESIDUES 203-765; COMPND 5 EC: 5.99.1.2; COMPND 6 ENGINEERED: YES; COMPND 7 MUTATION: YES; COMPND 8 MOL_ID: 2; COMPND 9 MOLECULE: DNA (5'- COMPND 10 D(*C*AP*AP*AP*AP*AP*GP*AP*CP*TP*CP*AP*GP*AP*AP*AP*AP*AP*TP* COMPND 11 TP*TP*TP*T)-3'); COMPND 12 CHAIN: C; COMPND 13 ENGINEERED: YES; COMPND 14 MOL_ID: 3; COMPND 15 MOLECULE: DNA (5'- COMPND 16 D(*C*AP*AP*AP*AP*AP*TP*TP*TP*TP*TP*CP*TP*GP*AP*GP*TP*CP*TP* COMPND 17 TP*TP*TP*T)-3'); COMPND 18 CHAIN: D; COMPND 19 ENGINEERED: YES SOURCE MOL_ID: 1; SOURCE 2 ORGANISM_SCIENTIFIC: HOMO SAPIENS; SOURCE 3 EXPRESSION_SYSTEM_COMMON: BACULOVIRUS EXPRESSION SYSTEM; SOURCE 4 EXPRESSION_SYSTEM_CELL: SF9 INSECT CELLS; SOURCE 5 MOL_ID: 2; SOURCE 6 SYNTHETIC: YES; SOURCE 7 MOL_ID: 3; SOURCE 8 SYNTHETIC: YES KEYWDS PROTEIN-DNA COMPLEX, TYPE I TOPOISOMERASE, HUMAN REMARK 1 REMARK 2 REMARK 2 RESOLUTION. 2.60 ANGSTROMS. REMARK 3 REMARK 3 REFINEMENT. REMARK 3 PROGRAM : X-PLOR 3.1 REMARK 3 AUTHORS : BRUNGER … REMARK 280 REMARK 280 CRYSTALLIZATION CONDITIONS: 27% PEG 400, 145 MM MGCL2, 20 REMARK 280 MM MES PH 6.8, 5 MM TRIS PH 8.0, 30 MM DTT REMARK 290 ...
Slide 14:From PDB to Entrez
Structure 3D Domains Protein Domains
Slide 15:From Coordinates to Models
1EJ9: Human topoisomerase I
Slide 16:Building the Structure Summary
Taxonomy Pubmed Protein 3D Domains Domains Nucleotide
Slide 17:Indexing into MMDB
Structure Import only experimentally determined structures Convert to ASN.1 Verify sequences inter-residue-bonds { { atom-id-1 { molecule-id 1 , residue-id 1 , atom-id 1 } , atom-id-2 { molecule-id 1 , residue-id 2 , atom-id 9 } } , id 1 , name "helix 1" , type helix , location subgraph residues interval { { molecule-id 1 , from 49 , to 61 } } } , Add secondary structure Add chemical bonds Create “backbone” model (Ca, P only) Create single-conformer model
Slide 18:Structure Indexing
Entrez MMDB-ID MMDB entry date EC number Organism PDB Accession Release date Class Source Description Comment Ligands PDB code PDB name PDB description Literature Article title Author Journal Publication date Experimental Method Resolution Counters Ligand types Modified amino acids Modified nucleotides Modified ribonucleotides Protein chains DNA chains RNA chains topoisomerase AND 2[dnachaincount] AND human[organism]
Slide 19:Creating Sequence Records
Protein Nucleotide Nucleotide 1EJ9A 1EJ9C 1EJ9D One record per chain
Slide 20:Building the Structure Summary
Slide 21:Building the Structure Summary
Slide 22:Annotating Secondary Structure
1EJ9: Human topoisomerase I a-Helices ß-strands coils/loops
Slide 23:Creating 3D Domains
3D Domain 0: 1EJ9A0 = entire polypeptide
Slide 24:Creating 3D Domains
1EJ9A1 1EJ9A3 1EJ9A2 1EJ9A4 1EJ9A5 < 3 Secondary Structure Elements
Slide 25:Building the Structure Summary
Slide 26:Building the Structure Summary
Slide 27:3D Domain Indexing
Entrez SDI MMDB-ID Accession MMDB entry date Organism Domain number Cumulative number PDB Accession Release date Class Source Description Comment Literature Article title Author Publication date Counters Modified amino acids a-Helices ß-Strands Residues Molecular weight REMEMBER: 3D Domain 0 is the entire polypeptide chain! 4[helixcount] AND 0[strandcount] AND 0[domainno] AND viruses[organism] Find all viral four helix bundles
Slide 28:Conserved Domains
Weakly conserved serine Active site serine Sequences Aligned by Function
Slide 29:Linking Sequence to Function
The PSSM Position Specific Score Matrix A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3 Serine scored differently in these two positions Active site nucleophile
Pfam-A seeds: HMM based models representing a wide variety of functional domains derived from SWISS-PROT COG SMART CDSlide 30:Entrez Domains (CDD v2.00)
HMM based models originally concentrating on eukaryotic signaling domains, now expanding BLAST based alignments derived from complete proteomes of prokaryotes NCBI curated domains based on sequence and structural alignments Pfam pfam01234 smart00123 cd01234 COG0123 NCBI NCBI Sanger EMBL Single Domains Protein Families A database of Position Specific Score Matrices (PSSMs)
Slide 31:CD-Search Output
CD SMART Pfam COG Click on a colored bar to align your sequence to the CD
Slide 32:CD Summary
Alignment view controls Cn3D launch PSSM created Aligned query
Slide 33:Building the Structure Summary
Slide 34:Building the Structure Summary
Cn3D
Slide 36:Links to CDs
CD-Search / RPS-BLAST 1EJ9A Query: protein sequence Database: PSSMs pre-computed in Entrez Protein Enter accession, GI, or FASTA sequence into RPS-BLAST
Slide 37:Finding Homologous Structures
By sequence similarity: BLAST By structural similarity: VAST By conserved function: CD-Search Entrez Protein
Slide 38:BLAST: Sequence Neighbors
BLAST Related Structures Displays a graphical and text alignment between a query sequence and a similar sequence with structure Accessed from Blink Any protein BLAST search ? GVKWKYLEHKGPVFAPPYDPLP GIKWKFLEHKGPVFAPPYEPLP
Slide 39:BLink Neighbors
EAA05377: ENSANGP00000011118 from A. gambiae Related Structures
Slide 40:Related Structures from BLASTp
Slide 41:Related Structures
Cn3D
Slide 42:VAST: Searching by Structure
Why search for similar structures? To find homologs that sequence searches cannot: distant protein homologs often conserve structure more strongly than sequence To explore protein evolution: similar protein folds can be used to support different functions To identify conserved core elements of a protein fold that can be used to model related proteins of unknown structure
Slide 43:VAST: Structure Neighbors
Vector Alignment Search Tool For each protein chain, locate SSEs (secondary structure elements), and represent them as individual vectors. 1 2 3 4 5 6 Human IL-4 structure: 1RCBstructure: 1RCB
Slide 45:VAST: Calculate (rik, zik)
3 1 z For both the query and target structures, For each SSE k, set the origin at the midpoint of k. Then calculate rik and zik for the endpoints of SSEs i ? k. Vector position relative to the xy plane xy z13 r13 structure: 1RCBstructure: 1RCB
Slide 47:VAST: Refinement
Aligned residues are red Alignment extended to the end of this strand Ca atoms are added to the aligned SSEs Alignments are allowed to extend beyond SSE boundaries All atoms are added to the models, and the detailed backbone and sidechain positions are refined
Slide 48:VAST: Alignment of Sequence
Aligned blocks represent structural core elements Aligned blocks have no internal gaps Aligned residues occupy the same position in space Aligned residues are shown in CAPITAL letters Helix 1 Helix 2 Helix 3 Helix 4
Slide 49:VAST: Scoring
p = d P(s > s0, n) c(n, P1, P2) P(s > s0, n) Probability of observing an alignment of n SSEs with a score greater than s0 by chance. c(n, P1, P2) Search space: Number of possible alignments of n SSEs between vector sets P1 and P2. d Number of structures searched (set to 500) The probability that the VAST alignment occurred by chance.
Slide 51:Query by Chain vs 3D Domain
Query by whole chain Query by domain 5 Not found using whole chain query! c(n, P1, P2) is smaller for a 3D domain!
Slide 52:VAST: Multiple Alignments
Cn3D
Slide 53:nr-PDB Sets
Choose criteria for inclusion in a set Non-redundant set of sequence similar clusters VAST reports one representative from each cluster
Slide 54:Submitting a PDB File to VAST
Pick the correct file format Remove all records except ATOM This is the best way to convert PDB into MMDB format!
Slide 55:Blocks in CD Alignments
Alignment view controls Aligned query Cn3D launch Block 1 Block 2 Block 3 Consensus sequence created PSSM created
Slide 56:Curating CD Alignments
smart00235 VAST cd00203 Cn3D Cn3D
Slide 57:Curated CD Summary
List of annotated features Customized view of the selected feature in Cn3D Residues comprising the selected feature Cn3D
Slide 58:
CD-Curation: Effect on model alignment accuracy A. Marchler-Bauer
Slide 59:CDART
Only available for single domain records: cd, pfam, smart
Slide 60:Finding a Structural Template
Overall Strategy: For a query protein sequence, construct a block alignment representing conserved core SSEs of the most sequence similar structures to the query, and then align the query sequence to this template. Construct the block alignment Curated CD: Locate using CD-Search and use the sequences most similar to the query VAST: Find the most sequence similar structure and find its VAST neighbors Align the query to the template: Use Cn3D PSI-BLAST: Aligns sequence using PSSM of current alignment BLOCKER: Aligns sequence to an existing block alignment: use where sequence similarity is high Threader: Aligns sequence to a structure and a block alignment: use where sequence similarity is low
Slide 61:BLOCKER: The Block Aligner
PSSM Creates alignments that match the existing block structure Matches are scored from a PSSM generated from the block alignment An entire block must be matched with no internal gaps There are no penalties for gaps between blocks up to a set gap length Can perform both local and global alignments Generally used after BLAST or PSI-BLAST The Block Aligner tests the existing block structure
Slide 63:The NCBI Threader
L R L S L E Q L Q V I A I A N Input Structure Block alignment Sequence Attempts to find matches based on chemical contacts, mainly buried hydrophobic interactions Useful on blocks for which sequence alignment methods fail Should be iterated with varying block structures Cn3D
Slide 64:The Future
More curated CDs: they keep coming… Pre-computed Related Structures for all sequences in Entrez Protein CD “children”: subfamilies of large CD records based on sequence and structure similarity Improved mapping of SNP data onto 3D structures Further linking of structural and genomic biology
Slide 65:What comes next…
Workshop I Working with Structures Workshop II Working with Alignments All exercises and other resources will remain on the course web pages info@ncbi.nlm.nih.gov NCBI Handbook, Ch. 3