1 / 50

Sequence-Function Relationships

Overview. DNA Structure and FunctionRegulatory Sites in DNAFinding Genes in DNA SequencesRNA StructureProtein Structure and FunctionProtein Motifs. Sequence Analysis on the Web. Can analyze sequence using a mainframe (GCG), on a Mac/PC (MacVector, OMIGA, LaserGene, etc.) or with free tools on the WebWeb tools are often bestAvailable to everyoneConstantly upgradedBut not always available and subject to random change.

Lucy
Download Presentation

Sequence-Function Relationships

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Sequence-Function Relationships Stuart M. Brown New York University School of Medicine

    2. Overview DNA Structure and Function Regulatory Sites in DNA Finding Genes in DNA Sequences RNA Structure Protein Structure and Function Protein Motifs

    3. Sequence Analysis on the Web Can analyze sequence using a mainframe (GCG), on a Mac/PC (MacVector, OMIGA, LaserGene, etc.) or with free tools on the Web Web tools are often best Available to everyone Constantly upgraded But not always available and subject to random change

    4. DNA Structure Primary = the sequence itself Secondary = double helix Tertiary = supercoiled, bent, etc. Quaternary = complexes with proteins Histones RNA Polymerase DNA binding proteins (transcription factors) Chromosome structure centromeres & telomeres

    6. DNA Information Content Just a 4 letter alphabet (GATC) Encodes proteins with 3 letter codons Punctuation determines transcription starts and stops Transcripitonal regulation (promoters, enhancers, etc.) Determines its own replication

    7. Many DNA Regulatory Sequences are Known Databases of promoters, enhancers, etc. TransFac the Transcription Factor database 4342 entries w/ known protein binding and transcriptional regulatory functions Maintained by Gesellschaft for Biotechnologische Forschung mbH (Braunschweig, Germany) The Eukaryotic Promoter Database(EPD) Bucher & Trifonov. (1986) NAR 14: 10009-26 1314 entries taken directly from scientific literature Maintained by ISREC (Lausanne, Switzerland) as a subset of the EMBL

    8. Tools to find TF sites in DNA GCG: FINDPATTERNS with TFSITES.DAT Macintosh (Signal Scan), PC/UNIX (Promoter Scan) Dr. Dan S. Prestridge, Univ. of Minnesota

    9. TF Binding sites lack information Most TF binding sites are determined by just a few base pairs (typically 6) This is not enough information for proteins to locate unique promoters for each gene TF's bind cooperatively and combinatorially The key is in the location in relation to each other and to the transcription units of genes

    10. Websites for Promoter finding Promoter Scan: NIH Bioinformatics (BIMAS) http://bimas.dcrt.nih.gov/molbio/proscan/ Promoter Scan II: Univ. of Minnesota & Axyx Pharmaceuticals http://biosci.cbs.umn.edu/software/proscan/promoterscan.htm Signal Scan: NIH Bioinformatics (BIMAS) http://bimas.dcrt.nih.gov:80/molbio/signal/index.html Transcription Element Search (TESS): Center for Bioinformatics, Univ. of Pennsylvania http://www.cbil.upenn.edu/tess/ Search TransFac at GBF with MatInspector, PatSearch, and FunSiteP http://transfac.gbf-braunschweig.de/TRANSFAC/programs.html TargetFinder: Telethon Inst.of Genetics and Medicine, Milan, Italy http://hercules.tigem.it/TargetFinder.html

    11. Finding Genes in Genomic DNA Translate (in all 6 reading frames) and look for similarity to known protein sequences Translate and look for long Open Reading Frames (ORFs) between start and stop codons Look for known gene markers TAATAA box, intron splice sites, etc. Statistical methods (codon preference)

    12. Gene Finding on the Web GRAIL: Oak Ridge Natl. Lab, Oak Ridge, TN http://compbio.ornl.gov/grailexp ORFfinder: NCBI http://www.ncbi.nlm.nih.gov/gorf/gorf.html DNA translation: Univ. of Minnesota Med. School http://alces.med.umn.edu/webtrans.html GenLang http://cbil.humgen.upenn.edu/~sdong/genlang.html BCM GeneFinder: Baylor College of Medicine, Houston, TX http://dot.imgen.bcm.tmc.edu:9331/seq-search/gene-search.html http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html

    13. Genomic Sequence Once each gene is located on the chromosome, it becomes possible to get upstream genomic sequence This is where the transcription factor binding sites are located Search for known TF sites, and discover new ones (among co-regulated genes)

    14. Intron/Exon structure Gene finding programs work well in bacteria None of these gene prediction programs do an adequate job predicting intron/exon boundaries The only reasonable gene models are based on alignment of cDNAs to genome sequence Perhaps 50% of all human genes still do not have a correct coding sequence defined

    15. RNA Structure Similar to DNA - base pairing Smaller molecules, free to take on more complex shapes tRNA, ribozymes, self-splicing introns

    16. tRNA Structures

    17. RNA Information Content Primary structure (sequence) contains: Information for 3-D self-assembly Genetic code for amino acids in protein Translation start and stop signals Intron splicing signals Controls for RNA stability and transcription level

    18. RNA Secondary Structure Rules for base pairing and free energy minimization are known Characteristic tRNA stem-loop structures Michael Zuker created the computer program FoldRNA GCG, UNIX/Mac/PC freeware, in commercial products, and on the Web Can predict many RNA secondary structures, not necessarily the optimal or true structure

    19. Protein Sequence Analysis Molecular properties (pH, mol. wt. isoelectric point, hydrophobicity) Secondary Structure Super-secondary (signal peptide, coiled-coil, trans-membrane, etc.) 3-D prediction, Threading Domains, motifs, etc.

    20. Self-assembly Proteins self-assemble in solution All of the information necessary to determine the complex 3-D structure is in the amino acid sequences Structure determines function lock & key model of enzyme function Know the sequence, know the function? Nearly infinite complexity

    21. Structure prediction Protein Structure prediction is the Holy Grail of bioinformatics Since structure = function, then structure prediction should allow protein design, design of inhibitors, etc. Huge amounts of genome data - what are the functions of all of these proteins?

    22. Chemical Properties of Proteins Proteins are linear polymers of 20 amino acids Chemical properties of the protein are determined by its amino acids Molecular wt., pH, isoelectric point are simple calculations from amino acid composition Hydrophobicity is a property of groups of amino acids - best examined as a graph

    23. Hydrophobicity Plot

    24. Web Sites for Simple Protein Analysis Protein Hydrophobicity Server: Bioinformatics Unit, Weizmann Institute of Science , Israel http://bioinformatics.weizmann.ac.il/hydroph/ SAPS - statistical analysis of protein sequences: composition, charge, hydrophobic and transmembrane segments, cysteine spacings, repeats and periodicity http://www.isrec.isb-sib.ch/software/SAPS_form.html

    25. Secondary Structure Protein secondary structure takes one of three forms: Alpha helix Beta pleated sheet Turn 2ndary structure is predicted within a small window Many different algorithms, not highly accurate Better predictions from a multiple alignment

    26. GCG Protein Analysis Toolkit Isoelectric: plots aa charge as a function of pH PeptideStructure: secondary structure predictions PlotStructure: plots protein secondary structure PepPlot: plots protein secondary structure and hydrophobicity in parallel panels Moment: makes a contour plot of the helical hydrophobic moment HelicalWheel: plots a peptide sequence as a helical wheel to help you recognize alpha-helical regions.

    27. Structure Prediction on the Web Secondary Structural Content Prediction (SSCP): EMBL, Heidelberg http://www.bork.embl-heidelberg.de/SSCP/sscp_seq.html BCM Search Launcher: Protein Secondary Structure Prediction: Baylor College of Medicine http://dot.imgen.bcm.tmc.edu:9331/seq-search/struc-predict.html PREDATOR: EMBL, Heidelberg http://www.embl-heidelberg.de/cgi/predator_serv.pl UCLA-DOE Protein Fold Recognition Server http://www.doe-mbi.ucla.edu/people/fischer/TEST/getsequence.html

    28. Sample Structure Prediction

    29. Super-secondary Structure Common structural motifs Membrane spanning (GCG= TransMem) Signal peptide (GCG= SPScan) Coiled coil (GCG= CoilScan) Helix-turn-helix (GCG = HTHScan)

    30. Web servers that predict these structures Predict Protein server: : EMBL Heidelberg http://www.embl-heidelberg.de/predictprotein/ SOSUI: Tokyo Univ. of Ag. & Tech., Japan http://www.tuat.ac.jp/~mitaku/adv_sosui/submit.html TMpred (transmembrane prediction): ISREC (Swiss Institute for Experimental Cancer Research) http://www.isrec.isb-sib.ch/software/TMPRED_form.html COILS (coiled coil prediction): ISREC http://www.isrec.isb-sib.ch/software/COILS_form.html SignalP (signal peptides): Tech. Univ. of Denmark http://www.cbs.dtu.dk/services/SignalP/

    31. 3-D Structure Cannot be accurately predicted from sequence alone (known as ab initio) Levinthals paradox: a 100 aa protein has 3200 possible backbone configurations - many orders of magnitude beyond the capacity of the fastest computers There are perhaps only a few hundred basic structures, but we dont yet have this vocabulary or the ability to recognize variants on a theme

    32. Threading Protein Structures Best bet is to compare with similar sequences that have known structures >> Threading Only works for proteins with >25% sequence similarity to a protein with known structure Current state of the art requires many days of computing on a dedicated workstation Some websites offer quick approximations Will improve as more 3-D structures are described Another aspect of the Genome Project

    33. Predicted Structure

    34. Protein Data Base There is a database of all known protein structures called the PDB. These have been determined by X-ray crystalography and/or NMR. Anyone download and view these structures with a PDB viewer program.

    35. RasMol RasMol is the simplest PDB viewer. http://www.umass.edu/microbio/rasmol/ It can work together with a web browser to let you view the structure of any sequence found with Entrez that has a known 3-D structure.

    36. Websites for 3-D structure prediction UCLA-DOE Protein Fold Recognition http://www.doe-mbi.ucla.edu/people/fischer/TEST/getsequence.html SwissModel: ExPASy, Univ. of Geneva http://www.expasy.ch/swissmod/SWISS-MODEL.html CPHmodels: Technical Univ. of Denmark http://www.cbs.dtu.dk/services/CPHmodels/

    37. Searching for Patterns in Proteins

    38. Protein Domains/Motifs Proteins are built out of functional units know as domains (or motifs) These domains have conserved sequences Often much more similar than their respective proteins Exon splicing theory (W. Gilbert) Exons correspond to folding domains which in turn serve as functional units Unrelated proteins may share a single similar exon (i.e.. ATPase or DNA binding function)

    39. Protein Motif Databases Known protein motifs have been collected in databases Best database is PROSITE The Dictionary of Protein Sites and Patterns maintained by Amos Bairoch, at the Univ. of Geneva, Switzerland contains a comprehensive list of documented protein domains constructed by expert molecular biologists.

    40. PROSITE is based on Patterns Each domain is defined by a simple pattern Patterns can have alternate amino acids in each position and defined spaces, but no gaps Pattern searching is by exact matching, so any new variant will not be found (can allow mismatches, but this weakens the algorithm)

    41. Tools for PROSITE searches Free Mac program: MacPattern ftp://ftp.ebi.ac.uk/pub/software/mac/macpattern.hqx Free PC program (DOS): PATMAT ftp://ncbi.nlm.nih.gov/repository/blocks/patmat.dos GCG provides the program MOTIFS Also in virtually all commercial programs: MacVector, OMIGA, LaserGene, etc.

    42. Websites for PROSITE Searches ScanProsite at ExPASy: Univ. of Geneva http://expasy.hcuge.ch/sprot/scnpsit1.html Network Protein Sequence Analysis: Institut de Biologie et Chimie des Protines, Lyon, France http://pbil.ibcp.fr/NPSA/npsa_prosite.html PPSRCH: EBI, Cambridge, UK http://www2.ebi.ac.uk/ppsearch/

    43. Profiles Profiles are tables of amino acid frequencies at each position in a motif They are built from multiple alignments PROSITE entries also contain profiles built from an alignment of proteins that match the pattern Profile searching is more sensitive than pattern searching - uses an alignment algorithm, allows gaps

    44. GCG ProfileSearch GCG has a set of profile analysis tools. Start with a multiple alignment Create a profile with ProfileMake ProfileSearch scans a database with your profile ProfileSegments displays alignments between a profile and matching database sequences ProfileGap makes pairwise alignments between a single sequence and a profile

    45. Websites for Profile searching PROSITE ProfileScan: ExPASy, Geneva http://www.isrec.isb-sib.ch/software/PFSCAN_form.html BLOCKS (builds profiles from PROSITE entries and adds all matching sequences in SwissProt): Fred Hutchinson Cancer Research Center, Seattle, Washington, USA http://www.blocks.fhcrc.org/blocks_search.html PRINTS (profiles built from automatic alignments of OWL non-redundant protein databases): http://www.biochem.ucl.ac.uk/cgi-bin/fingerPRINTScan/fps/PathForm.cgi

    46. More Protein Motif Databases PFAM (1344 protein family HMM profiles built by hand): Washington Univ., St. Louis http://pfam.wustl.edu/hmmsearch.shtml ProDom (profiles built from PSI-BLAST automatic multiple alignments of the SwissProt database): INRA, Toulouse, France http://www.toulouse.inra.fr/prodom/doc/blast_form.html [This is my favorite protein database - nicely colored results]

    47. Hidden Markov Models Hidden Markov Models (HMMs) are a more sophisticated form of profile analysis. Rather than build a table of amino acid frequencies at each position, they model the transition from one amino acid to the next. Pfam is built with HMMs. GCG version 10.2 (released March 2001) has added a bunch of HMM tools (and Pfam).

    48. Sample ProDom Output

    49. Discovery of new Motifs All of the tools discussed so far rely on a database of existing domains/motifs How to discover new motifs Start with a set of related proteins Make a multiple alignment Build a pattern or profile You will need access to a fairly powerful UNIX computer to search databases with custom built profiles or HMMs.

    50. Patterns in Unaligned Sequences Sometimes sequences may share just a small common region common signal peptide new transcription factors MEME: San Diego Supercomputing Facility http://www.sdsc.edu/MEME/meme/website/meme.html - GCG also includes the MEME program

    51. Summary DNA has genes and other information Transcription factors RNA has predictable structures Proteins have predictable 2ndary structures and functional domains, but generally cant predict new 3-D structures

More Related