Sequence-Function Relationships

1. Sequence-Function Relationships Stuart M. Brown New York University School of Medicine

2. Overview DNA Structure and Function Regulatory Sites in DNA Finding Genes in DNA Sequences RNA Structure Protein Structure and Function Protein Motifs

3. Sequence Analysis on the Web Can analyze sequence using a mainframe (GCG), on a Mac/PC (MacVector, OMIGA, LaserGene, etc.) or with free tools on the Web Web tools are often best Available to everyone Constantly upgraded But not always available and subject to random change

4. DNA Structure Primary = the sequence itself Secondary = double helix Tertiary = supercoiled, bent, etc. Quaternary = complexes with proteins Histones RNA Polymerase DNA binding proteins (transcription factors) Chromosome structure centromeres & telomeres

6. DNA Information Content Just a 4 letter alphabet (GATC) Encodes proteins with 3 letter codons Punctuation determines transcription starts and stops Transcripitonal regulation (promoters, enhancers, etc.) Determines its own replication

7. Many DNA Regulatory Sequences are Known Databases of promoters, enhancers, etc. TransFac the Transcription Factor database 4342 entries w/ known protein binding and transcriptional regulatory functions Maintained by Gesellschaft for Biotechnologische Forschung mbH (Braunschweig, Germany) The Eukaryotic Promoter Database(EPD) Bucher & Trifonov. (1986) NAR 14: 10009-26 1314 entries taken directly from scientific literature Maintained by ISREC (Lausanne, Switzerland) as a subset of the EMBL

8. Tools to find TF sites in DNA GCG: FINDPATTERNS with TFSITES.DAT Macintosh (Signal Scan), PC/UNIX (Promoter Scan) Dr. Dan S. Prestridge, Univ. of Minnesota

9. TF Binding sites lack information Most TF binding sites are determined by just a few base pairs (typically 6) This is not enough information for proteins to locate unique promoters for each gene TF's bind cooperatively and combinatorially The key is in the location in relation to each other and to the transcription units of genes

10. Websites for Promoter finding Promoter Scan: NIH Bioinformatics (BIMAS) http://bimas.dcrt.nih.gov/molbio/proscan/ Promoter Scan II: Univ. of Minnesota & Axyx Pharmaceuticals http://biosci.cbs.umn.edu/software/proscan/promoterscan.htm Signal Scan: NIH Bioinformatics (BIMAS) http://bimas.dcrt.nih.gov:80/molbio/signal/index.html Transcription Element Search (TESS): Center for Bioinformatics, Univ. of Pennsylvania http://www.cbil.upenn.edu/tess/ Search TransFac at GBF with MatInspector, PatSearch, and FunSiteP http://transfac.gbf-braunschweig.de/TRANSFAC/programs.html TargetFinder: Telethon Inst.of Genetics and Medicine, Milan, Italy http://hercules.tigem.it/TargetFinder.html

11. Finding Genes in Genomic DNA Translate (in all 6 reading frames) and look for similarity to known protein sequences Translate and look for long Open Reading Frames (ORFs) between start and stop codons Look for known gene markers TAATAA box, intron splice sites, etc. Statistical methods (codon preference)

12. Gene Finding on the Web GRAIL: Oak Ridge Natl. Lab, Oak Ridge, TN http://compbio.ornl.gov/grailexp ORFfinder: NCBI http://www.ncbi.nlm.nih.gov/gorf/gorf.html DNA translation: Univ. of Minnesota Med. School http://alces.med.umn.edu/webtrans.html GenLang http://cbil.humgen.upenn.edu/~sdong/genlang.html BCM GeneFinder: Baylor College of Medicine, Houston, TX http://dot.imgen.bcm.tmc.edu:9331/seq-search/gene-search.html http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html

13. Genomic Sequence Once each gene is located on the chromosome, it becomes possible to get upstream genomic sequence This is where the transcription factor binding sites are located Search for known TF sites, and discover new ones (among co-regulated genes)

14. Intron/Exon structure Gene finding programs work well in bacteria None of these gene prediction programs do an adequate job predicting intron/exon boundaries The only reasonable gene models are based on alignment of cDNAs to genome sequence Perhaps 50% of all human genes still do not have a correct coding sequence defined

15. RNA Structure Similar to DNA - base pairing Smaller molecules, free to take on more complex shapes tRNA, ribozymes, self-splicing introns

16. tRNA Structures

17. RNA Information Content Primary structure (sequence) contains: Information for 3-D self-assembly Genetic code for amino acids in protein Translation start and stop signals Intron splicing signals Controls for RNA stability and transcription level

18. RNA Secondary Structure Rules for base pairing and free energy minimization are known Characteristic tRNA stem-loop structures Michael Zuker created the computer program FoldRNA GCG, UNIX/Mac/PC freeware, in commercial products, and on the Web Can predict many RNA secondary structures, not necessarily the optimal or �true� structure

19. Protein Sequence Analysis Molecular properties (pH, mol. wt. isoelectric point, hydrophobicity) Secondary Structure Super-secondary (signal peptide, coiled-coil, trans-membrane, etc.) 3-D prediction, Threading Domains, motifs, etc.

20. Self-assembly Proteins self-assemble in solution All of the information necessary to determine the complex 3-D structure is in the amino acid sequences Structure determines function lock & key model of enzyme function Know the sequence, know the function? Nearly infinite complexity

21. Structure prediction Protein Structure prediction is the �Holy Grail� of bioinformatics Since structure = function, then structure prediction should allow protein design, design of inhibitors, etc. Huge amounts of genome data - what are the functions of all of these proteins?

22. Chemical Properties of Proteins Proteins are linear polymers of 20 amino acids Chemical properties of the protein are determined by its amino acids Molecular wt., pH, isoelectric point are simple calculations from amino acid composition Hydrophobicity is a property of groups of amino acids - best examined as a graph

23. Hydrophobicity Plot

24. Web Sites for Simple Protein Analysis Protein Hydrophobicity Server: Bioinformatics Unit, Weizmann Institute of Science , Israel http://bioinformatics.weizmann.ac.il/hydroph/ SAPS - statistical analysis of protein sequences: composition, charge, hydrophobic and transmembrane segments, cysteine spacings, repeats and periodicity http://www.isrec.isb-sib.ch/software/SAPS_form.html

25. Secondary Structure Protein secondary structure takes one of three forms: Alpha helix Beta pleated sheet Turn 2ndary structure is predicted within a small window Many different algorithms, not highly accurate Better predictions from a multiple alignment

26. GCG Protein Analysis Toolkit Isoelectric: plots aa charge as a function of pH PeptideStructure: secondary structure predictions PlotStructure: plots protein secondary structure PepPlot: plots protein secondary structure and hydrophobicity in parallel panels Moment: makes a contour plot of the helical hydrophobic moment HelicalWheel: plots a peptide sequence as a helical wheel to help you recognize alpha-helical regions.

27. Structure Prediction on the Web Secondary Structural Content Prediction (SSCP): EMBL, Heidelberg http://www.bork.embl-heidelberg.de/SSCP/sscp_seq.html BCM Search Launcher: Protein Secondary Structure Prediction: Baylor College of Medicine http://dot.imgen.bcm.tmc.edu:9331/seq-search/struc-predict.html PREDATOR: EMBL, Heidelberg http://www.embl-heidelberg.de/cgi/predator_serv.pl UCLA-DOE Protein Fold Recognition Server http://www.doe-mbi.ucla.edu/people/fischer/TEST/getsequence.html

28. Sample Structure Prediction

29. �Super-secondary� Structure Common structural motifs Membrane spanning (GCG= TransMem) Signal peptide (GCG= SPScan) Coiled coil (GCG= CoilScan) Helix-turn-helix (GCG = HTHScan)

30. Web servers that predict these structures Predict Protein server: : EMBL Heidelberg http://www.embl-heidelberg.de/predictprotein/ SOSUI: Tokyo Univ. of Ag. & Tech., Japan http://www.tuat.ac.jp/~mitaku/adv_sosui/submit.html TMpred (transmembrane prediction): ISREC (Swiss Institute for Experimental Cancer Research) http://www.isrec.isb-sib.ch/software/TMPRED_form.html COILS (coiled coil prediction): ISREC http://www.isrec.isb-sib.ch/software/COILS_form.html SignalP (signal peptides): Tech. Univ. of Denmark http://www.cbs.dtu.dk/services/SignalP/

31. 3-D Structure Cannot be accurately predicted from sequence alone (known as ab initio) Levinthal�s paradox: a 100 aa protein has 3200 possible backbone configurations - many orders of magnitude beyond the capacity of the fastest computers There are perhaps only a few hundred basic structures, but we don�t yet have this vocabulary or the ability to recognize variants on a theme

32. Threading Protein Structures Best bet is to compare with similar sequences that have known structures >> Threading Only works for proteins with >25% sequence similarity to a protein with known structure Current state of the art requires many days of computing on a dedicated workstation Some websites offer quick approximations Will improve as more 3-D structures are described Another aspect of the Genome Project

33. Predicted Structure

34. Protein Data Base There is a database of all known protein structures called the PDB. These have been determined by X-ray crystalography and/or NMR. Anyone download and view these structures with a PDB viewer program.

35. RasMol RasMol is the simplest PDB viewer. http://www.umass.edu/microbio/rasmol/ It can work together with a web browser to let you view the structure of any sequence found with Entrez that has a known 3-D structure.

36. Websites for 3-D structure prediction UCLA-DOE Protein Fold Recognition http://www.doe-mbi.ucla.edu/people/fischer/TEST/getsequence.html SwissModel: ExPASy, Univ. of Geneva http://www.expasy.ch/swissmod/SWISS-MODEL.html CPHmodels: Technical Univ. of Denmark http://www.cbs.dtu.dk/services/CPHmodels/

37. Searching for Patterns in Proteins

38. Protein Domains/Motifs Proteins are built out of functional units know as domains (or motifs) These domains have conserved sequences Often much more similar than their respective proteins Exon splicing theory (W. Gilbert) Exons correspond to folding domains which in turn serve as functional units Unrelated proteins may share a single similar exon (i.e.. ATPase or DNA binding function)

39. Protein Motif Databases Known protein motifs have been collected in databases Best database is PROSITE The Dictionary of Protein Sites and Patterns maintained by Amos Bairoch, at the Univ. of Geneva, Switzerland contains a comprehensive list of documented protein domains constructed by expert molecular biologists.

40. PROSITE is based on Patterns Each domain is defined by a simple pattern Patterns can have alternate amino acids in each position and defined spaces, but no gaps Pattern searching is by exact matching, so any new variant will not be found (can allow mismatches, but this weakens the algorithm)

41. Tools for PROSITE searches Free Mac program: MacPattern ftp://ftp.ebi.ac.uk/pub/software/mac/macpattern.hqx Free PC program (DOS): PATMAT ftp://ncbi.nlm.nih.gov/repository/blocks/patmat.dos GCG provides the program MOTIFS Also in virtually all commercial programs: MacVector, OMIGA, LaserGene, etc.

42. Websites for PROSITE Searches ScanProsite at ExPASy: Univ. of Geneva http://expasy.hcuge.ch/sprot/scnpsit1.html Network Protein Sequence Analysis: Institut de Biologie et Chimie des Prot�ines, Lyon, France http://pbil.ibcp.fr/NPSA/npsa_prosite.html PPSRCH: EBI, Cambridge, UK http://www2.ebi.ac.uk/ppsearch/

43. Profiles Profiles are tables of amino acid frequencies at each position in a motif They are built from multiple alignments PROSITE entries also contain profiles built from an alignment of proteins that match the pattern Profile searching is more sensitive than pattern searching - uses an alignment algorithm, allows gaps

44. GCG ProfileSearch GCG has a set of profile analysis tools. Start with a multiple alignment Create a profile with ProfileMake ProfileSearch scans a database with your profile ProfileSegments displays alignments between a profile and matching database sequences ProfileGap makes pairwise alignments between a single sequence and a profile

45. Websites for Profile searching PROSITE ProfileScan: ExPASy, Geneva http://www.isrec.isb-sib.ch/software/PFSCAN_form.html BLOCKS (builds profiles from PROSITE entries and adds all matching sequences in SwissProt): Fred Hutchinson Cancer Research Center, Seattle, Washington, USA http://www.blocks.fhcrc.org/blocks_search.html PRINTS (profiles built from automatic alignments of OWL non-redundant protein databases): http://www.biochem.ucl.ac.uk/cgi-bin/fingerPRINTScan/fps/PathForm.cgi

46. More Protein Motif Databases PFAM (1344 protein family HMM profiles built by hand): Washington Univ., St. Louis http://pfam.wustl.edu/hmmsearch.shtml ProDom (profiles built from PSI-BLAST automatic multiple alignments of the SwissProt database): INRA, Toulouse, France http://www.toulouse.inra.fr/prodom/doc/blast_form.html [This is my favorite protein database - nicely colored results]

47. Hidden Markov Models Hidden Markov Models (HMMs) are a more sophisticated form of profile analysis. Rather than build a table of amino acid frequencies at each position, they model the transition from one amino acid to the next. Pfam is built with HMMs. GCG version 10.2 (released March 2001) has added a bunch of HMM tools (and Pfam).

48. Sample ProDom Output

49. Discovery of new Motifs All of the tools discussed so far rely on a database of existing domains/motifs How to discover new motifs Start with a set of related proteins Make a multiple alignment Build a pattern or profile You will need access to a fairly powerful UNIX computer to search databases with custom built profiles or HMMs.

50. Patterns in Unaligned Sequences Sometimes sequences may share just a small common region common signal peptide new transcription factors MEME: San Diego Supercomputing Facility http://www.sdsc.edu/MEME/meme/website/meme.html - GCG also includes the MEME program

51. Summary DNA has genes and other information Transcription factors RNA has predictable structures Proteins have predictable 2ndary structures and functional domains, but generally can�t predict new 3-D structures

Sequence-Function Relationships

Sequence-Function Relationships

Presentation Transcript

From Protein Sequence to Function:

Function Prediction from Protein Sequence

Understanding Sequence, Structure and Function Relationships and the Resulting Redundancy

Lectures 9 10: Protein Structure-Function relationships

Protein Sequence-Structure-Function

Structure-Function Relationships in Organic Field Effect Transistors

Structure-function relationships (Using Discovery Studio)

Sequence – a function whose domain is positive integers.

Lung Function Testing Sequence Terminating a fitness test

Structure/Function Relationships in Electronic Ceramics

General Properties of Complex Binding Function - Structure Relationships

Lesson 9: predicting function from sequence

Sequence-Structure-Function

Sequence Analysis and Function Prediction

Proteins: Sequence --> Structure --> Function

Lung Function Testing Sequence

Predicting function from sequence

Structure function relationships in arteries

COFFEE: an objective function for multiple sequence alignments

STRUCTURE-FUNCTION RELATIONSHIPS

DeltaV Sequence Function Block Tutorial

Determining the Function from a Quadratic Sequence Algebraically!

Sequence-Function Relationships