500 likes | 1.01k Views
Overview. DNA Structure and FunctionRegulatory Sites in DNAFinding Genes in DNA SequencesRNA StructureProtein Structure and FunctionProtein Motifs. Sequence Analysis on the Web. Can analyze sequence using a mainframe (GCG), on a Mac/PC (MacVector, OMIGA, LaserGene, etc.) or with free tools on the WebWeb tools are often bestAvailable to everyoneConstantly upgradedBut not always available and subject to random change.
E N D
1. Sequence-Function Relationships Stuart M. Brown
New York University School of Medicine
2. Overview DNA Structure and Function
Regulatory Sites in DNA
Finding Genes in DNA Sequences
RNA Structure
Protein Structure and Function
Protein Motifs
3. Sequence Analysis on the Web Can analyze sequence using a mainframe (GCG), on a Mac/PC (MacVector, OMIGA, LaserGene, etc.) or with free tools on the Web
Web tools are often best
Available to everyone
Constantly upgraded
But not always available and subject to random change
4. DNA Structure Primary = the sequence itself
Secondary = double helix
Tertiary = supercoiled, bent, etc.
Quaternary = complexes with proteins
Histones
RNA Polymerase
DNA binding proteins (transcription factors)
Chromosome structure
centromeres & telomeres
6. DNA Information Content Just a 4 letter alphabet (GATC)
Encodes proteins with 3 letter codons
Punctuation determines transcription starts and stops
Transcripitonal regulation (promoters, enhancers, etc.)
Determines its own replication
7. Many DNA Regulatory Sequences are Known Databases of promoters, enhancers, etc.
TransFac the Transcription Factor database
4342 entries w/ known protein binding and transcriptional regulatory functions
Maintained by Gesellschaft for Biotechnologische Forschung mbH (Braunschweig, Germany)
The Eukaryotic Promoter Database(EPD)
Bucher & Trifonov. (1986) NAR 14: 10009-26
1314 entries taken directly from scientific literature
Maintained by ISREC (Lausanne, Switzerland) as a subset of the EMBL
8. Tools to find TF sites in DNA GCG: FINDPATTERNS with TFSITES.DAT
Macintosh (Signal Scan), PC/UNIX (Promoter Scan)
Dr. Dan S. Prestridge, Univ. of Minnesota
9. TF Binding sites lack information Most TF binding sites are determined by just a few base pairs (typically 6)
This is not enough information for proteins to locate unique promoters for each gene
TF's bind cooperatively and combinatorially
The key is in the location in relation to each other and to the transcription units of genes
10. Websites for Promoter finding Promoter Scan: NIH Bioinformatics (BIMAS)
http://bimas.dcrt.nih.gov/molbio/proscan/
Promoter Scan II: Univ. of Minnesota & Axyx Pharmaceuticals
http://biosci.cbs.umn.edu/software/proscan/promoterscan.htm
Signal Scan: NIH Bioinformatics (BIMAS)
http://bimas.dcrt.nih.gov:80/molbio/signal/index.html
Transcription Element Search (TESS): Center for Bioinformatics, Univ. of Pennsylvania
http://www.cbil.upenn.edu/tess/
Search TransFac at GBF with MatInspector, PatSearch, and FunSiteP
http://transfac.gbf-braunschweig.de/TRANSFAC/programs.html
TargetFinder: Telethon Inst.of Genetics and Medicine, Milan, Italy
http://hercules.tigem.it/TargetFinder.html
11. Finding Genes in Genomic DNA Translate (in all 6 reading frames) and look for similarity to known protein sequences
Translate and look for long Open Reading Frames (ORFs) between start and stop codons
Look for known gene markers
TAATAA box, intron splice sites, etc.
Statistical methods (codon preference)
12. Gene Finding on the Web GRAIL: Oak Ridge Natl. Lab, Oak Ridge, TN
http://compbio.ornl.gov/grailexp
ORFfinder: NCBI
http://www.ncbi.nlm.nih.gov/gorf/gorf.html
DNA translation: Univ. of Minnesota Med. School
http://alces.med.umn.edu/webtrans.html
GenLang
http://cbil.humgen.upenn.edu/~sdong/genlang.html
BCM GeneFinder: Baylor College of Medicine, Houston, TX
http://dot.imgen.bcm.tmc.edu:9331/seq-search/gene-search.html
http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html
13. Genomic Sequence Once each gene is located on the chromosome, it becomes possible to get upstream genomic sequence
This is where the transcription factor binding sites are located
Search for known TF sites, and discover new ones (among co-regulated genes)
14. Intron/Exon structure Gene finding programs work well in bacteria
None of these gene prediction programs do an adequate job predicting intron/exon boundaries
The only reasonable gene models are based on alignment of cDNAs to genome sequence
Perhaps 50% of all human genes still do not have a correct coding sequence defined
15. RNA Structure Similar to DNA - base pairing
Smaller molecules, free to take on more complex shapes
tRNA, ribozymes, self-splicing introns
16. tRNA Structures
17. RNA Information Content Primary structure (sequence) contains:
Information for 3-D self-assembly
Genetic code for amino acids in protein
Translation start and stop signals
Intron splicing signals
Controls for RNA stability and transcription level
18. RNA Secondary Structure Rules for base pairing and free energy minimization are known
Characteristic tRNA stem-loop structures
Michael Zuker created the computer program FoldRNA
GCG, UNIX/Mac/PC freeware, in commercial products, and on the Web
Can predict many RNA secondary structures, not necessarily the optimal or true structure
19. Protein Sequence Analysis Molecular properties (pH, mol. wt. isoelectric point, hydrophobicity)
Secondary Structure
Super-secondary (signal peptide, coiled-coil, trans-membrane, etc.)
3-D prediction, Threading
Domains, motifs, etc.
20. Self-assembly Proteins self-assemble in solution
All of the information necessary to determine the complex 3-D structure is in the amino acid sequences
Structure determines function
lock & key model of enzyme function
Know the sequence, know the function?
Nearly infinite complexity
21. Structure prediction Protein Structure prediction is the Holy Grail of bioinformatics
Since structure = function, then structure prediction should allow protein design, design of inhibitors, etc.
Huge amounts of genome data - what are the functions of all of these proteins?
22. Chemical Properties of Proteins Proteins are linear polymers of 20 amino acids
Chemical properties of the protein are determined by its amino acids
Molecular wt., pH, isoelectric point are simple calculations from amino acid composition
Hydrophobicity is a property of groups of amino acids - best examined as a graph
23. Hydrophobicity Plot
24. Web Sites for Simple Protein Analysis Protein Hydrophobicity Server: Bioinformatics Unit, Weizmann Institute of Science , Israel
http://bioinformatics.weizmann.ac.il/hydroph/
SAPS - statistical analysis of protein sequences: composition, charge, hydrophobic and transmembrane segments, cysteine spacings, repeats and periodicity
http://www.isrec.isb-sib.ch/software/SAPS_form.html
25. Secondary Structure Protein secondary structure takes one of three forms:
Alpha helix
Beta pleated sheet
Turn
2ndary structure is predicted within a small window
Many different algorithms, not highly accurate
Better predictions from a multiple alignment
26. GCG Protein Analysis Toolkit Isoelectric: plots aa charge as a function of pH
PeptideStructure: secondary structure predictions
PlotStructure: plots protein secondary structure
PepPlot: plots protein secondary structure and hydrophobicity in parallel panels
Moment: makes a contour plot of the helical hydrophobic moment
HelicalWheel: plots a peptide sequence as a helical wheel to help you recognize alpha-helical regions.
27. Structure Prediction on the Web Secondary Structural Content Prediction (SSCP): EMBL, Heidelberg
http://www.bork.embl-heidelberg.de/SSCP/sscp_seq.html
BCM Search Launcher: Protein Secondary Structure Prediction: Baylor College of Medicine
http://dot.imgen.bcm.tmc.edu:9331/seq-search/struc-predict.html
PREDATOR: EMBL, Heidelberg
http://www.embl-heidelberg.de/cgi/predator_serv.pl
UCLA-DOE Protein Fold Recognition Server
http://www.doe-mbi.ucla.edu/people/fischer/TEST/getsequence.html
28. Sample Structure Prediction
29. Super-secondary Structure Common structural motifs
Membrane spanning (GCG= TransMem)
Signal peptide (GCG= SPScan)
Coiled coil (GCG= CoilScan)
Helix-turn-helix (GCG = HTHScan)
30. Web servers that predict these structures Predict Protein server: : EMBL Heidelberg
http://www.embl-heidelberg.de/predictprotein/
SOSUI: Tokyo Univ. of Ag. & Tech., Japan
http://www.tuat.ac.jp/~mitaku/adv_sosui/submit.html
TMpred (transmembrane prediction): ISREC (Swiss Institute for Experimental Cancer Research)
http://www.isrec.isb-sib.ch/software/TMPRED_form.html
COILS (coiled coil prediction): ISREC
http://www.isrec.isb-sib.ch/software/COILS_form.html
SignalP (signal peptides): Tech. Univ. of Denmark
http://www.cbs.dtu.dk/services/SignalP/
31. 3-D Structure Cannot be accurately predicted from sequence alone (known as ab initio)
Levinthals paradox: a 100 aa protein has 3200 possible backbone configurations - many orders of magnitude beyond the capacity of the fastest computers
There are perhaps only a few hundred basic structures, but we dont yet have this vocabulary or the ability to recognize variants on a theme
32. Threading Protein Structures Best bet is to compare with similar sequences that have known structures >> Threading
Only works for proteins with >25% sequence similarity to a protein with known structure
Current state of the art requires many days of computing on a dedicated workstation
Some websites offer quick approximations
Will improve as more 3-D structures are described
Another aspect of the Genome Project
33. Predicted Structure
34. Protein Data Base There is a database of all known protein structures called the PDB.
These have been determined by X-ray crystalography and/or NMR.
Anyone download and view these structures with a PDB viewer program.
35. RasMol RasMol is the simplest PDB viewer.
http://www.umass.edu/microbio/rasmol/
It can work together with a web browser to let you view the structure of any sequence found with Entrez that has a known 3-D structure.
36. Websites for 3-D structure prediction UCLA-DOE Protein Fold Recognition
http://www.doe-mbi.ucla.edu/people/fischer/TEST/getsequence.html
SwissModel: ExPASy, Univ. of Geneva
http://www.expasy.ch/swissmod/SWISS-MODEL.html
CPHmodels: Technical Univ. of Denmark
http://www.cbs.dtu.dk/services/CPHmodels/
37. Searching for Patterns in Proteins
38. Protein Domains/Motifs Proteins are built out of functional units know as domains (or motifs)
These domains have conserved sequences
Often much more similar than their respective proteins
Exon splicing theory (W. Gilbert)
Exons correspond to folding domains which in turn serve as functional units
Unrelated proteins may share a single similar exon (i.e.. ATPase or DNA binding function)
39. Protein Motif Databases Known protein motifs have been collected in databases
Best database is PROSITE
The Dictionary of Protein Sites and Patterns
maintained by Amos Bairoch, at the Univ. of Geneva, Switzerland
contains a comprehensive list of documented protein domains constructed by expert molecular biologists.
40. PROSITE is based on Patterns Each domain is defined by a simple pattern
Patterns can have alternate amino acids in each position and defined spaces, but no gaps
Pattern searching is by exact matching, so any new variant will not be found (can allow mismatches, but this weakens the algorithm)
41. Tools for PROSITE searches Free Mac program: MacPattern
ftp://ftp.ebi.ac.uk/pub/software/mac/macpattern.hqx
Free PC program (DOS): PATMAT
ftp://ncbi.nlm.nih.gov/repository/blocks/patmat.dos
GCG provides the program MOTIFS
Also in virtually all commercial programs: MacVector, OMIGA, LaserGene, etc.
42. Websites for PROSITE Searches ScanProsite at ExPASy: Univ. of Geneva
http://expasy.hcuge.ch/sprot/scnpsit1.html
Network Protein Sequence Analysis: Institut de Biologie et Chimie des Protines, Lyon, France
http://pbil.ibcp.fr/NPSA/npsa_prosite.html
PPSRCH: EBI, Cambridge, UK
http://www2.ebi.ac.uk/ppsearch/
43. Profiles Profiles are tables of amino acid frequencies at each position in a motif
They are built from multiple alignments
PROSITE entries also contain profiles built from an alignment of proteins that match the pattern
Profile searching is more sensitive than pattern searching - uses an alignment algorithm, allows gaps
44. GCG ProfileSearch GCG has a set of profile analysis tools.
Start with a multiple alignment
Create a profile with ProfileMake
ProfileSearch scans a database with your profile
ProfileSegments displays alignments between a profile and matching database sequences
ProfileGap makes pairwise alignments between a single sequence and a profile
45. Websites for Profile searching PROSITE ProfileScan: ExPASy, Geneva
http://www.isrec.isb-sib.ch/software/PFSCAN_form.html
BLOCKS (builds profiles from PROSITE entries and adds all matching sequences in SwissProt): Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
http://www.blocks.fhcrc.org/blocks_search.html
PRINTS (profiles built from automatic alignments of OWL non-redundant protein databases): http://www.biochem.ucl.ac.uk/cgi-bin/fingerPRINTScan/fps/PathForm.cgi
46. More Protein Motif Databases PFAM (1344 protein family HMM profiles built by hand): Washington Univ., St. Louis
http://pfam.wustl.edu/hmmsearch.shtml
ProDom (profiles built from PSI-BLAST automatic multiple alignments of the SwissProt database): INRA, Toulouse, France
http://www.toulouse.inra.fr/prodom/doc/blast_form.html
[This is my favorite protein database - nicely colored results]
47. Hidden Markov Models Hidden Markov Models (HMMs) are a more sophisticated form of profile analysis.
Rather than build a table of amino acid frequencies at each position, they model the transition from one amino acid to the next.
Pfam is built with HMMs.
GCG version 10.2 (released March 2001) has added a bunch of HMM tools (and Pfam).
48. Sample ProDom Output
49. Discovery of new Motifs All of the tools discussed so far rely on a database of existing domains/motifs
How to discover new motifs
Start with a set of related proteins
Make a multiple alignment
Build a pattern or profile
You will need access to a fairly powerful UNIX computer to search databases with custom built profiles or HMMs.
50. Patterns in Unaligned Sequences Sometimes sequences may share just a small common region
common signal peptide
new transcription factors
MEME: San Diego Supercomputing Facility
http://www.sdsc.edu/MEME/meme/website/meme.html
- GCG also includes the MEME program
51. Summary DNA has genes and other information
Transcription factors
RNA has predictable structures
Proteins have predictable 2ndary structures and functional domains, but generally cant predict new 3-D structures