CSCE555 Bioinformatics

CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555 HAPPY CHINESE NEW YEAR University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu.

Outline • Introduction to DNA Motif • Motif Representations (Recap) • Motif database search • Algorithms for motif discovery

Search Space Motif width = W N Length = L Size of search space = (L – W + 1)N L=100, W=15, N=10  size  1019

Worked Example N = 4 pi = ¼ cki = Score = 1.99 - 0.50 + 0.20 + 0.60 = 2.29

1 Suppose the search space is a 2D rectangle. (Typically, more than 2 dimensions!) Start at a random point X. Randomly pick a dimension. 2 Look at all points along this dimension. Move to one of them randomly, proportional to its score π. X Gibbs Sampling Search Repeat.

Choose a random starting state. Randomly pick a sequence. Look at all motif positions in this sequence. Pick one randomly proportional to exp(score). Gibbs Sampling for Motif Search Repeat.

Does it Work in Practice? • Only successful cases get published! • Seems more successful in microbes (bacteria & yeast) than in animals. • The search algorithm seems to work quite well, the problem is the scoring scheme: real motifs often don’t have higher scores than you would find in random sequences by chance. I.e. the needle looks like hay. • Attempts to deal with this: • Assume the motif is an inverted palindrome (they often are). • Only analyze sequence regions that are conserved in another species (e.g. human vs. mouse). • As usual, repetitive sequences cause problems. • More powerful algorithm: MEME

Go to our MEME server: • http://molgen.biol.rug.nl/meme/website/meme.html • Fill in your emailadres, description of the sequences • Open the fasta formatted file you just saved with Genome2d (click “Browse”) • Select the number of motifs, number of sites and the optimum width of the motif • Click “Search given strand only” • Click “Start search”

Something like this will appear in your email. The results are quite self explanatory.

Promoter Prediction • What are promoters? • Three strategies for promoter prediction • Signal based • Comparative genomics/phylogenetic footprinting • Expression profile base de-novo motif discovery algorthms

What is a Promoter? Region of gene that binds RNA polymerase and transcription factors to initiate transcription

Promoters:Whatsignals are there? Simple ones in prokaryotes

Prokaryotic promoters • RNA polymerase complex recognizes promoter sequences located very close to & on 5’ side (“upstream”) of initiation site • RNA polymerase complexbinds directly to these. with no requirement for “transcription factors” • Prokaryotic promoter sequences are highly conserved • -10 region • -35 region

What signals are there? Complex ones in eukaryotes

Eukaryotic genes are transcribed by 3 different RNA polymerases Recognize different types of promoters & enhancers:

Eukaryotic promoters & enhancers • Promoters located “relatively” close to initiation site (but can be located within gene, rather than upstream!) • Enhancers also required for regulated transcription (these control expression in specific cell types, developmental stages, in response to environment) • RNA polymerase complexes do not specifically recognize promoter sequences directly • Transcription factors bind first and serve as “landmarks” for recognition by RNA polymerase complexes

Eukaryotic transcription factors • Transcription factors (TFs) are DNA binding proteins that also interact with RNA polymerase complex to activate or repress transcription • TFs contain characteristic “DNA binding motifs” http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.table.7039 • TFs recognize specific short DNA sequence motifs “transcription factor binding sites” • Several databases for these, e.g.TRANSFAC http://www.generegulation.com/cgibin/pub/databases/transfac

Zinc finger-containing transcription factors • Common in eukaryotic proteins • Estimated 1% of mammalian genes encode zinc-finger proteins • In C. elegans, there are 500! • Can be used as highly specific DNA binding modules • Potentially valuable tools for directed genome modification (esp. in plants) & human gene therapy

Predicting Promoters • Overview of strategies •  What sequence signals can be used? • What other types of information can be used? • Algorithms • Promoter prediction software • 3 major types • many, many programs

Promoter prediction: Eukaryotes vs prokaryotes Promoter prediction is easier in microbial genomes Why? Highly conserved Simpler gene structures More sequenced genomes! (for comparative approaches) Methods? Previously, again mostly HMM-based Now: • similarity-based. • comparative methods (because so many genomesavailable) • De novo motif discovery

Predicting promoters: Steps & Strategies • Closely related to gene prediction • Obtain genomic sequence • Use sequence-similarity based comparison • (BLAST, MSA) to find related genes • But: "regulatory" regions are much less well-conserved than coding regions • Locate ORFs • Identify TSS (if possible!) • Use promoter prediction programs • Analyze motifs, etc. in sequence(TRANSFAC) FirstEF

Automated promoter prediction strategies • Pattern-driven algorithms • Sequence-similarity based algorithms • Combined "evidence-based" • BEST RESULTS? Combined, sequential

1: Promoter Prediction: Pattern-driven algorithms • Success depends on availability of collections of annotated binding sites (TRANSFAC & PROMO) • Tend to produce huge numbers of FPs • Why? • Binding sites (BS) for specific TFs often variable • Binding sites are short (typically 5-15 bp) • Interactions between TFs (& other proteins) influence affinity & specificity of TF binding • One binding site often recognized by multiple BFs • Biology is complex: promoters often specific to organism/cell/stage/environmental condition

Solutions to problem of too many FP predictions? • Take sequence context/biology into account • Eukaryotes: clusters of TFBSs are common • Prokaryotes: knowledge of  factors helps • Probability of "real" binding site increases if annotated transcription start site (TSS) nearby • But: What about enhancers? (no TSS nearby!) & Only a small fraction of TSSs have been experimentally mapped • CpG islands before promoter around TSS • TATA Box, CCAAT box • Content Information: hexamer frequency

Why we cannot rely on consensus sequence? • Inr (Initiator) consensus sequence will appear once every 512bp in random sequences • For TATA box, one for every 120bp • Short-sequence patterns can appear by chance with high likelihood (false postives)

2: Promoter Prediction: PhylogeneticFootprinting • Assumption: common functionality can be deduced from sequence conservation • Comparative promoter prediction: "Phylogenetic footprinting rVista, ConSite, PromH, FootPrinter • For comparative (phylogenetic) methods • Must choose appropriate species • Different genomes evolve at different rates • Classical alignment methods have trouble with translocations, inversions in order of functional elements • If background conservation of entire region is highly conserved, comparison is useless • Not enough data (Prokaryotes >>> Eukaryotes) • Biology is complex: many (most?) regulatory elements are not conserved across species!

3: Promoter Prediction: Co-expression based algorithms Problems: • Need sets of co-regulated genes • Genes experimentally determined to be co-regulated (using microarrays??) Careful: How determine co-regulation? • Alignments of co-regulated genes should highlight elements involved in regulation Algorithms: MEME AlignACE, PhyloCon

Examples of promoter prediction/characterization software MATCH, MatInspector TRANSFAC MEME & MAST BLAST, etc. Others? FIRST EF Dragon Promoter Finder (these are links in PPTs) also see Dragon Genome Explorer (has specialized promoter software for GC-rich DNA, finding CpG islands, etc) JASPAR

TRANSFAC matrix entry: for TATA box • Fields: • Accession & ID • Brief description • TFs associated with this entry • Weight matrix • Number of sites used to build (How many here?) • Other info

Global alignment of human & mouse obese gene promoters (200 bp upstream from TSS)

Check out optional review & try associated tutorial: Wasserman WW & Sandelin A (2004) Applied bioinformatics for identification of regulatory elements. Nat Rev Genet 5:276-287 http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html Check this out: http://www.phylofoot.org/NRG_testcases/ D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)

Annotated lists of promoter databases & promoter prediction software • URLs from Mount Chp 9, available online Table 9.12 http://www.bioinformaticsonline.org/links/ch_09_t_2.html • Table in Wasserman & Sandelin Nat Rev Genet article http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.htm • URLs for Baxevanis & Ouellette, Chp 5: http://www.wiley.com/legacy/products/subject/life/bioinformatics/ch05.htm#links More lists: • http://www.softberry.com/berry.phtml?topic=index&group=programs&subgroup=promoter • http://bioinformatics.ubc.ca/resources/links_directory/?subcategory_id=104 • http://www3.oup.co.uk/nar/database/subcat/1/4/

Summary • Promoter & gene regulation • 3 types of methods for promoter prediction • Many programs have sensitivity and specificity less than 0.5 • Integrative algorithms are more promising

Acknowledgement • Zhiping Weng (Boston Uni.)

CSCE555 Bioinformatics