Motif Finding Workshop Project

Motif Finding WorkshopProject Chaim Linhart January 2008

Outline 1. Some background again… 2. The project

1. Background Slides with Ron Shamir and Adi Akavia

Gene: from DNA to protein Pre-mRNA Mature mRNA DNA protein transcription splicing translation

DNA • DNA: a “string” over the alphabet of 4 bases (nucleotides): { A, C, G, T } • Resides in chromosomes • Complementary strands: A-T ; C-G • Forward/sense strand: AACTTGCG • Reverse-complement/anti-sense strand: TTGAACGC • Directional: from 5’ to 3’: • (upstream) AACTTGCGATACTCCTA (downstream) 5’ end 3’ end

Gene structure (eukaryotes) Promoter DNA Coding strand Transcription start site (TSS) Transcription (RNA polymerase) Pre-mRNA Intron Exon Exon Splicing (spliceosome) 5’ UTR 3’ UTR Mature mRNA Stop codon Start codon Coding region Translation (ribosome) Protein

Translation • Codon - a triplet of bases, codes a specific amino acid (except the stop codons); many-to-1 relation • Stop codons - signal termination of the protein synthesis process http://ntri.tamuk.edu/cell/ribosomes.html

Genome sequences • Many genomes have been sequences, including those of viruses, microbes, plants and animals. • Human: • 23 pairs of chromosomes • 3+ Gbps (bps = base pairs) , only ~3% are genes • ~25,000 genes • Yeast: • 16 chromosomes • 20 Mbps • 6,500 genes

Regulation of Expression • Each cell contains an identical copy of the whole genome - but utilizes only a subset of the genes to perform diverse, unique tasks • Most genes are highly regulated – their expression is limited to specific tissues, developmental stages, physiological condition • Main regulatory mechanism – transcriptional regulation

TF TF 5’ 3’ Gene BS BS Transcriptional regulation • Transcription is regulated primarily by transcription factors (TFs) – proteins that bind to DNA subsequences, called binding sites (BSs) • TFBSs are located mainly (not always!) in the gene’s promoter – the DNA sequence upstream the gene’s transcription start site (TSS) • BSs of a particular TF share a common pattern, or motif • Some TFs operate together – TF modules TSS

TFBS motif models AC CG ACT T • Consensus (“degenerate”) string: gene 1 gene 2 AACTGT gene 3 CACTGT gene 4 CACTCT gene 5 CACTGT gene 6 gene 7 gene 8 gene 9 AACTGT gene 10 • Statistical models… • Motif logo representation

Human G2+M cell-cycle genes:The CHR – NF-Y module CDCA3(trigger of mitotic entry 1) CTCAGCCAATAGGGTCAGGGCAGGGGGCGTGGCGGGAAGTTTGAAACT -18 CDCA8(cell division cycle associated 8) TTGTGATTGGATGTTGTGGGA…[25bp]…TGACTGTGGAGTTTGAATTGG +23 CDC2(cell division control protein 2 homolog) CTCTGATTGGCTGCTTTGAAAGTCTACGGGCTACCCGATTGGTGAATCCGGGGCCCTTTAGCGCGGTGAGTTTGAAACTGCT 0 CDC42EP4 (cdc42 effector protein 4) GCTTTCAGTTTGAACCGAGGA…[25bp]…CGACGGCCATTGGCTGCTGC -110 CCNB1(G2/mitotic-specific cyclin B1) AGCCGCCAATGGGAAGGGAG…[30bp]…AGCAGTGCGGGGTTTAAATCT +45 CCNB2(G2/mitotic-specific cyclin B2) TTCAGCCAATGAGAGT…[15bp]…GTGTTGGCCAATGAGAAC…[15bp]…GGGCCGCCCAATGGGGCGCAAGCGACGCGGTATTTGAATCCTGGA +10 BS’s are short, non-specific, hiding in both strands and at various locations along the promoters TFs: NF-Y , CHR

The computational challenge • Given a set of co-regulated genes(e.g., from gene expression chips) • Find a motif that is over-represented (occurs unusually often) in their promoters • This may be the TF binding site motif • Find TF modules – over-represented motifs that tend to co-occur

The computational challenge (II) • Motifs can also be found w/o a given target-set – “genome-wide” • Find a motif that is localized - occurs more often neat the TSS of genes • Find a motif with a strand bias – occurs more often on the genes’ coding strand • Find TF modules with biases in their order / orientation / distance

Motif finding algorithms • >100 motif finding algs • Main differences between them: • Type of analysis & input: • Target-set vs. genome-wide • Single vs. multi-species (conservation) • Single motifs vs. modules • Motif model • Score for evaluating motif • Motif search technique: • Combinatorial (enumeration) vs. Statistical optimization

Example - Amadeus Over-represented motifs in the promoters of genes expressed in the G2 and G2/M phases of the human cell cycle: CHR NF-Y

2. The project

General goals • Develop software from A-Z: • Design • Implementation • (Optimization) • Execution & analysis of real data • A taste of bioinformatics • Have fun • Get credit…

The computational task • Given a set of DNA sequences • Find “interesting” pairs of motifs: • Order bias • Other scores… • Main challenges: • Performance (time, memory) • Output redundancy

Input File with DNA sequences in “fasta” format: >sequence-name1 <space> [header1] ACCCGNNNNTCGGAAATGANN CGGAGTAAAATATGCGAGCGT >sequence-name2 <space> [header2] cggattnnnaccgcannnnnnnnaccgtga >sequence-name3 <space> [header3] agtttagactgctagctcgatcgcta gcggatnggctannnnnatctag

Input (II) • Ignore the header lines • Sequence may span multiple lines or one long line • Sequence contains the characters A,C,G,T,N in upper or lower case • “N” means unknown or masked base • Sample input files will be supplied

Input (III) • Search parameters: • Length of motifs (between 5-10) • Min. + Max. distance between the motifs: ACGGATTGATNNNTGGATGCCAT distance=9 • Single vs. two strands search • Min. number of occurrences (hits) of pair: GCGGATTCAGTGATGCCANGNATGCCTCAGGATTGNAATGCCA hit hit hit • Max. p-value • Additional parameters… (don’t count overlaps, e.g. AAAAAA)

Output • A list of the string pairs with the best order-bias score (smallest p-values): Motif A Motif B A→B B→A p-value ACGTT GGATT 97 17 4.3E-15 ACGTT GATTC 87 16 2.7E-13 TTAAC CAGCC 31 114 1.2E-12 • A non-redundant list of motif pairs (motif = consensus string): logos, # of hits, additional scores

Part A: String pairs with order bias • nA = # of A→B ; nB = # of B→A • WLOG, nA > nB • n = nA + nB • H0 = random order: nA ~ B(n, 0.5) • p-value = prob for at least nA occurrences of A→B = tail of B(n, 0.5) • Normal approximation (central limit thm.) • Fix for multiple testing: x2

Part B: Non-redundant list of motif pairs • Collect similar strings to motif with better score: (motif = consensus) String pair (p-value) Motif pair ACGTT , GGATT (4.3E-15) ACGAT , GGATT (2.4E-11) AGGAT , GGTTT (1.7E-5) AGGTT , GGTTT (5.9E-5) • Don’t report similar motif pairs: • Motifs that consist of similar strings • Motif pairs that are small shifts of one another • Palindromes , (8.1E-31)

Part B (cont.): Additional score Option I: Co-occurrence rate N = total # of sequences sA = # of sequences that contain motif A sAB = # of sequences that contain motifs A and B H0 = motifs occur independently and randomly p-value = prob for at least joint occurrences, given the number of hits of each single motif= tail of hypergeometric distribution

Part B (cont.): Additional score Option II: Distance bias Is the distance between the two motifs uniform (H0), or are there specific distances that are very common? Option III: Gap variability Are the sequences between the motifs conserved (H0), or are they highly variable? Other options??

Implementation • Java (Eclipse) ; Linux • GUI: Simple graphical user interface for supplying the input parameters and reporting the results • Packages for motif logo and statistical scores will be supplied • Time performance will be measured only for part A • Reasonable documentation • Separate packages for data-structures, scores, GUI, I/O, etc.

Design document • Due in 3 weeks (Feb 24) • 3-5 pages (Word), Hebrew/English • Briefly describe main goal, input and output of program • Describe main data structures, algorithms, and scores for parts A+B • Meet with me before submission

Fin

Motif Finding Workshop Project

Motif Finding Workshop Project

Presentation Transcript

Regulatory Motif Finding

Regulatory Motif Finding

DNA Motif Finding

Motif finding: Lecture 1

Motif finding : Lecture 2

Regulatory Motif Finding (II)

(Regulatory-) Motif Finding

Motif finding

Comparative Motif Finding

Motif Finding

Motif Finding

Motif finding

Randomized Algorithms and Motif Finding

Motif Finding

Motif Finding

Gibbs sampling for motif finding

Motif finding methods and algorithms

Regulatory Motif Finding

Motif Finding

Regulatory Motif Finding

Motif Finding

Gibbs Sampling in Motif Finding