300 likes | 405 Views
Motif Finding Workshop Project. Chaim Linhart January 2008. Outline. 1. Some background again… 2. The project. 1. Background. Slides with Ron Shamir and Adi Akavia. Gene: from DNA to protein. Pre-mRNA. Mature mRNA. DNA. protein. transcription. splicing. translation. DNA.
E N D
Motif Finding WorkshopProject Chaim Linhart January 2008
Outline 1. Some background again… 2. The project
1. Background Slides with Ron Shamir and Adi Akavia
Gene: from DNA to protein Pre-mRNA Mature mRNA DNA protein transcription splicing translation
DNA • DNA: a “string” over the alphabet of 4 bases (nucleotides): { A, C, G, T } • Resides in chromosomes • Complementary strands: A-T ; C-G • Forward/sense strand: AACTTGCG • Reverse-complement/anti-sense strand: TTGAACGC • Directional: from 5’ to 3’: • (upstream) AACTTGCGATACTCCTA (downstream) 5’ end 3’ end
Gene structure (eukaryotes) Promoter DNA Coding strand Transcription start site (TSS) Transcription (RNA polymerase) Pre-mRNA Intron Exon Exon Splicing (spliceosome) 5’ UTR 3’ UTR Mature mRNA Stop codon Start codon Coding region Translation (ribosome) Protein
Translation • Codon - a triplet of bases, codes a specific amino acid (except the stop codons); many-to-1 relation • Stop codons - signal termination of the protein synthesis process http://ntri.tamuk.edu/cell/ribosomes.html
Genome sequences • Many genomes have been sequences, including those of viruses, microbes, plants and animals. • Human: • 23 pairs of chromosomes • 3+ Gbps (bps = base pairs) , only ~3% are genes • ~25,000 genes • Yeast: • 16 chromosomes • 20 Mbps • 6,500 genes
Regulation of Expression • Each cell contains an identical copy of the whole genome - but utilizes only a subset of the genes to perform diverse, unique tasks • Most genes are highly regulated – their expression is limited to specific tissues, developmental stages, physiological condition • Main regulatory mechanism – transcriptional regulation
TF TF 5’ 3’ Gene BS BS Transcriptional regulation • Transcription is regulated primarily by transcription factors (TFs) – proteins that bind to DNA subsequences, called binding sites (BSs) • TFBSs are located mainly (not always!) in the gene’s promoter – the DNA sequence upstream the gene’s transcription start site (TSS) • BSs of a particular TF share a common pattern, or motif • Some TFs operate together – TF modules TSS
TFBS motif models AC CG ACT T • Consensus (“degenerate”) string: gene 1 gene 2 AACTGT gene 3 CACTGT gene 4 CACTCT gene 5 CACTGT gene 6 gene 7 gene 8 gene 9 AACTGT gene 10 • Statistical models… • Motif logo representation
Human G2+M cell-cycle genes:The CHR – NF-Y module CDCA3(trigger of mitotic entry 1) CTCAGCCAATAGGGTCAGGGCAGGGGGCGTGGCGGGAAGTTTGAAACT -18 CDCA8(cell division cycle associated 8) TTGTGATTGGATGTTGTGGGA…[25bp]…TGACTGTGGAGTTTGAATTGG +23 CDC2(cell division control protein 2 homolog) CTCTGATTGGCTGCTTTGAAAGTCTACGGGCTACCCGATTGGTGAATCCGGGGCCCTTTAGCGCGGTGAGTTTGAAACTGCT 0 CDC42EP4 (cdc42 effector protein 4) GCTTTCAGTTTGAACCGAGGA…[25bp]…CGACGGCCATTGGCTGCTGC -110 CCNB1(G2/mitotic-specific cyclin B1) AGCCGCCAATGGGAAGGGAG…[30bp]…AGCAGTGCGGGGTTTAAATCT +45 CCNB2(G2/mitotic-specific cyclin B2) TTCAGCCAATGAGAGT…[15bp]…GTGTTGGCCAATGAGAAC…[15bp]…GGGCCGCCCAATGGGGCGCAAGCGACGCGGTATTTGAATCCTGGA +10 BS’s are short, non-specific, hiding in both strands and at various locations along the promoters TFs: NF-Y , CHR
The computational challenge • Given a set of co-regulated genes(e.g., from gene expression chips) • Find a motif that is over-represented (occurs unusually often) in their promoters • This may be the TF binding site motif • Find TF modules – over-represented motifs that tend to co-occur
The computational challenge (II) • Motifs can also be found w/o a given target-set – “genome-wide” • Find a motif that is localized - occurs more often neat the TSS of genes • Find a motif with a strand bias – occurs more often on the genes’ coding strand • Find TF modules with biases in their order / orientation / distance
Motif finding algorithms • >100 motif finding algs • Main differences between them: • Type of analysis & input: • Target-set vs. genome-wide • Single vs. multi-species (conservation) • Single motifs vs. modules • Motif model • Score for evaluating motif • Motif search technique: • Combinatorial (enumeration) vs. Statistical optimization
Example - Amadeus Over-represented motifs in the promoters of genes expressed in the G2 and G2/M phases of the human cell cycle: CHR NF-Y
General goals • Develop software from A-Z: • Design • Implementation • (Optimization) • Execution & analysis of real data • A taste of bioinformatics • Have fun • Get credit…
The computational task • Given a set of DNA sequences • Find “interesting” pairs of motifs: • Order bias • Other scores… • Main challenges: • Performance (time, memory) • Output redundancy
Input File with DNA sequences in “fasta” format: >sequence-name1 <space> [header1] ACCCGNNNNTCGGAAATGANN CGGAGTAAAATATGCGAGCGT >sequence-name2 <space> [header2] cggattnnnaccgcannnnnnnnaccgtga >sequence-name3 <space> [header3] agtttagactgctagctcgatcgcta gcggatnggctannnnnatctag
Input (II) • Ignore the header lines • Sequence may span multiple lines or one long line • Sequence contains the characters A,C,G,T,N in upper or lower case • “N” means unknown or masked base • Sample input files will be supplied
Input (III) • Search parameters: • Length of motifs (between 5-10) • Min. + Max. distance between the motifs: ACGGATTGATNNNTGGATGCCAT distance=9 • Single vs. two strands search • Min. number of occurrences (hits) of pair: GCGGATTCAGTGATGCCANGNATGCCTCAGGATTGNAATGCCA hit hit hit • Max. p-value • Additional parameters… (don’t count overlaps, e.g. AAAAAA)
Output • A list of the string pairs with the best order-bias score (smallest p-values): Motif A Motif B A→B B→A p-value ACGTT GGATT 97 17 4.3E-15 ACGTT GATTC 87 16 2.7E-13 TTAAC CAGCC 31 114 1.2E-12 • A non-redundant list of motif pairs (motif = consensus string): logos, # of hits, additional scores
Part A: String pairs with order bias • nA = # of A→B ; nB = # of B→A • WLOG, nA > nB • n = nA + nB • H0 = random order: nA ~ B(n, 0.5) • p-value = prob for at least nA occurrences of A→B = tail of B(n, 0.5) • Normal approximation (central limit thm.) • Fix for multiple testing: x2
Part B: Non-redundant list of motif pairs • Collect similar strings to motif with better score: (motif = consensus) String pair (p-value) Motif pair ACGTT , GGATT (4.3E-15) ACGAT , GGATT (2.4E-11) AGGAT , GGTTT (1.7E-5) AGGTT , GGTTT (5.9E-5) • Don’t report similar motif pairs: • Motifs that consist of similar strings • Motif pairs that are small shifts of one another • Palindromes , (8.1E-31)
Part B (cont.): Additional score Option I: Co-occurrence rate N = total # of sequences sA = # of sequences that contain motif A sAB = # of sequences that contain motifs A and B H0 = motifs occur independently and randomly p-value = prob for at least joint occurrences, given the number of hits of each single motif= tail of hypergeometric distribution
Part B (cont.): Additional score Option II: Distance bias Is the distance between the two motifs uniform (H0), or are there specific distances that are very common? Option III: Gap variability Are the sequences between the motifs conserved (H0), or are they highly variable? Other options??
Implementation • Java (Eclipse) ; Linux • GUI: Simple graphical user interface for supplying the input parameters and reporting the results • Packages for motif logo and statistical scores will be supplied • Time performance will be measured only for part A • Reasonable documentation • Separate packages for data-structures, scores, GUI, I/O, etc.
Design document • Due in 3 weeks (Feb 24) • 3-5 pages (Word), Hebrew/English • Briefly describe main goal, input and output of program • Describe main data structures, algorithms, and scores for parts A+B • Meet with me before submission