480 likes | 724 Views
Oral Preliminary Exam (May 7, 2007). Regulation of Alternative Splicing. Jihye Kim. Outline. Alternative Splicing Overview Goal : Investigate “regulation” of AS Method : Association Rule Mining Part I : Finding association rules of cis -regulatory elements involved in alternative splicing
E N D
Oral Preliminary Exam (May 7, 2007) Regulation of Alternative Splicing Jihye Kim
Outline • Alternative Splicing Overview • Goal : Investigate “regulation” of AS • Method : Association Rule Mining • Part I : Finding association rules of cis-regulatory elements involved in alternative splicing • Part II : Cis-regulatory Motif Combinations Associated with Tissue-specific Alternative Splicing • Summary • Future Work
Splicing • Introns are removed and flanking exons are concatenated • Spliceosome - snRNPs and other proteins [image from http://fig.cox.miami.edu/~cmallery/150/gene/c7.17.11.spliceosome.jpg]
Splice Sites • Recognized by spliceosome • Splice sites are too weak to predict intron location accurately 5’ 3’ [image from http://web-books.com/MoBio/Free/Ch5A4.htm]
Assist spliceosome to identify splice sites Splicing factors SR (serine/arginine-rich) proteins Exonic and intronic enhancers and silencers (cis-acting) ESE (A/G rich motifs), ESS (hnRNP), ISE (G triples, UGCAUG), ISS Splicing Factors and Binding Sites Exon Exon 2 [Source from Katherina Kechris in Rocky’05 Conference]
Pre-mRNA mRNA protein Alternative Splicing • Over 70% in human genome • Major mechanism to generate protein diversity • Highly relevant to disease • 15% disease-causing mutations affect splicing [Krawczak 1992] [Krawczak 1992] Krawczak, M., Reiss, J., and Cooper, D.N. 1992 Hum. Genet. 90: 41-54
Types of Alternative Splicing Cassette Exon [Source from Cartegni et al. 2002]
Investigating Alternative Splicing • Traditionally, align ESTs and mRNAs to genomic sequences • Recently, microarray technology (Splice arrays) • Exon skipping is measured • Hard to measure other types of AS
Previous Work on AS Regulation • Most methods • use only sequence data • focus on the effect of individual motifs • Brain-specific exon skipping [Brudno 2001] • 25 brain-specific cassette exons from literature • Over-representation of UGCAUG in downstream intron • RESCUE-ESE [Fairbrother 2002] • Frequent hexamers in exon by weak splice sites • 10 ESE motifs show enhancer activity in experiment [Brudno 2001] Brudno M., Gelfand M.S., et al., 2001 NAR 20 (11) 2338-21348 [Fairbrother 2002] Fairbrother WG., et al., 2002 Science 9;297(5583):1007-13
What We Have Done So Far • Investigate cis-regulatory motifs that influence amount of AS or tissue-specific AS [Jihye Kim, Sihui Zhao, Steffen Heber, “Finding association rules of cis-regulatory elements involved in alternative splicing”, Proceedings of the 45th annual southeast regional conference (ACM-SE) pp. 232 – 237] [Jihye Kim, Sihui Zhao, Steffen Heber, “Cis-regulatory Motif Combinations Associated with Tissue-specific Alternative Splicing”,7th workshop on Algorithms in Bioinformatics (WABI 2007) (submitted) • Use mouse splice array data • Apply Association Rule Mining • Investigate motif combination involved in tissue-specific AS
Dataset Splice Array [Pan 2004] with 6 probes 3126 exon skipping genes in mouse %ASex : percentage of exon skipping in 10 tissues Aim I-I : representing data context AS Datasets in Mouse [Pan 2004] Pan, Q., et al., 2004 Mol Cell 16(6):929-942
Association Rule Mining • By Agrawal et al. in 1993 • Initially used for Market Basket Analysis • An association rule is a pattern that states when X occurs, Y occurs with certain probability • X : antecedent (left-hand-side, lhs), Y : consequent (right-hand-side, rhs) • Goal: Find all rules that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf) XY
Rule Strength Measures • Given a rule, • Support = Pr(X∧Y) • Confidence = Pr(Y | X) • Lift = Pr(X∧Y)/ Pr(X)Pr(Y) • Dependency of lhs and rhs • Generally, lhs and rhs have positive dependency if lift >1.0 XY
ARM Example Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam
ARM Example Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam Min supp = 0.5 Min conf = 0.7 Frequent Itemset = itemset whose support > 0.5
ARM Example Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam Min supp = 0.5 Min conf = 0.7 Frequent Itemsets (support)
ARM Example Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam Min supp = 0.5 Min conf = 0.7 Frequent Itemsets (support) Bread (2/5 < 0.5)
ARM Example Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam Min supp = 0.5 Min conf = 0.7 Frequent Itemsets (support) Beer (0.8) Beer (0.8), Jam (0.6), Diaper (0.6) {Beer, Diaper} (0.6)
ARM Example Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam Min supp = 0.5 Min conf = 0.7 Frequent Itemsets Association Rules (confidence) Beer (0.8), Jam (0.6), Diaper (0.6) {Beer, Diaper} (0.6)
ARM Example Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam Min supp = 0.5 Min conf = 0.7 Frequent Itemsets Association Rules (confidence) Beer (0.8), Jam (0.6), Diaper (0.6) {Beer, Diaper} (0.6) Beer => Jam (2/4 < 0.7)
ARM Example Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam Min supp = 0.5 Min conf = 0.7 Frequent Itemsets Association Rules (confidence) Beer (0.8), Jam (0.6), Diaper (0.6) {Beer, Diaper} (0.6) Beer => Diaper (0.75)
Apriori Algorithm • Most popular algorithm • Two steps: • Find all itemsets that satisify min_supp. (frequent itemsets) • any subset of a frequent itemset is also frequent • Find all 1-item frequent itemsets; then all 2-item frequent itemsets, and so on. • Generate Rules • A B is an association rule if Confidence(A B) ≥ min_conf
Part I : Finding association rules of cis-regulatory elements involved in alternative splicing[Proceedings of the 45th annual southeast regional conference (ACM-SE) Winston-Salem, North Carolina pp. 232 – 237]
Aim I-I : representing data context K-mers Around Cassette Exon (items) • Pre-mRNA sequences • Transcripts from NCBI • BLAT to align transcripts to mouse genome • 200 bps from 7 regions around cassette exon • 2565 genes in total • Items (6mers) : AAAAAA to TTTTTT in region 1 … 7
ARM in Finding AS Motif Rule • Items : all possible hexamers (motifs) • Transactions : 2565 AS genes • Goal : finding motif association rules in AS genes. (e.g., AGGATA TTAGCT) • By Apriori algorithm [Agrawal 1993] Find All Frequent Hexamers Generate Hexamer Rules [Agrawal 1993] Agrawal R., Imielinski T., Swami AN., 1993 SIGMOD 22(2):207-216
ARM Example [Example] Seq 1 : ACGATTAGG Seq 2 : GAATAGG Seq 3 : TGCAGG Seq 4 : GGATTAGG Seq 5 : CAGAT Min support = 0.5 Min confidence = 0.7
ARM Example [Example] Seq 1 : ACGATTAGG Seq 2 : GAATAGG Seq 3 : TGCAGG Seq 4 : GGATTAGG Seq 5 : CAGAT Min support = 0.5 Min confidence = 0.7 - Frequent 3-mer sets (support) AGG (0.8),
ARM Example [Example] Seq 1 : ACGATTAGG Seq 2 : GAATAGG Seq 3 : TGCAGG Seq 4 : GGATTAGG Seq 5 : CAGAT Min support = 0.5 Min confidence = 0.7 - Frequent 3mers sets (support) AGG (0.8), GAT (0.6), TAG (0.6), {AGG,TAG} (0.6)
ARM Example [Example] Seq 1 : ACGATTAGG Seq 2 : GAATAGG Seq 3 : TGCAGG Seq 4 : GGATTAGG Seq 5 : CAGAT Min support = 0.5 Min confidence = 0.7 - Frequent 3mers sets (support) AGG (0.8), GAT (0.6), TAG (0.6), {AGG,TAG} (0.6) - Rules (confidence) AGGGAT conf = 2 / 4 = 0.5 < minconf
ARM Example [Example] Seq 1 : ACGATTAGG Seq 2 : GAATAGG Seq 3 : TGCAGG Seq 4 : GGATTAGG Seq 5 : CAGAT Min support = 0.5 Min confidence = 0.7 - Frequent 3mers sets (support) AGG (0.8), GAT (0.6), TAG (0.6), {AGG,TAG} (0.6) - Rules (confidence) AGG TAG (0.75) TAG AGG (1.0)
- 7_TGAAGA, 7_GAAGAA (ASF/SF2, SRp55) - 6_TTTTCT, 6_AATAAA, … - Among 6,000 6-mers, 1/3 are in AEDB - Candidates of regulatory motifs Aim I-II : finding motif association rules for all AS genes 1 7 4 6 5 2 3 Motif Association Rules from AS Genes Frequent 6-mers Minsup = 0.05 (129 genes) Association Rules Minconf = 0.4 - 7_AAAAAT 7_TGAAGA, 7_AAAGGA 7_AGAAGA, - 7_GAAAAA 7_AAGAAG, 7_CTGCCT 7_CTGGAG, - 7_AGGAAA 7_AAGAAG, 7_AATAAA 7_AAGAAG - Candidates of regulatory combinations for AS
Aim I-III : finding motif association rules for cluster Clustering by AS Pattern in 10 Tissues • Hypothesize : Motif combinations “cause” AS profile • Cluster genes based on AS profile. We use • Euclidean distance / Correlation • Average linkage clustering • Frequent 6-mers in cluster are motif candidates
Aim I-III : finding motif association rules for cluster 1 7 4 6 5 2 3 Association Rules from Clusters • Lift (XY) > 2.0 • Comparison with outside the cluster (p-value < 2.13e-10) • Association rules are candidates of motif combinations for the corresponding AS pattern Correlation based clusters
Part II : Cis-regulatory Motif Combinations Associated with Tissue-specific Alternative Splicing[7th workshop on Algorithms in Bioinformatics (WABI 2007) (submitted)]
Aim II-I : finding motif association rules for tissue-specific AS Finding Motifs Involved in Tissue-Specific AS • Items : • hexamers in gene regions and • exon skipping rate in tissues • Transactions : • 2565 genes from Pan’s data set • Goal : find associations AGGATA in cassette exon High exon skipping in Brain • We focus on complex rules, e.g. {AGGATA in cassette exon, CCTGCG in downstream intron} High exon skipping in Brain
AS profile items • Use quartile to convert numeric %ASexes to character AS profile items • BrainLow :The first %ASex quartile in Brain • BrainHigh : The last %ASex quartile in Brain BrainLow BrainHigh
Motif Combination ARM Example [Sequence] Seq 1 : ACGATTAGG Seq 2 : GAATAGG Seq 3 : TGCAGG Seq 4 : GGATTAGG Seq 5 : CAGAT Min support = 0.5 Min confidence = 0.7 [AS profile] BH, HH BH, HL BH, HH BL, HH BH, HL + BH : BrianHigh BL : BrainLow HH : HeartHigh HL : HeartLow
Aim II-I : finding motif association rules for tissue-specific AS Tissue-Specific AS Motif Combinations • With strict thresholds • Min_supp = 0.01, Min_conf = 0.5, Min_lift = 1.2 • MinLen of lhs = 2 (for complex rule) • Rule appearance • lhs : hexamers, rhs : AS profile items • 197 association rules are found in total • 27 complex rules are found • lhs : combinations of 34 frequent hexamers rhs : AS profile items in tissues • All rules have >1.9 lift • 23 rules show motif combinations in different regions
Aim II-I : finding motif association rules for tissue-specific AS 1 7 4 6 5 2 3 {5_TTTTTA, 7_AGAGGA} => {HeartHigh}
Aim II- II : analyzing motif combination 1 7 4 6 5 2 3 AS Profile of Motif Combinations
Summary of Graphs • In some cases, genes with one motif do not show any different AS profile from all AS genes • However, often, genes containing all multiple motifs show significantly changed exon skipping levels • Combination of cis-regulatory motifs can influence AS profile in tissues
Comparison with AEDB • AEDB in EBI • Transcript regulatory sequences from literature • 292 enhancers and silencers • >60% extracted frequent hexamers are part of AEDB motifs • >97% of hexamers involved in complex rules are part of AEDB motifs
Summary • Association rule mining (ARM) applied • Finding motif association rules for AS • Finding motif association rules for AS clusters • Finding motif combinations for tissue-specific AS
Future Work Improve method • Improve motif representation, e.g. • variable motif length, gapped k-mers • results from motif finding tools • Improve AS profile representation • Add more features, e.g. • position and distance between motifs • splice site • exon / intron length • conservation, gene information • Statistical analysis • Thresholds • Multiple testing
Future Work • Systematic analysis of simple & complex motifs • Other data sources • Human splice array [Johnson 2003] • ESTs • Investigate discovered motifs • Apply motif discovery tools • Analyze genome occurrence • Analyze gene and protein structure • Build predictive model and apply it (If I have enough time ) • Experimental verification [Johnson 2003] Science. 2003 Dec 19;302(5653):2141-4
Acknowledgements • Dr. Steffen Heber • Dr. Eric A. Stone • Dr. Zhao-Bang Zeng • Dr. Barbara Sherry • Sihui Zhao • Li Zhang • Hyunmin Kim THANK YOU