700 likes | 866 Views
Mining of cis -Regulatory Motifs Associated with Tissue-Specific Alternative Splicing. Jihye Kim Bioinformatics Research Center. Outline. Background and Motivation Association Rule Mining (ARM) Use ARM techniques to discover cis -regulatory elements involved in alternative splicing
E N D
Mining of cis-Regulatory Motifs Associated with Tissue-Specific Alternative Splicing Jihye Kim Bioinformatics Research Center
Outline • Background and Motivation • Association Rule Mining (ARM) • Use ARM techniques to discovercis-regulatory elements involved in alternative splicing • Conclusions and Future Directions
cytoplasm Nucleus intron exon DNA gene TRANSCRIPTION Pre-mRNA RNA SPLICING matureRNA EXPORT matureRNA TRANSLATION protein Central Dogma of Molecular Biology
Splicing • Introns are removed and flanking exons are concatenated [image from http://fig.cox.miami.edu/~cmallery/150/gene/c7.17.11.spliceosome.jpg]
Alternative Splicing Pre-mRNA mRNA protein • Over 70% of human genes show AS • Some genes express thousands of different mRNAs
Biological Relevance of AS • Major mechanism to generate protein diversity • Important in gene regulation • Highly relevant to disease • 15% disease-causing mutations affect splicing [Krawczak 1992] [Krawczak 1992] Krawczak, M., Reiss, J., and Cooper, D.N. 1992 Hum. Genet. 90: 41-54
Types of Alternative Splicing Cassette Exon [Source from Cartegni et al. 2002]
Regulation of AS • Spliceosome detects splice site • Often, splicing factors bind to intron/exon to assist/repress exon splicing [Image from J.R. Sanford, et al., Cell Science at a Glance 117(26:6261]
Cis-Regulatory Elements • Short sequences • ESE, ESS, ISE, ISS • Close to splice sites GENEINFO:Specie:Homo sapiens, human GENEINFO:Gene Name:fibronectin eda exon GENEINFO:Entry type:Exonenhancer GENEINFO:Methods:In vivo splicing assay SEQINFO:Sequence:GAAGAAGA SEQINFO:Sequence origin:Exonic [Source from http://www.ebi.ac.uk/asd/aedb/ [Image from Z.Wang and C. Burge, RNA 2008
Investigating AS Regulation • Several computational methods • Over-represented hexamers from brain-specific genes [Brudno 2001] • RESCUE-ESE founds 10 motifs with enhancer activity [Fairbrother 2002] • Motif pairs by coCOA (compositionally orthogonalized Co-Occurrence Analysis) [Friedman 2008] • Most methods • use only sequence data • focus on the effect of individual motifs [Brudno 2001] Brudno M., Gelfand M.S., et al., 2001 NAR 20 (11) 2338-21348 [Fairbrother 2002] Fairbrother WG., et al., 2002 Science 9;297(5583):1007-13 [Friedman 2008] Friedman B.A., et al., 2008 Genome Res 18(10) 1643-51
Motivation • Often, AS is regulated by combination of several binding factors • Exonic UAGG AND GGGG motifs required for skipping of the cassette exon of the glutamate NMDA R1 receptor [Han 2005] UAGG GGGG [Han 2005] K. Han, et al., PloS Biol. 2005 3(5):e158
Goal • Find Motifs and Motif combinations involved in AS Motif Exon exclusion MotifMotif Motif, Motif Exon exclusion • Association Rules : Unexpected relationships between two objects
Association Rule Mining • By Agrawal et al. in 1993 • Initially used for Market Basket Analysis • An association rule is a pattern that states when X occurs, Y occurs with certain probability • X : antecedent (left-hand-side, lhs), • Y : consequent (right-hand-side, rhs) • Goal: Find all interesting rules XY
ARM Example Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam
ARM Example Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam An unexpected rule Beer => Diaper
Rule Strength Measures • Given a rule, • Support = Pr(X∧Y) • Confidence = Pr(Y | X) • Lift = Pr(X∧Y)/ Pr(X)Pr(Y) • Dependency of lhs and rhs • Generally, lhs and rhs have positive dependency if lift >1.0 XY
ARM Example Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam
ARM Example Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam Min supp = 0.5 Min conf = 0.7 Frequent Itemset = itemset whose support > 0.5
ARM Example Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam Min supp = 0.5 Min conf = 0.7 Frequent Itemsets (support)
ARM Example Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam Min supp = 0.5 Min conf = 0.7 Frequent Itemsets (support) Beer (0.8) Beer (0.8), Jam (0.6), Diaper (0.6) {Beer, Diaper} (0.6)
ARM Example Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam Min supp = 0.5 Min conf = 0.7 Frequent Itemsets Association Rules (confidence) Beer (0.8), Jam (0.6), Diaper (0.6) {Beer, Diaper} (0.6)
ARM Example Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam Min supp = 0.5 Min conf = 0.7 Frequent Itemsets Association Rules (confidence) Beer (0.8), Jam (0.6), Diaper (0.6) {Beer, Diaper} (0.6) Beer => Diaper (0.75)
Apriori Algorithm • Most popular ARM algorithm • Two steps: 1. Find all itemsets that satisfy min_supp. (frequent itemsets) • any subset of a frequent itemset is also frequent • Find all 1-item frequent itemsets; then all 2-item frequent itemsets, and so on. 2. Generate Rules • A B is an association rule if Confidence(A B) ≥ min_conf
Association Rules of Motifs in AS • Beer => Diaper : Shopping items purchased together in a market basket data • Motif A => Motif B : Motif pair regulates together alternative splicing
Part I : Finding association rules of cis-regulatory elements involved in alternative splicing[Proceedings of the 45th annual southeast regional conference (ACM-SE) Winston-Salem, North Carolina pp. 232 – 237, 2007, BEST REGULAR PAPER]
Dataset Splice Array [Pan 2004] with 6 probes 3126 exon skipping genes in mouse %ASex : percentage of exon skipping in 10 tissues AS Datasets in Mouse [Pan 2004] Pan, Q., et al., 2004 Mol Cell 16(6):929-942
K-mers Around Cassette Exon (items) • Pre-mRNA sequences • Transcripts from NCBI • BLAT to align transcripts to mouse genome • 200 bps from 7 regions around cassette exon • 2565 genes in total • Items (6mers) : • AAAAAA to TTTTTT in region 1 … 7
ARM for AS Motif Rules • Items : all possible hexamers (motifs) • Transactions : 2565 AS genes • Goal : finding motif association rules in AS genes. (e.g., AGGATA TTAGCT) • By Apriori algorithm [Agrawal 1993] Find All Frequent Hexamers Generate Hexamer Rules [Agrawal 1993] Agrawal R., Imielinski T., Swami AN., 1993 SIGMOD 22(2):207-216
ARM Example [Example] Seq 1 : ACGATTAGG Seq 2 : GAATAGG Seq 3 : TGCAGG Seq 4 : GGATTAGG Seq 5 : CAGAT Min support = 0.5 Min confidence = 0.7
ARM Example [Example] Seq 1 : ACGATTAGG Seq 2 : GAATAGG Seq 3 : TGCAGG Seq 4 : GGATTAGG Seq 5 : CAGAT Min support = 0.5 Min confidence = 0.7 - Frequent 3mers sets (support) AGG (0.8), GAT (0.6), TAG (0.6), {AGG,TAG} (0.6)
ARM Example [Example] Seq 1 : ACGATTAGG Seq 2 : GAATAGG Seq 3 : TGCAGG Seq 4 : GGATTAGG Seq 5 : CAGAT Min support = 0.5 Min confidence = 0.7 - Frequent 3mers sets (support) AGG (0.8), GAT (0.6), TAG (0.6), {AGG,TAG} (0.6) - Rules (confidence) AGGTAG (0.75) TAGAGG (1.0)
- 4_TGAAGA, 7_GAAGAA (ASF/SF2, SRp55) - 6_TTTTCT, 6_AATAAA, … 6,000 6-mers - Candidates of regulatory motifs 1 7 4 6 5 2 3 Motif Association Rules from AS Genes Frequent 6-mers Minsup = 0.05 (129 genes) Association Rules Minconf = 0.4 - 4_AAAAAT 4_TGAAGA, 4_AAAGGA 4_AGAAGA, - 4_GAAAAA 4_AAGAAG, 4_CTGCCT 4_CTGGAG, - 4_AGGAAA 4_AAGAAG, 4_AATAAA 4_AAGAAG - Candidates of regulatory combinations for AS
Clustering by AS Pattern in 10 Tissues • Hypothesize : Motif combinations “cause” AS profile • Cluster genes based on AS profile. We use • Euclidean distance / Correlation • Average linkage clustering • Frequent 6-mers in cluster are motif candidates
1 7 4 6 5 2 3 Association Rules from Clusters • 112 frequent hexamers (0 – 39 for each cluster) • Lift (XY) > 2.0 • Comparison with outside the cluster (p-value < 2.13e-10) • Association rules are candidates of motif combinations for the corresponding AS pattern Correlation based clusters
AS profile of Genes with a Motif Rule Example: 7_AGCAGC => 6_GCAGCC
Summary • Motifs and motif association rules from a group of genes with similar AS pattern • Candidates of motif combinations • BUT: • Problems in choosing the “right” threshold • Dependent on clustering technique
Part II : Mining of Cis-regulatory Motifs Associated with Tissue-specific Alternative Splicingby Discretization-Based Quantitative Association Rule Mining
Quantitative Association Rule Mining • Mine numeric or quantitative data • Two methods : • Discretization (Binning methods, e.g., equi-width, equi-depth, distance-based) • Distribution-based
Example Cart 1 : Liquor $21, Vegetables $20, Meat $12 Cart 2 : Liquor $7, Vegetables $70 Cart 3 : Liquor $86, Meat $59 Cart 4 : Liquor $29, Vegetables $3 Cart 5 : Liquor $98 Cart 6 : Liquor $33, Meat $16
Discretization-based Discretization of numeric attributes Intuitive and popular Sensitive to bin size
AS profile items • Use quartile to convert numeric %ASexes to character AS profile items • BrainLow :The first %ASex quartile in Brain • BrainHigh : The last %ASex quartile in Brain BrainLow BrainHigh
Finding Motifs Involved in Tissue-Specific AS • Items : • hexamers in gene regions • exon skipping rate in tissues • Transactions : • 2565 genes from Pan’s data set • Goal : find associations between hexamers and exon skipping rate AGGATA in cassette exon High exon skipping in Brain
Tissue-Specific AS Motif Combinations • 1464 association rules are found in total • 204 complex rules are found • lhs : combinations of 113 frequent hexamers rhs : AS profile items in tissues • All rules have >1.9 lift • 117 rules show motif combinations in different regions
AS profile of Motif • 1260 simple rules with 806 hexamers
1 7 4 6 5 2 3 {5_TTTTTA, 7_AGAGGA} => {HeartHigh}
1 7 4 6 5 2 3 AS Profile of Motif Combinations
Part III: Mining of Cis-regulatory Motifs Associated with Tissue-specific Alternative Splicingby Distribution-Based Quantitative Association Rule Mining[J. Kim, S. Zhao, B. Howard, S. Heber, LNBI 5542, pp 260-71, 2009]
Distribution-based QARM • Proposed by Aumann and Lindell • Diaper => Liquor:mean=$12/week (overall mean =$7/week) • Association between a subset of a database and it’s “extraordinary” behavior • To define “extraordinary” behavior, statistical tests are used
Our Data • Heptamers : categorical items • Exon skipping rates : quantitative items G1 : 1_ACTGGAG, …, 7_TTTTCGA, 43(Brain), …, 78(Testis) G2 : 1_AAGCTTG, …, 7_TCTTAAA, 22(Brain), …, 54(Testis) G3 : 1_AGGCCAA, …, 7_TGAATTT, 4(Brain), …, 13(Testis) G4 : 1_ATATTTT, …, 7_TTTTCGA, 89(Brain), …, 100(Testis) … …
Our goal • Mining of “heptamer(s) => exon skipping rate” rules • Mean of exon skipping rates • T-test for extraordinary exon skipping rates • E.g., 4_TTGCGAC => mean(Brain) =80 (overall mean(Brain) = 30)