310 likes | 430 Views
Some Data Mining Challenges Learned From Bioinformatics & Actions Taken. Limsoon Wong National University of Singapore. Plan. Bioinformatics Examples Treatment prognosis of DLBC lymphoma Prediction of translation initiation site Prediction of protein function from PPI data
E N D
Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore
Plan • Bioinformatics Examples • Treatment prognosis of DLBC lymphoma • Prediction of translation initiation site • Prediction of protein function from PPI data • What have we learned from these projects? • What have I been looking at recently? • Statistical measures beyond frequent items • Small changes that have large impact • Evolution of pattern spaces
Image credit: Rosenwald et al, 2002 Example #1: Treatment Prognosis for DLBC Lymphoma Ref: H. Liu et al, “Selection of patient samples and genes for outcome prediction”, Proc. CSB2004, pages 382--392
DLBC lymphoma is the most common type of lymphoma in adults Can be cured by anthracycline-based chemotherapy in 35 to 40 percent of patients DLBC lymphoma comprises several diseases that differ in responsiveness to chemotherapy Intl Prognostic Index (IPI) age, “Eastern Cooperative Oncology Group” Performance status, tumor stage, lactate dehydrogenase level, sites of extranodal disease, ... Not very good for stratifying DLBC lymphoma patients for therapeutic trials Use gene-expression profiles to predict outcome of chemotherapy? Diffuse Large B-Cell Lymphoma
Knowledge Discovery from Gene Expression of “Extreme” Samples 240 samples 7399 genes “extreme” sample selection: < 1 yr vs > 8 yrs 47 short- term survivors 80 samples 26 long- term survivors 84 genes knowledge discovery from gene expression T is long-term if S(T) < 0.3 T is short-term if S(T) > 0.7
Low risk High risk p-value of log-rank test: < 0.0001 Risk score thresholds: 0.7, 0.3 Kaplan-Meier Plot for 80 Test Cases No clear difference on the overall survival of the 80 samples in the validation group of DLBCL study, if no training sample selection conducted
Example #2: Protein Translation Initiation Site Recognition Ref: L. Wong et al., “Using feature generation and feature selection for accurate prediction of translation initiation sites”, GIW 13:192--200, 2002
What makes the second ATG the TIS? Approach Training data gathering Signal generation k-grams, distance, domain know-how, ... Signal selection Entropy, 2, CFS, t-test, domain know-how... Signal integration SVM, ANN, PCL, CART, C4.5, kNN, ... A Sample cDNA 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT
For each value of k, there are 4k * 3 * 2 k-grams If we use k = 1, 2, 3, 4, 5, we have 24 + 96 + 384 + 1536 + 6144 = 8184 features! This is too many for most machine learning algorithms Choose a signal w/ low intra-class distance Choose a signal w/ high inter-class distance E.g., Too Many Signals Feature Selection
Sample k-grams Selected by CFS • Position –3 • in-frame upstream ATG • in-frame downstream • TAA, TAG, TGA, • CTG, GAC, GAG, and GCC Leaky scanning Kozak consensus Stop codon Codon bias?
Our method ATGpr Validation Results (on Chr X and Chr 21) • Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s
Level-1 neighbour Level-2 neighbour Example #3: Protein Function Predictionfrom Protein Interactions
SH3-Binding Proteins SH3 Proteins An illustrative Case of Indirect Functional Association? • Is indirect functional association plausible? • Is it found often in real interaction data? • Can it be used to improve protein function prediction from protein interaction data?
59.2% proteins in dataset share some function with level-1 neighbours 27.9% share some function with level-2 neighbours but share no function with level-1 neighbours YAL012W |1.1.6.5 |1.1.9 YJR091C |1.3.16.1 |16.3.3 YMR300C |1.3.1 YPL149W |14.4 |20.9.13 |42.25 |14.7.11 YBR055C |11.4.3.1 YMR101C |42.1 YDR158W |1.1.6.5 |1.1.9 YPL088W |2.16 |1.1.9 YBR293W |16.19.3 |42.25 |1.1.3 |1.1.9 YBL072C |12.1.1 YLR140W YMR047C |11.4.2 |14.4 |16.7 |20.1.10 |20.1.21 |20.9.1 YBR023C |10.3.3 |32.1.3 |34.11.3.7 |42.1 |43.1.3.5 |43.1.3.9 |1.5.1.3.2 YBL061C |1.5.4 |10.3.3 |18.2.1.1 |32.1.3 |42.1 |43.1.3.5 |1.5.1.3.2 YLR330W |1.5.4 |34.11.3.7 |41.1.1 |43.1.3.5 |43.1.3.9 YKL006W |12.1.1 |16.3.3 YDL081C |12.1.1 YDR091C |1.4.1 |12.1.1 |12.4.1 |16.19.3 YPL013C |12.1.1 |42.16 YPL193W |12.1.1 YOR312C |12.1.1 Freq of Indirect Functional Association
Sensitivity vs Precision 1 L1 - L2 0.9 L2 - L1 0.8 L1 ∩ L2 0.7 0.6 0.5 Sensitivity 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision Over-Rep of Functions in L1 & L2 Neighbours
Informative FCs 1 NC 0.9 Chi² 0.8 PRODISTIN Weighted Avg 0.7 Weighted Avg R 0.6 Sensitivity 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision Performance Evaluation • Prediction performance improves after incorporation of L1, L2, & interaction reliability info
Some of those “techniques” frequently needed in analysis of biomedical data are insufficiently studied by current data mining researchers • Recognizing what samples are relevant and what are not • Recognizing what features are relevant and what are not & handling missing or incorrect values • Recognizing trends, changes, and their causes
Action #1: Going Beyond Frequent Patterns to Recognize What Features Are Relevant and What Are Not
Statisticians use a battery of “interestingness” measures to decide if a feature/factor is relevant Examples: Odds ratio Relative risk Gini index Yule’s Q & Y etc Odds ratio Going Beyond Frequent Patterns
Proposition: Let SkOR(ms,D) = { P F(ms,D) | OR(P,D) k}. Then SkOR(ms,D) is not convex i.e., the space of odds ratio patterns is not convex. Ditto for many other types of patterns {A,B}:1 {A,B,C}:3 {A}:∞ OR search space Challenge: Frequent Pattern Mining Relies on Convexity for Efficiency, But …
Theorem: Let Sn,kOR(ms,D) = { P F(ms,D) | PD,ed=n, OR(P,D) k}. Then Sn,kOR(ms,D) is convex The space of odds ratio patterns becomes convex when stratified into plateaus based on support levels on positive (or negative) dataset Proposition: Let Q ∊[P]D, then OR(Q,D)=OR(P,D) The plateau space can be further divided into convex equivalence classes on the whole dataset The space of equivalence classes can be concisely represented by generators and closed patterns Solution: Luckily They Become Convex When Decomposed Into Plateaus
Mining odds ratio and relative patterns depends on GC-growth GC-Growth is mining both generators and closed patterns It is comparable in speed to the fastest algorithms that mined only closed patterns Performance
Action #2: Tipping Factors---The Small Changes With Large Impact
Given a data set, such as those related to human health, it is interesting to determine impt cohorts and impt factors causing transition betw cohorts Tipping events Tipping factors are “action items” for causing transitions “Tipping event” is two or more population cohorts that are significantly different from each other “Tipping factors” (TF) are small patterns whose presence or absence causes significant difference in population cohorts “Tipping base” (TB) is the pattern shared by the cohorts in a tipping event “Tipping point” (TP) is the combination of TB and a TF Tipping Events
Action #3: Evolution of Pattern Spaces---How Do They Change When the Sample Space Changes?
DLBC Lymphoma: Jinyan Li, Huiqing Liu Translation Initiation: Fanfan Zeng, Roland Yap Huiqing Liu Protein Function Prediction: Kenny Chua, Ken Sung Odds Ratio & Relative Risk Mengling Feng, Yap-Peng Tan, Haiquan Li, Jinyan Li Tipping Points: Guimei Liu, Jinyan Li Guozhu Dong Pattern Space Evolution: Mengling Feng, Yap-Peng Tan Guozhu Dong Jinyan Li Acknowledgements