Some Data Mining Challenges Learned From Bioinformatics & Actions Taken

Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Plan • Bioinformatics Examples • Treatment prognosis of DLBC lymphoma • Prediction of translation initiation site • Prediction of protein function from PPI data • What have we learned from these projects? • What have I been looking at recently? • Statistical measures beyond frequent items • Small changes that have large impact • Evolution of pattern spaces

Image credit: Rosenwald et al, 2002 Example #1: Treatment Prognosis for DLBC Lymphoma Ref: H. Liu et al, “Selection of patient samples and genes for outcome prediction”, Proc. CSB2004, pages 382--392

DLBC lymphoma is the most common type of lymphoma in adults Can be cured by anthracycline-based chemotherapy in 35 to 40 percent of patients DLBC lymphoma comprises several diseases that differ in responsiveness to chemotherapy Intl Prognostic Index (IPI) age, “Eastern Cooperative Oncology Group” Performance status, tumor stage, lactate dehydrogenase level, sites of extranodal disease, ... Not very good for stratifying DLBC lymphoma patients for therapeutic trials Use gene-expression profiles to predict outcome of chemotherapy? Diffuse Large B-Cell Lymphoma

Knowledge Discovery from Gene Expression of “Extreme” Samples 240 samples 7399 genes “extreme” sample selection: < 1 yr vs > 8 yrs 47 short- term survivors 80 samples 26 long- term survivors 84 genes knowledge discovery from gene expression T is long-term if S(T) < 0.3 T is short-term if S(T) > 0.7

Low risk High risk p-value of log-rank test: < 0.0001 Risk score thresholds: 0.7, 0.3 Kaplan-Meier Plot for 80 Test Cases No clear difference on the overall survival of the 80 samples in the validation group of DLBCL study, if no training sample selection conducted

Example #2: Protein Translation Initiation Site Recognition Ref: L. Wong et al., “Using feature generation and feature selection for accurate prediction of translation initiation sites”, GIW 13:192--200, 2002

What makes the second ATG the TIS? Approach Training data gathering Signal generation k-grams, distance, domain know-how, ... Signal selection Entropy, 2, CFS, t-test, domain know-how... Signal integration SVM, ANN, PCL, CART, C4.5, kNN, ... A Sample cDNA 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT

For each value of k, there are 4k * 3 * 2 k-grams If we use k = 1, 2, 3, 4, 5, we have 24 + 96 + 384 + 1536 + 6144 = 8184 features! This is too many for most machine learning algorithms Choose a signal w/ low intra-class distance Choose a signal w/ high inter-class distance E.g., Too Many Signals Feature Selection

Sample k-grams Selected by CFS • Position –3 • in-frame upstream ATG • in-frame downstream • TAA, TAG, TGA, • CTG, GAC, GAG, and GCC Leaky scanning Kozak consensus Stop codon Codon bias?

Our method ATGpr Validation Results (on Chr X and Chr 21) • Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s

Level-1 neighbour Level-2 neighbour Example #3: Protein Function Predictionfrom Protein Interactions

SH3-Binding Proteins SH3 Proteins An illustrative Case of Indirect Functional Association? • Is indirect functional association plausible? • Is it found often in real interaction data? • Can it be used to improve protein function prediction from protein interaction data?

59.2% proteins in dataset share some function with level-1 neighbours 27.9% share some function with level-2 neighbours but share no function with level-1 neighbours YAL012W |1.1.6.5 |1.1.9 YJR091C |1.3.16.1 |16.3.3 YMR300C |1.3.1 YPL149W |14.4 |20.9.13 |42.25 |14.7.11 YBR055C |11.4.3.1 YMR101C |42.1 YDR158W |1.1.6.5 |1.1.9 YPL088W |2.16 |1.1.9 YBR293W |16.19.3 |42.25 |1.1.3 |1.1.9 YBL072C |12.1.1 YLR140W YMR047C |11.4.2 |14.4 |16.7 |20.1.10 |20.1.21 |20.9.1 YBR023C |10.3.3 |32.1.3 |34.11.3.7 |42.1 |43.1.3.5 |43.1.3.9 |1.5.1.3.2 YBL061C |1.5.4 |10.3.3 |18.2.1.1 |32.1.3 |42.1 |43.1.3.5 |1.5.1.3.2 YLR330W |1.5.4 |34.11.3.7 |41.1.1 |43.1.3.5 |43.1.3.9 YKL006W |12.1.1 |16.3.3 YDL081C |12.1.1 YDR091C |1.4.1 |12.1.1 |12.4.1 |16.19.3 YPL013C |12.1.1 |42.16 YPL193W |12.1.1 YOR312C |12.1.1 Freq of Indirect Functional Association

Sensitivity vs Precision 1 L1 - L2 0.9 L2 - L1 0.8 L1 ∩ L2 0.7 0.6 0.5 Sensitivity 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision Over-Rep of Functions in L1 & L2 Neighbours

Informative FCs 1 NC 0.9 Chi² 0.8 PRODISTIN Weighted Avg 0.7 Weighted Avg R 0.6 Sensitivity 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision Performance Evaluation • Prediction performance improves after incorporation of L1, L2, & interaction reliability info

What Have We Learned?

Some of those “techniques” frequently needed in analysis of biomedical data are insufficiently studied by current data mining researchers • Recognizing what samples are relevant and what are not • Recognizing what features are relevant and what are not & handling missing or incorrect values • Recognizing trends, changes, and their causes

Action #1: Going Beyond Frequent Patterns to Recognize What Features Are Relevant and What Are Not

Statisticians use a battery of “interestingness” measures to decide if a feature/factor is relevant Examples: Odds ratio Relative risk Gini index Yule’s Q & Y etc Odds ratio Going Beyond Frequent Patterns

Proposition: Let SkOR(ms,D) = { P  F(ms,D) | OR(P,D)  k}. Then SkOR(ms,D) is not convex i.e., the space of odds ratio patterns is not convex. Ditto for many other types of patterns {A,B}:1 {A,B,C}:3 {A}:∞ OR search space Challenge: Frequent Pattern Mining Relies on Convexity for Efficiency, But …

Theorem: Let Sn,kOR(ms,D) = { P  F(ms,D) | PD,ed=n, OR(P,D)  k}. Then Sn,kOR(ms,D) is convex The space of odds ratio patterns becomes convex when stratified into plateaus based on support levels on positive (or negative) dataset Proposition: Let Q ∊[P]D, then OR(Q,D)=OR(P,D) The plateau space can be further divided into convex equivalence classes on the whole dataset The space of equivalence classes can be concisely represented by generators and closed patterns Solution: Luckily They Become Convex When Decomposed Into Plateaus

Mining odds ratio and relative patterns depends on GC-growth GC-Growth is mining both generators and closed patterns It is comparable in speed to the fastest algorithms that mined only closed patterns Performance

Action #2: Tipping Factors---The Small Changes With Large Impact

Given a data set, such as those related to human health, it is interesting to determine impt cohorts and impt factors causing transition betw cohorts Tipping events Tipping factors are “action items” for causing transitions “Tipping event” is two or more population cohorts that are significantly different from each other “Tipping factors” (TF) are small patterns whose presence or absence causes significant difference in population cohorts “Tipping base” (TB) is the pattern shared by the cohorts in a tipping event “Tipping point” (TP) is the combination of TB and a TF Tipping Events

Impact-To-Cost-Ratio of Tipping Points

Some Simple Results Useful For Constructing TPs

Action #3: Evolution of Pattern Spaces---How Do They Change When the Sample Space Changes?

Impact of Adding New Transactions onKey and Closed Patterns

Impact of Removing Items From All Transactions

DLBC Lymphoma: Jinyan Li, Huiqing Liu Translation Initiation: Fanfan Zeng, Roland Yap Huiqing Liu Protein Function Prediction: Kenny Chua, Ken Sung Odds Ratio & Relative Risk Mengling Feng, Yap-Peng Tan, Haiquan Li, Jinyan Li Tipping Points: Guimei Liu, Jinyan Li Guozhu Dong Pattern Space Evolution: Mengling Feng, Yap-Peng Tan Guozhu Dong Jinyan Li Acknowledgements

Some Data Mining Challenges Learned From Bioinformatics & Actions Taken