Bioinformatics, Data Integration and Machine Learning a Thesis Proposal

Bioinformatics, Data Integration and Machine Learninga Thesis Proposal Kaushik Sinha Supervisors: Prof. Gagan Agrawal and Prof. Mikhail Belkin

Roadmap • Motivation • Our Approach • Current Work • Learning Layouts of Flat-file Biological Datasets • Exploratory Tools for Biological Data Analysis • Proposed Work • Deep Web Mining for Biological Data • Semi-supervised Ranking • Multiple-instance Learning • Conclusion

Motivation • Integration is hard • Data explosion • Data size & number of data sources • New analysis tools • Autonomous resources • Heterogeneous data representation & various interfaces • Frequent Updates • New trend: web and grid services

Motivation contd… • In recent years DNA microarry and other gene and protein assays have become essential tools for biologists • Next step of biological enquiry is to find out • What is known about these genes? • How are these genes related to each other or other genes identified in similar studies? • However, major difficulties are • How do we extract key properties shared by a candidate genes? • How do we generate reasonable hypothesis to explain them? • How do we define and evaluate similarity between sets of genes?

Motivating Example • Suppose after a micro array experiment a biologist suspects that a small set of genes are related to a disease • This can be confirmed by searching existing literature • One would expect related genes to appear together in literature • Due to sheer volume • Searching is time consuming and error prone • Some complications could arise as well • However, suppose Gene A and C are related and both of them are weakly related to gene B • In literature, one would expect • A,C appear together OR/AND • A,B appear together • B,C appear together • How do we efficiently conclude that A,C are actually related?

Our Approach • Using data mining / machine learning techniques to extract useful information from biological data • Different forms of data • Flat-file data • Microarray data • Online literature abstracts • Develop different forms of tools • Layout extractor • Hypergraph mining • Similarity measure among sets of genes

Learning Layout of a Flat-File • In general – intractable • Try and learn the layout, have a domain expert verify • Key issue: what delimiters are being used ?

Finding Delimiters • Some knowledge from domain expert is required (Semi-automatic) • Naïve approaches • Frequency Counting • Counts frequently occurring single tokens (word separated by space) • Sequence Mining • Counts frequently occurring sequence of tokens

Assumptions • Biological datasets are written for humans to read • It is very unlikely that delimiters will be scattered all around, in different places in a line • Position of the possible delimiters might provide useful information • Combination of positional and frequency information might be a better choice

Positional Weight • Let P be the different positions in a line where a token can appear • For each position iє P, tot_seqji represents total # of token sequences of length j starting at position i • For each position iє P, tot_unique_seqji represents total # of unique token sequences of length j starting at position i • For any tuple (i,j), p_ratio(i,j) is defined as shown above • p_ratio(i,j) can be log normalized to get positional weight, p_wt(i,j) with the property p_wt(i,j)є (0,1)

Delimiter score (d_score) • Frequency weight for any token sequence sji with length j and starting at position i, f_wt(sji), is obtained by log normalizing frequency f(sji) • Obviously, f_wt(sji) є (0,1) • Positional and frequency weight now can be combined together to get d_score as follows, • d_score(sji)= α * p_wt(i,j) + (1-α) * f_wt(sji) • Where αє(0,1) • Thus d_scrore has the following two properties, • d_score(sji) є(0,1) • d_score(sji) > d_score(sjk) implies sji is more likely to be a delimiter than sjk

Generating layout descriptor • Once the delimiters are identified, an NFA can be built scanning the whole database where, delimiters are different states of the NFA • This NFA can be used to generate a layout descriptor since it nicely represents optional and repeating states • The following figures shows an NFA, where A, B, C, D, and E are delimiters with B being an optional delimiter and C D being a repeating delimiters

Results • By suitably varying α, a tight superset of possible delimiters are found • A domain expert can then help to identify the true delimiters • Results from 3 different flat file datasets are as follows

Comparison with naïve approaches • d_score based approach definitely does a better job as compared to the naïve approaches • The following table clearly shows the improvement

Realistic Situation • The task of identifying complete list of correct delimiters is difficult • Most likely we will end up with getting an incomplete list of delimiters • The delimiters which does not appear in every data record (optional) are the ones to be possibly missed

Identifying Optional Delimiters • Given a list of incomplete delimiters how can we identify optional delimiters, if any? • Build a NFA based on given incomplete information • Perform clustering to identify possible crucial delimiters • Perform contrast analysis

Crucial delimiter • A delimiter is considered crucial, if missing delimiters will appear immediately following these delimiters • The goal is to create two clusters, • one having delimiters which are not crucial • The other one having crucial delimiters

Identifying crucial delimiters:A few definitions • Succ(X): Set of delimiters that can immediately follow X • Dist_App: # of groups of occurrences of X based on # of text lines between X and immediately next delimiter • Info_Tuple(nXi,fXi,tXi): Information for each Dist_App • Info_Tuple_List Lx: For any X, list of all possible Info_Tuple.

Metric for clustering • rXf is likely to be low if an optional delimiter appears immediately after X, and high otherwise • Choose a suitable cut-off value rc and assign delimiters to different groups as follows,- • If rXf < rc, assign X to a group containing possible crucial delimiters • Else assign X to the group containing non crucial delimiters

Observations and Facts • Missing optional delimiters can appear immediately after crucial delimiters ONLY • Non-crucial delimiters can be pruned away • Consider two Info_Tuples (nX1, fX1 ,tX1) and (nX2, fX2 ,tX2) in LX • If a missing delimiter appears immediately after the appearance corresponding to the first tuple but not the second one,- • nX1 > nX2 • Missing delimiter will appear in tX1 but not in tX2

A hypothetical example illustrating Contrast Analysis • Suppose, X is a crucial delimiter having 2 Info_tuples, L1 and L2 , as follows, • L1=(50, 20, l1 .txt) • L2=(20, 12, l2 .txt) • Sequence mining on l1 .txt and l2 .txt yields two sets of frequently occurring sequences, S1 and S2 , as follows, • S1={ f1 , f5 , f6 , f8 , f13 , f21 } • S2={ f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21 } • Since but , f5 is a possible missing delimiter • f5is a missing delimiter only if it has a high d_score or is verified by a domain expert as a valid delimiter

Contrast Analysis • For any i,j, if nXi > nXj , look for frequently occurring sequences in tXi and tXj, call them fsXi and fsXj respectively • If there exists a frequent sequence fs such that, but then, fs is quite likely to be a possible delimiter • If fs has a fairly high d_score or identified by a domain expert as valid delimiter add it to the incomplete list as newly found delimiter

Generalized Contrast Analysis • In case of more than two Info_Tuples, identify mean of all nXi values • Form a group by appending text from all Info_Tuples, where • Form another group by appending text from all Info_Tuples, where • Perform contrast analysis among all such possible groups

Another example illustrating Generalized Contrast Analysis • Suppose, X is a crucial delimiter having 3 Info_tuples, L1 , L2 , L3 , as follows, • L1=(50, 20, l1 .txt) • L2=(20, 12, l2 .txt) • L3=(15, 10, l3 .txt) • Mean number of lines, • Append l2 .txt and l3 .txt , call it t2 .txt • Sequence mining on l1 .txt and t2 .txt yields two sets of frequently occurring sequences, S1 and S2 , as follows, • S1={ f1 , f5 , f6 , f8 , f13 , f21 } • S2={ f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21 } • Since but , f5 is a possible missing delimiter • f5is a missing delimiter only if it has a high d_score or is verified by a domain expert as a valid delimiter

Overall Algorithms

Results: Optional delimiters • % Pruning=

Results: Non-optional Missing delimiters • Even though designed for finding optional delimiters, our algorithms works, in some cases, for missing non-optional delimiters too • If a missing non-optional delimiter appears exactly in the same location in each record, then our algorithm fails • If a non-optional delimiter has a backward edge coming from a delimiter that appears later in a topologically sorted NFA then our algorithm works

Hypergraph Mining • Basic Motivation • To find useful “Transitive Relation” (hypergraphs) among genes • Example (Gene-Disease Relationship) • Gene A is related to a gene B • Gene B is related to a gene C • Is Gene A related to Gene C ? • Gene Source • Microarray Experiments • Information Source • Online Literature abstracts

Formal Problem Definition • Given • A dictionary KT of keywords • A dictionary KM of user provided key words (KTכKM) • Collection of literature abstracts,- each abstract is represented as a set of keywords • Task • To find hyperedges exceeding user defined threshold, each of which involves a set of key words from KM and are potentially connected by another set of linking words from KT-KM

Modeling • Purpose • To use a similar approach as frequent itemset mining • Define • total weight=support + cross support • Support: set of keywords appear together in one document • Cross support: set of keywords can be partitioned so that each partition appears in different document • Issues • Since downclosure property does not hold for total weight modified downclosure property can be defined

Idea • Support satisfies downclosure property • Let X be a set, Ω be its power set. A function f : Ω→R+satisfies downclosure property if for all A,B ∈ Ω , A כ B ,f(B)>f(A) • Cross support can be designed to be restricted below a particular value, i.e., it is bounded • Form a function h as addition of two functions h=f+g • f satisfies downclosure property • g is bounded • h satisfies modified down closure property • For any θ≥0, if h(Kn) ≥θ then f(Kn-1) ≥ max{0,(θ-sup(g))} • This property can be used to devise efficient algorithm

Results

Similarity Measure among sets of genes • Each file containing gene names can be considered as a Discrete Random Variable (DRV) • Each such DRV can take several values (gene names) • For two such files X,Y and for any pair (x,y), x∈X and y∈Y, p(x,y) can be computed from online abstracts based on co-occurrence • Now defining Z=g(X,Y), Z is a RV • Expectation of Z can be used as a similarity measure • Different g gives rise to different similarity measure

A huge source of online biological information is available in the form of deepweb An online query form query form needs to be filled out Required information is available by filling out may such forms from different websites There might be some dependency among these forms Requires Redundancy elimination Query Planning for Deepweb Mining

Semi-supervised Ranking • Ranking • Given a training set of examples with labels/pair wise relationships • Task is to rank an unseen test set, i.e. to get a permutation so that relevant examples are ranked higher than irrelevant ones • This corresponds to learning a ranking function • Semi-supervised Ranking • Incorporating unlabeled examples to learn the ranking function • Out of sample extension

Potential Application • Following a microarray experiment it might be possible to guess if gene A is more important than gene B involved in the experiment • However all possible order relationship is time consuming end error prone • Thus, from a small set of order relationship and using other genes from the experiment as unlabeled data a semi-supervised ranking function can be learned

Multiple Instance Learning • Instead of instance-label pair (x,y), bag-label pair (B,y) is provided as training data • A bag contains multiple instances • A bag label is negative, if each instance in the bag has negative label • A bag label is positive, if there exists at least one instance with positive label • Given an unseen bag, the task is to predict its label

Potential Application • Following a microarray experiment it might be possible to form bags of genes with appropriate labels • From different biological labs doing similar experiments, many such bags can be obtained to use as training data • Before, designing a new microarray experiment, gene set can be selected based on multiple instance learning

Summary • Use of data mining /machine learning techniques to extract information for biological data • Work done • Learning layouts of flat-file biological datasets • Hypergraph Mining • Similarity Measure among sets of genes • Proposed Work • Study and application of machine learning techniques

Bioinformatics, Data Integration and Machine Learning a Thesis Proposal