320 likes | 449 Views
Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets. Kaushik Sinha Xuan Zhang Ruoming Jin Gagan Agrawal. Overall Goal. Informatics tools for biological data integration driven by: Data explosion Data size & number of data sources New analysis tools
E N D
Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets Kaushik Sinha Xuan Zhang Ruoming Jin Gagan Agrawal
Overall Goal • Informatics tools for biological data integration driven by: • Data explosion • Data size & number of data sources • New analysis tools • Autonomous resources • Heterogeneous data representation & various interfaces • Frequent Updates • Common Situations: • Flat-file datasets • Ad-hoc sharing of data
Current Approaches • Manually written wrappers • Problems • O(N2) wrappers needed, O(N) for a single updates • Mediator-based integration systems • Problems • Need a common intermediate format • Unnecessary data transformation • Integration using web/grid services • Needs all tools to be web-services (all data in XML?)
Our Approach • Automatically generate wrappers • Transform data in files of arbitrary formats • No domain- or format-specific heuristics • Layout information provided by users • Help biologists write layout descriptors using data mining techniques
Our Approach: Challenges • Description language • Format and logical view of data in flat files • Easy to interpret and write • Wrapper generation and Execution • Correspondence between data items • Separating wrapper analysis and execution • Interactive tools for writing layout descriptors • What data mining techniques to use ?
Wrapper Generation System Overview Layout Descriptor Schema Descriptors Parser Mapping Generator Data Entry Representation Schema Mapping Application Analyzer WRAPINFO Source Dataset Target Dataset DataReader DataWriter Synchronizer
Key Open Questions • How hard is it to write layout descriptors ? • Given a flat file, how hard is it to learn its layout? • Can we make the process semi-automatic ?
Learning Layout of a Flat-File • In general – intractable • Try and learn the layout, have a domain expert verify • Key issue: what delimiters are being used ?
Finding Delimiters • Difficult problem • Some knowledge from domain expert is required (Semi-automatic) • Naïve approaches • Frequency Counting • Counts frequently occurring single tokens (word separated by space) • Sequence Mining • Counts frequently occurring sequence of tokens
Frequency Counting • Problems • Some tokens, appearing very frequently, are not delimiters • Delimiters could be a sequence of token rather than a single token • Possible Solution • Use knowledge from frequency of token sequence and all its subsequences to decide possible delimiter sequence
Sequence Mining Example • For any sequence of tokens s, f(s) represents frequency of s • Lets say A,B,C are tokens • Case 1: • f(ABC)=10, f(AB)=10, f(BC)=10, f(CA)=10 • Information about AB, BC, CA is already embedded in ABC • ABC is possible delimiter but AB, BC, CA are not • Case 2: • f(ABC)=10, f(AB)=20, f(BC)=10, f(CA)=10 • BC and CA occur less frequently than AB • ABC cannot be a delimiter • AB is possible delimiter
Limitations of Sequence Mining • Does not work very well if token frequencies are distributed in a skewed manner • An example where it does not work in (Pfam dataset) • \n, #=GF, AC are tokens with • f(\n,#=GF)>>f(#=GF,AC) • F(\n,#=GF)>>f(\n,#=GF,AC) • \n #=GF is concluded as possible delimiter • In reality \n #=GF AC is a delimiter
Can we do better? • Biological datasets are written for humans to read • It is very unlikely that delimiters will be scattered all around, in different places in a line • Position of the possible delimiters might provide useful information • Combination of positional and frequency information might be a better choice
Positional Weight • Let P be the different positions in a line where a token can appear • For each position iє P, tot_seqji represents total # of token sequences of length j starting at position i • For each position iє P, tot_unique_seqji represents total # of unique token sequences of length j starting at position i • For any tuple (i,j), p_ratio(i,j) is defined as shown above • p_ratio(i,j) can be log normalized to get positional weight, p_wt(i,j) with the property p_wt(i,j)є (0,1)
Delimiter score (d_score) • Frequency weight for any token sequence sji with length j and starting at position i, f_wt(sji), is obtained by log normalizing frequency f(sji) • Obviously, f_wt(sji) є (0,1) • Positional and frequency weight now can be combined together to get d_score as follows, • d_score(sji)= α * p_wt(i,j) + (1-α) * f_wt(sji) • Where αє(0,1) • Thus d_scrore has the following two properties, • d_score(sji) є(0,1) • d_score(sji) > d_score(sjk) implies sji is more likely to be a delimiter than sjk
Finding delimiters using d_score • Since delimiter sequence length is not known in advance, an iterative algorithm is used to get a superset S of potential delimiters, where, • At any iteration i, ci represents the cut-off value which is determined by observing a substantial difference in sorted d_score values • All token sequences above ci are called Ni
Generating layout descriptor • Once the delimiters are identified, an NFA can be built scanning the whole database where, delimiters are different states of the NFA • This NFA can be used to generate a layout descriptor since it nicely represents optional and repeating states • The following figures shows an NFA, where A, B, C, D, and E are delimiters with B being an optional delimiter and C D being a repeating delimiters
Realistic Situation • The task of identifying complete list of correct delimiters is difficult • Most likely we will end up with getting an incomplete list of delimiters • The delimiters which does not appear in every data record (optional) are the ones to be possibly missed
Identifying Optional Delimiters • Given a list of incomplete delimiters how can we identify optional delimiters, if any? • Build a NFA based on given incomplete information • Perform clustering to identify possible crucial delimiters • Perform contrast analysis
Crucial delimiter • A delimiter is considered crucial, if missing delimiters will appear immediately following these delimiters • The goal is to create two clusters, • one having delimiters which are not crucial • The other one having crucial delimiters
Identifying crucial delimiters:A few definitions • Succ(X): Set of delimiters that can immediately follow X • Dist_App: # of groups of occurrences of X based on # of text lines between X and immediately next delimiter • Info_Tuple(nXi,fXi,tXi): Information for each Dist_App • Info_Tuple_List Lx: For any X, list of all possible Info_Tuple.
Metric for clustering • rXf is likely to be low if an optional delimiter appears immediately after X, and high otherwise • Choose a suitable cut-off value rc and assign delimiters to different groups as follows,- • If rXf < rc, assign X to a group containing possible crucial delimiters • Else assign X to the group containing non crucial delimiters
Observations and Facts • Missing optional delimiters can appear immediately after crucial delimiters ONLY • Non-crucial delimiters can be pruned away • Consider two Info_Tuples (nX1, fX1 ,tX1) and (nX2, fX2 ,tX2) in LX • If a missing delimiter appears immediately after the appearance corresponding to the first tuple but not the second one,- • nX1 > nX2 • Missing delimiter will appear in tX1 but not in tX2
A hypothetical example illustrating Contrast Analysis • Suppose, X is a crucial delimiter having 2 Info_tuples, L1 and L2 , as follows, • L1=(50, 20, l1 .txt) • L2=(20, 12, l2 .txt) • Sequence mining on l1 .txt and l2 .txt yields two sets of frequently occurring sequences, S1 and S2 , as follows, • S1={ f1 , f5 , f6 , f8 , f13 , f21 } • S2={ f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21 } • Since but , f5 is a possible missing delimiter • f5is a missing delimiter only if it has a high d_score or is verified by a domain expert as a valid delimiter
Contrast Analysis • For any i,j, if nXi > nXj , look for frequently occurring sequences in tXi and tXj, call them fsXi and fsXj respectively • If there exists a frequent sequence fs such that, but then, fs is quite likely to be a possible delimiter • If fs has a fairly high d_score or identified by a domain expert as valid delimiter add it to the incomplete list as newly found delimiter
Generalized Contrast Analysis • In case of more than two Info_Tuples, identify mean of all nXi values • Form a group by appending text from all Info_Tuples, where • Form another group by appending text from all Info_Tuples, where • Perform contrast analysis among all such possible groups
Another example illustrating Generalized Contrast Analysis • Suppose, X is a crucial delimiter having 3 Info_tuples, L1 , L2 , L3 , as follows, • L1=(50, 20, l1 .txt) • L2=(20, 12, l2 .txt) • L3=(15, 10, l3 .txt) • Mean number of lines, • Append l2 .txt and l3 .txt , call it t2 .txt • Sequence mining on l1 .txt and t2 .txt yields two sets of frequently occurring sequences, S1 and S2 , as follows, • S1={ f1 , f5 , f6 , f8 , f13 , f21 } • S2={ f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21 } • Since but , f5 is a possible missing delimiter • f5is a missing delimiter only if it has a high d_score or is verified by a domain expert as a valid delimiter
Results: Optional delimiters • % Pruning=
Results: Non-optional Missing delimiters • Even though designed for finding optional delimiters, our algorithms works, in some cases, for missing non-optional delimiters too • If a missing non-optional delimiter appears exactly in the same location in each record, then our algorithm fails • If a non-optional delimiter has a backward edge coming from a delimiter that appears later in a topologically sorted NFA then our algorithm works
Summary • Semi-automatic tool for learning the layout of a flat-file dataset • Mechanism for identifying missing optional delimiters • Automatic tool for wrapper generation • Once the layout descriptor is known • Can ease integration of new/updated sources