200 likes | 317 Views
Exploitation of Structural Similarity in Semi-Structured Bioinformatics Data for Efficient Storage Construction. Dongkyoo Shin (shindk@sejong.ac.kr) Sejong University, InCob2007. Table of contents. Abstract Background Methods Results Conclusions. Abstract (1). Background
E N D
Exploitation of Structural Similarity in Semi-Structured Bioinformatics Data for Efficient Storage Construction Dongkyoo Shin (shindk@sejong.ac.kr) Sejong University, InCob2007
Table of contents • Abstract • Background • Methods • Results • Conclusions Multimedia & Internet Laboratory, Sejong University
Abstract (1) • Background • Many researches related to storing XML data • Reduce the number of joins between tables • Not proper to microarray data with distinctive hierarchy • Hierarchical feature of microarray data model • a few core values occurs iteratively • New approach for capturing the feature • Class elements with similar structure into a group • Design common database table for the group Multimedia & Internet Laboratory, Sejong University
Abstract (2) • Results • Database schema created by our approach • Reduce the number of table joins remarkably • Improve performance of storing and loading XML-based microarray data • Conclusions • Efficient way to improve performance of microarray data is mining structural similarity of elements Multimedia & Internet Laboratory, Sejong University
Background (1) • DTD (Data Type Definition)-dependent base • Map one element into one table For each e E, #(S) ≥1 OR #(A) ≥1 -> define_Class(e) For each Se S -> Add_attributes_of_Class(e) Se SequenceType -> Define_multivalued_att(Se, e) Multimedia & Internet Laboratory, Sejong University
Background (2) • Inline technique base • Reduce the complexity of DTD (Data Type Definition) For each e, #(S) == 1 AND Se SequenceType -> Add_Multi-valued_attribute_of_Paren-tClass(e) Multimedia & Internet Laboratory, Sejong University
Background (3) • Drawback of previous approaches • DTD-dependent • Database schema has the same complexity with DTD • Inline technique • Strongly depend on the number of omissible elements • New design approach for microarray database • Capture similar structural features of microarray data • Need fast and simple way to mine the structural features Multimedia & Internet Laboratory, Sejong University
Background (5) • Microarray data and MAGE (Microarray Gene Expression) standards • Research groups share microarray data with others, and use it to solve their biological questions • MGED society’s standard definitions • MIAME (Minimum Information for the Annotation of a Microarray Experiment) • MAGE-OM and MAGE-ML • Exchange object model and format for MIAME • Structural feature of MAGE-OM • a variety set of objects defining the same data types including complex types. Multimedia & Internet Laboratory, Sejong University
Background (6) • Decision Tree • a simple model for easy understanding classification rules correlations, and effects between variables • Proper formining structural features of MAGE-ML DTD itself (Not MAGE-ML instances !!!) • Possible to classify all elements three levels: • A root, mediators group, and bottoms group Multimedia & Internet Laboratory, Sejong University
Methods (1) • Classification of core features using decision tree • Terminologies for expression of a complexType • e: an element defined in XML schema • E: an elements set of e • SE: a sub-elements set of e • a: an attribute of e • A: an attributes set of e • SA: an attributes set for all sub-elements of e • complexType: Structural information that consists of SE and (or) A of e. • Lowestchild: an element without a sub-element • Lowestparent: an element with a sub-element that is one of the lowest child elements • PG (ParentGroup): a set of candidate elements to be parents of a LowestChild • LPCG (TheLowest Parent CandidateGroup): a set of candidates to be LowestParent • LCG (TheLowestChildGroup): a set of Lowest child elements • LPG (The Lowest Parent Group): a set of Lowest Parent elements • ULPG (Upper Level Parent Group): a set of upper level parents, including elements that are neither LowestChild nor LowestParent Multimedia & Internet Laboratory, Sejong University
Methods (2) • Expression of a complexType • A complexType defines structural information of elements • A set of arrays including data type • Definition of structural similarity SEelex = {e1, e2, … , en}, SAelex = {Ae1, Ae2, … , Aen} complexType(elex) = {SEelex, SAelex} • complexType(elex) == complexType(eley) Multimedia & Internet Laboratory, Sejong University
Methods (3) • Decision Tree for recognizing the core features • Condition 1: If rule 1 is satisfied, then e arrives at LCG. Otherwise, it arrives at PG. • Condition 2: If rule 2 is satisfied, then e and its similar element e arrive at a new LCG. • Condition 3: If rule 3 is satisfied, then e arrives at LPG. Otherwise, it arrives at ULPG. • Condition 4: If rule 4 is satisfied, then e and elements similar to e arrive at a new LPG. Multimedia & Internet Laboratory, Sejong University
Methods (4) • Classification rules • Rule 1 • Decide that an element should belong to group LCG or PG For each ei E { if(number of elements in SEei == 0){ ei is classified into LCG; }else{ ei is classified into PG; } } Multimedia & Internet Laboratory, Sejong University
Methods (5) • Classification rules • Rule 2 • Classify multiple sets of LCG p = 0; For each ei LCG0 { Flag=0; If (p>0) { For q=1 to p If (complexType(ei) = complexType(element in LCGq) { ei is classified into LCGq; Flag=1; } } If (Flag==0) { For each ej LCG0 if(complexType(ei) = complexType(ej) { p=p+1; ei and ej are classified into a new group of LCGp; } } } Multimedia & Internet Laboratory, Sejong University
Methods (6) • Classification rules • Rule 3 • Separate elements in PG into two groups: LPG and ULPG For each ei PG { if(SEei LCG) { ei is classified into LPG; }else{ ei is classified into ULPG; } } Multimedia & Internet Laboratory, Sejong University
Methods • Classification rules • Rule 4 • Classify multiple sets of LPG p = 0; For each ei LPG0 { Flag=0; If (p>0) { For q=1 to p If (complexType(ei) = complexType(element in LPGq) { ei is classified into LPGq; Flag=1; } } If (Flag==0) { For each ej LPG0 if(complexType(ei) = complexType(ej) { p=p+1; ei and ej are classified into a new group of LPGp; } } } Multimedia & Internet Laboratory, Sejong University
Result (1) • Database design by the proposed decision tree Multimedia & Internet Laboratory, Sejong University
Result (2) • Database space complexity • Time complexity Multimedia & Internet Laboratory, Sejong University
Result (3) • Reconstructing the XML Document Multimedia & Internet Laboratory, Sejong University
Conclusions • Proposed approach • Mine elements with structural similarity from XML Schema for biological information • Experimental result • Mining structural similarity of object model is proper to microarray data and more efficient than previous approaches • Future work • Plan to extend current classification rules to root, LCG, LPG, ULPG respectively Multimedia & Internet Laboratory, Sejong University