260 likes | 338 Views
A Tool for Supporting Integration Across Multiple Flat-File Datasets. Xuan Zhang, Gagan Agrawal Ohio State University. Outline. Motivation System Overview System Implementation Languages Query Execution Experiments. Motivation. Biological researches ask for
E N D
A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University
Outline • Motivation • System Overview • System Implementation • Languages • Query Execution • Experiments
Motivation • Biological researches ask for • Accessing multiple heterogeneous data sources • Lack of common data model, data format • Tracking multiple objects • A motivating example: protein sequence analysis
An Example To predict protein function >unknown sequence. ……MCMFGSSVIECPNPRIWFVWPYEFPLFLLPGGDRMEI…… NCBI protein-protein BLAST service List of similar sequences Clustering Analysis (ClustalW, RiPE, etc)
Current Solution • Manual • Copy-and-paste keyword search • Format conversion programs • NCBI link-out • Database • Load data (BLAST output, sequence database) • Parse input; Re-format output
Our Approach • Join request between BLAST output and SWISSPROT (sequence database) • Data maintained in flat files • Query specification and data description are high-level, declarative • Data parsing and query processing are behind the scene
Advantage • Retrieve multiple pieces of information all at once • Data easily available • Declarative languages only • High flexibility • Low over-head
System Overview query Source/target names Query parser Metadata collection Dataset descriptors Descriptor parser mappings Schema & Layout information Application analyzer QUERYINFOR Source data files Target Data file DataReader DataWriter synchronizer
Outline • Motivation • System Overview • System Implementation • Languages • Query Language • Metadata Description Language • System • Query Analysis • Query Execution • Experiments
Query Language • Declarative, SQL-like • Projection, selection, cross product, join queries • Example AUTOWRAP POSTBLAST FROM BLASTP, SWISSPROT BY BLASTP.SP_ID = SWISSPROT.ID WHERE POSTBLAST.QUERY = BLASTP.QUERY POSTBLAST.SP_AC = BLASTP.SP_AC POSTBLAST.SP_ID = BLASTP.SP_ID POSTBLAST.FULL_DESCR = SWISSPROT.DE POSTBLAST.SEQUENCE = SWISSPORT.SQ POSTBLAST.SCORE = BLASTP.SCORE POSTBLAST.E_VALUE = BLASTP.E_VALUE Target dataset Source datasets Join criteria Attribute pairs
Metadata Description Language • One descriptor for each flat file dataset. • Schema • Layout • Re-usable by different queries • Can be learned semi-automatically using data mining techniques • Example: BLAST output
Schema Descriptors • Written in XML DTD format • Example <?xml version='1.0' encoding='UTF-8'?> <!ELEMENT BLASTP (QUERY, SEQUENCE*)> <!ELEMENT QUERY (#PCDATA)> <!ELEMENT SEQUENCE (SP_AC, SP_ID, DESCR, SCORE, E_VALUE)> <!ELEMENT SP_AC (#PCDATA)> <!ELEMENT SP_ID (#PCDATA)> <!ELEMENT DESCR (#PCDATA)> <!ELEMENT SCORE (#PCDATA)> <!ELEMENT E_VALUE (#PCDATA)>
Layout Descriptors • Example DATASET "BLASTP" { DATATYPE {BLASTP} DATASPACELINESIZE = 90 { … … } DATA {data/Blast_htm.txt} } Dataset name Schema name File layout File location
Layout descriptor Actual data file Description of File Layout "BLASTP" VERSION … … "Query=" QUERY "\nDatabase:" DB_NAME < "\nsp|" SP_AC "|" SP_ID " " DESCR " " SCORE " " E_VALUE > "\n\nALIGNMENT" DUMMY BLASTP2.2.11 [Jun-05-2005] Reference: … … RID: … … Query=Random 50 residue protein sequence. Database:Non-redundant SwissProt sequences 175,661 sequences; 64,716,374 total letters Score E Sequences producing significant alignments: (Bits) Value sp|P11884|AL1A1_SHEEPModification methylase MwoI (N-4 cytosin...30.01.5 sp|P00352|AL1A1_HUMANOxygen-independent coproporphyrinogen II...28.1 5.7 sp|P40530|YIE2_YEASTOxygen-independent coproporphyrinogen II...28.1 5.7 ALIGNMENTS >sp|P11884|AL1A1_SHEEP Modification methylase MwoI (N-4 cytosine-specific … …
Query Analysis query Source/target names Query parser Metadata collection Dataset descriptors Descriptor parser mappings Schema & Layout information Application analyzer Application analyzer QUERYINFOR Source data files Target Data file DataReader DataWriter synchronizer
Terminology • DLM-VAR node/pair • a pairing of a delimiter and an attribute value • E.g "Query=" QUERY • Reach-ability • DLM-VAR node r is reachable from node a iff configuration “ar” is allowed by the layout description • Regular v.s Semi-structured Attribute • Regular: fixed number of values per entry • Semi-structured: various number of values per entry • Number v.s Index • Label for layout node v.s schema node • 1 index/number, 1+ number/index
Application Analyzer • Label Schema and Layout Tree • Query analysis • Record layout information • Delimiter look-up table • Draw correspondence between schema and layout • Label look-up table • Collect constant values in query • Pseudo-label look-up table • Calculate reachable nodes • Reachable look-up table • Other information Parameters QUERYINFOR
QUERY-PROC Structure • Three general action modules • DataReader • DataWriter • Synchronizer • One query-specific data module • QUERYINFOR QUERYINFOR Source 1 Target DataReader DataWriter Source 2 Synchronizer
QUERY-PROC Structure (cont.) • One value buffer • Configuration vary from query to query • Accessible to three general modules QUERY SP_AC SP_ID BLASTP SCORE E_VALUE Source 1 Source 2 Regular Semi-structured
QUERY-PROC Action • DataReader • Extract attribute value • Start: Delimiter look-up table • End: Reachable look-up table • Fill value buffer: Label look-up table • DataWriter • Retrieve from value buffer: Label look-up table • Write target file: Delimiter look-up table • Truncate or wrap: Reachable look-up table + label look-up table
QUERY-PROC Action (cont.) • Synchronizer • Set up pseudo-attributes: Pseudo label look-up table • Call DataReader on source 1 and 2, Call DataWriter on target: Parameters • Test join conditions: Parameters • Clean value buffer: Parameters
Outline Outline • Motivation • System Overview • System Implementation • Languages • System • Experiments
Post-BLAST Query • Enhance BLAST output • Join query between BLAST output and SWISSPROT • Results in FASTA format • 2 modes • UNIQUE: halt once a match is found in source 2 • ALL: search all source 2 entries
Chip-supplement Query • Look up microarray genes information • Join query between protein array and yeast genome database • Results in tabular form • 2 queries • Chip-Supplement: array join genome • Chip-Supplement-Sorted: genome join array
OMIM-plus Query • Add reverse links of proteins to disease database • Join query between OMIM database and SWISSPROT database • Results in OMIM form • 86.38 seconds/entry * 12,158 OMIM entry = 291.7 hours
Summary • A data integration tool • Answers query on flat-file datasets • Light-weighted • Modest programming efforts • No DBMS • Various flat file formats supported