150 likes | 302 Views
Summary of SDM ETC Kickoff for the Data Integration Task. Terence Critchlow. Calton Pu Ling Liu David Buttler. Bertram Ludaescher Amarnath Gupta Mladen Vouk Tom Potok. People Terence Critchlow (LLNL) Calton Pu (GT) Ling Liu (GT) David Buttler (GT) Bertram Ludaescher (UCSD)
E N D
Summary of SDM ETC Kickoff for theData Integration Task Terence Critchlow Calton Pu Ling Liu David Buttler Bertram LudaescherAmarnath Gupta Mladen VoukTom Potok
People Terence Critchlow (LLNL) Calton Pu (GT) Ling Liu (GT) David Buttler (GT) Bertram Ludaescher (UCSD) Amarnath Gupta (UCSD) TDB: Ph.D. student at Georgia Tech Developer at UCSD Mladen Vouk / Tom Potok NCSU / ORNL Commitment per institution LLNL 0.25 (likely) – 1.0 FTE Georgia Tech 2 Ph.D. Students X months Calton’s time Y months Ling’s time UCSD 1 FTE 1 month Bertram’s time 1 month Gupta’s time Agent team 2-4 months over the course of the year People involved:
Application ties • Primary domain: bioinformatics • Secondary domains: • Material science • Air / water quality • Scientists (early adopters) • Matt Coleman (LLNL) • Allen Christian (LLNL) • Phil Bourn (PDB) Contacted by Terence Contacted by Bertram / Gupta
Use Case 1: Finding out everything about a sequence • Bob starts with one or several DNA or protein sequences that he wants to analyze • OR: Bob finds protein or gene sequences of interest by querying databases/web sites for metabolic pathways/cell signaling pathways (e.g., KEGG); • OR: Bob looks at a database of microarray experiments and chooses those genes that exhibit specified patterns of co-occurrence (what subsets of genes “go hand in hand” across a large number of experiments) • The relevant sequences are submitted to one or more sequence databases for blast search • The homologous sequences found in the searched database(s) are • directly returned to the user, sorted by score • OR: post-processed by the mediator (duplicate elimination, groupings, links to additional contextual data) • The resulting sequences can be queried for their associated information • Bob can use these sequences for new similarity searches
Use Case 1: Additional scenerios • Helpful features for users • Multiple sequences entered through a single file • Ability to tie in other programs to preprocess data before passing it to wrappers / mediator • Follow-up searches may be more than just blasts • Selection / project / join queries through the interface • Tie in other tools such as RasMol • Other types of search such as phiblast, psiblast or other structural similarity searches
ExternalProgram Data Integration Architecture if invoked, pre-processes query parameters and post-processes results Query Dispatch and Collection (QDaC) XML Wrapper XQuery (subsets e.g. Sel/Proj) : API Medline VIPAR Integration component / KB-Mediator (KBM) XML Wrapper CM Wrapper PDB CM Wrapper XML Wrapper XML Wrapper df CM Wrapper Source / Agent MetaData Registry XQuery interface Select/project only XWRAP Wrapper Generator
Architecture comments • Communication protocol: • Use agent technology to communicate between components • Don’t use full capabilities when on the same machine • Between QDaC and wrappers, QDaC and mediator, mediator and CMs, CMs and wrappers • NOT expected between wrappers and source • Embedded representation: • XML sources are queried using a subset of XQuery (fragments) • Primarily concerned with selection and projection – not join • Query results are returned in XML
Architecture comments • Meta-data repository (=metadata server) • Contains: • Location, schema • Query capabilities (blast, keyword, XPath) of sources • May be duplicated / shared between QDaC and KBM • Eventually may be treated as an agent • External programs • Will be included as preprocessing steps • May need wrappers to handle translations properly • Will be tied in to interface where possible • Gives users access to tools they need / want / are familiar with
Architecture comments • Expect most wrappers to be generated by XWrap in practice, but it shouldn’t matter as long as they follow the specified protocol and representation • VIPAR used to wrap publication sources • Simple SQL wrapper for direct database access • Definitions: • CM – conceptual mapping: a wrapper that translates source-specific XML into
Year 1 deliverables • Send XQuery command to BLAST sources, combine results, and return to user interface • Interact with at least 4 sources • Integration component will have at least 2 sources • QDaC will directly query NCBI and at least one other • Operate QDaC and mediator in a distributed environment • Interface / QDaC at LLNL and mediator at UCSD Have agent stubs at UCSD and LLNL passing text strings within 3 months
Detailed tasks • Interface (LLNL) • Extended to handle blast against new sources • Some of which are not integrated • QDaC (LLNL) • Identify available wrappers from meta-data • This includes the SDSC component • Query wrappers using XQuery • Collect and sort responses • Adopt agent protocol
Detailed tasks • XWrap (GT) • Accept XPath/XQuery input • Handle complex BLAST interfaces • Adopt agent protocol • Mediator (UCSD) • Model of pathways, gene and protein expressions ==> ontology to be used for driving BLAST queries and interpreting their results • Accept XQuery queries • Identify available sources from meta-data • Modify CM wrappers to generate XQuery commands • Agent technology (ORNL, LLNL, UCSD) • Use VIPAR to wrap Medline database • Use protocols to communicate between LLNL and SDSC components
Administrative • Reports • Quarterly reports • to be collected by Terence, (possibly) summarized, and forwarded on to Arie • Short – bulleted form (word file or plain text preferred) • Center-wide communications • Telecon 1st Monday of the month 11:00 – 12:00 PST • It is ok to miss this • Semi-annual meetings • next at ORNL in mid-March • Center web site will point to individual task sites • Shared CVS repository at NC State • Primarily for major releases / sharing code between tasks
Administrative • Advisory committee • Potential names from bioinformatics area • Carole Goble (Univ of Manchester), Tom Slezak (LLNL), ??? • Unclear who pays travel for members • This is for us, so they will not be generating reports
Mail list For our task ONLY sdmctr-integrate@llnl.govis being set up Will be archived Site contacts Terence (LLNL) Bertram (UCSD) Calton (GT) Tom (Agents) Web site Being set up at GT Use main CVS repository for major releases Code sharing option 1 Task-only CVS repository for day-to-day work Unlikely LLNL could host this service Code sharing option 2 Site specific cvs repositories for day-to-day work Alexandria repository for inter-task code sharing https://www-casc.llnl.gov/alexandria/ Disadv: tar-balls Adv: we don’t all need an account on the repository machine Task specific