490 likes | 619 Views
Semantic Mediation in myGrid. Chris Wroe Manchester University. UK e-Science Pilot Project. Oct 2001 – April 2005. £3.4 million. £0.4 million studentships. Newcastle. Sheffield. Manchester. Nottingham. Hinxton. Southampton. Data-intensive bioinformatics.
E N D
Semantic Mediation in myGrid Chris Wroe Manchester University
UK e-Science Pilot Project. • Oct 2001 – April 2005. • £3.4 million. • £0.4 million studentships. Newcastle Sheffield Manchester Nottingham Hinxton Southampton
Data-intensive bioinformatics ID MURA_BACSU STANDARD; PRT; 429 AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE DE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE; OC BACILLUS. KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE. FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY). FT CONFLICT 374 374 S -> A (IN REF. 3). SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI
Bioinformaticians Tool Providers Service Providers Service Stack Work bench Taverna Talisman Web Portal Applications Gateway Personalisation Registries Service and Workflow Discovery Provenance Event Notification Ontologies Ontology Mgt Views Metadata Mgt Core services myGrid Information Repository OGSA-DQP Distributed Query Processor FreeFluo Workflow Enactment Engine Web Service (Grid Service) communication fabric External services SoapLab GowLab Native Web Services AMBIT Text Extraction Service Legacy apps Legacy apps
Workflow approach Grave’s Disease
Issues • Connecting web services together • Shim services • Connecting data to web services • Data provenance delivered by LSIDs • Connecting data to data • Distributed Query Processing
Technology • Resource Description Framework • Representing metadata about data and services • Ontology Web Language • Representing concepts and classifications
myGrid & Bioinformatics world • Automating mainstream, well known tasks • Well known mature data formats • Often no formal description of formats • Lots of code to manipulate formats already exists (BioPerl, BioJava …) • Semantic mediation work in progress..
Williams-Beuren Syndrome Workflow Explore gaps regions within the W-B Critical Region Main Bioinformatics Applications Main Bioinformatics Services Main Bioinformatics Application Main Bioinformatics Application SHIM Services
Williams Example (simple) Genbankretrieval service GenscanGene predication service Semantic level Genbank record has_part genomic sequence genomic sequence in Syntactic level Genbank record FASTA sequence
Sample Genbank Record LOCUS AY214156 1065 bp mRNA linear VRT 07-MAY-2004 DEFINITION Oncorhynchus nerka RH1 opsin mRNA, complete cds. ACCESSION AY214156 VERSION AY214156.1 GI:37787241 KEYWORDS . SOURCE Oncorhynchus nerka (sockeye salmon) ORGANISM Oncorhynchus nerka Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Actinopterygii; Neopterygii; Teleostei; Euteleostei; Protacanthopterygii; Salmoniformes; Salmonidae; Oncorhynchus. REFERENCE 1 (bases 1 to 1065) AUTHORS Dann,S.G., Allison,W.T., Levin,D.B., Taylor,J.S. and Hawryshyn,C.W. TITLE Salmonid opsin sequences undergo positive selection and indicate an alternate evolutionary relationship in oncorhynchus JOURNAL J. Mol. Evol. 58 (4), 400-412 (2004) PUBMED 15114419 REFERENCE 2 (bases 1 to 1065) AUTHORS Dann,S.G., William,A.E., David,L.B. and Craig,H.W. TITLE Direct Submission JOURNAL Submitted (08-JAN-2003) Biology, University of Victoria, PO Box 3020 Stn CSC, Victoria, British Columbia V8W 3N5, Canada FEATURES Location/Qualifiers source 1..1065 /organism="Oncorhynchus nerka" /mol_type="mRNA" /db_xref="taxon:8023" CDS 1..1065 /codon_start=1 /product="RH1 opsin" /protein_id="AAP58347.1" /db_xref="GI:37787242" /translation="MNGTEGPDFYVPMSNATGIVRNPYEYPQYYLVSPAAYSLMAAYM FFLILTGFPINFLTLYVTIEHKKLRTALNYILLNLAVADLFMVIGGFTTTMYTSMHGY FVFGRTGCNIEGFCATHGGEIALWSLVVLAIERWLVVCKPISNFRFSETHAIIGVAFT WVMAAACSVPPLLGWSRYIPEGMQCSCGIDYYTRAPDINNESFVIHMFVVHFMIPLFI ISFCYGNLLCAVKAAAAAQQESETTQRAEREVTRMVIMMVVSFLVCWVPYASVAWYIF CNQGTEFGPVFMTIPAFFAKSSSLYNPLIYVLMNKQFRNCMITTLCCGKNPFEEEEGA STTASKTEASSVSSSSVAPA" ORIGIN 1 atgaacggca cagagggacc agatttctac gtccctatgt ccaatgctac tggcattgtt 61 aggaacccct atgaataccc ccagtactac cttgtcagcc cagcggcgta ctcactcatg 121 gctgcctaca tgttcttcct catcctcacc ggcttcccca tcaacttcct cacactctat 181 gtcaccatcg agcacaaaaa gctgaggacc gccctgaact acatcctgct gaacctggct 241 gtggccgatc tcttcatggt aatcggaggc ttcaccacta cgatgtacac ctccatgcat 301 ggctatttcg tctttggaag aacgggctgc aacatcgagg gattctgtgc tacccatggt 361 ggtgagattg ccctatggtc cctggttgtc ctggctattg agaggtggtt ggtcgtctgc 421 aaacctatta gcaacttccg cttcagtgag acccatgcca tcataggcgt ggcctttacc 481 tgggtcatgg ctgctgcttg ctccgtcccc cctctgcttg ggtggtcccg ctatatcccc 541 gaaggcatgc agtgctcatg tggaattgac tactacacgc gcgcccctga catcaacaat 601 gagtcctttg tcatccacat gttcgttgtc cactttatga ttcccctgtt catcatctcc 661 ttctgctacg gcaacctgct ctgcgctgtc aaggcagctg ccgccgccca gcaggagtct 721 gagaccaccc agagggctga gagggaagtg acccgcatgg tcatcatgat ggtcgtctcc 781 ttcctagtgt gctgggtgcc ctacgccagc gtggcctggt atatcttctg caaccaggga 841 acagagttcg gccccgtctt catgacaatt ccggcattct ttgccaagag ttcgtccctg 901 tacaaccctc tcatctacgt gttgatgaac aagcagttcc gcaactgcat gatcaccacc 961 ctgtgctgtg ggaagaaccc cttcgaggag gaggagggag cctccaccac tgcctccaag 1021 accgaggcct cctccgtgtc ctccagctcc gtggctcctg cataa //
FASTA >gi|37787241|gb|AY214156.1| Oncorhynchus nerka RH1 opsin mRNA, complete cds ATGAACGGCACAGAGGGACCAGATTTCTACGTCCCTATGTCCAATGCTACTGGCATTGTTAGGAACCCCT ATGAATACCCCCAGTACTACCTTGTCAGCCCAGCGGCGTACTCACTCATGGCTGCCTACATGTTCTTCCT CATCCTCACCGGCTTCCCCATCAACTTCCTCACACTCTATGTCACCATCGAGCACAAAAAGCTGAGGACC GCCCTGAACTACATCCTGCTGAACCTGGCTGTGGCCGATCTCTTCATGGTAATCGGAGGCTTCACCACTA CGATGTACACCTCCATGCATGGCTATTTCGTCTTTGGAAGAACGGGCTGCAACATCGAGGGATTCTGTGC TACCCATGGTGGTGAGATTGCCCTATGGTCCCTGGTTGTCCTGGCTATTGAGAGGTGGTTGGTCGTCTGC AAACCTATTAGCAACTTCCGCTTCAGTGAGACCCATGCCATCATAGGCGTGGCCTTTACCTGGGTCATGG CTGCTGCTTGCTCCGTCCCCCCTCTGCTTGGGTGGTCCCGCTATATCCCCGAAGGCATGCAGTGCTCATG TGGAATTGACTACTACACGCGCGCCCCTGACATCAACAATGAGTCCTTTGTCATCCACATGTTCGTTGTC CACTTTATGATTCCCCTGTTCATCATCTCCTTCTGCTACGGCAACCTGCTCTGCGCTGTCAAGGCAGCTG CCGCCGCCCAGCAGGAGTCTGAGACCACCCAGAGGGCTGAGAGGGAAGTGACCCGCATGGTCATCATGAT GGTCGTCTCCTTCCTAGTGTGCTGGGTGCCCTACGCCAGCGTGGCCTGGTATATCTTCTGCAACCAGGGA ACAGAGTTCGGCCCCGTCTTCATGACAATTCCGGCATTCTTTGCCAAGAGTTCGTCCCTGTACAACCCTC TCATCTACGTGTTGATGAACAAGCAGTTCCGCAACTGCATGATCACCACCCTGTGCTGTGGGAAGAACCC CTTCGAGGAGGAGGAGGGAGCCTCCACCACTGCCTCCAAGACCGAGGCCTCCTCCGTGTCCTCCAGCTCC GTGGCTCCTGCATAA
Williams Example (simple) Genbankretrieval service GenscanGene predication service Genbank service EMBOSS seqret service Semantic level Genbank record has_part genomic sequence genomic sequence in Syntactic level Genbank record FASTA sequence
Graves disease Array Express Gene clustering service Semantic level Microarray expression data out Microarray expression data in Syntactic level Affymetrix CEL file Treeview format
Example data CEL format Template CellHeader=X Y MEAN STDV NPIXELS 0 0 112.0 24.4 25 1 0 10699.0 1340.6 20 2 0 147.0 42.4 25 3 0 10602.0 2126.2 25 4 0 100.8 29.9 20 5 0 96.0 11.9 25 6 0 9829.0 1983.4 25 7 0 133.3 21.6 20 8 0 9092.0 1470.7 25 Cell header Probe ID 2 0 1000_at 5 0 1001_at 2 3 1002_at Treeview format Probe_Id Sample1236 1000_at 147 1001_at 96 1002_at -59
Graves disease Array Express Gene clustering service Template file AffyR service Semantic level Microarray expression data out Microarray expression data in Syntactic level Affymetrix CEL file Treeview format
Classification of shims Defn: experimentally neutral service used to connect domain services that don’t quite fit Shim service FILTER MAPPER DEREFERENCER TRANSLATOR syntax (e.g. GenBank to EMBL) data (e.g. DNA to protein) TRANSFORMER SIFTER (sql SELECT type operation) PARSER (sql PROJECT type operation) - also known as SPLITTER or DECOMPOSER COMPARER SORTER
Providing more assistance 1. Register Taverna workbench Pedro 3. Query 2. Annotate Taverna workbench
myGrid’s model of services operation name, description input output task method resource application service name, description authororganisation parameter name, description semantic type format transport type collection type collection format workflow WSDL operation WSDL service Soaplab service bioMoby service
Service Description Flow Instance Store Semantic Indexing Component XML document describing service FACT DL reasoner Pedro Discovery Client Extract service descriptions to reason over Registry Jena RDF repository
Pedro XML <serviceDescription> <organisation>http://genetics.man.ac.uk</organisation> <operation> <name>execute</name> <task>http://www.mygrid.org.uk/ontology#pairwise_local_aligning</task> …..
RDF RDF #service http://genetics.man.ac.uk #aligning type published_by a1234 #operation subclass hasOperation “execute” #local_pairwise_aligning type a2 name a3 task Queries possible within RDF repository: Find me an operation called “exec*” Find me a service provided by groups working on Williams disease Find me an operation which performs aligning?
RDF #service http://genetics.man.ac.uk #aligning type published_by a1234 #operation subclass hasOperation “execute” #local_pairwise_aligning type a2 name a3 task Queries not possible: Find me an operation which performs aligning which is local? Where does this service fit into a classification
OWL classes #service Most specific class expression extracted Owl property restriction: hasOperation #operation #local_pairwise_aligning Owl property restriction: performsTask Definition: Service which has an operation which performs the task local pairwise aligning
OWL classes Each service class has its own property based OWL definition service aligning service local aligning service pairwise local aligning service a1234 Instance store indexes our service instance in the appropriate place Classification calculated by the FACT reasoner using property based definitions
Query by navigation Service browser Service classified by task
Use of ontologies • Property based classification requires property based modelling • Advantages • Explicit, machine interpretable, easier to maintain large ontologies with polyhierarchies • Disadvantages • Complex definitions take time/ skill to author, require expert domain knowledge • Difficult to present back to the user
Property based classification on steroids Data nucleic acid sequence data RNA sequence data DNA sequence data
Property based classification on steroids Data Feature nucleic acid sequence nucleic acid sequence data RNA sequence RNA sequence data DNA sequence DNA sequence data encodes
Property based classification on steroids Data Feature Biological Concept nucleic acid nucleic acid sequence nucleic acid sequence data RNA RNA sequence RNA sequence data DNA DNA sequence DNA sequence data encodes sequence_of
Property based classification on steroids Data Feature Biological Concept nucleic acid nucleotide nucleic acid sequence nucleic acid sequence data RNA ribonucleotide RNA sequence RNA sequence data DNA deoxyribonucleotide DNA sequence DNA sequence data encodes sequence_of polymer_of
Property based classification on steroids Data Feature Biological Concept nucleic acid nucleotide nucleic acid sequence nucleic acid sequence data RNA ribonucleotide RNA sequence RNA sequence data DNA deoxyribonucleotide DNA sequence DNA sequence data encodes sequence_of polymer_of
Human readable ontologies GROWL parser GROWL renderer OWL API OWL API Reasoner
Only data to hand • Metadata associated with data items. • Life science identifier (LSID) protocol used to retrieve metadata. • Metadata model similar to service parameter Data item name, description semantic type format collection type collection format
Organisation level provenance Process level provenance Provenance (1) Service Project runBye.g. BLAST @ NCBI Experiment design Process Workflow design componentProcesse.g. web service invocation of BLAST @ NCBI partOf Event instanceOf componentEvente.g. completion of a web service invocation at 12.04pm Workflow run Data/ knowledge level provenance knowledge statementse.g. similar protein sequence to run for User can add templates to each workflow process to determine links between data items. Data item Person Organisation Data item Data item data derivation e.g. output data derived from input data
..masked_sequence_of .. nucleotide_sequence project ..part_of organisation >gi|19747251|gb|AC005089.3| Homo sapiens BAC clone CTA-315H11 from 7, complete sequence AAGCTTTTCTGGCACTGTTTCCTTCTTCCTGATAACCAGAGAAGGAAAAGATCTCCATTTTACAGATGAG GAAACAGGCTCAGAGAGGTCAAGGCTCTGGCTCAAGGTCACACAGCCTGGGAACGGCAAAGCTGATATTC AAACCCAAGCATCTTGGCTCCAAAGCCCTGGTTTCTGTTCCCACTACTGTCAGTGACCTTGGCAAGCCCT GTCCTCCTCCGGGCTTCACTCTGCACACCTGTAACCTGGGGTTAAATGGGCTCACCTGGACTGTTGAGCG experiment definition rdf:type ..part_of group urn:lsid:taverna:datathing:13 ..part_of ..author workflow definition ..works_for ..invocation_of ..author person ..BLAST_Report workflow invocation ..similar_sequences_to ..run_for ..run_during service description rdf:type 19747251 AC005089.3 831 Homo sapiens BAC clone CTA-315H11 from 7, complete sequence 15145617 AC073846.6 815 Homo sapiens BAC clone RP11-622P13 from 7, complete sequence 15384807 AL365366.20 46.1 Human DNA sequence from clone RP11-553N16 on chromosome 1, complete sequence 7717376 AL163282.2 44.1 Homo sapiens chromosome 21 segment HS21C082 16304790 AL133523.5 44.1 Human chromosome 14 DNA sequence BAC R-775G15 of library RPCI-11 from chromosome 14 of Homo sapiens (Human), complete sequence 34367431 BX648272.1 44.1 Homo sapiens mRNA; cDNA DKFZp686G08119 (from clone DKFZp686G08119) 5629923 AC007298.17 44.1 Homo sapiens 12q22 BAC RPCI11-256L6 (Roswell Park Cancer Institute Human BAC Library) complete sequence 34533695 AK126986.1 44.1 Homo sapiens cDNA FLJ45040 fis, clone BRAWH3020486 20377057 AC069363.10 44.1 Homo sapiens chromosome 17, clone RP11-104J23, complete sequence 4191263 AL031674.1 44.1 Human DNA sequence from clone RP4-715N11 on chromosome 20q13.1-13.2 Contains two putative novel genes, ESTs, STSs and GSSs, complete sequence 17977487 AC093690.5 44.1 Homo sapiens BAC clone RP11-731I19 from 2, complete sequence 17048246 AC012568.7 44.1 Homo sapiens chromosome 15, clone RP11-342M21, complete sequence 14485328 AL355339.7 44.1 Human DNA sequence from clone RP11-461K13 on chromosome 10, complete sequence 5757554 AC007074.2 44.1 Homo sapiens PAC clone RP3-368G6 from X, complete sequence 4176355 AC005509.1 44.1 Homo sapiens chromosome 4 clone B200N5 map 4q25, complete sequence 2829108 AF042090.1 44.1 Homo sapiens chromosome 21q22.3 PAC 171F15, complete sequence urn:lsid:taverna:datathing:15 service invocation ..described_by ..created_by ..filtered_version_of A B Provenance tracking Relationship BLAST report has with other items in the repository Other classes of information related to BLAST report
Using IBM’s Haystack GenBank record Portion of the Web of provenance Managing collection of sequences for review
Storage • LSID has no protocol for storage • Taverna/ Freefluo implements its own data/ metadata storage protocol Publish interface Taverna/ Freefluo Metadata Store data Data store metadata
Retrieval • LSID protocol used to retrieve data and metadata • Query handled separately RDF aware client LSID aware client Query LSID interface Metadata Store Data store
Queries within Workflows Semantic content of result depends on query and data source schema query result Grid Data Service query Select GO_ID FROM GO WHERE GO.term LIKE “enzyme activity”; Gene ontology term ID Select GO_Annotation_ID FROM GOA WHERE GO.term LIKE “enzyme activity”; protein ID
Distributed Query Processing • DQP linked with the OGSA-DAI activity • Built within myGrid project • Plans execution of a query over multiple Grid Data Services • Each Grid Data Service provides schema metadata • Currently no semantic mediation
Example query DQP Plan • select p.proteinId, blast(p.sequence)from p in protein, t in proteinTermwhere t.termId = 'GO:0008372' andp.proteinId = t.proteinId • “Select proteins and homologous proteins fromSWISS-PROT which have been annotated withGO:008372” t.proteinId p.proteinId = Data encoding the identity of a protein in SWISS-PROT namespace Data encoding the identity of a protein in SWISS-PROT namespace Gene ontology database SWISS-PROT protein database
TAMBIS I Query 1: Select motifs for antigenic human proteins that participate in apoptosis and are homologous to the lymphocyte associated receptor of death (also known as lard). Translation: Select patterns in the proteins that invoke an immunological response and participate in programmed cell death that are similar in their sequence of amino acids to the protein that is associated with triggering cell death in the white cells of the immune system. (A) Ontology expression: Motif which <isComponentOf (Protein which <hasOrganismClassification Species functionsInProcess Apoptosis hasFunction Antigen isHomologousTo Protein which <hasName ProteinName>)>)> Species: Is instantiated by value “human” ProteinName: Is instantiated by value “lard”
TAMBIS II • Informal query plan: • Select proteins with protein name “lard” from SWISS-PROT • Execute a BLAST sequence alignment process against SWISS-PROT results • Check the entries for apoptosis process and antigen function • Pass the resultant sequences to PROSITE to scan for their motifs • CPL expression: set-unique {(#motif1:motif1)I \protein3 <- get-sp-entries-by-de("lard"), \protein2 <- do-blastp-by-sq-in-entry(protein3), Check-sp-entries-by-kwd("apoptosis",protein2), check-sp-entries-by-de("antigen",protein2), Check-sp-entry-for-species("human",protein2), \motif1 <- do-ps-scan-by-sq-in-entry(protein2)}
select p.proteinId, blast(p.sequence)from p in protein, t in proteinTermwhere t.termId = 'GO:0008372' andp.proteinId = t.proteinId
How we did it in the past • Service type directory • How we currently plan to do it • Shims, genbank, microarray • How we may want to do it in the future • DQP & TAMBIS
Overview • We’re not attacking the same problem • When would your problem become our problem • Common descriptions of the core entities involved. • Data items, Datasets, Services.