1 / 35

Rule-based Knowledge Aggregation for Large-Scale Protein Sequence Analysis of Influenza A Viruses

Rule-based Knowledge Aggregation for Large-Scale Protein Sequence Analysis of Influenza A Viruses. Sixth International Conference on Bioinformatics (InCoB2007) Hong Kong, 28 th August 2007. Olivo Miotto

philander
Download Presentation

Rule-based Knowledge Aggregation for Large-Scale Protein Sequence Analysis of Influenza A Viruses

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rule-based Knowledge Aggregation for Large-Scale Protein Sequence Analysis of Influenza A Viruses Sixth International Conference on Bioinformatics (InCoB2007) Hong Kong, 28th August 2007 Olivo Miotto Institute of Systems Science and Yong Loo Lin School of Medicine, National University of Singapore Tan Tin Wee Vladimir Brusic Yong Loo Lin School of Medicine Cancer Vaccine Center National University of Singapore Dana-Farber Cancer Inst.

  2. Outline • Knowledge Aggregation in large-scale analysis • Semantic Technologies for Knowledge Aggregation • Task: Annotating the Influenza Dataset • XML-based structural rules • Rule-based knowledge restructuring • Discussion and Conclusions

  3. Outline 1 • Knowledge Aggregation in large-scale analysis • Semantic Technologies for Knowledge Aggregation • Task: Annotating the Influenza Dataset • XML-based structural rules • Rule-based knowledge restructuring • Discussion and Conclusions

  4. Knowledge Aggregation:Scaling up Bioinformatics Currently, dataset preparationis manual • Bioinformatic Analysis is current limited in scope • Usually single domain (single aspect) • Mostly small datasets (single genes, or few sequences) • "Horizontal" scalability: connecting domains • Multiple database sources, diversely purposed data • Systemic and semantic heterogeneity • Discovery by relationship analysis • "Vertical" scalability: analyzing large datasets • Many thousands of records • Diversity of geography, tissue types, host, etc. • Discovery by comparative analysis

  5. Horizontal Scalability BioHaystack Semantic Web Browser IBM + MIT Quan, D (2004): BioHaystack: Gateway to the Biological Semantic Web www.w3.org/2004/Talks/0520-em-swa/WWW-2004-BioHaystack-W3C-track.ppt

  6. Vertical ScalabilityMutual Information Analysis Identification of Characteristic Sites Metadata Selection

  7. Obstacles to Scalability • Heterogeneity of Biological Databases • Systemic: access to data in different databases • Syntactic: data formats, use of free text • Structural: different table structures in different databases • Semantic: data with different meaning and intent • Semantic Heterogeneity is particularly insidious • Data is rarely used in the way it was originally intended • Low level of end-use technical expertise • Biologists, not computer scientists • Excel spreadsheets, Web page “scraping” • Does not scale up

  8. Outline • Knowledge Aggregation in large-scale analysis • Semantic Technologies for Knowledge Aggregation • Task: Annotating the Influenza Dataset • XML-based structural rules • Rule-based knowledge restructuring • Discussion and Conclusions 2

  9. Knowledge Aggregation: Technology requirements • To enable large-scale Knowledge Aggregation we need a technology platform with • Structural independence • Structural adaptability • To support biological researchers we need a technology platform with • Limited infrastructure needs • Intutitiveness • Easy interchange and transformation Best current candidate: Semantic Technologies

  10. Semantic Technologies: XML • XML is a tried-and-tested self-descriptive encoding that support any data application • Has a standard software platform for parsing and transforming data <struct_refCategory> <struct_ref id="1"> <db_name>UNP</db_name> <db_code>HEMA_IAZH3</db_code> <pdbx_db_accession>P11134</pdbx_db_accession> <entity_id>1</entity_id> <pdbx_seq_one_letter_code> GLFGAIAGFIENGWEGMIDGWYG </pdbx_seq_one_letter_code> <pdbx_align_begin>330</pdbx_align_begin> </struct_ref> </struct_refCategory>

  11. Semantic Technologies: RDF X ID Name DOB Street Postcode Spouse RDBMS Table S324567 Goh Ah Beng 25/12/1972 127, Orchard Road 243623 S658347 S885347 Tan Ah Lian 1/1/1975 88, Bukit Timah Road 536564 Subject Property Value S324567 name Goh Ah Beng RDF S324567 dob 25/12/1972 S324567 address S324567-home S324567-home street 127, Orchard Road S324567-home postcode 243623 S324567 spouse S658347 • RDF defines a very simple universal data structure encoded in XML Same structure for any kind of data!

  12. Semantic Technologies: Ontologies • Ontologies: vocabularies of concepts and properties that describe a field of knowledge • OWL technology allows user to define ontologies • Shared ontologies allow interchange of data • Ontologies support REASONING by means of programs that • Read RDF data, encoded using an ontology • Apply rules that relate to the described properties • Generate new knowledge from these rules

  13. Outline • Knowledge Aggregation in large-scale analysis • Semantic Technologies for Knowledge Aggregation • Task: Annotating the Influenza Dataset • XML-based structural rules • Rule-based knowledge restructuring • Discussion and Conclusions 3

  14. Study goals • Analyze all influenza protein sequences available • GenBank + GenPept = 92,343 documents • Final dataset comprises 40,169 unique sequences • Various types of analysis, e.g. • Identify amino acid mutations sites that characterize human-transmissible strains • Compare the diversity of viral sequences over different periods of time and geographical areas • Several Metadata fields required • Protein name Subtype Isolate • Host Country Year Manual Curation is not an Option!

  15. Inconsistencies in GenBank records Not so Good Pretty Bad Good

  16. Experimental Approach • Retrieve all influenza A records from GenBank and Genpept in XML format, using ABK platform • Miotto O, Tan TW, Brusic V (2005) LNCS 3578, 398-405. • Use XML structural rules to extract, merge and reconcile the metadata from the records • Use RDF encoding and an Ontology to encode and structure the resulting metadata • Use a Reasoner with Semantic Rules to restructure the metadata, and make inferences that improve the consistency

  17. Outline • Knowledge Aggregation in large-scale analysis • Semantic Technologies for Knowledge Aggregation • Task: Annotating the Influenza Dataset • XML-based structural rules • Rule-based knowledge restructuring • Discussion and Conclusions 4

  18. Leveraging on XML • XML offers great advantages for extracting heterogeneous metadata • Wide availability • Popular encoding for source databases • Standard processing software • Independence from source schemas • Query Language (XPath) • Some disadvantages • Almost unreadable by humans • Interpretation of semantics requires understanding the schema

  19. ABK Structural Rules Hierarchical valuereconciliation Automatic formation ofXML Structural Rule Concise visualization of XML as name/value tree Familiar presentation ofmetadata for biologists Point-and-click selectionof location and constraints Tabulated visualizationand manual curation RDF storage and output

  20. Structural Rules for Influenza Analysis Applicable to GBXML (Genbank and Genpept)

  21. Database Performance Genbank is more thoroughly annotated than Genpept Genbank Genpept

  22. Rule performance Multiple rules often needed Some properties are very fragmented

  23. Outline • Knowledge Aggregation in large-scale analysis • Semantic Technologies for Knowledge Aggregation • Task: Annotating the Influenza Dataset • XML-based structural rules • Rule-based knowledge restructuring • Discussion and Conclusions 5

  24. Semantic Metadata Restructuring • Semantic Structure Gap • Genbank semantics represents individual sequences • A single isolate can comprise multiple sequences • ->Sequences from same isolate can present metadata discrepancies • Semantic Restructuring • Restructure metadata to relate sequences from the same isolate • Implemented using Jena2 (http://jena.sourceforge.net/) • Native Jena rule-based reasoner • Jena OWL reasoner validates inferences against ontology

  25. Semantic Restructuring isolate Genbank:123456 A/Duck/GD/1234/04 genbankRef origin CHINA record-234567 dnaSequence record-234567/nt year 2004 DnaSequence SequenceRecord A proteinName NS1 isolate IsolateRecord A/Duck/GD/1234/04 isolate-a/duck/gd/1234/04 origin CHINA 2004 year hasSequenceRecord Genbank:123456 genbankRef record-234567 dnaSequence record-234567/nt proteinName NS1 B SequenceRecord DnaSequence Semantics of GenBank Restructured Semantics

  26. Restructuring Rules [rule1: (?rec rdf:type vg:SequenceRecord) (?rec vg:isolate ?isolateId) normalizeIsolate(?isolateId, ?nIsoId) uriConcat('urn:abk:isolate:', ?nIsoId, ?isolateUri) -> (?isolateUri rdf:type vg:IsolateRecord) (?isolateUri vg:hasSequenceRecord ?rec) ] [rule2: (?isolateUri vg:hasSequenceRecord ?rec) (?rec ?prop ?value) oneOf(?prop, vg:isolate, vg:virusSubtype, vg:year, vg:country, vg:hostOrganism) -> (?isolateUri ?prop ?value) ]

  27. Semantic Validation identifies Inconsistencies isolate isolate A A/Duck/GD/1234/04 A/Duck/GD/1234/04 record-234567890 record-345678901 origin origin CHINA JAPAN NA HA SequenceRecord SequenceRecord proteinName proteinName isolate IsolateRecord B A/Duck/GD/1234/04 Isolate-a/duck/gd/1234/04 origin CHINA origin JAPAN Multiple Values For Functional Property hasSequenceRecord hasSequenceRecord record-234567890 record-345678901 NA HA proteinName proteinName SequenceRecord SequenceRecord

  28. Isolate Restructuring Full Genome studies are main contributors

  29. Re-annotation Results Huge Manual Curation savings

  30. Outline • Knowledge Aggregation in large-scale analysis • Semantic Technologies for Knowledge Aggregation • Task: Annotating the Influenza Dataset • XML-based structural rules • Rule-based knowledge restructuring • Discussion and Conclusions 6

  31. Discussion - 1 • Large-scale metadata recovery from public databases is difficult even for simple requirements • Relatively simple approaches such as structural rules can do most of the tedious work • Accuracy can be further improved with machine learning • Semantic inferences can improve data quality • Significant impact on manual curation task • Rules have more potential for intuitive end-user GUI than programming • cf. email rules, firewall rules

  32. Discussion - 2 • Semantic Technologies are suitable for bioinformatics metadata management today • Limited infrastructure requirements • Flexibility and extensibility of ontologies (Open World) • Enormous potential for analysis tool integration • Build tools that are "semantically agnostic" • Reasoning currently computationally expensive • Our simple reasoning tasks exceeded the power of a current desktop when applied to 10,000's records • Divide-and conquer strategies were effective, but require manual work, and are not always applicable • Reasoning services and computing grid can help scalability, but only if easy to access

  33. Acknowledgements and Thanks • Institute of Systems Science, NUS • Funding support for this conference • Prof. J Thomas August, Johns Hopkins University • AT Heiny, NUS • Partial Grant Support: • National Institute of Allergy and Infectious Diseases, NIH • Grant No. 5 U19 AI56541, Contract No. HHSN2662-00400085C • ImmunoGrid Project • EC Contract FP6-2004-IST-4, No. 028069

  34. Metadata Extraction Ontology (fragment) Sequence Record Class Six Functional Properties

More Related