Rule-based Knowledge Aggregation for Large-Scale Protein Sequence Analysis of Influenza A Viruses

Rule-based Knowledge Aggregation for Large-Scale Protein Sequence Analysis of Influenza A Viruses Sixth International Conference on Bioinformatics (InCoB2007) Hong Kong, 28th August 2007 Olivo Miotto Institute of Systems Science and Yong Loo Lin School of Medicine, National University of Singapore Tan Tin Wee Vladimir Brusic Yong Loo Lin School of Medicine Cancer Vaccine Center National University of Singapore Dana-Farber Cancer Inst.

Outline • Knowledge Aggregation in large-scale analysis • Semantic Technologies for Knowledge Aggregation • Task: Annotating the Influenza Dataset • XML-based structural rules • Rule-based knowledge restructuring • Discussion and Conclusions

Outline 1 • Knowledge Aggregation in large-scale analysis • Semantic Technologies for Knowledge Aggregation • Task: Annotating the Influenza Dataset • XML-based structural rules • Rule-based knowledge restructuring • Discussion and Conclusions

Knowledge Aggregation:Scaling up Bioinformatics Currently, dataset preparationis manual • Bioinformatic Analysis is current limited in scope • Usually single domain (single aspect) • Mostly small datasets (single genes, or few sequences) • "Horizontal" scalability: connecting domains • Multiple database sources, diversely purposed data • Systemic and semantic heterogeneity • Discovery by relationship analysis • "Vertical" scalability: analyzing large datasets • Many thousands of records • Diversity of geography, tissue types, host, etc. • Discovery by comparative analysis

Horizontal Scalability BioHaystack Semantic Web Browser IBM + MIT Quan, D (2004): BioHaystack: Gateway to the Biological Semantic Web www.w3.org/2004/Talks/0520-em-swa/WWW-2004-BioHaystack-W3C-track.ppt

Vertical ScalabilityMutual Information Analysis Identification of Characteristic Sites Metadata Selection

Obstacles to Scalability • Heterogeneity of Biological Databases • Systemic: access to data in different databases • Syntactic: data formats, use of free text • Structural: different table structures in different databases • Semantic: data with different meaning and intent • Semantic Heterogeneity is particularly insidious • Data is rarely used in the way it was originally intended • Low level of end-use technical expertise • Biologists, not computer scientists • Excel spreadsheets, Web page “scraping” • Does not scale up

Outline • Knowledge Aggregation in large-scale analysis • Semantic Technologies for Knowledge Aggregation • Task: Annotating the Influenza Dataset • XML-based structural rules • Rule-based knowledge restructuring • Discussion and Conclusions 2

Knowledge Aggregation: Technology requirements • To enable large-scale Knowledge Aggregation we need a technology platform with • Structural independence • Structural adaptability • To support biological researchers we need a technology platform with • Limited infrastructure needs • Intutitiveness • Easy interchange and transformation Best current candidate: Semantic Technologies

Semantic Technologies: XML • XML is a tried-and-tested self-descriptive encoding that support any data application • Has a standard software platform for parsing and transforming data <struct_refCategory> <struct_ref id="1"> <db_name>UNP</db_name> <db_code>HEMA_IAZH3</db_code> <pdbx_db_accession>P11134</pdbx_db_accession> <entity_id>1</entity_id> <pdbx_seq_one_letter_code> GLFGAIAGFIENGWEGMIDGWYG </pdbx_seq_one_letter_code> <pdbx_align_begin>330</pdbx_align_begin> </struct_ref> </struct_refCategory>

Semantic Technologies: RDF X ID Name DOB Street Postcode Spouse RDBMS Table S324567 Goh Ah Beng 25/12/1972 127, Orchard Road 243623 S658347 S885347 Tan Ah Lian 1/1/1975 88, Bukit Timah Road 536564 Subject Property Value S324567 name Goh Ah Beng RDF S324567 dob 25/12/1972 S324567 address S324567-home S324567-home street 127, Orchard Road S324567-home postcode 243623 S324567 spouse S658347 • RDF defines a very simple universal data structure encoded in XML Same structure for any kind of data!

Semantic Technologies: Ontologies • Ontologies: vocabularies of concepts and properties that describe a field of knowledge • OWL technology allows user to define ontologies • Shared ontologies allow interchange of data • Ontologies support REASONING by means of programs that • Read RDF data, encoded using an ontology • Apply rules that relate to the described properties • Generate new knowledge from these rules

Study goals • Analyze all influenza protein sequences available • GenBank + GenPept = 92,343 documents • Final dataset comprises 40,169 unique sequences • Various types of analysis, e.g. • Identify amino acid mutations sites that characterize human-transmissible strains • Compare the diversity of viral sequences over different periods of time and geographical areas • Several Metadata fields required • Protein name Subtype Isolate • Host Country Year Manual Curation is not an Option!

Inconsistencies in GenBank records Not so Good Pretty Bad Good

Experimental Approach • Retrieve all influenza A records from GenBank and Genpept in XML format, using ABK platform • Miotto O, Tan TW, Brusic V (2005) LNCS 3578, 398-405. • Use XML structural rules to extract, merge and reconcile the metadata from the records • Use RDF encoding and an Ontology to encode and structure the resulting metadata • Use a Reasoner with Semantic Rules to restructure the metadata, and make inferences that improve the consistency

Leveraging on XML • XML offers great advantages for extracting heterogeneous metadata • Wide availability • Popular encoding for source databases • Standard processing software • Independence from source schemas • Query Language (XPath) • Some disadvantages • Almost unreadable by humans • Interpretation of semantics requires understanding the schema

ABK Structural Rules Hierarchical valuereconciliation Automatic formation ofXML Structural Rule Concise visualization of XML as name/value tree Familiar presentation ofmetadata for biologists Point-and-click selectionof location and constraints Tabulated visualizationand manual curation RDF storage and output

Structural Rules for Influenza Analysis Applicable to GBXML (Genbank and Genpept)

Database Performance Genbank is more thoroughly annotated than Genpept Genbank Genpept

Rule performance Multiple rules often needed Some properties are very fragmented

Semantic Metadata Restructuring • Semantic Structure Gap • Genbank semantics represents individual sequences • A single isolate can comprise multiple sequences • ->Sequences from same isolate can present metadata discrepancies • Semantic Restructuring • Restructure metadata to relate sequences from the same isolate • Implemented using Jena2 (http://jena.sourceforge.net/) • Native Jena rule-based reasoner • Jena OWL reasoner validates inferences against ontology

Semantic Restructuring isolate Genbank:123456 A/Duck/GD/1234/04 genbankRef origin CHINA record-234567 dnaSequence record-234567/nt year 2004 DnaSequence SequenceRecord A proteinName NS1 isolate IsolateRecord A/Duck/GD/1234/04 isolate-a/duck/gd/1234/04 origin CHINA 2004 year hasSequenceRecord Genbank:123456 genbankRef record-234567 dnaSequence record-234567/nt proteinName NS1 B SequenceRecord DnaSequence Semantics of GenBank Restructured Semantics

Restructuring Rules [rule1: (?rec rdf:type vg:SequenceRecord) (?rec vg:isolate ?isolateId) normalizeIsolate(?isolateId, ?nIsoId) uriConcat('urn:abk:isolate:', ?nIsoId, ?isolateUri) -> (?isolateUri rdf:type vg:IsolateRecord) (?isolateUri vg:hasSequenceRecord ?rec) ] [rule2: (?isolateUri vg:hasSequenceRecord ?rec) (?rec ?prop ?value) oneOf(?prop, vg:isolate, vg:virusSubtype, vg:year, vg:country, vg:hostOrganism) -> (?isolateUri ?prop ?value) ]

Semantic Validation identifies Inconsistencies isolate isolate A A/Duck/GD/1234/04 A/Duck/GD/1234/04 record-234567890 record-345678901 origin origin CHINA JAPAN NA HA SequenceRecord SequenceRecord proteinName proteinName isolate IsolateRecord B A/Duck/GD/1234/04 Isolate-a/duck/gd/1234/04 origin CHINA origin JAPAN Multiple Values For Functional Property hasSequenceRecord hasSequenceRecord record-234567890 record-345678901 NA HA proteinName proteinName SequenceRecord SequenceRecord

Isolate Restructuring Full Genome studies are main contributors

Re-annotation Results Huge Manual Curation savings

Discussion - 1 • Large-scale metadata recovery from public databases is difficult even for simple requirements • Relatively simple approaches such as structural rules can do most of the tedious work • Accuracy can be further improved with machine learning • Semantic inferences can improve data quality • Significant impact on manual curation task • Rules have more potential for intuitive end-user GUI than programming • cf. email rules, firewall rules

Discussion - 2 • Semantic Technologies are suitable for bioinformatics metadata management today • Limited infrastructure requirements • Flexibility and extensibility of ontologies (Open World) • Enormous potential for analysis tool integration • Build tools that are "semantically agnostic" • Reasoning currently computationally expensive • Our simple reasoning tasks exceeded the power of a current desktop when applied to 10,000's records • Divide-and conquer strategies were effective, but require manual work, and are not always applicable • Reasoning services and computing grid can help scalability, but only if easy to access

Acknowledgements and Thanks • Institute of Systems Science, NUS • Funding support for this conference • Prof. J Thomas August, Johns Hopkins University • AT Heiny, NUS • Partial Grant Support: • National Institute of Allergy and Infectious Diseases, NIH • Grant No. 5 U19 AI56541, Contract No. HHSN2662-00400085C • ImmunoGrid Project • EC Contract FP6-2004-IST-4, No. 028069

Metadata Extraction Ontology (fragment) Sequence Record Class Six Functional Properties

Rule-based Knowledge Aggregation for Large-Scale Protein Sequence Analysis of Influenza A Viruses

Rule-based Knowledge Aggregation for Large-Scale Protein Sequence Analysis of Influenza A Viruses

Presentation Transcript

Protein Sequence Analysis - Overview

Building a Large-Scale Knowledge Base for Machine Translation

Influenza viruses

Influenza viruses

Large-Scale Phylogenetic Analysis

PROTEIN SEQUENCE ANALYSIS

Protein Sequence Analysis - Overview

Knowledge-based Analysis of Genome-scale Data

A Knowledge/Rule-Based Expert System for Deciding

Protein Sequence Analysis - Overview -

Protein sequence analysis

Sequence control of Aggregation

Large-scale knowledge aggregation for infectious diseases

INFLUENZA A (H1N1) VIRUSES

Large-Scale Protein Production

Scenarios for Protein Aggregation

large scale data analysis

Scenarios for Protein Aggregation

Protein Sequence Analysis - Overview

Large-Scale Multiple Sequence Alignment

Large-scale knowledge aggregation for infectious diseases

Building a Large-Scale Knowledge Base for Machine Translation