220 likes | 311 Views
Large-scale knowledge aggregation for infectious diseases. ASEAN-China International Bioinformatics Workshop Singapore, 17 th April 2008. Olivo Miotto Institute of Systems Science and Yong Loo Lin School of Medicine, National University of Singapore. Large-scale Research Questions.
E N D
Large-scale knowledge aggregation for infectious diseases ASEAN-China International Bioinformatics Workshop Singapore, 17th April 2008 Olivo Miotto Institute of Systems Science and Yong Loo Lin School of Medicine, National University of Singapore
Large-scale Research Questions • What can we learn from large-scale studies of pathogens? • Does H5N1 Avian influenza have pandemic potential? • What makes Human flu different from Avian flu? • What are stable potential immune epitopes to use as vaccine candidates for influenza? • How does each serotype of dengue differ from all others?
Large-scale Research Questions Large scale Statistical evidence Historical data Systematic analysis • What can we learn from large-scale studies of pathogens? • Does H5N1 Avian influenza have pandemic potential? • What makes Human flu different from Avian flu? • What are stable potential immune epitopes to use as vaccine candidates for influenza? • How does each serotype of dengue differ from all others?
We need Metadata! • Metadata = Descriptive data about sequences • If you want to compare avian vs human, you need host organism info • If you want conservation analysis, you need to have serotype and host information • If you want to study a period of virus evolution, you need date information • If you want a balanced dataset, you may need to filter according to country, date, subtype
Knowledge Mining Knowledge Aggregation Active Text Mining Characteristic Mutations Analysis Conservation Analysis Identify mutations in H5N1 that characterize transmissibility amongst humans Viral Sequence and Metadata H5N1 mutation map User-defined Queries Viral Protein References User-defined Dictionaries User-defined Extraction Rules and Priorities Evidence of strain co-circulation Epitope Vaccine Candidates Cross-reference Identifiers Viral Sequence and Metadata Identify Evolutionarily Stable Region across subgroups Public Database Records Extract Desired Source Knowledge from Public Databases Biomedical Text Documents with Cross-reactivity information User-defined Patterns User-defined Dictionaries Identify Biomedical literature with Cross-reactivity information Curator's Knowledge Previous Annotations
Scalability in Bioinformatics Knowledge Mining • Integrative scalability • We need to integrate heterogeneous information from multiple data repositories with multiple purposes • Quantitative scalability • We need methods that can leverage on and explore effectively large-scale data sets • Hierarchical scalability • We need to cascade analysis tasks, flowing knowledge from one task to the next
Obstacles to Scalability • Heterogeneity of Biological Databases • Systemic: access to data in different databases • Syntactic: data formats, use of free text • Structural: different table structures in different databases • Semantic: data with different meaning and intent • Semantic Heterogeneity is particularly insidious • Data is rarely used in the way it was originally intended • Low level of end-use technical expertise • Biologists, not computer scientists • Excel spreadsheets, Web page “scraping” • Does not scale up
Semantic Heterogeneity in GenBank Not so Good Pretty Bad Good
Semantic Heterogeneity in GenBank • Fields (e.g. country/date) are inconsistently encoded • Inconsistent level of details between databases • Inconsistent field location within different records of the same database • Implicit encoding of the data (e.g. within the title of a publication) • Multiple usage of the same field Usage ofisolation_sourcefield in differentGenPept records BAC77216 BAC77216 /isolation_source=" /isolation_source="Samoa" Samoa AAN74539 AAN74539 /isolation_source="isolated in /isolation_source="isolated in 1993" AAT85667 AAT85667 /isolation_source="Homo sapiens" /isolation_source="Homo sapiens"
Influenza Large-Scale Studies • Analyze all influenza protein sequences available • GenBank + GenPept = 92,343 documents • Final dataset comprises 40,169 unique sequences • Various types of analysis, e.g. • Identify amino acid mutations sites that characterize human-transmissible strains • Compare the diversity of viral sequences over different periods of time and geographical areas • Several Metadata fields required • Protein name Subtype Isolate • Host Country Year Manual Curation is not an Option!
The Aggregator of Biological Knowledge Public Repositories Public Repositories input input Data Data Collection Collection augment augment query query manage manage Data Data Researcher Researcher Management Management input input control control augment augment filter filter Data Data Analysis Analysis KDD System ABK • An end-user environment for data retrieval, extraction and analysis • Uses XML technology and structural rules to allow biologists to extract and reconcile the data needed • Wrapper framework provides accessto multiple sources • Manages extracted results • Offers plug-in architecture for analysis tools
ABK Structural Rules Hierarchical valuereconciliation Automatic formation ofXML Structural Rule Concise visualization of XML as name/value tree Familiar presentation ofmetadata for biologists Point-and-click selectionof location and constraints Tabulated visualizationand manual curation RDF storage and output
Data Extraction and Cleaning Different rules (or different documents) produced conflicting values Values produced by user-defined rules User can fill in or override values DENV-1 sequences
Rule performance Multiple rules often needed Some properties are very fragmented
Using MI to detect Characteristic Sites • At a characteristic site, the residue observed is strongly associated to a set of sequences • E.g. : Arg -> Avian Thr -> Human • This association is explored by measuring mutual information of • The residue observed at a site • The label of the set in which it is observed • MI is in range 0 – 1.0 • MI = 0.0 -> no statistical significance in the occurrence of residues in the two sets • MI = 1.0 -> Residues observed in one set are never observed in the other, and vice versa
Spikes indicate characteristic sites MI Entropy A2A (719 sequences) PB2 Protein PB2 Protein H2H (1650 sequences)
RNP proteins: PB2 Nuclear Localization Signal PB1 binding NP binding RNA cap binding NT DE A S M T T MV VM TA S A A TI IV T K R M L DE N I AV VA TI K E A T T AS K R 9 44 64 81 105 199 271 292 368 475 567 588 613 627 661 674 702 http://www-micro.msb.le.ac.uk/3035/Orthomyxoviruses.html A2A H2H PB2 (759 aa) 17 sites
Ongoing Projects at ISS • InViDiA - Integrated Virus Diversity Analysis • Web-based tool for metadata-enabled diversity analysis • WADE - Web-based Aggregation and Display of Epitopes • Web-based tool for aggregating epitope predictions from multiple prediction systems
Thanks to • Johns Hopkins University • Prof. J Thomas August • Dana-Farber Cancer Institute, Harvard • Dr. Vladimir Brusic • Dept. of Biochemistry, NUS • Prof. Tan Tin Wee • AT Heiny, Asif M Khan, Hu Yong Li • Institut Pasteur • Dr. Hervé Bourhy • Partial Grant Support: • National Institute of Allergy and Infectious Diseases, NIH • Grant No. 5 U19 AI56541, Contract No. HHSN2662-00400085C