230 likes | 311 Views
Ontology Driven Data Collection for EuPathDB. Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania. Issues associated with Data Collection. Heterogeneity of free text Difficulty in data integration, requires human intervention
E N D
Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania
Issues associated with Data Collection Heterogeneity of free text Difficulty in data integration, requires human intervention Complex queries are limited
Data Collection for EuPathDB • Apply ontology to data submission form design • Form to collect sequence data and information on isolates of pathogens • Geographic location from where isolate specimen collected • Host organism information: species, age, clinical information • Genetic manipulation with resulting phenotype data collection form • Mutation method • Effects of genetic modification on the parasite and on the location, function, and involvement in biological process of the resultant modified protein These data are important for parasite epidemiology and research on vaccines and anti-parasitic drugs • Enable Queries • Compare sequence data from Plasmodium isolates that are restricted to East Africa to those from West Africa and are controlled for age and health of hosts • List genes that when knocked out result in a defect in parasite growth during the erythrocytic cycle • List genes fused to green fluorescent protein (GFP) that when expressed are located in the cell membrane
EupathDB EupathDB (Eukaryotic Pathogen Database Resources ) is a NIAID Bioinformatics Resource Center covering Eukaryotic Parasites EuPathDB: a portal to eukaryotic pathogen databases.Aurrecoechea C, et al.Nucleic Acids Res. 2010
Isolate Data • Need to import and integrate datasets from GenBank • But GenBank did not specify needed metadata for isolates • Manual curation required • Harmonize: enable host queries: Human-> Homo sapiens • Deconvolute descriptions in free text: isolated from storm waters • isolated from Homo sapiens patient infected with HIV
Isolate Submission Form • Target isolate information • Geographic location • Source organism samples information • or Environmental samples information • Sequence information
Ontology-based Representation of Isolate Data The data collected in the submission form are in the bold font. The fields require ontology terms are in thick border box
Excel Format • Generally already collected in this format according to our community advisors • Lowers the barrier for usage • Easily converted to GenBank submission-ready format automatically • Allows multiple sequence submission
Genetic Manipulation and Phenotype Data • Integrate phenotype data from • other resources (GeneDB) • Allow individuals to submit • phenotype data via the EuPathDB • web site via User Comments • on Gene pages • Either way these are free text • descriptions limiting utility for • data exploration T. bruceiRNAi knockdowns
Genetic Manipulation and Phenotype Submission Form • Genetic Manipulation • Mutation method including selective marker, report if available • Mutation type (effect on gene function) • Phenotype data – impact of genetic manipulation on four possible observed features: • Quality of the organism • Cellular location of gene product • Molecular function of gene product • Biological process of gene product
Ontology-based Representation of Genetic Manipulation with Resulting Phenotype Data The data collected in the submission form are in the bold font. The fields require ontology terms are in thick border box. Ontology for Parasite Lifecycle (OPL) will be used in the annotation of life cycle stage
Ontology-based Representation of Genetic Manipulation – Gene Knock Out
Phenotype Section OPL GO Cellular location OBI Biological process GO PATO OBI
Web-based Form • Collect the data directly from specific components of the EuPathDB web site • Change dynamically based on user’s inputs (lifecycle stage based on species, display selective marker, report, etc. section when needed)
Future Work • Submission forms are at the prototype stage • Distribute isolate submission forms to EuPathDB users • Incorporate genetic manipulation and phenotype form into EuPathDB website • Evaluation of submission forms based on the data collected • Improve the submission forms based on feedback
Acknowledgements • Stoeckert Lab • Haiming Wang and EuPathDB Team • EuPathDB Community Dr. G Robinson, Dr. R Chalmers, Dr. CJ Janse, Dr. G. Widmer, Dr. L. Xiao, Dr. SM Khan • Funding • NIH grant 5R01GM93132-1 • National Institute of Allergy and Infectious Diseases at the National Institutes of Health Award NO1-AI900038C Contract No. HHSN272200900038C