560 likes | 695 Views
Challenges in developing and implementing standards-based approaches in bioinformatics. South African National Bioinformatics Institute Electric Genetics University of the Western Cape. Impact of Open Standard is a lot like Open Source Software. Free software Open source software
E N D
Challenges in developing and implementing standards-based approaches in bioinformatics South African National Bioinformatics Institute Electric Genetics University of the Western Cape
Impact of Open Standard is a lot like Open Source Software Free software Open source software Myriad of licenses Low or no cost access
Tools • Existing and growing numbers of initiatives • Applications: EMBOSS, BLAST • Environments, vocabularies, databases
ACCESS and support of OS tools • Understanding • How do I install and use this system/application? • What if I have never used a non-windows environment? • Who else is using this that I can share my questions with? • Is anyone out there going to help me? • Is there a credible user base?
Impact • Commercial • Are there any legal/regulations hurdles to employing open source tools? • Is it a time sink? • Any impact on early adopters within the company? • How is this supported in terms of impact on the enterprise?
Impact • Academic • Funding threat • Calls for public funds to be spent upon development that should be available back to the public • requirement to distribute freely • If a developer wishes to do OS projects, sometimes a requirement to commercilise as part of funding
Population Genetics of Open Source • Longevity of use as a function of penetrance • most new mutations, even if they are not selected against, never succeed in entering the population. • Where N is the total finite population size • 1/2N is the probability that the mutation will become fixed.
Software zygosity • Two possible forms of software or data • Open • Controlled access • Heterozygous • Both are used • Homozygous • Only one is used
Software selection - packaging • Opportunity to use • Support and documentation • Distribution and marketing • Training • User base • Knowledge of users • Repeat uses = impact • Funded/stable development • Commercial or open source support or both
Packaging Mean number of users No packaging Number of choice instances Artificial Selection
Effects of software selection • Same selection can have different outcomes • Roberston and Reeve • Change in wing size in drosophila • Number of cells • Size of cells • Selection for a web browser • Mosaic* • Netscape $ > * • Mozilla* • Internet explorer * • Opera $ > *
Manifesto for bioinformatics • open source • open standards • open annotation • open data • open health care
North – South Divide • Generation of genome data has been performed mostly in the developed west • Major laboratories and researchers are not in developing countries • Researchers at ‘site of infection’ have to compete with developed country researchers for access to genome projects • Developing countries lack resources for large scale projects • Developing countries provide the genetic material
Lessons so far • Sharing knowledge is key to developing knowledge • Sharing is difficult if there is an impediment to access • Open philosophies provide access to those with limited resources but a need for knowledge • Standards improve sharing • Those who would benefit from access to knowledge should contribute to standards and sharing
Why did SANBI get involved in controlled vocabularies? RP1expression product is unique to retina ESTs have nasty annotations Leverage Legacy expertise in gene expression data Genome imminent
Looking at ESTs across libraries • Library descriptions are diverse and in many cases non-informative • NCI_CGAP_Lip2 • UT0117 (75% of all EST libraries) • Soares foetal %^&*() • What were the actual expression states that these libraries captured?
eVOC: Controlled Vocabulary for Unifying Gene Expression Data • Consistent description of different libraries • Mapped orthogonal vocabularies • Anatomy, Cell type, Pathology, Development • 7016 EST libraries classified + 104 SAGE • 700 controlled terms • Applying terms of SAGE and EST allows cross comparisons for the first time, Microarray to follow…
Uses of eVOC • Provide as an integrated public resource which allows: • Linking libraries, transcripts and genes with expression terms • Analysis of expression level and tissue expression profiles • Comparison of expression between species • Linkage of genome sequence with expression phenotype information
Data Structure • 4 orthogonal mutually exclusive knowledge domains • independent pure hierarchies • One parent but multiple children • Advantages of pure hierarchies over more complex data structures • Easily maintained • Easily expanded • Easily visualised • Human and computer readable • Powerful simple querying • Each node has specific concept • One or more synonymous terms • Nasal • Nose
No More Tangles? • Where multiple parents/relationship types exist and could be represented in a DAG, these can often be “untangled” into more than one hierarchy • Untangling a tangled ontology. A complex mixed ontology can be simplified by creating simpler ontologies representing distinct domains.
System Tangled Ontology Body Substance Person Organic Ion Man Woman Doctor Patient Steroid Hormone Neurotransmitter Male doctor Female doctor Testosterone Glutamate Untangled Ontology Value Types Entities Roles Age Sex Physiological Role Clinical Role Person Body Substance Organic Ion Doctor Steroid Patient Neurotransmitter Hormone Male Female Adult Child Testosterone Glutamate
Relationships • Single type of relationship between nodes • Anatomical System • part-of • Cell Type + Pathology • subclass • Developmental Stage • is-a
Anatomical System Ontology • Untangling of Computational Biology and Informatics Laboratory’s (CBIL) terms (ICDM9) • removal of all references to tissue type, cell type or developmental stage • digestive system > pancreatic islets • Anatomical Site (spatial position) • 372 terms
Cell Type ontology • fine-grained description of where a gene is expressed. • listing of human cell types extracted from Gray’s Anatomy (Gray, H. L., Bannister, L. H, Williams, P. L, Collins, P., and Berry, M. M 1995). • 154 different cell types.
Developmental Stage ontology • Ordered timeline of human development for the description of gene expression in temporal space • Examples “embryo” and “adult”. • Embryogenesis is further divided into the standard Carnegie stages (www.ana.ed.ac.uk/anatomy/database/humat/) • first two months of human development. • further divided into weekly and yearly categories • 133 terms
Pathology Ontology • WHO ICD-9-CM basis • classification of morbidity and mortality information • Stats and indexing of hospital records by disease and surgery performed • first two levels • sample description • 141 terms
Total cDNA library collection liver neoplasia Anatomical System Pathology Query “liver AND neoplasia” Result: Intersection of libraries mapped to liver and to neoplasia
Total cDNA Libraries Annotated Libraries Not Annotated Anatomical System 7016 6752 5.2% Cell type 7016 410 94.2% Developmental Stage 7016 5891 17.3% Pathology 7016 6401 10.1% Most libraries can be annotated with Anatomical System terms as these are generally present in the library record. Less information is available for Cell Type and Developmental Stages as these are not consistently captured during the capture of library information.
Ontologies Clone Libraries ESTs Anatomical System foreskin Human TNF-treated BG9 fibroblasts (ID:1260) Homo sapiens foreskin fibroblast (ID:1620) U30152 U30154 U30159 U30162 U30163 U30164 U58979 Pathology Not classified Developmental Stage Not classified Cell Type fibroblast The four expression ontologies are used to annotate cDNA clone libraries. ESTs can be transitively associated with ontology terms via their association with a unique clone library.
Browsing, Querying and Curation An interface for browsing, curating and querying the ontologies is under development by Electric Genetics (see poster by Visagie et al. this meeting).
Curation • Central, versioned database of the eVOC ontologies • Curators who are domain experts add and delete terms or synonyms and make changes to the hierarchies on an ongoing basis • Groups that modify the ontologies are encouraged to contribute these modifications back to eVOC
Applications What happens when you link libraries (cDNA/SAGE) or microarray probes to terms in each ontology? • Expression profile selection of libraries • Terms > Libraries > Transcripts > Genes • Genes > Terms • Breadth of expression • Assess differential expression levels (SAGE) • Assess differential tissue expression (cDNA & SAGE) • Physical distribution of expression across the genome • Expression profile prioritisation of disease candidates • Link genome to standardised controlled terms • Assess expression clustering • Cross species expression comparison • Comparison of local data with whole picture • Choice of libraries by, for instance, molecular pathology : Neoplasia • Transitive Integration with GO
Integration Current ICL Candidate Gene Profiler A disease gene candidate identification system which integrates genomic data with the GO and eVOC ontologies to identify and rank genes which are candidates for known diseases. Swiss Institute of Bioinformatics Transcriptome Database Future EnsemblDatamart: select expression profile in a defined genome region GOBO apply for incorporation MGED apply for inclusion as an MGED-approved expression ontology
Human Transcriptome Database • H-Invitational Odaiba, Japan • Human FLcDNA annotation jamboree • Non-redundant set of mapped, manually curated, expression profiled, classified cDNAs • eVOC terms used to describe mRNA expression
High resolution of eVOC • genome-wide detection of alternatively spliced transcripts and identified those which show tissue-specificity(Xu, Q., Modrek, B., and Lee, C. 2002) • flat list of 46 human tissue classes • isoform-specific EST lists provided for a subset of the genes
Gene Name Isoform 1 Isoform 2 Xu et al. eVOC Xu et al. eVOC IRP3 Brain-specific 5 nervous >brain 1 respiratory >lung No specificity 2 urogenital >genital >female >uterus 1 urogenital >genital >female >placenta 1 haematological >blood 4 infant 3 adult WNK1 Kidney-specific 7 urinary >kidney No specificity 2 urogenital >genital male >penis1 alimentary >pancreas eVOC extends the expression information that can be obtained from other sources. IRP3, described by Xu et al. as having a brain-specific isoform, was shown to be infant brain specific by combining information gathered from the eVOC ontologies. The ESTs for each isoform were submitted to eVOC and the associated terms in each of the four ontologies were examined to identify expression state specificity.
GANESH deep Annotation engine ENSEMBL annotation engine Annotation served using DAS Controlled Expression Vocabulary Annotation using sequences Generated in the lab, and using Local domain expertise Candidate Gene Profiler Candidate gene Enrichment
Exon Skipping in Cancer: • Determine chromosomal location of 1011 gene set on human genome sequence • Assess the frequency and tissue distribution of exon skipping • Determine functional significance of exon skipping • Can the presence of transcripts demonstrating exon skipping be used as diagnostic/prognostic markers? • Can the biological effect of the skip on the resulting protein be explained?
Genes with exon skipped transcripts found uniquely in cancer tissues