210 likes | 302 Views
Ontology-based Annotation & Query of TMA data. Nigam Shah Stanford Medical Informatics (nigam@stanford.edu). Tissue Microarrays. www.nature.com/clinicalpractice/onc. Stanford tissue microarray database. http://tma.stanford.edu/tma_portal/. Key analysis issue.
E N D
Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)
Tissue Microarrays www.nature.com/clinicalpractice/onc
Stanford tissue microarray database http://tma.stanford.edu/tma_portal/
Key analysis issue • Tissue microarrays query a large number of samples/patients for one protein. • The key query dimension in TMA data is a tissue sample • Because of the lack of a commonly used ontology to describe the diagnosis [or annotations] for a given TMA sample in TMAD it is not easy to perform such as query.
Ontologies considered • The NCI Thesaurus, version 05.09g • The SNOMED-CT, from UMLS 2005 AA
Available annotations for a block • Each donor block in the TMA has semi-structured text associated with it.
Map text to ontology terms • Make all possible permutations • Rules to weed out bad permutations • Check for an exact match with NCI and SNOMED-CT terms (and/or synonyms) • Rules to weed out bad matches 24 permutations Prostate Carcinoma Adeno intraductal Prostate Carcinoma Adeno intraductal : Carcinoma Prostate intraductal Adeno : Adeno Carcinoma intraductal Prostate : Prostate intraductal Adeno Carcinoma Prostate_Ductal_Adenocarcinoma
Results and validation • Mapped the term-sets for 8495 records, which correspond to 783 distinct term-sets. • 577 term-sets (6614 records) matched to the NCI thesaurus • 365 term-sets (3465 records) matched to SNOMED-CT • In total mapped 6871 records (80%) of annotated records in TMAD (641 distinct term-sets) to one or more ontology terms.
Parents & Siblings nodes with data (Burly wood) Child nodes with no data (Grey) Child nodes with data (Yellow)
How do ontology based annotation help? • Better search: we can retrieve samples of all the retroperitoneal tumors or malignant uterine neoplasms for example. • Better Integration of data: we can correlate gene expression with protein expression across multiple tumor types. • Tissue microarray data from TMAD • Gene expression data from GEO
Integrating mRNA and protein expression Genes Samples Proteins Samples
Steps in Alignment • Anchor identification • Identify similar class labels in the ontologies to be aligned • Usually done by string matching • Ontology structure • Use the “similar” classes as anchors and examine the local [graph] structure around them to inform the “similarity” metric R Root Term-1 Term-2 t1 t2 Term-3 Term-4 t3 t4 t5 t7 Term-5 t6
We might improve alignment … Ontology [graph] structure based step t5 S2 Term-5 t5 S2 R Root Term-5 Term-1 Term-2 t1 t2 Term-2 t1 Term-3 Term-4 t3 t4 t5 Term-5 t5 t7 Term-5 t6 Provide Anchors from annotated data
Summary Ability to map word-groups to ontology terms
Pathology Robert Marinelli Matt van de Rijn Medical Informatics Kaustubh Supekar Daniel Rubin Mark Musen Funding NIH Credits and acknowledgements