300 likes | 496 Views
K -neighborhood Decentralization: A Comprehensive Solution to Index the UMLS for Large Scale Knowledge Discovery. Yang Xiang Joint work with Kewei Lu, Stephen L. James, Tara B. Borlawsky , Kun Huang, and Philip R.O. Payne Journal of Biomedical Informatics, In Press.
E N D
K-neighborhood Decentralization: A Comprehensive Solution to Index the UMLS for Large Scale Knowledge Discovery Yang Xiang Joint work with Kewei Lu, Stephen L. James, Tara B. Borlawsky, Kun Huang, and Philip R.O. Payne Journal of Biomedical Informatics, In Press
Unified Medical Language System(UMLS) • A compendium of controlled vocabularies in the biomedical sciences (since 1986). It contains: • Metathesaurus • Semantic Network • SPECIALIST Lexicon • UMLS contains data more than ontologies • Maintained by US National Library of Medicine • Website: http://www.nlm.nih.gov/research/umls/
UMLS - Metathesaurus • Number of biomedical concepts > 1 million • Stem from over 100 incorporated controlled source vocabularies: • ICD (International Statistical Classification of Diseases and Related Health Problems) • MeSH (Medical Subject Headings) • SNOMED CT (Systematized Nomenclature of Medicine – Clinical Terms) • LOINC (Logical Observation Identifiers Names and Codes) • Gene Ontology • OMIM (Mendelian Inheritance in Man) … http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/source_vocabularies.html
UMLS - Semantic Network • Semantic types (categories) 133 in 2011AA • Entity • Physical Object • Organism … … • Event • Actitivity • Behavior … … • Semantic relationships (connecting two concepts)591 In 2011AA • isa • assoicated_with • physically_related_to • part_of… • spatially_related_to • location_of… … Drug A treats treated_by Disease B disease_is_marked_by_gene Gene A http://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html http://www.clres.com/semrels/umls_relation_list.html
Knowledge Discovery in the UMLS Graph • Reachability • Distance • Path • Conceptual Knowledge Construct (Subject Matter Expert) • Depth First Search for CKC, limited to 4 hops, and limited to a small number of data sources, (CITIH) • Finding all paths are computationally intractable
Reachability The problem: Given two vertices u and v in a directed graph G, is there a path from u to v ? ?Query(1,11) Yes ?Query(3,9) No 15 14 11 13 10 12 6 7 8 9 3 4 5 1 2
Distance The problem: Given two vertices u and v in a (directed) graph G, what is the distance from u to v? ?Query dG(1, 11) =3 15 14 11 13 10 12 6 7 8 9 3 4 5 1 2
Path The problem:Given two vertices u and v in a (directed) graph G, what is a path (are paths) connecting u to v ? 15 14 Find a path from1to11 11 13 10 12 6 7 8 9 3 4 5 1 2
The estimated difficulty of building a very efficient indexing graph database schemes (based on current research) Reference: R. Jin, Y. Xiang, N. Ruan, H. Wang, "Efficiently Answering Reachability Queries on Very Large Directed Graphs", Proc. of ACM SIGMOD Conference, Vancouver, June 9-12, 2008, pp. 595-608. R. Jin, Y. Xiang, N. Ruan, D. Fuhry, "3-HOP: A High-Compression Indexing Scheme for Reachability Query", Proc. of ACM SIGMOD Conference, Providence, Rhode Island, June 29-July 2, 2009, pp. 813-826.
Application: Disease Gene Prioritization • 8,134 Disease concepts from OMIM (Online Mendelian Inheritance in Man), by selecting semantic type to be “Disease or Syndrome” or “Neoplastic Process”. • 29,333 Genes from HUGO (Human Genome)
IDS gene to CLL • IDS gene is associated with inflammation and enlargement of the liver, as well as enlargement of the spleen which is a lymphocytic organ • GSE2466: IDS expression levels show a significant decrease in CLL patients as compared to the normal control (t-test p-value<10-11, mean fold change=1.63)
MIR1-1 gene to Breast Cancer the paths are then led to breast carcinoma via three drugs (Cyclophosphamide, Methotrexate, and Fluorouracil) which are the three components constituting the NCI recommended CMF regimen for breast cancer.
Thanks! Questions?