360 likes | 458 Views
Life Science 2010. Meeting the Challenges with Semantic Technology. By Dr. Sheng-Chuan Wu. Some Looming US Healthcare Crises. Aging of population in developed countries US older population increases 50% in 30 years Life expectancy gets longer (79 – 84 in 2008)
E N D
Life Science 2010 Meeting the Challenges with Semantic Technology By Dr. Sheng-Chuan Wu
Some Looming US Healthcare Crises • Aging of population in developed countries • US older population increases 50% in 30 years • Life expectancy gets longer (79 – 84 in 2008) • Cost of healthcare skyrocketing • Drug prices increase 50% from 2000 – 2007 (US CPI is 20% in the same period) • Drug development costs mushroom • US$200+ million for a successful drug approval • Uninsured reaches epidemic proportion (30%)
Despite Enormous Scientific Advance Sequencing of human genome and greater insight into human genes (e.g., Gene Ontology) Microarray gene expression chips Better understanding of cancer, viral infection Greater expanded research in pharmacology, pathology, immunology, physiology, etc. An explosion of information (knowledge) in life science without comparable benefit
Challenges for Life Science – Diversity • Very diverse subjects How to relate all the information cohesively?
Challenges for Life Science – Taxonomy Physiologist Geneticist Pharmacologist Biochemist Virologist Different disciplines use different taxonomies even for the same thing Taxonomic science is intrinsically dynamic
Challenges for Life Science – Information Model • Mostly employing relational model Credited to Universiti Malaysia Sabah
Challenges for Life Science – Information Model • Horrendous RDB table schema • More than 70% of table cells contain null value • Need to call in experts to update schema Credited to Universiti Malaysia Sabah
Challenges for Life Science – Knowledge Representation Designed for human (90%+), not for computer
Challenges for Life Science – Integration Many sources (silos) of life science information Our understanding in some areas (e.g., pathways) is very limited and uncertain We don’t even know what else to come A mammoth data integration problem, let alone integrated understanding & knowledge discovery Try to design a schema for such data tables and knowledge warehouses !!
Same Challenges for Every Field in Biology Semantic Technology can help overcome these challenges Many diverse but related subjects Different taxonomies from different disciplines Very complex information model, which must evolve constantly as we learn more Difficulty in knowledge representation, for computer not just for human Mammoth information integration problem
What is Semantic Technology “The Semantic Web is a vision: the idea of having data on the Web defined and linked in a way that it can be used by machines not just for display purposes, but for automation, integration and reuse of data across various applications.[Tim Berners-Lee et al , 2001] ” A new way of representing (modeling) and accessing information Make unstructured documents without semantic (on the web or not) computer intelligible Key enabling standards: URI, RDF, RDFS, OWL and SPARQL
The Semantic Wave is NOT One Thing … Two Differing Major Waves Within It • The Semantic Web • Information sharing on a global scale • Intranets vs. Internet • Semantic Technology • Standard knowledge representation • Enhanced knowledge access and discovery • Semantic Interoperability • Information syndication and so forth
Concept Relates to activates Form Referent Stands for First, We Need Shared Reference Computer cannot help without clear data semantics “Family“ ?
nme CV CV work vate educ CV educ < > < > name name < > < > education educ < CV > < > CV <> < > work < > < > private ‹›„⁄ URI (from W3C) Removes Ambiguity • W3C Ontology (RDF/RDFS/OWL) • Provide external common references, structure and property for individual terms • Universal Resource Identifier (URI) as common reference • <http://http://xmlns.com/foaf/0.1/Thing.owl #Family> • <http://www.usda.gov/classification/plants/taxaonomy.owl#Family> • Semantic languages (RDF) • To describe mappings, relations • (properties) and structures of data
15 IDD Workshop 2008 | Semantic Web in Life Sciencesl | Martin Romacker, Therese Vachon| 30. Sep .2008 Caused_by Virus Virus_infection rdf:type rdf:type HIV_Virus AIDS RDF – Distributed Data (Knowledge) • RDF (Resource Description Framework) from W3C is a triplet consisting of a uniform structure: [subject] [predicate] [object]Virus_infection caused_by virus • This structure is close to simple phrases in natural language. • With URI, generates a graph (easy to visualize). • All are represented as unique resources (URI).
DOM1913 Table friendsID1 ID2 2 35 One-to-Many Relational Model – Many Table Joints • Adding a new relation requires a change of the DB schema
Equivalent Semantic Model – Easy with URI & RDF <triple 32: "person2" "type" "person"> <triple 33: "person2" "first-name" "Rose"> <triple 34: "person2" "middle-initial" "Elizabeth"> <triple 35: "person2" "last-name" "Fitzgerald"> <triple 36: "person2" "suffix" "none"> <triple 37: "person2" "alma-mater" "Sacred-Heart-Convent"> <triple 38: "person2" "birth-year" "1890"> <triple 39: "person2" "death-year" "1995"> <triple 40: "person2" "sex" "female"> <triple 41: "person2" "spouse" "person1"> <triple 58: "person2" "has-child" "person17"> <triple 56: "person2" "has-child" "person15"> <triple 54: "person2" "has-child" "person13"> <triple 52: "person2" "has-child" "person11"> <triple 50: "person2" "has-child" "person9"> <triple 48: "person2" "has-child" "person7"> <triple 46: "person2" "has-child" "person6"> <triple 44: "person2" "has-child" "person4"> <triple 42: "person2" "has-child" "person3"> <triple 60: "person2" "profession" "home-maker"> <triple 66: "person2" "has-friend" "person35"> <triple 67: "person2" “year-of-marriage" “1913"> Copy Right, Sheng-Chuan Wu July 2009
Taxonomic Structure (ontology) A Little Ontology Goes A Long Way Copy Right, Sheng-Chuan Wu November 2008
Synergies in Knowledge Representation Biomedical Taxonomy/ Ontologies
RDF Class Hierarchy Maps Taxonomy • NCI ontology – a comprehensive biomedical taxonomy, containing 1,200,000 concepts mapped to 2,900,000 terms with 5,000,000 relationships, e.g., • Medicine • Medical_Specialties • Radiology • Radiology_Therapeutic • Radiology_Bone • Radiology_Dental • Pediatric_Radiology • Nuclear_Medicine • Medical_Radiation_Physics • Diagnostic_Radiology_Ionizing_and_Nonionizing_ • Radiology_Thorax_Chest • Radiology_Soft_Tissue • Radiology_Head_Neck • Interventional_Radiology
Information Inferred WhiteLadySlipper type Liliidae Semantic Model – Explicit Relationship Information Given WhiteLadySlippertype OrchidFamily OrchidFamily subClassOf Liliidae Where are the relationships? Relationship Model Question rdf:typeandrdfs:subClassOf are from W3C standard, and the relationship is transitive In which subclass does WhiteLadySlipper belong? Relationships are explicit in the model and directly available to applications! Answer Liliidae July 2009
Question In which Subclass is WhiteLadySlipper located? Develop a Query Select Subclass From Species_Table, SubclassTable, Species_Family Table Where Sepices_Name = “WhiteLadySlipper” and ID = SP_ID and Family = FM_ID Answer In Liliidae Relational Model – Implicit Relationship ID Species Name IDC WhiteLadySlipper Family Subclass OrchidFamilyLiliidae FM_ID SP_ID OrchidFamily IDC Species Table Relationships are in documents, SQL code and collective memories - not available to applications! SubclassTable Species_FamilyTable Data Definition Statements? Applications do not use them, they are not descriptive and their scope is a single database Data Dictionary? Data Registry? They are for human, not computer use Where are the relationships? July 2009
Information automatically Inferred WhiteLadySlipper type Orchidales WhiteLadySlipper type Liliidae new data Semantic Model – Explicit Relationship Information Given OrchidFamily subClassOf Liliidae WhiteLadySlippertype OrchidFamily OrchidFamily subClassOf Orchidales Orchidales subClassOf Liliidae Changes are Easy to Make Relationship Model Question typeandsubClassOf are from W3C RDFS standard, and the relationships are trasitive In which subclass does WhiteLadySlipper belong? Answer Liliidae July 2009
Question In which Subclass is WhiteLadySlipper located? Develop a Query Select Subclass From Species_Table, SubclassTable, Species_Family Table Where Sepices_Name = “WhiteLadySlipper” and ID = SP_ID and Family = FM_ID Relational Model Changes at Great Peril ID Species Name IDC WhiteLadySlipper Order Name ID Subclass Orchidales ORD Liliidae FM_ID SP_ID OrchidFamily IDC Family_ID Order_ID OrchidFamily ORD ID Species Name IDC WhiteLadySlipper Family Subclass OrchidFamilyLiliidae FM_ID SP_ID OrchidFamily IDC Species Table Species Table Order Table SubclassTable Changes should be avoided at ALL costs Species_Family Table Doesn’t work any more! Species_FamilyTable Get No Answer! Family_Order Table ? July 2009
Changing Taxonomy Affects Only the View on the Species rdf:type rdfs:subClassOf Copy Right, Sheng-Chuan Wu November 2008
How About Unstructured Documents with No Semantic? Designed for human (90%+), not for computer
Turn Document into Semantic Model Restaurant Jurong Point type address halal http://www.zabihah.com/ds.php?id=1716 “65-6792-6593” Find me a Thai restaurant that is Halal, not too expensive, no alcohol served, near Jorong Point Shopping Centre in Singapore. halalAuth true phone cuisine alcoholServed False cuisine true country price city address Intelligent, complex, ad hoc query, beyond keyword search, now possible Thai Singapore Singapore Indonesian Median “1 jurong west central 2” Best keyword search engine gives very unsatisfactory results
Ad Hoc Query with SPARQL – Biological Processes in Dendrites • Alzheimer’s disease is characterized by neural degeneration. • Among other things, there is damage to dendrites and axons, parts of nerve cells. • What resources do we have available to learn more about biological processes in dendrites?
Query Gene Ontology (GO) for Clues Inference at work
Looking for Alzheimer Disease Targets Signal transduction pathways are considered to be rich in “druggable” targets - proteins that might respond to chemical therapy CA1 Pyramidal Neurons are known to be particularly damaged in Alzheimer’s disease. Can we find candidate genes known to be involved in signal transduction and active in Pyramidal Neurons?
A SPARQL Query Spanning 4 Sources Ad hoc queries over multi data sources (in RDF) easy
Finally, Semantic TechnologyA Different but Better Mouse Trap Semantic Technology gives us an integrated view of available knowledge Database w/o schema, nor table; change easily Distribution and integration of data easy and seamless with URI and RDF Separation of taxonomic information and individual data Ontology to bridge different taxonomies A query language (SPARQL) for ad hoc pattern matching Ideal for modeling & accessing life science data
Potential Applications Bridging natural herbal medicine and western medical knowledge Fresh water fishery (environmental) management Plant biotechnology, Precision Agriculture Biodiversity repository All based on a single framework to model, integrate and access different life science “knowledge” sources
Semantic Technology for Life Science Technology for a healthier life