Ontology-based Data Matching and Applications

Ontology-based Data Matchingand Applications Dr. Ioana Ciuciu: iciuciu@vub.ac.be

Overview • Idea • Basic notions • Ontology-based Data Matching Framework (ODMF) • ODMF Algorithms • ODMF Strategies • Applications – EU projects • Prolix • 3D Anatomical Human • TAS3 • DIYSE (Do It Yourself Smart Experience)

Idea • …last week’s OIS course • Step5: Link your data with other data sources • Manually (e.g. FOAF profiles) • Automated linking algorithms (large data sets) Common identifier: ISBN: 0747581088 RDF Book Mashup DBpedia sameAs Harry Potter and the Half Blood Prince Harry Potter and the Half Blood Prince <http://dbpedia.org/resource/Harry_Potter_and_the_Half-Blood_Prince> owl:sameAs <http://www4.wiwiss.fu-berlin.de/bookmashup/books/0747581088>

Idea • If no common identifier? • Use semantic information • 2 resources with properties, relations betw. properties, etc. • Perform matching (semantic level) • Find similarity score Resource 2 Resource 1 Matching % Similarity score

Basic notions • Ontology Matching (Alignment) • Data Matching • Ontology-based Data Matching Ontology A Ontology B Resource 2 Resource 1 Matching %

Ontology-based Data matching Framework (ODMF) • ODMF • 9 algorithms and 7 strategies in total, among which WordNet + multilingual terminography, OntoGraM, GRASIM, LeMaSt, C-FOAM are the innovative strategies • ODMF matching levels • String matching (e.g. SecondString library) • Lexical matching (WordNet based) • Graph matching (ontology based)

String Matching Algorithms • String similarity of two objects same super-ordinate concept • SecondString library • + new ODMF algorithms • ODMF.UnsmoothedJS (same context, e.g. competency) • ODMF.JaroWrinklerTFIDF (fuzzy matching) • “hearth”  “heart” Competence 1: “Obtain and test capillary blood samples” Competence 2: “Obtain and test specimens from individuals” 0.4 (MongeElkan) 0.9 (WinklerRescorer) 0.5 (UnsmoothedJS, JaroWrinklerTFIDF, TFIDF (Term Frequency Inverse Document Frequency))

String Matching Algorithms Advantages Easy to implement No complex knowledge resource required Objects are described in natural language Drawbacks Different similarity score for different wording (e.g. with Jaro, Jaro-Winkler) Ex: “Ioana Ciuciu” ≠ “Ciuciu Ioana” Little explanation of the result to the user

Lexical Matching Algorithms • Semantic similarity of two objects same super-ordinate concept • Object descriptions are terminologically annotated - WordNet Competence 1: “Obtain and test capillary blood samples” Set1 = {obtain, test, capillary blood sample} Competence 2: “Obtain and test specimens from individuals” Set2 = {obtain, test, specimens, individuals} Jaccard similarity coefficient: - ODMF versions: Lexical1, Lexical2, Lexical3 2/5 = 0.4

WordNet • Lexical database of English (George A. Miller, Princeton, 1985) • Tool for computational linguistics & natural language processing • Synsets – distinct concepts

WordNet • Examples of semantic & lexical relations between synsets (nouns) • Hypernym – hyponym (e.g. canine - dog) • Holonym – meronym (e.g. car- wheel) • In ODMF, we use semantic relations defined in Wordnet for matching two resources • Ex.: ”student” & “pupil” belong to the same synset

Lexical Matching Algorithms(Jaccard-based) • Lexical1 • Does not make use of hypernym-hyponym relation • Lexical2 • Takes the hypernym-hyponym relation between terms into account • Knowledge engineer needs to create the type hierarchy between terms • Lexical3 • Takes the hypernym-hyponym relation between terms into account • Makes use of WordNet as upper ontology (concepts & relations) • Automatically converts information from WordNet to a Categorisation Framework • Knowledge engineer is assisted with possible concept suggestions

Example: Competence description • Let a competence be by definition described as: Competence description = Persons Act or Interact on Objects in Manners using Instruments at Locations at Times • Then for Ex: Competence: “Obtain and test capillary blood samples” • The annotation could look like • Person • Agent • Action • Obtain capillary blood sample • Test capillary blood sample • Object • Capillary blood sample

Example: possible sets of terms • Lexical1 {Agent, Obtain capillary blood sample, Test capillary blood sample, Capillary blood sample} • Lexical2 (Lexical1+hypernyms) {Person, Agent, Action, Obtain capillary blood sample, Test capillary blood sample, Object, Capillary blood sample}

Example: possible sets of terms • Lexical3 (WordNet) • Annotation • blood: the fluid (red in vertebrates) that is pumped by the heart • capillary: any of the minute blood vessels connecting arterioles with venules • obtain: come into possession of • sample: a small part of something intended as representative of the whole • test1: determine the presence or properties of (a substance) • test2: test or examine for the presence of disease or infection • Set of terms {blood, capillary, obtain, sample, test1, test2}

Lexical Matching Algorithms Advantages The score is less dependent on the wording of the object descriptions (because synonyms and translation equivalents will be taken into account using the term base) The two compared sets of terms produce evidence about the score to the end user. Drawbacks The terminological resource (used to annotate object descriptions) should sufficiently cover the different domains.

Graph Matching Algorithms • Recall: Every ODMF strategy contains at least one graph algorithm • Idea: using classification information of objects, the relations between objects, and the properties of objects, to: • Calculate the similarity between two objects (e.g. two competences) • Find related objects for a given object (e.g. find relevant qualifications to improve a competency) • Graph: semantic graph (terminological ontology) • Vertices: concepts • Edges: semantic relations between concepts • Bidirected (role - co-role) • 2 ODMF graph matching algorithms • Ontology-based Graph Matching (OntoGraM) • Graph-Aided Similarity Calculation (GRASIM)

Ontology-based Graph Matching (OntoGraM) • Two objects: their classification, properties & semantic relations • Domain ontology + application ontology • Rule-based reasoning: forward chaining to infer new knowledge, based on the existing knowledge expressed in the knowledge base • Ex: competency-based HRM ontology– competency as reference Person p Competenciesp = {c1,…cn} Task t Competenciest={c1,…,cm} Compare p & t by comparing Competenciesp & Competenciest

Ontology-based Graph Matching (OntoGraM) • Use holonym-meronym relation between competences (or tasks) Ex: (HSC237- Health & Social Care Standard) “Obtain and test capillary blood samples” “Obtain capillary blood” “Test, record and report on capillary blood sample results ” If a person has both sub-competences, then we can deduce that he/she has also the competence “Obtain and test capillary blood samples” • 2 OntoGram versions • OntoGram version 1 (Graph1) • the one described above • OntoGram version 2 (Graph2) • Uses extra (fuzzy) relations such as “is slightly similar”, “is moderately similar”” or “is very similar” • More accurate results across domains/organizations provided that the relations are applied correctly

OntoGraM example • Get the similarity score between • Task = “Agree closure with customer” • Competence: “Process skills” with competence level: Good • Competence: “Soft skills” with competence level: Good • Function = “Junior SAT” • Competence: “Administrative skills & other skills”, with competence level: Good • Competence: “Process skills”, with competence level: Good • Competence: “Soft skills” with competence level: Good • Competence: “Technical skills” with competence level: Good Remark: every competence has a list of sub-competences, and so on Ex: “Process skills” has sub- competences {“additional tasks”, “escalations”, “information”, “products”, “supplier”} Result:sim = 0.5 ( acc. to Graph1; 2 competences overlap with the 4 required) sim = 0.7593 (acc. to Graph2; intersection: 41 competences; union: 54 competences; all sub-competences are taken into account)

OntoGraM Advantages The similarity of two objects that belong to a different object type can be calculated with a degree of accuracy comparable to that of a human expert. Extensive evidence for the calculated score may be presented to the end user (by reporting on the rules that were applied to calculate the score). Drawbacks The management of the application ontology requires a considerable effort by the knowledge engineer.

Graph-AidedSimilarity Calculation (GRASIM) • Idea [11]: • Compute the shortest path between two sub-graphs of a given graph (Dijkstra, other) • Use Semantic Decision Tables (SDT) to freely & correctly adjust graph algorithms (assign weights) • Transfer the shortest path into a similarity score

GRASIM- the algorithm Select a shortest path algorithm Dijkstra’s shortest path algorithm Label graph arcs using SDT calculate shortest paths (SP) calculate Similarity (S) Tang, Y. et al. (2010): Towards Freely and Correctly Adjusted Dijkstra's Algorithm with Semantic Decision Tables for Ontology Based Data Matching, in Proc. fo the 2nd International Conference on Computer and Automation Engineering "ICCAE 2010, ISBN: 978-1-4244-5586-7, Singapore, February 26 - 28, 2010 24/09/2014 | pag. 23

ODMF Strategies • Combination and composition of Algorithms • Strategy One: Lexon Matching Strategy (LeMaSt) • Strategy Two: Controlled Fully Automated Ontology Based matching Strategy (C-FOAM)

Lexon Matching Strategy (LeMaSt) • LeMaSt • A string matching algorithm • A lexical matching algorithm • Lexon matching • Semantic similarity of the objects descriptions (2 lexon sets) • Object descriptions annotated with lexons • Ex: Annotation • Competence : • “Obtain and test capillary blood samples” • Annotated with lexon: • < “Obtain and test capillary blood samples” , “agent”, “obtains”, “obtained by”, “capillary blood sample”> • Interpreted as: • For the competence “Obtain and test capillary blood samples”, • “an agent obtains a capillary blood sample” & • “a capillary blood sample is obtained by an agent”

Lexon Matching Strategy (LeMaSt) • Extension of Jaccard similarity coefficient: • When S is calculated, we run OntoGraM to calculate the final matching score • use lexical relations (e.g. holonym-meronym) between competences • use semantic constraints (e.g. cardinaliry constraint) C – contribution score (calculated with Jaccard similarity scores) xi – contribution score of each lexon xi= 1, if same head, role, co-role & tail xi= 0.5, if same head & tail but different role & co-role xi= 0.5 , if same head or tail and same role & co-role xi = 0.25, if same head or tail S – average score of the lexons

Extended Jaccard similarity coefficient – pseudo code FUNCTION FLOAT calculate-Score (Ontology-DB) { LS1 = Load lexons from Ontology-DB; LS2 = Load Lexons from Ontology-DB; Union = 0; WHILE (LS1. Has Next ()) { Lexon lexon1 = (Lexon) LS1.next() ; Lexon lexon2 = null ; Union += 1 ; Lexon lexon2MaxOverlap = null ; DmaxOverlap = 0.0 ; WHILE (LS2. Has Next ()) { lexon2 = (Lexon) LS2.next (); IF (lexon1.Head.equals (lexon2.Head)) AND (lexon1.Tail.equals (lexon2.Tail)) AND (lexon1.Role.equals (lexon2.Role)) AND (lexon1.CoRole.equals (lexon2.CoRole)) { dMaxOverlap = 1.0; //samelexon lexon2MaxOverlap = lexon2; lexon2MaxOverlap.MaxOverlap = dMaxOverlap; break; } }

Extended Jaccard similarity coefficient – pseudo code ELSE { dMaxOverlap = x1; //e.g. 0.5 if ((lexon2MaxOverlap == null) || (lexon2MaxOverlap.dMaxOverlap < dMaxOverlap)) { lexon2MaxOverlap = lexon2; lexon2MaxOverlap.dMaxOverlap = dMaxOverlap; } } ELSE IF (lexon1.Role.equals(lexon2.Role)) { IF (lexon1.CoRole.equals (lexon2.CoRole)) { DmaxOverlap = x2//e.g. 0.5; IF ((lexon2MaxOverlap == null) || (lexon2MaxOverlap.dMaxOverlap < dMaxOverlap)) { lexon2MaxOverlap = lexon2; lexon2MaxOverlap.dMaxOverlap = dMaxOverlap; } } } • ELSE { • DmaxOverlap = x3 //e.g. 0.25; • IF ((lexon2MaxOverlap == null) || • (lexon2MaxOverlap.dMaxOverlap < dMaxOverlap)) { • lexon2MaxOverlap = lexon2; • lexon2MaxOverlap.dMaxOverlap = dMaxOverlap; • } • } • ... • } • Add += dMaxOverlap; • }//while • WHILE (LS2. Has Next ()) { • … //similar • } • Score = Add / Union; • RETURN Score; • }

Controlled Fully Automated Ontology Based Matching Strategy (C-FOAM) String matching Lexical matching Graph matching (any combination of ≠ graph algorithms)

Controlled Fully Automated Ontology Based Matching Strategy (C-FOAM) • 2 modules: (1) Interpreter and (2) Comparator Pre-processing by the Interpreter Ontology-based Comparator Matching at string level Matching at lexical level Similarity Score Graph Matching “hearty” “warmhearted” String matching “heart” (ontology) + its annotation set & Score + Penalty Score + Penalty Lexical matching

Controlled Fully Automated Ontology Based Matching Strategy (C-FOAM) Advantages Contains all the advantages of the selected algorithms. It supports fuzzy inputs from end users e.g. both “hearty” and “warmhearted” are interpreted as “heart” Therefore: provides a lot of freedom to the end users. Robust (rarely fails) Drawbacks Contains all the disadvantages of the selected algorithms C-FOAM performs worse (in terms of complexity) than the worst among the selected algorithms (since it’s a composition of these algorithms) Algorithms are inter-dependent (one algorithm needs to wait until another algorithm finishes the calculation.)

ODMF Applications • 3 EU projects • FP6 Prolix • Competency matching (learning & training in Competency Management) • FP6 3D Anatomical Human (3DAH) • Anatomic data matching (human anatomy: images, videos, text, etc.) • FP7 TAS3 (Trusted Architecture for Securely Shared Services) • Security attributes matching (security and privacy)

Prolix – competency matching Straightforward & Heart Course ITIL1

GRASIM result - example • Matching a competence (“heart”) with a learning material (“Problem Solving and decision making”)

3DAH – anatomic data matching • Virtual teacher – 3 components: • Knowledge Base • Anatomy Browser • Controlled Fully Automated Ontology-based Data Matching Strategy (C-FOAM) • Match: anatomical data (images, videos, books, etc.) with user knowledge (captured from Computer-Human Interactions) • Purpose • Evaluate students • Retrieve and deliver personalized suggestions on the learning materials in order to improve the students’ skills

Virtual teacher – the Knowledge Base Extensor Hallucis Longus Tendon (image) Acetabular labrum (video) 9/24/2014

Virtual Teacher – the Anatomy Browser • 3DAH Viewer [UNIGE&CRS4] • Knowledge Interaction Framework • Queries submitted to the CMS • Online knowledge retrieval http://3dah.miralab.ch/index.php?option=com_remository&Itemid=78&func=finishdown&id=107 9/24/2014

Virtual Teacher – C-FOAM Ω (musculoskeletal ontology) C-FOAM – overlapping rate of Li and Li’ L – lexon set describing labeled concepts (e.g. “Patella”) L’ – lexon set describing the learning materials (e.g. “Imaging of the dysplasia” ) - student input - concept label linked to the interpretation of We denote - synonym set of 3 possible situations: 9/24/2014

eLearning Scenario • 8 steps The viewer shows highlighted zone The student gives input 7. The computer shows the correct answers for answer ≠ 100% 8. The computer finds the learning materials labrum Black box • Matching engine works • Finds labrum – Acetabular labrum • – x labrum • – y labrum, ... Acetabular labrum “Hip Arthroscopy” “Acetabular Labral Tears” “Operative Hip Arthroscopy” 5. 4. 40% correct calculate ->5 Questions 6. Final score 70%

Results

Interpretation • Patella – 4 situations • Patela - 0,98 error spelling • Patella - 1 • Kneecap - 0,75 synonym • Knee - 0,69 synonym + error spelling • C-FOAM • JaroWinkler (patela->patella) • WordNet (kneepan) • LeMaSt Combined (advanced C-FOAm) They both use LeMaSt (graph algorithm) Jaro Learning materials patela patella kneecap patella 1 2 lexons Wrinkler Word Net

TAS3 – security policy matching • Semantic interoperability • Between SR & SP • Security Policy Ontology (SecPODE) • Subjects, Actions, Targets • (Domination) Relation lookup (between 2 terms originating from different security policies) • C-FOAM matching strategy • Context: authorization architecture

Ontology-based Interoperation Service (OBIS) • OBIS – WS • OBIS main method: • Val:

OBIS • Example of security concept dominance:

OBIS Architecture controlled mapping

Use Case Scenario • Employability Demonstrator

Use Case Scenario

Use Case Scenario • Service requester = Placement Co-ordinator • Resource = Email 5 6 10 9 1 4 6 9 7 8 7 8 3 2

OBIS for credential validation • Step1: CVS invokes OBIS • Step2: OBIS performs the (ontology-based data) matching • Step3: OBIS returns Val = 1 (i.e. ‘equivalence’) i.e. ‘equivalence’) 1: (‘Placement Co-ordinator’, ‘Placement Service, Val) 2: matching ( ‘Placement Co-ordinator’, ‘Placement Service’) 3: Val = 1 CVS OBIS

Ontology-based Data Matching and Applications