590 likes | 714 Views
Geographically-Typed Geospatial Data Source Matching with High-Quality Clustering and Multi-Attribute Matching. Jeffrey Partyka Dr. Latifur Khan Dr. Bhavani Thuraisingham. Funded by NGA & US Air Force. Topic Outline. Problem Statement Background Information
E N D
Geographically-Typed Geospatial Data Source Matching with High-Quality Clustering and Multi-Attribute Matching Jeffrey PartykaDr. Latifur KhanDr. BhavaniThuraisingham Funded by NGA & US Air Force
Topic Outline • Problem Statement • Background Information • Matching Procedures - Generalized Solution - N-grams - Non-Geographic Matching (NGT Matching) - Geographic Matching (GT Matching) - Attribute Weighting - High-Quality Clustering - 1:N Matching • Experimental Results • Future Work
Motivation • Internet Architecture • Highly Distributed • Federated Architecture • Web Application Problems • Low Performance for Information Retrieval • Accuracy of Retrieved Information
Sample Scenario Query: Publication of Academic Staff Rank Data Source MIT Ontology UMBC Ontology Karlsruhe Ontology {Article, Book, Booklet, InBook, InCollection, InProceedings, Manual, Misc, Proceedings, Report, Technical Report, Project Report, Thesis, Master Thesis, PhD Thesis, Unpublished, Faculty Member, Lecturer}
Different Bibliography Ontologies UMBC Ontology MIT Ontology Karlsruhe Ontology
Problem Statement: Schema Matching Given 2 data sources, S1 and S2 , each of which is composed of a set of tables where {T11, T12, T13…T1k…T1m} є S1 and {T21, T22, T23…T2j…T2n} є S2, with 1<= k <= m and 1 <= j <= n, determine the similarity between T1k and T2j Road Road S1 S2
Problem Statement: Ontology Matching Given 2 ontologies, O1 and O2 , each of which is composed of a set of concepts where {C11, C12, C13…C1k…C1m} є O1 and {C21, C22, C23…C2j…C2n} є O2, with 1<= k <= m and 1 <= j <= n, determine the similarity between C1k and C2j
Motivating Scenarios 1 Making Complex Business Decisions Regulatory Affairs R & D “Should we invest in a new cholesterol drug for the Asia-Pacific region?“ Yes/No/Maybe? Corporate Marketing Manufacturing 2 2 Robust Semantic Web Applications “Find the group of friends around Jeff. Thenfind the most important person out of the group. Find out if this person was at an event of type Meeting, and happened between 9AM-11AM within 5 miles of UTD” RDFS Lookup Temporal Logic Geospatial Ontology Social Network Yes/No/Maybe? Jeff,Jeff’s friends Within 5 miles of UTD Event of Type ‘Meeting’ 9:00am-11:00am
COUNTYNAME CID County DSP Kitsap Kingston TRAIL RANGE DR 96 Wahkiak Puget Island KITSAP 97 Matching Approaches Mappings may be generated in several ways – some approaches are: (1: Name Matching (2: Structure Matching(3: Instance Matching Email emailAddress ?
Some Definitions Definition 1 (attribute)An attribute of a table T, denoted as att(T), is defined as a property of T that further describes it. Definition 2 (instance)An instance x of an attribute att(T) is defined as a data value associated with att(T). Definition 3 (keyword)A keyword k of an instance x associated with attribute att(T) is defined as a meaningful word (not a stopword) representing a portion of the instance.
Some Definitions (cont) Definition 4a (geographic type (GT))A geographic type GT associated with attribute att(T) is defined as a class of instances of att(T) that represent the same geographic feature. (e.g: “lake”, “road”) Definition 4b (non-geographic type (NGT)) A non-geographic type (NGT) associated with attribute att(T) is defined as a group of keywords from instances of att(T) that are semantically related to each other. Collin New Jersey Plano Trenton Richardson Monmouth
Topic Outline • Problem Statement • Background Information • Matching Procedures - Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching • Experimental Results • Future Work
Overview of Matching Algorithm 1 Select attribute pairs for comparison roadName roadType rType rName county city town 2 Match instances between compared attributes Run Sim algorithms… roadName rName K Ave. L Ave. LBJ Freeway Jupiter Rd. Coit Rd. US 75 3 Determine final attribute similarity EBD = .98 roadName rName
Determining Semantic Similarity • We use Entropy-Based Distribution (EBD) • EBD is a measurement of type similarity between 2 attributes (or columns): • EBD takes values in the range of [0,1] . Greater EBD corresponds to more similar type distributions between compared attributes (columns) EBD = H(C|T) H(C)
Applying EBD to Semantic Matching • X • X • X • Z • Y • Y • Z • Y • Y • X • Y • X Entropy = H(C) = • Y • Z • Y • Y • Z • Y • Y • X • X • X • X • X Conditional Entropy = H(C|T) =
Topic Outline • Problem Statement • Background Information • Matching Procedures - Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching • Experimental Results • Future Work
Matching Using N-grams • Use commonly occurring N-grams [2,3] in compared attributes to determine similarity (N = 2) LO TA LO OV ST OV TB Some N-grams extracted from A.StrName = {LO, OC, CU,ST, OV…..} UI Some N-grams extracted from B.Street = {LO, OU, UI, OV,…..} Conditional Entropy = H(C|T) = [2] Jeffrey Partyka, Neda Alipanah, Latifur Khan, Bhavani M. Thuraisingham, Shashi Shekhar: Content-based ontology matching for GIS datasets. ACM SIGSPATIAL GIS 2008 (ACM GIS, Laguna Beach, California, Nov. 2008): 51. [3] Jeffrey Partyka, Neda Alipanah, Latifur Khan, Bhavani M. Thuraisingham, Shashi Shekhar: Ontology Alignment Using Multiple Contexts. 7th International Semantic Web Conference (ISWC) Karlsruhe, Germany, Oct. 2008.
Faults of this Method • Semantically similar columns are not guaranteed to have a high similarity score A є T1 B є T2 2-grams extracted from A: {Da, al, la, as, Ho, ou, us…} 2-grams extracted from B: {Sh, ha, an, ng, gh, ha, ai, Be, ei, ij…}
Topic Outline • Problem Statement • Background Information • Matching Procedures - Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching • Experimental Results • Future Work
Non-Geographic Matching ●Use clustering methods to group keywords of instances together without relying on shared N-grams between instances[4] ●K-means is not suitable because we cannot compute a centroid among string instances, so we use K-medoid clustering ●Use Normalized Google Distance (NGD) as a distance measure between any two keywords in a cluster Dallas USA Houston China Tokyo Jamaica Beijing India Halifax New Delhi Malaysia ● WordNet would not be a suitable distance measure in the GIS domain [4] Jeffrey Partyka, Latifur Khan, Bhavani M. Thuraisingham: Semantic Schema Matching without Shared Instances. 3rd IEEE International Conference on Semantic Computing (ICSC) Berkeley, California, September 2009: 297-302.
Definition of Google Distance NGD(x, y)[7] is a measure for the symmetric conditional probability of co-occurrence of x and y [7] Cilibrasi,R.,Vitányi, P.: The Google Similarity Distance. IEEE Trans. Knowledge and Data Engineering 19, 370--383 (2007)
K-medoid + NGD instance similarity Extract distinct keywords from compared attributes Step 1 T1 T2 T1 є O1 T2 є O2 Keywords extracted from attributes = {Johnson, Rd., School, 15th,…} Group distinct keywords together into semantic clusters Step 2 : Attribute1 “Rd.”,”Dr.”,”St.”,”Pwy”,… “Johnson”,”School”,”Dr.”…. : Attribute2 Similarity = H(C|T) / H(C) Calculate Similarity Step 3
Problems with Non-Geographic Matching via NGD + K-medoid It is possible that two different geographic entities (ie: Dallas, TX and Dallas County) in the same location will be mistaken for being similar: similarity = .797
Topic Outline • Problem Statement • Background Information • Matching Procedures - Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching • Experimental Results • Future Work
Geographic Type Matching We use a gazetteer to determine the geographic type (GT) of an instance[5,6]: Instances of S1 Instances of S2 GTs Anacortes Victoria ? Victoria ? Victoria ? Edmonds Clinton ? Clinton ? [5] Jeffrey Partyka, Latifur Khan, Bhavani M. Thuraisingham: Geographically-Typed Semantic Schema Matching. In: Divyakant, A., Aref, W., Lu, C.T. et al. (eds.) ACM SIGSPATIAL GIS 2009, Seattle, Washington, pp. 456--459. ACM (Nov. 2009) [6] Jeffrey Partyka, Latifur Khan, Bhavani M. Thuraisingham: Geospatial Schema Matching with High-Quality Clustering and Multi-Attribute Matching. Submitted to the 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2011, May 2011, Shenzhen, China).
Using Latlong Value to Enhance GT Matching
GSim: Combining NGT and GT Matching We apply GT matching for an attribute comparison if >= 50% of the instances involved in the comparison have GT information. If this is not the case, then NGT matching is applied instead[1]: >= 50% of instances have a GT? GT Matching NGT Matching Cooke Lake Lake Rock Collin Creek Mud Lake Creek Stone Stone Briar Lake River Mud [1] Jeffrey Partyka, Pallabi Parveen, Latifur Khan, Bhavani M. Thuraisingham, Shashi Shekhar: Enhanced Geographically-Typed Semantic Schema Matching. To appear in the Journal of Web Semantics, 2011.
Topic Outline • Problem Statement • Background Information • Matching Procedures - Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching • Experimental Results • Future Work
Attribute Weighting • We can distribute the weight of each attribute match based on their importance: 23% 27% 26% 24%
Measuring Attribute Match Importance 1 2 Attribute Relevance Attribute Uniqueness • Attribute Match Importance determined by: Roads Roads name roadType road_type rName county city town Ports Sea Ports name destPort Name Dest city cty Lakes LakeFeatures lakeType name edez_id city lakename type
Attribute Uniqueness • Determine uniqueness of attributes att1 and att2 involved in a match (att1-att2) by clustering all attributes from all tables over S1 and S2 : cutoff 2 cutoff 1
Attribute Clustering • Use Intercluster Similarity (ICS) to decide if clusters A and B should merge: • Calculate cutoff point (CP) to determine when to stop clustering:
Calculating AU, corrected EBD value rName name (Roads) destPort name (Ports) • Calculate AU for an attribute att in a match: • Calculate pairwise uniqueness (PU) for a match att1-att2: PUatt1,att2 = avg (AUatt1(T) ,AUatt2(T’)) • Recalculate EBD between att1(T)-att2(T’): EBDcorr (att1,att2) = EBDorig(att1, att2) x PUatt1,att2 Dest name (Lakes) Name (Sea Ports) AUattϵ [0,1] lakename
Topic Outline • Problem Statement • Background Information • Matching Procedures - Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching • Experimental Results • Future Work
High-Quality Clustering • Due to the inherent randomness of clustering (e.g: choosing initial centroid), EBD scores may not be stable [6] • We need a way to produce consistent EBD values - To eliminate EBD variability - To provide a confidence value for our EBD value - To guarantee that our EBD value was generated from a high- quality clustering • We proposed the following two cluster-based measures(1: Semantic Purity: the “meaning distance” between any two instances within the same cluster (2: Geographic Purity: the GT purity of a given cluster
Cluster Purity Measures Distance-based Measure: ImpS = Collin Kaufman Tarrant Coppell Plano Richardson Geographic-Type Measure: Collin Richardson Tarrant Plano Coppell Kaufman Objective Function to be Minimized: OSSKM = where Wi =
Topic Outline • Problem Statement • Background Information • Matching Procedures - Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching • Experimental Results • Future Work
1:N Matching Many relationships are not 1:1, but involve matching groups of entities N1 N2 N3 N4 Cmp
Defining 1:N Matching • 1:N matching can be defined in many ways • - Optimize similarity or value of N? • - Meronymy or Subsumption? • We chose to optimize similarity (EBD) • - Use EBD scores produced from 1-1 matches between Cmp and Nk(1 <= k <= N) - Apply greedy algorithm to add attributes to match with Cmp based on decreasing EBD score (highest to lowest) - Any 1:N match will minimize the set difference between GT(Cmp) and the union of the sets of GTs for the N matching attributes. - We do not include an attribute in a 1:N match if it would make the EBD of the current match decrease
1:N Matching Example 1 2 3 4 1 3 4 2 N1 N2 N3 Cmp N4
1:N Matching From Type Perspective • X • X • X • W • X • X • Y • W • W • Y • X • W • Y • W • W • Z • W • X • X • Z • W • X • W • Y • Z • Z • Y • Z • W • Y • Z • Y • W • X • Y • Y • X • W • Z • Y • Y • Z • Y • Z • Z • X Conditional Entropy = H(C|T) Entropy = H(C)
Greedy 1:N Matching Algorithm program 1:N_Matching (S(T2), Sebd(T2)) { var E(T2) = Φ; var S(T2) = Φ; Sebd(T2) = 0.0; GTCmp = getGTSet(Cmp); E(T2) = getMatchCandidates(Cmp, T2, GTCmp); E(T2) = orderByEBD(E(T2)); For att A ϵ E(T2) with max value of EBD(Cmp,A){ if (increaseEBD(Cmp, Sebd(T2)) { Emax = A; S(T2) = S(T2) U Emax; Sebd(T2) = addEBD(Sebd(T2), EBD(Cmp, Emax)) end if E(T2) = E(T2) – A; end for}
Proof of Correctness Theorem 1: (Proof of Greedy Choice Property for 1:N matching algorithm) – All choices for Emaxx(T2) will be present in an optimal 1:N match with Cmp ϵ T1. Suppose that SebdN(T2), for an arbitrary SN(T2), produces an optimal EBD. Let us build a new set called S2ebdN(T2) from S2N(T2) such that every attribute included in S2N(T2) represented a value of Emaxx (T2) for some x. Also, the cardinality of SN(T2) and S2N(T2) are equal, and every attribute between SN(T2) and S2N(T2) is identical, except for an arbitrary attribute indexed by r (r <= N) in S2N(T2). Then by the definition of Emaxx for all x in Ex(T2), the EBD value produced between Cmp and attribute r in S2N(T2) is >= the EBD value produced between Cmp and attribute r given in SN(T2) . Since all other attributes are equal between SN(T2) and S2N(T2), then their associated 1:1 EBD scores with Cmp are also identical. Therefore, EBD(Cmp, S2N(T2)) >= EBD (Cmp, SN(T2)), but since SN(T2) produces an optimal EBD with Cmp through SebdN(T2), then EBD(Cmp, S2N(T2)) = EBD (Cmp, SN(T2)). Thus, S2N(T2) also produces an optimal EBD with Cmp through S2ebdN(T2).
Proof of Correctness (cont) Theorem 2: (Proof of optimal substructure property) – Let SebdN-1(T2), N > 1, be the EBD score corresponding to the attribute match between Cmp ϵ T1 and SN-1(T2) ϵ T2. If SebdN-1(T2) is an optimal EBD score, and SebdN(T2) is obtained by adding Emaxx to SN-1(T2), then SebdN(T2) must also be an optimal EBD score. Assume that SN(T2) was formed by adding Emaxx to SN-1(T2), but does not produce an optimal value of SebdN(T2). Emaxx represents the attribute with the highest EBD score with Cmp to be included in SN-1(T2) with respect to all other attributes in Ex(T2). Then this means that SN-1(T2) contains some attribute indexed by r (r <= N-1) whose EBD value is less than that of Emaxr. Thus, SebdN-1(T2) is not an optimal EBD score. This contradicts the statement above that SebdN-1(T2) is an optimal EBD score. Therefore, if SebdN-1(T2) is an optimal EBD score, and SebdN(T2) is obtained by adding Emaxx to SN-1(T2), then SebdN(T2) must be an optimal EBD score. Theorem 3: Greedy 1:N matching produces a safe match with an optimal EBD score. This follows from Theorem 1 and Theorem 2.
Dataset Details GIS Transportation Dataset (GTD) GIS Location Dataset (GLD)
Dataset Details (cont) GIS Point of Interest Dataset (GPD) • Through all of our datasets, few shared instances exist • Data is multijurisdictional in nature- Number of attributes and instances differ