410 likes | 531 Views
A Framework for Hierarchical Cost-sensitive Web Resource Acquisition. Tan Yee Fan WING Group Meeting 2009 November 10. Contents. Background and Motivation Cost-sensitive Record Acquisition Framework Solution for Record Matching Problems Evaluation Conclusion. Contents.
E N D
A Framework for Hierarchical Cost-sensitive Web Resource Acquisition Tan Yee Fan WING Group Meeting 2009 November 10
Contents • Background and Motivation • Cost-sensitive Record Acquisition Framework • Solution for Record Matching Problems • Evaluation • Conclusion
Contents • Background and Motivation • Record linkage • Using web-based resources • Issues with existing work • Cost-sensitive Record Acquisition Framework • Solution for Record Matching Problems • Evaluation • Conclusion
Record Linkage A B • Input • Two lists of items, A and B • Output • For each item pair (a, b) in A× B, are a and b a match? KDD PAKDD DKMD KDID Knowledge Discovery and Data Mining Pacific-Asia Conference on Knowledge Discovery and Data Mining Research Issues on Data Mining and Knowledge Discovery Knowledge Discovery in Inductive Databases ? Input information is sometimes incomplete or insufficientMay be useful to acquire additional information from the web
Information from Search Engine Hit count Title Snippet URL → Web page Possible queries “jcdl” “jcdl” “joint conference on digital libraries” “jcdl” OR “joint conference on digital libraries” jcdl digital libraries …
Using Search Engine Results • Inverse host frequency • Counts from snippets • Number of snippets containing a particular term • Web page similarity • TF-IDF cosine similarity between two pages • Test instance for classification • Frequency counts • Similarity metrics • …
Web-based Record Matching • Matching entities to concepts(Cimiano et al., 2005) • Semantic similarity between words(Bollegala et al., 2007) • Word sense disambiguation(Mihalcea and Moldovan, 1999) • Machine transliteration(Ohand and Isahara, 2008) • Record linkage(Elmacioglu et al., 2007) • Author name disambiguation(Tan et al., 2006) • Web people search(Kalashnikov et al., 2008)
Web-based Record Matching • Most existing work use this scheme • Query search engine with queries of one or more of these types: a, b, and a b • Optionally, append predefined terms or tokens t, e.g.,a t • Compute similarity metrics • Optionally, construct test instances for classification
Cost of Acquiring Web Resources • Acquiring web resources are slow and may incur other access costs • For a dataset of 1000 records, querying all records individually takes a day or more, querying them pairwise takes even longer • For large datasets, not feasible to query and download everything • Existing work sidesteps or ignores this issue • Some discussion in Tan et al. (2008)
Hierarchical Dependencies • Basically ignored in existing work! • Same resource may be used to extract different types of attribute values • Different instances can share attribute values Search engine results Web pages Term frequencies
Contents • Background and Motivation • Cost-sensitive Record Acquisition Framework • Resource dependency graph • Resource acquisition problem • Observations for record matching problems • Solution for Record Matching Problems • Evaluation • Conclusion
search(ab) search(b) search(a) hitcount(ab) hitcount(b) hitcount(a) a b dice(a, b) instance xa, b search(a1b1) search(b1) search(a1) hitcount(a1) hitcount(a1b1) hitcount(b1) dice(a1, b1) search(a1b2) search(a2b1) hitcount(a2b1) hitcount(a1b2) dice(a1, b2) dice(a2, b1) search(a2) search(a2b2) search(b2) hitcount(a2b2) hitcount(b2) hitcount(a2) dice(a2, b2) Resource Dependency Graph • Directed acyclic graph G = (V, E) • Vertex types T with |T| = 7
Resource Acquisition Graph • Feasible vertex set V’ V: • V’ can be acquired as-is without violating acquisition dependencies • If v V’, then for all u → v E, we require u V’ • Set of feasible subsets of vertices F(G) 2V: • All sets of feasible vertices
search(ab) search(b) search(a) hitcount(ab) hitcount(b) hitcount(a) a b dice(a, b) instance xa, b Cost and Utility • Cost • search(.) cost 10 each • All other cost 1 each • Utility • Positive utility for vertices that are attribute values in test instances • Zero utility otherwise
Resource Acquisition Problem • Given • Resource dependency graph G = (V, E) • Acquisition cost function cost: V → R+ {0} • Utility function utility: F(G) → R • Vertex type function type: V → T • Budget budget • Select V’ F(G) with cost(V’) ≤ budget to maximize obj(V’) = utility(V’) – cost(V’)
Applications • Easy to construct resource dependency graphs • When web pages can be downloaded • When TF or TF-IDF cosine similarity is involved • Blocking is applied • …
search(ab) search(b) search(a) hitcount(ab) hitcount(b) hitcount(a) a b dice(a, b) instance xa, b Observations for Record Matching • Structure of resource acquisition graph • Depth of 2 to 4 is typical • Vertex out-degree varies widelye.g., from 1 to max(|A|, |B|)
search(b) search(a) search(ab) search(ab) search(b) search(a) hitcount(ab) hitcount(b) hitcount(ab) hitcount(b) hitcount(a) hitcount(a) dice(a, b) a b a b instance xa, b dice(a, b) instance xa, b Observations for Record Matching • Usually possible to flatten graph to two levels • Root level: expensive search engine results and web page downloads • Leaf level: cheap attribute values
Contents • Background and Motivation • Cost-sensitive Record Acquisition Framework • Solution for Record Matching Problems • Application of Tabu search • More intelligent legal moves • Propagation of utility values • Utility function when classifier is SVM • Evaluation • Conclusion
Search Algorithm • State space • F(G), set of feasible subsets of vertices • Search process • Current state denoted by V’ F(G) • Starting state V’ = φ • Legal moves add or remove vertices from V’ to reach next state • Objective • Maximize obj(V’) = utility(V’) – cost(V’)
search(b) search(a) search(ab) hitcount(ab) hitcount(b) hitcount(a) dice(a, b) a b instance xa, b Issues • Heuristic search required • Not feasible to enumerate all possible V’ to acquire • Many local maxima • V’ = φ is a local maxima • Simple greedy search does not work • Easy for states to be revisited • No guidance as to which root vertices to acquire • Each search(.) has the same objective value
Solutions • Application of Tabu search • More intelligent legal moves • Propagation of utility values
Tabu search • Simple hill climbing but with Tabu list • Moves in Tabu list are prohibited until expiry of prohibition, this avoids repeated visiting of states • Except when new objective value exceeds current known best objective value • When a move is made, moves that reverse the effect of the move are added to Tabu list • Expiration set by current iteration plus Tabu tenure • Stop when best objective value never improve for some number of iterations
Legal Moves • Let D(V’) be the set of children from any vertex in V’ that are not in V’ • If V’ is empty, then set D(V’) = V • Legal moves • Add(v): Add v from D(V’) and its ancestors into V’ • Remove(v): Remove v and its descendants from V’ • AddType(t): Add a maximal set of vertices (within acquisition budget) of type t from D(V’) into V’
Propagated Utility u v v
Surrogate Utility and Objective • Current state V’ and we consider adding V’’ • Surrogate utility function • Surrogate objective function • surr-obj(V’, V’’) = surr-utility(V’, V’’) – cost(V’’)
Utility Function for SVM • Decompose utility(V’) into sum ofutility(x, A(V’, x)) over test instances x • Compute utility(x, A’) as expected decrease in misclassification cost for acquiring attribute values A’ • Apply earlier work on cost-sensitive attribute value acquisition to compute utility(x, A’)
Contents • Background and Motivation • Cost-sensitive Record Acquisition Framework • Solution for Record Matching Problems • Evaluation • Experimental setup and datasets • Results • Conclusion
search(sf) search(lf) search(sflf) hitcount(sflf) hitcount(lf) hitcount(sf) cosine(sf, lf) sf lf snippetcount(sf → lf) snippetcount(sf ← lf) instance xsf, lf Evaluation Task • Linkage of short forms (SF) to long forms (LF) • Evaluation metric • Total cost of acquisitions and misclassifications cost = 10 cost = 1
Datasets • Human genomes • 307 short forms × 307 long forms • Training: 154 × 154 pairs • 23716 training instances • Testing: 153 × 153 pairs • 23409 test instances • 117657 vertices, 140760 edges • Misclassification costs • 50 false positive, 500 false negative • DBLP venues • 920 short forms ×906 long forms • Training: 460 × 453 pairs • Randomly selected 100000 training instances • Testing: 460 × 453 pairs • 213444 test instances • 1069068 vertices, 1281588 edges • Misclassification costs • 50 false positive, 1500 false negative
Algorithms • Baselines • Random • Least cost • Highest utility • Best cost-utility ratio • Best type • Our algorithm • Manual (with access to solution set)
Evaluation Methodology • Start with graph with no vertices acquired • For each iteration • Run acquisition algorithm with budget 100 • Record data point • Total cost of acquisitions • Total cost of misclassifications • Total cost of acquisitions and misclassifications • Run for 100 iterations
Results • GENOMES
Results • DBLP
Results • Almost 50% improvement in average difference in misclassification cost!
Contents • Background and Motivation • Cost-sensitive Record Acquisition Framework • Solution for Record Matching Problems • Evaluation • Conclusion • Contributions • Future work
Contributions • Framework for performing cost-sensitive acquisitions of resources with hierarchical dependencies • Model using resource dependency graph • Acquisition algorithm for record matching problems, given cost and utility function • Utility function when classifier is SVM • Evaluation demonstrate effectiveness on record matching problems when using SVM
Future Work • Apply resource acquisition framework to clustering problems