220 likes | 340 Views
Efficient Approximate Entity Extraction with Edit Distance Constraints. Presented by: Aneeta Kolhe. Introduction. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text mining and also for web search. Problem.
E N D
Efficient Approximate Entity Extraction with Edit Distance Constraints Presented by: Aneeta Kolhe
Introduction • Named Entity Recognition finds approximate matches in text. • Important task for information extraction and integration, text mining and also for web search.
Problem • Approximate dictionary matching. • Previous solution – Token based similarity constraints • Proposed solution – Neighborhood generation method
Limitations of token based solution • It uses Jaccard co-efficient similarity • It may miss some match. • It may result in too many matches.
For Example: Given al-qaida *“al-qaeda” or “al-qa’ida” won’t be matched unless use low jaccard similarity of 0.33. “alqaeda” will match “al gore” as well as “al pacino” Hence we use edit distance
Problem Definition: • For example: • Given :document D, a dictionary E of entities • To find: all substrings in D such that they are within edit distance from one of the entities in E • Solution: Iterate through all the valid substrings of the document D • Issue a similarity selection query to the dictionary to retrieve the set of entities that satisfy the constraint. • Consider each substring as a query segment.
Neighborhood generation method using partitioning • at least one partition with at most one edit error • select k т = (т +1)/2 Example: s = [ abcdefghijkl ] s’= [ axxbcdefghxijkl ] т = 3 , k т = 2 • s = [ abcdef ], [ ghijkl ] • s’ = [ axxbcde ], [ fghxijkl ]
Shifting the first partition s by 2 => s = [cdef] • scaling it by -1 => s = [ cdefg] • Transformation rules • First partition, we only need to consider scaling • within the range of [−2, 2]. • Last partition, we only need to consider the combination of the same amount of shifting and scaling within the range of [− т, т] (so that the last character is always included in the resulting substring). • For the rest of the partitions, we need to consider shifting within the range [− т, т] and scaling within the range [−2, 2].
Partitioned variant filtering • 1st partition: 5 variations • intermediate partitions: 5*(2 т +1) variations • last partition: (2 т +1) variations • Total amount of the 1-variants generated = O(m + 2).
s = [ abcdef ], [ ghijkl ] • s’ = [ axxbcde ], [ fghxijkl ] < [ abcd ], 1>< [ abcdefgh ], 1>< [ ghijkl ], 2> • <[ abcde ], 1> <[ jkl ], 2> < [ fghijkl ], 2 > • <[ abcdef ], 1> < [ ijkl ], 2 > < [ efghijkl ], 2> • <[ abcdefg ],1>< [ hijkl ],2><[ defghijkl ], 2> • segment s’ comes in second partition [ fghxijkl ], will have 1-variant match with s’s partition variation [fghijkl ] generated from s’s second partition.
Prefixed Pruning method • The partition (variation) is longer than a prefix length l p, we only use its l p-prefix to generate its 1-variants. • Assume l p is set to 3. Then 1-variants are generated from only the following prefixes. • <[ abc ], 1> <[ ghi ], 2 > <[ hij ], 2> <[ fgh ], 2 > • By setting l p ≤ m/kт – 2 • Total # of 1-variants generated is further reduced to O(l p т²).
Indexing the entities • to index short and long entities • in the dictionary, and store them in two inverted indexes, Ishort and Ilong • For each entity whose length is smaller than kтlp + т • lp-prefix of each partition variation is used to generate its 1-variant family, which will be indexed.
Algorithm : BuildIndex(E, , lp) • for each e ЄE do • if |e| < k lp + then • V GenVariants(e[1 .. min(lp, |e|)], ); • /* The GenVariants (s, k) function generates • the k-variant family of string s */ • for each v Є V do • Ishort <- Ishort U { e }; • if |e| ≥ k lp then • P the set of k partitions of e; • for each i-th partition p Є P do • PT TransformPartition(p); • /* according to the three • transformation rules in Section 3.1 */ • for each partition variations pTЄ PT do • V GenVariants(p[1 .. lp], 1); • for each v 2 V do • Ilong <- Ilong U <e, i >; • return (Ishort, Ilong)
Algorithm : MatchDocument (D, E, т ) • for each starting position p Є[1, |D| − Lmin + т + 1] do • SearchLong (D[p .. p + lp − 1], E, т ); /* matching entities no shorter than kт lp */ • SearchShort (D[p .. p + lp − 1], E, т ); /* matching entities of length in [lmin, kт lp) */
Search Long (s) • R <- ф; /* holds results */ • C <- ф; /* holds candidates */ • V <- GenVariants(s, 1) ; /* gen 1-variant family */ • for each v ЄV do • for each <e, pid> ЄIlongvdo • C <- C U <e, pid > ; /* duplicates removed */ • 7 for each <e, pid > Є C do • 8 S <- QuerySegmentInstantiation(e, pid); • /* returns • the set of query segment candidates for e */ • for each segЄS do • if Verify(seg, e) = true then • R <-R <seg, e > • Return R
Search short(s) • We need to generate the т-variant families for each possible length l between Lmin − т and lp • If the current query segment is shorter than lp, every candidate pair formed by probing the index needs to be verified • Otherwise, we need to perform verification for 2 т + 1 possible query segments.
Reduce amount of enumeration • For example, enumerate 1-variants of the string [ abcdef ] from left to right. • no variant starts with abc in the index. • Algorithm still enumerate other three 1-variants containing abc. • To avoid this set parameter lpp set to lp/2.
Conclusion • Successfully reduced the size of neighborhood • Proposed an efficient query processing algorithm • Optimized the algorithm to share computation • Avoid unnecessary variant enumeration