Efficient Approximate Entity Extraction with Edit Distance Constraints

Efficient Approximate Entity Extraction with Edit Distance Constraints Presented by: Aneeta Kolhe

Introduction • Named Entity Recognition finds approximate matches in text. • Important task for information extraction and integration, text mining and also for web search.

Problem • Approximate dictionary matching. • Previous solution – Token based similarity constraints • Proposed solution – Neighborhood generation method

Limitations of token based solution • It uses Jaccard co-efficient similarity • It may miss some match. • It may result in too many matches.

For Example: Given al-qaida *“al-qaeda” or “al-qa’ida” won’t be matched unless use low jaccard similarity of 0.33. “alqaeda” will match “al gore” as well as “al pacino” Hence we use edit distance

Problem Definition: • For example: • Given :document D, a dictionary E of entities • To find: all substrings in D such that they are within edit distance from one of the entities in E • Solution: Iterate through all the valid substrings of the document D • Issue a similarity selection query to the dictionary to retrieve the set of entities that satisfy the constraint. • Consider each substring as a query segment.

Neighborhood generation method using partitioning • at least one partition with at most one edit error • select k т = (т +1)/2 Example: s = [ abcdefghijkl ] s’= [ axxbcdefghxijkl ] т = 3 , k т = 2 • s = [ abcdef ], [ ghijkl ] • s’ = [ axxbcde ], [ fghxijkl ]

Shifting the first partition s by 2 => s = [cdef] • scaling it by -1 => s = [ cdefg] • Transformation rules • First partition, we only need to consider scaling • within the range of [−2, 2]. • Last partition, we only need to consider the combination of the same amount of shifting and scaling within the range of [− т, т] (so that the last character is always included in the resulting substring). • For the rest of the partitions, we need to consider shifting within the range [− т, т] and scaling within the range [−2, 2].

Partitioned variant filtering • 1st partition: 5 variations • intermediate partitions: 5*(2 т +1) variations • last partition: (2 т +1) variations • Total amount of the 1-variants generated = O(m + 2).

s = [ abcdef ], [ ghijkl ] • s’ = [ axxbcde ], [ fghxijkl ] < [ abcd ], 1>< [ abcdefgh ], 1>< [ ghijkl ], 2> • <[ abcde ], 1> <[ jkl ], 2> < [ fghijkl ], 2 > • <[ abcdef ], 1> < [ ijkl ], 2 > < [ efghijkl ], 2> • <[ abcdefg ],1>< [ hijkl ],2><[ defghijkl ], 2> • segment s’ comes in second partition [ fghxijkl ], will have 1-variant match with s’s partition variation [fghijkl ] generated from s’s second partition.

Prefixed Pruning method • The partition (variation) is longer than a prefix length l p, we only use its l p-prefix to generate its 1-variants. • Assume l p is set to 3. Then 1-variants are generated from only the following prefixes. • <[ abc ], 1> <[ ghi ], 2 > <[ hij ], 2> <[ fgh ], 2 > • By setting l p ≤ m/kт – 2 • Total # of 1-variants generated is further reduced to O(l p т²).

Indexing the entities • to index short and long entities • in the dictionary, and store them in two inverted indexes, Ishort and Ilong • For each entity whose length is smaller than kтlp + т • lp-prefix of each partition variation is used to generate its 1-variant family, which will be indexed.

Algorithm : BuildIndex(E, , lp) • for each e ЄE do • if |e| < k lp + then • V GenVariants(e[1 .. min(lp, |e|)], ); • /* The GenVariants (s, k) function generates • the k-variant family of string s */ • for each v Є V do • Ishort <- Ishort U { e }; • if |e| ≥ k lp then • P the set of k partitions of e; • for each i-th partition p Є P do • PT TransformPartition(p); • /* according to the three • transformation rules in Section 3.1 */ • for each partition variations pTЄ PT do • V GenVariants(p[1 .. lp], 1); • for each v 2 V do • Ilong <- Ilong U <e, i >; • return (Ishort, Ilong)

Algorithm : MatchDocument (D, E, т ) • for each starting position p Є[1, |D| − Lmin + т + 1] do • SearchLong (D[p .. p + lp − 1], E, т ); /* matching entities no shorter than kт lp */ • SearchShort (D[p .. p + lp − 1], E, т ); /* matching entities of length in [lmin, kт lp) */

Search Long (s) • R <- ф; /* holds results */ • C <- ф; /* holds candidates */ • V <- GenVariants(s, 1) ; /* gen 1-variant family */ • for each v ЄV do • for each <e, pid> ЄIlongvdo • C <- C U <e, pid > ; /* duplicates removed */ • 7 for each <e, pid > Є C do • 8 S <- QuerySegmentInstantiation(e, pid); • /* returns • the set of query segment candidates for e */ • for each segЄS do • if Verify(seg, e) = true then • R <-R <seg, e > • Return R

Search short(s) • We need to generate the т-variant families for each possible length l between Lmin − т and lp • If the current query segment is shorter than lp, every candidate pair formed by probing the index needs to be verified • Otherwise, we need to perform verification for 2 т + 1 possible query segments.

Reduce amount of enumeration • For example, enumerate 1-variants of the string [ abcdef ] from left to right. • no variant starts with abc in the index. • Algorithm still enumerate other three 1-variants containing abc. • To avoid this set parameter lpp set to lp/2.

Consider 4 possible cases:

Conclusion • Successfully reduced the size of neighborhood • Proposed an efficient query processing algorithm • Optimized the algorithm to share computation • Avoid unnecessary variant enumeration

?? Questions ??

Thank You !!

Efficient Approximate Entity Extraction with Edit Distance Constraints

Efficient Approximate Entity Extraction with Edit Distance Constraints

Presentation Transcript

Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints

Top-k String Similarity Search with Edit-Distance Constraints

Efficient Approximate Entity Extraction with Edit Distance Constraints

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Trie -Join : Efficient Trie -based String Similarity Joins with Edit Distance Constraints

Minimum Edit Distance

Efficient Approximation of Edit Distance

Minimum Edit Distance

Answer Extraction as Sequence Tagging with Tree Edit Distance

String Edit Distance Matching Problem With Moves

Approximate Distance Oracles

Minimum Edit Distance

Edit Distance

Minimum Edit Distance

Dynamic Programming: Edit Distance

L arge-scale Similarity Join with Edit-distance Constraints

Named Entity Extraction

Minimum Edit Distance

Edit Distance