1 / 50

Creating Imprecise Databases from Information Extraction Models

Creating Imprecise Databases from Information Extraction Models. Rahul Gupta Sunita Sarawagi. (IBM India Research Lab) (IIT Bombay) . Extraction 1, probability p 1. State of the art Extraction Model (CRF). text. Extraction 2, probability p 2. Extraction 3, prob ability p 3.

vsigel
Download Presentation

Creating Imprecise Databases from Information Extraction Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

  2. Extraction 1, probability p1 State of the art Extraction Model (CRF) text Extraction 2, probability p2 Extraction 3, probability p3 Extraction k, probability pk Q1.Isn’t storing the best extraction sufficient? Q2. Can we store all this efficiently ? Imprecise Database

  3. CRFs: Probability = Confidence • If N segmentations are labeled with probability p, • then around Np should be correct.

  4. Is the best extraction sufficient?

  5. Extraction 1, probability p1 State of the art Extraction Model (CRF) text Extraction 2, probability p2 Extraction 3, probability p3 NO! Extraction k, probability pk Q1.Isn’t storing the best extraction sufficient? Q2. Can we store all this efficiently ? Imprecise Database

  6. How much to store? • k can be exponential in the worst case • Retrieval is O(k.logk) • Arbitrary amount of storage not available

  7. Extraction 1, probability p1 State of the art Extraction Model (CRF) text Extraction 2, probability p2 Extraction 3, probability p3 NO! Extraction k, probability pk Q1.Isn’t storing the best extraction sufficient? NO! Q2. Can we store all this efficiently ? Imprecise Database

  8. Our approach Extraction 1, probability p1 text CRF Extraction 2, probability p2 Extraction 3, probability p3 Extraction k, probability pk Imprecise Database

  9. Example Input: “52-A Bandra West Bombay 400 062” CRF

  10. Imprecise Data Models • Segmentation-per-row model (Exact) • One-row model • Multi-row model

  11. Segmentation-per-row model (Rows: Uncertain; Cols: Exact) Exact but impractical. We can have too many segmentations!

  12. One-row Model Each column is a multinomial distribution (Row: Exact; Columns: Indep, Uncertain) e.g. P(52-A, Bandra West, Bombay, 400 062) = 0.7 x 0.6 x 0.6 x 1.0 = 0.252 Simple model, closed form solution, but crude.

  13. Multi-row Model Segmentation generated by a ‘mixture’ of rows (Rows: Uncertain; Columns: Indep, Uncertain) e.g. P(52-A, Bandra West, Bombay, 400 062) = 0.833 x 1.0 x 1.0 x 1.0 x 0.6 + 0.5 x 0.0 x 0.0 x 1.0 x 0.4 = 0.50

  14. Goal • Devise algorithms to efficiently populate these imprecise data models • For multi-row model, populate a constant number of rows • Avoid enumeration of top-k segmentations • Is the multi-row model worth all this effort?

  15. Outline • DAG view of CRFs • Populating the one-row model • Populating the multi-row model • Enumeration based approach • Enumeration-less approach • Experiments • Related work and conclusions

  16. CRF: DAG View 52 Bandra Bombay 400 062 52-A Bandra West West Bombay Bandra West Bombay • Segment ¼ Node • Segmentation ¼ Path • Probability ¼ Score = exp(uF(u)+eG(e))

  17. Semantic Mismatch • CRF: Probability factors over segments • Imprecise DB: Probability factors over labels Have to approximate one family of distribution by another

  18. Populating Models: Approximating P() • Metric: KL(P||Q) = s P(s) log (P(s)/Q(s)) • Zero only when Q = P, positive otherwise • Q tries to match the modes of P • Easy to play with • Enumeration over s not required.

  19. Outline • DAG view of CRFs • Populating the one-row model • Populating the multi-row model • Enumeration based approach • Enumeration-less approach • Experiments • Related work and conclusions

  20. Populating the One-row Model Min KL(P||Q) = Min KL(P|| y Qy) = Min y KL(Py||Qy) • Solution: Qy(t,u) = P(t,u,y) (marginal) = computed via  and  Q

  21. Marginal P((t,u,y)) /u(y)y’t-1(y’)Score(t,u,y,y’) CRFs: Marginal Probabilities   Marginal ? 52 Bandra Bombay 400 062 52-A Bandra West West Bombay Bandra West Bombay

  22. CRF: Forward Vectors 52 Bandra Bombay 400 062 52-A Bandra West West Bombay Bandra West Bombay i(y) = Score of all paths ending at position i with label y = prefixScore of (prefix + edge from prefix) = y’,di-d(y’)Score(i-d+1,i,y’,y)

  23. CRF: Backward Vectors 52 Bandra Bombay 400 062 52-A Bandra West West Bombay Bandra West Bombay i(y) = Score of all paths starting at i+1 with label y atposition i. = suffixScore of (suffix+ edge to suffix) = y’,di+d(y’)Score(i+1,i+d,y,y’)

  24. Outline • DAG view of CRFs • Populating the one-row model • Populating the multi-row model • Enumeration based approach • Enumeration-less approach • Experiments • Related work and conclusions

  25. Populating Multi-row Model • log Qmulti is unwieldy and disallows closed form solution • But the problem is same as estimating the parameters of a mixture. • Algorithms • Enumeration-based approach • Enumeration-less approach.

  26. Enumeration-based approach • Edge weight of s!k denotes ‘soft-membership’ of segmentation s in row k. • Depends on how strongly row k ‘generates’ s. • Row-parameters depends on its member segmentations. Segmentations, probabilities rows

  27. Continued.. • EM-Algorithm • Start with random edge-weights • Repeat • Compute new row-parameters (M-step) • Compute new edge-weights (E-step) • Monotone convergence to a locally optimal solution is guaranteed. • In practice, global optimality often achieved by taking multiple starting points.

  28. Outline • DAG view of CRFs • Populating the one-row model • Populating the multi-row model • Enumeration based approach • Enumeration-less approach • Experiments • Related work and conclusions

  29. Enumeration-less approach Grouping mechanism EM Multi-row model (2 rows) Compound segmentations (enumerated) Segmentations (not enumerated) Similar to enumeration-based approach

  30. Grouping Mechanism • Divide the DAG paths into distinct sets • e.g. whether they pass through a node or not, or they contain a given starting position or not. • We have the following boolean tests • Atuy = segment (t,u,y) present ? • Bty = segment starts at t with label y ? • Cuy = segment ends at u with label y ? • P(A),P(B),P(C) can be computed easily using  and  vectors

  31. Grouping Mechanism: Example A’52-A’,HNO P({s1,s4}) = P(A=0), P({s3}) = P(A=1,B=0) We can work with boolean tests instead of segmentations! 0 1 Compute via ,  like we did for marginals B‘West’,* s1,s4 0 1 s3 s2

  32. Choosing the boolean tests • At each step, choose a current partition and a variable that ‘best’ bisects that partition • ‘Best’ computed by an ‘approximation-quality’ metric. • Can be computed using  and 

  33. Outline • DAG view of CRFs • Populating the one-row model • Populating the multi-row model • Enumeration based approach • Enumeration-less approach • Experiments • Related work and conclusions

  34. Experiments: Need for multi-row • KL very high at m=1. One-row model clearly inadequate. • Even a two-row model is sufficient in many cases.

  35. Experiments: KL divergence (Address) • Accuracy of enumeration-less approach almost • as good as the ideal enumeration-based approach.

  36. Related Work • Imprecise DBs • Families of data models ([Fuhr ’90], [Sarma et.al. ’06], [Hung et.al. ’03]) • Efficient query processing ([Cheng et.al. ’03], [Dalvi and Suciu ’04], [Andritsos et.al. ’06]) • Aggregate queries ([Ross et.al. ’05])

  37. Related Work (Contd.) • Approximating intractable distributions • With tree-distributions ([Chow and Liu ’68]) • Mixture of mean-fields ([Jaakkola and Jordan ’99]) • With Variational methods ([Jordan et.al. ’99]) • Tree-based reparameterization ([Wainwright et.al. ’01])

  38. Conclusions • First approach to quantify uncertainty of information extraction through imprecise database models. • Algorithms to populate multi-row DBs • A competitive enumeration-less algorithm for CRFs. • Future Work • Representing uncertainty of de-duplication • Multi-table uncertainties

  39. Questions?

  40. Backup: Expts: Number of partitions? • Much fewer partitions generated for merging as • compared to the ideal enumeration-based approach. • Final EM run on substantially fewer (compound) • segmentations.

  41. Backup: Expts: Sensitivity to cut-off • Info-gain cut-off largely independent of dataset. • Approximation quality not resilient to small changes in • info-gain cut-off

  42. Experiments: KL divergence (Address) (Cora) • Accuracy of enumeration-less approach almost • as good as the ideal enumeration-based approach.

  43. Backup: Segmentation Models (Semi-CRF) s: segmentation (t,u,y): segment e.g. s = 52-A Bandra West Bombay 400 062 HNo Area City Pincode P(s|x) = exp(T. jf(j,x,s)) / Z(x) • f(j,x,s) = feature vector for jth segment • xj contains ‘East’ or ‘West’ ^ j · n-1 ^ yj = AREA • xj is a valid pincode ^ yj-1 = CITY

  44. Backup: EM Details • E-step R(k|sd) /k Qk(sd) • R(k|sd) = ‘soft’ assignment of sd to row k • R(k|sd) favours that row which generates it with high probability • M-step Qyk(t,u) /sd:(t,u,y)2 sdP(sd)R(k|sd) k = sdP(sd)R(k|sd) (k = weight of a row = its total member count weighed by their probabilities)

  45. Backup: Enumeration-less approach Grouping Mechanism Multi-row model (4 rows) Compound segmentations (Hard Partitions: enumerated) Segmentations (not enumerated)

  46. Backup: Issues • Hard-partitioning not as good as soft-partitioning. • Generally, numrows is too small. Sufficient partitioning may be infeasible. • Solution: • Generate much more partitions than rows. • Apply EM algorithm to soft-assign the partition to the rows.

  47. Merging • Each partition is a ‘compound’ segmentation (s). • Can afford to enumerate partitions • M-step • Need P(s) and R(k|s). P(s) same as P(c). • E-step • Need Qk(s) = prob of generating compound segmentation s from kth row. Tricky!

  48. Backup: Merging (E-step) • Idea: • Decompose generation task over the labels (they are independent) • Prob of generating the Py(s) multinomial from the Qyk multinomial depends on ‘Bregman’ divergence = KL distance (for multinomials) Qk(s) /k exp(-yKL(Py(s)||Qyk))

  49. Expts: What about query results? • KL Divergence highly correlated with ranking inversion • errors. • Much fewer ranking errors for the Merging approach

  50. Approximating segmentation models with mixtures Approximate P / exp(T.F(x)) with Q = kk Qk(x) = kki Qki(x) • Metrics • Approximation: KL(P||Q) • Data Querying: Rank-Inversion, Square-Error • Can be computationally intractable/non-decomposable

More Related