1 / 18

Optimizing Statistical Information Extraction Programs Over Evolving Text

Optimizing Statistical Information Extraction Programs Over Evolving Text. Fei Chen Xixuan (Aaron) Feng Christopher Ré Min Wang. One-Slide Summary. Statistical Information Extraction (IE ) is increasingly used. For example, MSR Academic Search, Ali Baba (HU Berlin), MPI YAGO

Download Presentation

Optimizing Statistical Information Extraction Programs Over Evolving Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing Statistical Information Extraction Programs Over Evolving Text Fei Chen Xixuan (Aaron) Feng Christopher Ré Min Wang

  2. One-Slide Summary • Statistical Information Extraction (IE) is increasingly used. • For example, MSR Academic Search, Ali Baba (HU Berlin), MPI YAGO • isWiki at HP Labs • Text Corpora evolve! • An issue: difficult to keep IE results up to date • Current approach: rerun from scratch, which can be too slow • Our Goal: Improve statistical IE runtime on evolving corpora by recycling previous IE results. • We focus on a popular statistical model for IE – conditional random fields (CRFs), and build CRFlex • Show 10x speedup is possible for repeated extractions

  3. Background

  4. Background 1: CRF-based IE Programs David DeWitt is working at Microsoft. • Document • Token sequence • Trellis graph • Label sequence • Table x: 1 2 3 4 5 6 P: Person A: Affiliation O: Other y:

  5. Background 2: CRF Inference Steps • Token sequence  Label Sequence (CRF Labeling) • (I) Computing Feature Functions (Applying Rules) • (II) Constructing Trellis Graph (Dot Product) • (III) Viterbi Inference (Dynamic Programming) • A version of standard shortest path algorithm 1 2 3 4 5 6 P: Person A: Affiliation O: Other f(O, A, x, 6) = 0 g(O, A, x, 6) = 1 x model λ (0.5, 0.2) weight 0.2 weight w = v∙λ = 0.2 feature v (0, 1)

  6. Token Sequences ( I ) Computing Feature Functions f f ... f 1 2 K Feature Values ( II ) Computing Trellis Graph Trellis Graph ( III ) Perform Inference Label Sequences Challenges • How to do CRF inference incrementally w/ exactlysame results as re-run • no straight-forward solutions for each step • How to trade off savings and overhead • intermediate results (feature values & trellis graph) are much larger than input (tokens) & output (labels)

  7. Technical Contributions

  8. Recycling Each Inference Step • (I) Computing Feature Functions (Applying Rules) • (Cyclex) Efficient Information Extraction over Evolving Text Data, F. Chen, et al. ICDE-08 • (II) Constructing Trellis Graph (Dot Product) • In a position, unchanged features  unchanged trellis • (III) Viterbi Inference (Dynamic Programming) • Auxiliary information needed to localize dependencies • Modified version for recycling

  9. Performance Trade-off • Materialization decision in each inference step • A new trade-off thanks to the large amount of intermediate representation of statistical methods • CPU computation varies from task to task

  10. Optimization • Binary choices for 2 intermediate outputs  22 = 4 plans • More plans possible • If partial materialization in a step • No plan is always fastest  cost-based optimizer • CPU time per token, I/O time per token – task-dependent • Changes between consecutive snapshots – dataset-dependent • Measure by running on a subset at first few snapshots

  11. Experiments

  12. Repeated Extraction Evaluation • Dataset • Wikipedia English w/ Entertainment tag, 16 snapshots (once every three weeks), 3000+ pages per snapshot on average • IE Task: Named Entity Recognition • Features • Cheap: token-based regular expressions • Expensive: approximate matching over dictionaries ~10X Speed-up Statistics Collection

  13. Conclusion • Concerning real-world deployment of statistical IE programs, we: • Devised a recycling framework without loss of correctness • Explored a performance trade-off,CPU vs. I/O • Demonstrated that up to about 10X speed-up on a real-world dataset is possible • Future Directions • More graphical models and inference algorithms • In parallel settings

  14. Importance of Optimizer • Only the fastest 3 (out of 8) are plotted • No plan is always within top 3

  15. Per Snapshot Comparisons

  16. Runtime Decomposition • Only the fastest 3 and Rerun are plotted • IO can be more in the slow plans

  17. Scoping Details • Per-document IE • No breakable assumptions for a document • Repeatedly crawling using a fixed set of URLs • Focus on the most popular model in IE • Linear-chain CRF • Viterbi inference • Optimize inference process with a pre-trained model • Exact results as rerun, no approximation • Recycle granularity is token (or position)

  18. X N + 1 1 X 2 N ( I ) Feature A V 3 a Computation FC N b V N + 1 1 V 2 N ( II ) Trellis A F 3 a TC N Computation b F N + 1 1 F 2 N ( III ) Viterbi A Y 3 a VI N Inference b Y N + 1 Recycle Each Step 1 1 1 new new token new feature factors previous previous sequence values token feature previous sequence values factors 2 2 2 Unix Vector Factor Diff Diff Diff vector match factor match token match inference factor feature region region region recompute recompute recompute regions regions regions a a a Feature Factor Inference Recyclers Recycler Recycler previous inference copy feature copy factor copy Viterbi regions regions regions context previous new previous new new previous feature feature factors factors labels labels values 3 3 3 values Feature Factor Label & & Copier Copier Copier b b b Viterbi Viterbi context context ( a ) Step I ( b ) Step II ( c ) Step III

More Related