1 / 9

Optimizing Complex Extraction Programs over evolving Text Data

Optimizing Complex Extraction Programs over evolving Text Data. Authors : Fei Chen  University of Wisconsin-Madison, Madison, WI, USA Byron J. Gao  Texas State University-San Marcos, San Marcos, TX, USA AnHai Doan  University of Wisconsin-Madison, Madison, WI, USA

cleary
Download Presentation

Optimizing Complex Extraction Programs over evolving Text Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing Complex Extraction Programs over evolving Text Data • Authors : • Fei Chen  University of Wisconsin-Madison, Madison, WI, USA • Byron J. Gao  Texas State University-San Marcos, San Marcos, TX, USA • AnHai Doan  University of Wisconsin-Madison, Madison, WI, USA • Jun Yang  Duke University, Durham, NC, USA • Raghu Ramakrishnan Yahoo! Research, Santa Clara, CA, USA • Presented by : Yogendra Godbole

  2. Introduction • Motivation • Traditional IE method: Static • Practical conditions: Dynamic corpus • DBlife(10000+URLs,120+MB corpus snapshot.) • Enterprise Intranet • Problem • How to efficiently extract information based on Dynamic corpora

  3. Problem Definition • Concepts • Data pages, Extractors, Mentions • An extractor E:p→R(a1,a2,…,an) extracts mentions of relation R from page p. A mention of R is a tuple(m1,m2,…,mn,)such that mi is either a mention of attribute ai or nil. • Examples • Assumptions • Extract mentions from each single data pages

  4. Methods • Concepts • Extractor scope • Let s.start and s.end be the start and end character positions of a string s in a page p. We say an extractor E has scope α iff for any mention m = (m1, . . . ,mn) produced by E, (maxi mi.end − mini mi.start) < α, where mi.start and mi.end are the start and end character positions of attribute mention mi in page p. • Extractor Context • The β-context of mention m in page p is the string p[(m.start−β)..(m.end+ β)], i.e., the string of m being extended on both sides by β characters. We say extractor E has context β iff for any m and p′ obtained by perturbing the text of p outside the β- context of m, applying E to p′ still produces m as a mention. • Challenges • Matchers (Find overlapping)

  5. Problem Definition (cont) • Let P1, . . . , Pn be consecutive snapshots of a text corpus, ρ be an IE program written in xlog, E1, . . . ,Em be the IE blackboxes (i.e., IE predicates) in ρ, and (α1, β1), . . . , (αm, βm) be the estimated scopes and contexts for the blackboxes, respectively. Develop a solution to execute ρ over corpus snapshot Pn+1 with minimal cost, by reusing extraction results over P1, . . . , Pn.

  6. Solutions • CAPTURING IE RESULTS • Level of Reuse • IE Results to Capture • Storing Captured IE Results • REUSING CAPTURED IE RESULTS • Scope of Mention Reuse • Overall Processing Algorithm • Identifying Reuse with Matchers • SELECTING A GOOD IE PLAN • Searching for Good Plans • Cost Model

  7. Evaluation(DataSet)

  8. Experimental Results

  9. Sources : • ACM : http://portal.acm.org/citation.cfm?doid=1559845.1559881 • Overview of SIGMOD 2009 idke.ruc.edu.cn/seminars/2009/07.04/SIGMOD2009%20Overview.ppt

More Related