90 likes | 155 Views
Optimizing Complex Extraction Programs over evolving Text Data. Authors : Fei Chen University of Wisconsin-Madison, Madison, WI, USA Byron J. Gao Texas State University-San Marcos, San Marcos, TX, USA AnHai Doan University of Wisconsin-Madison, Madison, WI, USA
E N D
Optimizing Complex Extraction Programs over evolving Text Data • Authors : • Fei Chen University of Wisconsin-Madison, Madison, WI, USA • Byron J. Gao Texas State University-San Marcos, San Marcos, TX, USA • AnHai Doan University of Wisconsin-Madison, Madison, WI, USA • Jun Yang Duke University, Durham, NC, USA • Raghu Ramakrishnan Yahoo! Research, Santa Clara, CA, USA • Presented by : Yogendra Godbole
Introduction • Motivation • Traditional IE method: Static • Practical conditions: Dynamic corpus • DBlife(10000+URLs,120+MB corpus snapshot.) • Enterprise Intranet • Problem • How to efficiently extract information based on Dynamic corpora
Problem Definition • Concepts • Data pages, Extractors, Mentions • An extractor E:p→R(a1,a2,…,an) extracts mentions of relation R from page p. A mention of R is a tuple(m1,m2,…,mn,)such that mi is either a mention of attribute ai or nil. • Examples • Assumptions • Extract mentions from each single data pages
Methods • Concepts • Extractor scope • Let s.start and s.end be the start and end character positions of a string s in a page p. We say an extractor E has scope α iff for any mention m = (m1, . . . ,mn) produced by E, (maxi mi.end − mini mi.start) < α, where mi.start and mi.end are the start and end character positions of attribute mention mi in page p. • Extractor Context • The β-context of mention m in page p is the string p[(m.start−β)..(m.end+ β)], i.e., the string of m being extended on both sides by β characters. We say extractor E has context β iff for any m and p′ obtained by perturbing the text of p outside the β- context of m, applying E to p′ still produces m as a mention. • Challenges • Matchers (Find overlapping)
Problem Definition (cont) • Let P1, . . . , Pn be consecutive snapshots of a text corpus, ρ be an IE program written in xlog, E1, . . . ,Em be the IE blackboxes (i.e., IE predicates) in ρ, and (α1, β1), . . . , (αm, βm) be the estimated scopes and contexts for the blackboxes, respectively. Develop a solution to execute ρ over corpus snapshot Pn+1 with minimal cost, by reusing extraction results over P1, . . . , Pn.
Solutions • CAPTURING IE RESULTS • Level of Reuse • IE Results to Capture • Storing Captured IE Results • REUSING CAPTURED IE RESULTS • Scope of Mention Reuse • Overall Processing Algorithm • Identifying Reuse with Matchers • SELECTING A GOOD IE PLAN • Searching for Good Plans • Cost Model
Sources : • ACM : http://portal.acm.org/citation.cfm?doid=1559845.1559881 • Overview of SIGMOD 2009 idke.ruc.edu.cn/seminars/2009/07.04/SIGMOD2009%20Overview.ppt