330 likes | 444 Views
Toward Best-Effort Information Extraction. by Warren Shen , Pedro DeRose , Robert McCann, AnHai Doan, and Raghu Ramakrishnan , SIGMOD'08 , June 2009, Vancouver, British Columbia, Canada, 2007, 1031-1042. Presented by Andrew Zitzelberger. Information Extraction.
E N D
Toward Best-Effort Information Extraction by Warren Shen, Pedro DeRose, Robert McCann, AnHai Doan, and RaghuRamakrishnan, SIGMOD'08, June 2009, Vancouver, British Columbia, Canada, 2007, 1031-1042 Presented by Andrew Zitzelberger
Information Extraction • Many solutions exist for extracting structured data from raw data pages. • But … • Virtually all of these solutions focus on precise Information Extraction programs that output exact results
Current IE Limitations • Generally cannot execute a partially specified version of the program. • Generally takes a long time (days or weeks) before obtaining the first meaningful results (partially due to the first limitation). Not acceptable for time sensitive applications • Writing precise IE programs can be a waste of time in some instances.
Example • Given 500 pages find all houses which cost more than $500,000 and whose high school is Lincoln. • Case 1: Define price as a numeric value and run the approximate program. 9 pages are returned containing a number greater than 500,000 and the word Lincoln. Search these 9 pages manually. • Case 2: Instead 120 pages are returned. The program is underspecified so the Next-Effort assistant is consulted which asks if price tags are always bolded. After discovering they are, this is added to the specification, and this time 35 pages are returned.
iFlex (iterative Flexible Extraction System) • Allows the developer to quickly develop an approximate extraction program (Alog) • The approximate program can then be run to quickly retrieve approximate results (Compact Tables) • To improve results the developer can enlist the aid of the Next-Effort assistant
Xlog (basis for Alog) • Xlog is a variant of datalog • Consists of a number of rules in the form of p:-q1,…,qn where p and qi are predicates and p is the head and the qi’s form the body. • Xlog does not allow rules with negated predicates or recursion. • Xlog can accommodate procedural steps of real world IE using p-predicates and p-functions.
p-predicates and p-functions • p-predicate • A p-predicate takes the form q(a1, . . . , an, b1, . . . , bm), where ai and bi are variables and q is associated with some procedural code module. • The associated procedural code module takes a in an input tuple (u1, . . . , un), where ui is bound to ai, i ∈ [1, n], and produces as out-put a set of tuples (u1, . . . , un, v1, . . . , vm). • p-function • p-function f(a1, . . . , an) takes as input a tuple (u1, . . . , un) and returns a scalar value.
Xlog Example • Extract houses with a price above $500,000, more than 4500 square feet, and with a top high school. • p-predicates • extractHouses(x, p, a, h) • extractSchools(y, s) • p-function • approxMatch(h, s) • Query • R1: houses(x,p,a,h) :- housePages(x), extractHouses(x,p,a,h) • R2: schools(s) :- schoolPages(y), extractSchools(y,s) • R3: Q(x,p,a,h) :- houses(x,p,a,h), schools(s), p>500000, a>4500, approxMatch(h,s)
Xlog Usage and Limitations • To write an Xlog program the developer must first decompose the program into smaller tasks. • Then p-predicates and p-functions are designed to reflect the decomposition. • Next procedural modules to perform the functionality of the p-predicates and p-functions must be designed and written (takes a lot of time and must be fairly complete before testing can begin). • Finally, the modules must be linked in.
Alog (best-effort support) • IE predicates – a p-predicate that extracts one or more output spans from a single input document or span. • The procedure writing stage of Xlog is replaced by the ability to write description rules to do “good enough.” • The developer can also attach procedural modules if desired. • The developer also specifies the type of approximation to use with annotations.
Description Rules • Written in the same form as traditional Xlog rules except that the head of the rule must be an IE predicate. • Can be used to define domain constraints in the form of f(a) = v (example: numeric(a) = yes) • Values can be yes, distinct-yes, no, distinct-no, and unknown • Can also describe text features such as bold-font, followed-by, underlined, hyperlinked, etc. • iFlex provides a rich set of built in features and provides an interface for the user to add more.
Feature Interface • Verify(s, f, v) checks whether f(s) = v. • Refine(s, f, v) returns all subspans t from s such that f(t) = v • This implementation is done once and stored so that all future Alog programs can make use of it.
Safe Description Rules • Description rules must be safe – meaning that they don’t produce an infinite relation. • extractHouses(x, p, a, h) :- numeric(p), numeric(a) is not safe because it does not specify where p, a, and h are extracted from. • iFlex provides built-in rule from(x, y) that conceptually extracts all sub-spans y from document x. • This predicate can be used to easily make rules safe. • extractHouses(x, p, a, h) :- from(x, p), from(x, a), from(x, h), numeric(p)=yes, numeric(a)=yes
Annotations • Existence Annotation • Indicates that a tuple in the relation may or may not exist. • schools(s)? :- schoolPages(y), extractSchools(y,s) • Attribute Annotation • Indicates that an attribute takes a value from a given set, but we do not know which value. • houses(x,<p>,<a>,<h>) :- housePages(x), extractHouses(x,p,a,h)
Existence Annotation Example • Suppose we determine that school names are in bold font. • It is not likely that every bold word in the document is a school name. • Thus we can use the existence annotation to specify that each tuple found may or may not be in the actual relation. • Every tuple found is added to a relation and the power set is returned to specify the set of relations that are possibly correct.
Attribute Annotation Example • Suppose that each document x in housePages describes exactly one house (the x is a key in the relation) • Then we can specify that price, area, and high school come from some matching values we found on the page. • All possible relations are constructed for houses where one value is selected for each attribute.
Approximate Tables • Need a way to store the set of relations an Alog program produces. • An a-table is a multiset of a-tuples. • An a-tuple is a tuple (V1,…,Vn), where each Vi is a multiset of possible values. • An a-tuple may be annotated with a ‘?’, in which case it is also called a maybe a-tuple. • An a-table represents the set of all possible relations that can be constructed by: • (a) selecting a subset of the maybe a-tuples and all non-maybe a-tuples, then • (b) selection one possible value for each attribute in each a-tuple in (a).
Compact Tables • A-tables are not typically succinct enough due to the fact that an Alog rule may produce a huge number of extracted values. • iFlex employs compact tables which exploit the sequential nature of text to “pack” the set of values into each cell into a much smaller set of so-called assignments. • A compact table is a multiset of compact tuples. A compact tuple is a tuple of cells (c1,…,cn) where each cell ci is a multiset of assignments or an expansion cell. A compact tuple may optionally be designated as a maybe compact tuple, denoted with a ‘?’.
Assignments • exact • exact(s) – encodes a value that is exactly span s • contain • contains(s) – encodes all values that are sub-spans of s on the page (example: contain(“Cherry”) includes {“C”, “Ch”, …,“Cherry”}
Expansion Cells • Suppose a tuple t with cells (c1, …,ci, …, cn) where ci = expand(v1, …, vk). • T can be expanded into a set of compact tuples obtained by replacing cell ci with an assignment exact (vj): (c1,…,exact(vj),…cn), where 1≤j ≤ k.
Compact Table Limitations • Not a complete model for approximate data (cannot do mutual exclusion) • Not closed under traditional relational operators
Modified Relational Operators • Ensure superset semantics – result is always a super set of actual results • Projection – ignore duplicate detection • Selection – if any of the possible tuples in a compact tuple meet the selection condition the tuple is retained. If only some of the tuples meet the condition, it becomes a maybe compact tuple. • θ-join – evaluate θ condition on all compact tuples in the Cartesian product using the selection criteria.
Approximate Query Processing • Unfold all rules (unifying variables if necessary) until only IE predicates remain that are associated with procedures in the program. • Construct a logical plan fragment
Next-Effort Assistant • Suggests ways to refine the current information extraction program by asking the developer questions? • Example: “is price in bold font?” • iFlex adds new constraints to the program based on the developers responses. • If the number of tuples does not change for k iterations, the assistant can notify the developer that the results of have converged.
Question Selection Strategies • Sequential • Rank attributes in decreasing importance (using various heuristics) • Always ask questions about the most important attribute • Simulation • Ask questions whose answers will eliminate the most possible answers • The results of each stage of the execution plan are stored, so that only the changes have to be rerun.
Evaluation • Domains: Movies, DLBP, Books • Comparisons in performance are based on the time it takes to write the program for extraction (or do the extraction in the manual case). • Times are averaged over 1-3 volunteers for each task. • Time stops when correct result is obtained or the program converges.
Results iFlex reduced time by 25-98% in all 27 scenarios
Results • iFlex converged correctly in 23 out of 27 of the scenarios (not shown due to space limitations) • The four remaining cases were 170%, 161%, 114%, and 102%. • Two of those cases had a small number of tuples
Real-World System Results Tasks took 104, 351, and 107 seconds to run; iFlex running time is comparable to Perl extraction programs.
Conclusion • iFlex is a best-effort information extraction program that can be use to quickly obtain approximate results. • iFlex significantly reduces the developer time in creating information extraction programs. • iFlex is efficient enough to run with comparable speed to Perl • Simulated question patterns from the Next-Effort assistant outperforms the sequential pattern.