260 likes | 273 Views
Learn how to extract objects from documents using extraction ontologies; Explore structured, semi-structured, and free-text data for improved extraction; Utilize attribute candidates and wrapper induction for optimal results.
E N D
Information extraction from web pages using extraction ontologies Martin Labský KEG Seminar, 28th November 2006
Agenda • Purpose • Knowledge sources • Extraction ontology • Finding attribute candidates • Instance parsing • Wrapper induction • Ex demo • Discussion
Purpose • Extract objects from documents • object = instance of a class from an ontology • document = text, possibly with formatting • Objects • belong to known, well-defined class(es) • classes consist of attributes, axioms, constraints • Documents • may come in collections of arbitrary sizes • Structured, semi-structured, free-text • Extraction should improve if: • documents contain some formatting (e.g. HTML) • this formatting is similar within or across document(s) • Examples • Product catalogues (e.g. detailed product descriptions) • Weather forecast sites (e.g. forecasts for the next day) • Restaurant descriptions (cuisine, opening hours etc.) • Contact information • Financial news
Knowledge sources • Why • often for some attributes of one class it is easier to obtain manual extraction knowledge than training data and vice versa (experience from Bicycle product IE) • allow people to experiment just with manually encoded patterns, let them investigate easily whether some IE task is feasible by trying it quickly. If so, training data can be added for attributes which require it. • 1. Knowledge entered manually by expert • the only mandatory source • class definitions + extraction evidence • 2. Training data • sample attribute values or sample instances • possibly coupled with referring documents • used to induce typical content and context of extractable items, cardinalities and orderings of class attributes... • 3. Common formatting structure • of observed instances • in a single document, or • across documents from the same source
Extraction ontology • Attribute data types • assigned manually • Cardinality ranges • assigned manually, cardinality probability estimates could be trained • Patterns for contentand typical contextof attributes: • regular grammars at the level of words, lemmas, POS tags or word types (uppercase, capital, number, alphanumeric etc.) • phrase lists • attribute value lengths • the above equipped with probability estimates • assigned manually or to be induced from training data • For numeric attributes: • units • estimated probability distributions (e.g. tables, gaussian) • assigned manually or trained • Sample ex ontologies contacts_en.xml or monitors.eol.xml • see class and attribute definitions, data types • ECMAScript axioms, • regular pattern language, • pattern precision and recall parameters • Sample instances monitors.tsv and *.html
Finding attribute candidates (1) • Preprocessing • document tokenized • parsed into a light-weight DOM (if HTML) • Matching of attributes’ regular patterns • content and context patterns matched • each pattern has: • Pattern precision • estimates how often the pattern actually identifies a value of the attribute in question • P(attribute|pattern) • Pattern recall • estimates how many values of the attribute in question satisfy the pattern • P(pattern|attribute)
Finding attribute candidates (2) • Create new attribute candidate (AC) where at least one pattern matches • AC for attribute A scored by the estimate of • P(A|patterns) where • patterns = matched state of all patterns known for A • Independence assumption for all patterns E, F from the set of known patterns for attribute A (phi) • AC score computation: (for derivation see ex.pdf)
Finding attribute candidates (3) • Most attributes can occur independently or as part of their containing class • Each attribute is equipped with estimate of • P(engaged|A) = e.g. 0.75 • Three ways of explaining an AC: • part of instance; then the AC score computed as: • P(A|patterns) * P(engaged|A) • standalone: • P(A|patterns) * (1-P(engaged|A)) • mistake: • 1 - P(A|patterns)
the best path scores = -0.5754 • if we wanted just standalone attributes, • we could be complete Finding attribute candidates (4) • ACs naturally overlap; they form a lattice within document: AC’s ID and indices of start and end tokens initial null state log(AC standalone score)
Parsing instances (1) • Initially, each AC is converted into a singleton instance candidate IC = {AC} • Nested ACs supported • Then, iteratively, the most promising ICs are expanded: neighboring ACs are added to them • Expansion possible only if constraints not violated (e.g. max cardinality reached or ecmascript axioms may fail; selective axiom evaluation) • IC scoring • so far, IC score = log (AC engaged score) + + penalties for skipped ACs (orphans) within the IC’s span + + fixed penalties for crossing formatting blocks by IC • we need to incorporate ASAP: • likelihood of IC’s attribute cardinalities and ordering • learnable formatting block crossing penalties
Parsing instances (2) Simplified IC parsing algorithm • Create a set of singleton ICssingletons={ {AC}, {AC}, ... } of singleton ICs each containing just 1 AC • Enrich ICssingletons by adding ICs with 2 or more contained attribute values (still referred to as singletons since they have a single containing root attribute) • Create a set of instance candidates ICsvalid={}; • Create a queue of instance candidates ICswork={}. Keep ICswork sorted by IC score, with max size of K (heap). • Add content of ICssingletons to ICswork. • Pick the best scoring ICbest from ICswork • Set beam area of document BA=span of the document fragment (e.g. HTML element) containing ICbest • While expanding ICbest: • If BA does not contain more ACs, expand BA to the parent BA • Within BA, try adding to IC those ICnear_singleton which are singletons and are closest to IC: ICnew = IC + ICnear_singleton • If ICnew does not violate integrity constraints (e.g. max cardinality already reached in IC or axiom failure) • Add ICnew to ICswork • If ICnew is valid, add it to ICsvalid • Break If • large portion of ICsnear_singleton was refused due to integrity constraints, or • BA is too large or too high in the formatting block tree • Remove ICbest from document, and if ICswork is not empty goto 6 • Return ICsvalid
Parsing instances (3) Class C X card=1, may contain Y Y card=1..n Z card=1..n {AX} m n ... a b c d e f g h i j k l AX AY AZ Garbage A block structure TD TD TR TABLE
Parsing instances (3) Class C X card=1, may contain Y Y card=1..n Z card=1..n {AXAY} {AX} {AY} m n ... a b c d e f g h i j k l AX AY AZ Garbage A block structure TD TD TR TABLE
Parsing instances (3) Class C X card=1, may contain Y Y card=1..n Z card=1..n {AXAY} {AXAY} {AY} {AX} {AY} m n ... a b c d e f g h i j k l AX AY AZ Garbage A block structure TD TD TR TABLE
Parsing instances (3) Class C X card=1, may contain Y {AXAYAZ} {AXAYAY} {AXAYAZ} {AXAYAZ} {AXAYAY} {AXAYAZ} {AXAZ} {AXAY} {AXAZ} Y card=1..n Z card=1..n {AXAY} {AXAY} {AY} {AX} {AY} {AZ} {AY} {AZ} m n ... a b c d e f g h i j k l AX AY AZ Garbage A block structure TD TD TR TABLE
Parsing instances (3) Class C X card=1, may contain Y {AXAYAZ} {AXAYAY} {AXAYAZ} {AXAYAZ} {AXAYAY} {AXAYAZ} {AXAZ} {AXAY} {AXAZ} Y card=1..n Z card=1..n {AXAY} {AX[AY]} {AX[AY]} {AXAY} {AY} {AX} {AY} {AZ} {AY} {AZ} m n ... a b c d e f g h i j k l AX AY AZ Garbage A block structure TD TD TR TABLE
Parsing instances (3) Class C X card=1, may contain Y {AXAYAZ} {AXAYAY} {AXAYAZ} {AXAYAZ} {AXAYAY} {AXAYAZ} {AXAZ} {AXAY} {AXAZ} {AX[AY]AZ}{AX[AY]AY}{AX[AY]AZ} {AX[AY]AZ}{AX[AY]AY}{AX[AY]AZ} Y card=1..n Z card=1..n {AXAY} {AX[AY]} {AX[AY]} {AXAY} {AY} {AX} {AY} {AZ} {AY} {AZ} m n ... a b c d e f g h i j k l AX AY AZ Garbage A block structure TD TD TR TABLE
Parsing instances (3) Class C X card=1, may contain Y {AXAYAZ} {AXAYAY} {AXAYAZ} {AXAYAZ} {AXAYAY} {AXAYAZ} {AXAZ} {AXAY} {AXAZ} {AX[AY]AZ}{AX[AY]AY}{AX[AY]AZ} {AX[AY]AZ}{AX[AY]AY}{AX[AY]AZ} Y card=1..n Z card=1..n {AXAY} {AX[AY]} {AX[AY]} {AXAY} {AY} {AX} {AY} {AZ} {AY} {AZ} m n ... a b c d e f g h i j k l AX AY AZ Garbage A block structure TD TD TR TABLE
Parsing instances (4) • From the instance parser, we get a set of valid ICs • similar to ACs, these may overlap • valid ICs form a lattice within the analyzed document
Parsing instances (5) • Since we want to extract both valid instances and standalone attributes, we merge the AC lattice and the valid IC lattice: • ICs which interfere with other ICs and leave their parts unexplained are penalized relatively to the unexplained parts of interfering ICs
Parsing instances (6) • The best path is found through the merged lattice • This should be the sequence of standalone ACs and valid ICs which best explain the document content
Wrapper induction (1) • During IC parsing, we search for common formatting patterns which would encapsulate part of the ICs being generated • E.g. person’s first name and last name (if we extracted these as separate attributes) could be regularly contained in formatting pattern: • TR[1..n] { TD[0] {person.firstname} TD[1] {person.lastname} } • Formatting pattern is defined as the first block area (HTML tag) containing the whole IC, plus the paths from that area to each of the IC’s attributes. • If “reliable” formatting patterns are found, we add them to the context patterns of the respective attributes. For such attribute A, we then: • boost/lower scores of all ACs of A, • create new ACs for A where the formatting patterns match and ACs did not exist before • rescore all ICs which contain rescored ACs, • add new singleton ICs for newly added ACs.
Wrapper induction (2) Formatting pattern induction process • Segment all ICs from parser’s queue (not only the valid ones) into clusters of ICs with the same attributes populated • e.g. {firstname: Varel, lastname: Fristensky} {firstname: Karel, lastname: Nemec} would fit into one cluster. • For each cluster, build an IC lattice going through the document, and find the best path of non-overlapping ICs. • For ICs on the best path, compute the counts of each distinct formatting pattern. For each formatting pattern FP, estimate • precision(FP)=C(FP,instance from cluster) / C(instance from cluster) • recall(FP)=C(FP,instance from cluster) / C(FP) where C() means observed counts. • We induce a new pattern if precision(FP), recall(FP) and C(FP,instance from cluster) reach configurable thresholds.
Wrapper induction (3) • Plugging wrapper generation into the instance parsing algorithm • in the current implementation, formatting patterns are only induced once for singleton ICs • Parallel parsing of multiple documents • documents from the same source (e.g. website) often share formatting patterns; we expect measurable improvement over the single document extraction approach • to be implemented • More experiments needed
Ex demo • Command line version • GUI available • GUI of Information Extraction Toolkit exists as a separate project, ready to accommodate other IE Engines • Simple API to enable usage in 3rd party systems • Everything written in Java • however may connect to lemmatizers / POS taggers / other tools written in arbitrary languages • Ex: ~ 26,000 lines of code • Information Extraction Toolkit: ~ 2,500 lines of code
Discussion • Thank you.