460 likes | 569 Views
Another approach to Information Extraction. using Extended Ontologies. Marek Nekvasil xnekm06@vse.cz. agenda. g athering information with wrappers w ays to build a wrapper u sing and extending an ontology t emplates and patterns s uggesting a simple wrapper induction method.
E N D
Another approach to Information Extraction using Extended Ontologies Marek Nekvasil xnekm06@vse.cz
agenda • gathering information with wrappers • ways to build a wrapper • using and extending an ontology • templates and patterns • suggesting a simple wrapper induction method
wrapping up a document • synonym to identifying relevant information in the document • there are many ways how to wrap a document up
wrapper classes • string-based wrappers • Kushmerick‘s wrapper classes • tree-based wrappers • XPath • Elog • finite automata • Methods Comparison
LR class • basic class (stands for Left-Right) • 2n parameters (2 for every part of extracted tuple) • example: • suitable wrapper LR(<B>; </B>; <I>; </I>) <HTML> <TITLE>Ceny pobytů</TITLE> <BODY> <B>Řecko - Lefkada</B> <I>16 299 Kč</I><BR> <B>Mallorca - Santa Ponsa</B> <I>21 100 Kč</I><BR> <B>Egypt - Sharm El Sheikh</B> <I>18 500 Kč</I><BR> <B>Egypt - Ghiza</B> <I>19 049 Kč</I><BR> </BODY> </HTML>
other LR class derivates • Nicolas Kushmerick‘s classes • HLRT (Head-Left-Right-Tail) • OCLR (Opening-Closing-Left-Right) • HOCLRT (…) • N-LR or N-HLRT (Nested-…)
XPath wrappers • using XPath queries to identify data in the tree representation of a document • often using just the very basicfeatures of the XPath language • usually building queries from the root of a document
Elog • declarative language similar to Prolog • uses predicates to generate instances • used in the Lixto tool • example of Elog wrapper
finite automata • FSM can be used for wrapping in various ways • usually used for searching in the linear representation of a document • Carme shows it is possible to use FSM for searching in the tree structure
methods comparison • Tree-based wrappers are more error-prone than linear string-based wrappers • Elog and N-LR allow extraction not only from tabular data structure but also from a general hierarchical data structure • XPath wrappers reuse a well defined standard
agenda • gathering information with wrappers • ways to build a wrapper • using and extending an ontology • templates and patterns • suggesting a simple wrapper induction method
building a wrapper • by hand • Oracle and PAC analysis • interactive visual pattern design • tree-fragment queries • tree traversal pattern generalization • and many other …
PAC analysis • uses an abstract function called Oracle to gather enough example instances of extracted class (asuming it‘s embrased by human) • gathers examples until it has enoughN to suggest a wrapper class with a designated error e on a given probality level 1-d, using the formula: • finally searches for the first set of parameters of the wrapper to match all the exmaples
interactive visual pattern design • used in Lixto tool to craft wrappers in Elog language • first user points out the example instances which makes a generating rule, a pattern • then the user forms conditions (filters) of the patterns to restrict them, which is done visually
tree-fragment queries • searching such a minimum XPath query that forms a tree-prefix to all examples • tree-prefix examples
tree traversal pattern generalization • application of the graph theory on the generalized document tree • searching the shortest path through the document tree and thus forming an efficient XPath query
agenda • gathering information with wrappers • ways to build a wrapper • using and extending an ontology • templates and patterns • suggesting a simple wrapper induction method
ontologies and wrappers • ontology is a knowledge model • we can make a knowledge model that summarizes what information we are going to extract • with a nifty extension we can use the ontology to identify examples of what we are going to extract • theese examples can be used to build a wrapper with any method
ontology in OWL <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:owl="http://www.w3.org/2002/07/owl#"> <owl:Ontology rdf:about=""> <owl:imports rdf:resource=“http://www.somedomain.com/x“/> </owl:Ontology> <owl:Class rdf:ID=“class_A“> <owl:disjointWith rdf:resource=“#class_B“/> </owl:Class> <owl:Class rdf:ID=“class_C“> <owl:subClassOf rdf:resource=“#class_A“/> </owl:Class> <owl:DatatypeProperty rdf:ID="property_A"> <rdfs:domain rdf:resource="#class_A"/> </owl:DatatypeProperty> </rdf:RDF>
extending OWL • in the terms of ontologies we extract values of datatype properties • therefore we need some technique to identify (and rank) possible instances of theese values • we suggest a way to define complex templates of typical values of a datatype property
placing a template into the ontology • we estabilish a new namespace: xmlns:ot="http://st.vse.cz/~XNEKM06/ontologytemplates#„ • in the new namespace we use an element <ot:Template> to write a template down • such a template can only be joined with a datatype property <owl:DatatypeProperty rdf:ID=„property_A"> <rdfs:domain rdf:resource="#class_B"/> <ot:Template ...> ... </ot:Template> </owl:DatatypeProperty>
agenda • gathering information with wrappers • ways to build a wrapper • using and extending an ontology • templates and patterns • suggesting a simple wrapper induction method
patterns • pattern – a general rule that can be evaluated against any continuous part of a document to see with what degree it matches
template • template – a set of rules that can be evaluated as a whole against any continuous part of a document to see with what degree it matches • a template is a special case of a pattern • thus a template can contain other templates
simple patterns • pattern has an internal algorythm that can (with some parameters) identify possible matches throughout the document with a pattern match degree as an output • moreover we need to infer a degree of evidence certainty which should be our confidence that it really is a value that the pattern was to identify
deriving the degree of evidence certainty 1 • let us define two propositions: • A – the pattern algorythm identified a given part of a document • E – the part really should have been identified by that pattern • A and E are logical propositions and in fuzzy logic their truth value is a real number from the interval <0; 1>
deriving the degree of evidence certainty 2 • intuitively there should be a relationA E • thanks to modus ponens rule we can write in basic logic(A & (A E)) E • of that we can deriveval(E) val(A & (A E)) • and while not wanting to overestimate the evidence certainty we setval(E) = val(A & (A E))
deriving the degree of evidence certainty 3 • now we introduce a parameter of the patternval (A E) = p • we call it pattern precision • using for examle Łukasiewicz‘ logic we can derivee = max (0, a + p -1)where e stands for val(E) and A for val(A)
deriving the degree of evidence certainty 4 • without doubt it‘s true that(E A) E, and (A E) E • while in Łukasiewicz‘ logic we can derive from the above(A S E) (E A) • and therefore(E A)(A E)
deriving the degree of evidence certainty 5 • while we substitute (E A) for (E A) we can derive(E A) E • and we introduce a second parameterval (E A) = cwhich we call a pattern completeness
deriving the degree of evidence certainty 6 • combinig the two rules above we can derive an ultimate rule((A & (A E)) (E A)) E • and while still not wanting to overestimate the evidence certainty we can write down (in Łukasiewicz‘ logic)e = max (max (0, a + p -1), 1 – c)
simple patterns summary • a pattern identifies a given place in the document with a pattern match degree denoted as a • every pattern has two parameters: p – precision and c – completeness • the degree of pattern evidence certainty can then be calculated ase = max (a + p -1, 1 – c)
composite patterns • as to forming a template we can combine the fragmentary simple patterns together • computing the evidence certainty is the same as it was in case of simple patterns however we have to derive a pattern match degree somehow
deriving the composite pattern match degree • joining evidences of two patterns can be viewed as joining two fuzzy sets • for this we can use either a set union (asociated with disjuntion) or a set intersection (asociated with conjunction) • therefore we compute the composite pattern match degree as the conjuncion or disjunction of evidence certainties of all component patterns • so we get two kinds of templates: conjoint and disjoint
the nature of templates • for the calculations we use the formulae of min-conjuntion and max-disjunction • the parameters p and c of component patterns now get a new meaning • in a disjoint template a high value of p means that the pattern forms a sufficient condition • in a conjoint template a high value of c means that the pattern forms a necessary condition
writing down the templates • we write the template down as to match it with the ontology as was shown before: <ot:Template ot:p=“0.95“ ot:c=“0.8“ ot:type=“disjoint“> ... </ot:Template> • the component patterns will be written in the form of nested xml tags
a few kinds of patterns • <ot:String ot:p=“0.7“>Egypt</ot:String> • <ot:Stringlist ot:source=“c:\temp\zeme.txt“ ot:c=“0.62“/> • <ot:Concatenation> ..</..> • <ot:Context ot:side="left" ot:maxdistance="1" ot:c="0.5">..</..> • <ot:Number ot:min = “1“ ot:min = “10“ /> • <ot:Distribution ot:type="gauss" ot:mean="10900" ot:variance="9200000"/> • <ot:Regexp> ..</..> • …
example template <ot:Template ot:type="disjoint" ot:c="0.9"> <ot:Concatenation> <ot:Distribution ot:type="gauss" ot:mean="10900" ot:variance="9200000"/> <ot:Stringlist> <ot:String ot:case="any">kc</ot:string> <ot:String ot:case="any">kč</ot:string> <ot:String ot:case="same">,-</ot:string> </ot:Stringlist> </ot:Concatenation> <ot:Context ot:side="left" ot:maxdistance="2" ot:p="0.6"> <ot:Template> <ot:String ot:case="any">cena</ot:string> <ot:String ot:case="any">cena:</ot:string> </ot:Template> </ot:Context> </ot:Template>
agenda • gathering information with wrappers • ways to build a wrapper • using and extending an ontology • templates and patterns • suggesting a simple wrapper induction method
anotating the document • fisrt of all we can use the ontology as a model of the extracted data • then we would have to use the templates included in the ontology to identify possible example instances of the extracted values • theese examples can be used with any wrapper induction method
purifying the evidences • while every pattern has the precision attribute, we can say that up to (1-p)% of the template evidences can be false • we can make segments of the evidences based on thei absolute XPath • then we calculate the sum of confidences of all evidences in such a segment and ignore (1-p)% of the segments with the lowest sum
generalizing the segments • we generalize the segment using the variable index in the XPath • comparing the number of this generalized segment‘s elements with the original, we can use the completeness parameter to measure the probable error of such a generalization
matching the segments • we can match the segments of patterns of more datatype properties and form thus complex rules for extracting the instances of ontology classes • the matching can be based on the number of their elements or on the conformity of their XPath
future work suggestions • integration with some wrapper generation tool • automatic learnig of the patterns • using other properties of ontologies, such as cardinalities