360 likes | 974 Views
Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation. Martin Labsk ý , Vojt ěch Svátek Dept. of Knowledge Engineering, UEP {labsky,svatek}@vse.cz AI Seminar, November 13 th 2008. Agenda. Example applications of Web IE
E N D
Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation Martin Labský, Vojtěch Svátek Dept. of Knowledge Engineering, UEP {labsky,svatek}@vse.cz AI Seminar, November 13th 2008
Agenda • Example applications of Web IE • Difficulties in practical applications • Extraction Ontologies • Extraction process • Experimental results • Future work and Conclusion AI Seminar IE based on Extraction Ontologies
Example apps of Web IE (1/5): online products AI Seminar IE based on Extraction Ontologies
Example apps of Web IE (2/5): contact information AI Seminar IE based on Extraction Ontologies
Example apps of Web IE (3/5): seminars, events AI Seminar IE based on Extraction Ontologies
Example apps of Web IE (4/5): bike products AI Seminar IE based on Extraction Ontologies
Example apps of Web IE (4/5) • Store the extracted results in a DB to enable structured search over documents • information retrieval • database-like querying • e.g. online product search engine, • e.g. building a contact DB • Support for web page quality assessment • involved in an EU project MedIEQ to support medical website accreditation agencies • Source documents • internet, intranet, emails • can be very diverse AI Seminar IE based on Extraction Ontologies
Agenda • Example applications of Web IE • Difficulties in practical IE applications • Extraction Ontologies • Extraction process • Experimental results • Future work and Conclusion AI Seminar IE based on Extraction Ontologies
Difficulties in practical applications (1/3) • Requirements • quickly prototype IE applications • not necessarily with the best accuracy initially • often needed for a proof-of-concept application • then more work can be done to boost accuracy • the extraction model changes • meaning of to-be-extracted items may shift, • new items are often added or removed AI Seminar IE based on Extraction Ontologies
Difficulties in practical applications (2/3) • Purely manual rules • writing extraction rules manually does not scale when more complex extraction rules need to be encoded • not easy to combine with trained models when training data become available in later phases • Training data • trainable IE systems often require large amounts of training data: these are typically not available for the desired task • when training data is collected, it is not easy to adapt it to modified or additional criteria • Wrappers • cannot rely on wrapper-only systems when extracting from multiple websites • non-wrapper systems often do not utilize regular formatting cues AI Seminar IE based on Extraction Ontologies
Difficulties in practical applications (3/3) • Seems interesting to exploit at the same time • extraction knowledge from domain experts • training data • formatting regularities AI Seminar IE based on Extraction Ontologies
Agenda • Example applications of Web IE • Difficulties in practical applications • Extraction Ontologies • Extraction process • Experimental results • Future work and Conclusion AI Seminar IE based on Extraction Ontologies
Extraction ontologies • An extraction ontology is a part of a domain ontology transformed to suit extraction needs • Contains classes composed of attributes • more like UML class diagrams, less like ontologies where e.g. relations are standalone • also contains axioms related to classes or attributes • Classes and attributes are augmented with extraction evidence • manually provided patterns for content and context • axioms • value or length ranges • links to trained models Person name {1} degree {0-5} email {0-2} phone {0-3} Responsible AI Seminar IE based on Extraction Ontologies
Extraction evidence provided by domain expert (1) • Patterns • for attributes and classes • for their content and context • patterns may be defined at the following levels: • word and character-level, • formatting tag level • level of labels (e.g. sentence breaks, POS tags) • Attribute value constraints • word length constraints, numeric value ranges • possible to attach units to numeric attributes • Axioms • may enforce relations among attributes • interpreted using JavaScript scripting language • Simple co-reference resolution rules AI Seminar IE based on Extraction Ontologies
Extraction evidence provided by domain expert (2) Axioms • class level • attribute level Patterns • class content • attribute value • attribute context • class context Value constraints • word length • numeric value AI Seminar IE based on Extraction Ontologies
Extraction evidence based on trained models (1) • Links to trainable classifiers • may classify attributes only • binary or multi-class • Trained models may use as features: • simple word level features (word itself, word type, possibly POS tags) • re-use all evidence provided by expert (patterns, axioms, constraints) • induced binary features based on word n-grams classifier usage classifier definition AI Seminar IE based on Extraction Ontologies
Extraction evidence based on trained models (2) • Data representation for classifiers: • word sequence (1 word = 1 sample) • phrase set (sliding window method) • Tested trainable classifiers: • CRF++ (Conditional Random Fields) http://crfpp.sourceforge.net • algorithms from the Weka machine learning toolkit • SVM (Support Vector Machine) • JRip (rule induction) • http://www.cs.waikato.ac.nz/ml/weka • Hidden Markov Model extractor AI Seminar IE based on Extraction Ontologies
Extraction evidence based on trained models (3) • Feature induction • candidate features are all word n-grams of given lengths occurring inside or near training attribute values • pruning parameters: • point-wise mutual information thresholds: • minimal absolute occurrence count • maximum number of features AI Seminar IE based on Extraction Ontologies
Probabilistic model to combine evidence • Each piece of evidence E is equipped with 2 probability estimates with respect to predicted attribute A: • evidence precision P(A|E) ... prediction confidence • evidence coverage P(E|A) ... necessity of evidence (support) • Each attribute is assigned some low prior probability P(A) • Let be the set of evidence applicable to A • Assume conditional independence among : • Using Bayes formula we compute P(A | its evidence values) as: where AI Seminar IE based on Extraction Ontologies
Extraction vs. domain ontologies • When existing domain ontologies are available: • identify relevant parts • reuse classes, attributes, cardinalities, some axioms • Transformation rules • reused parts of domain ontology may require transformation to fit into extraction ontology • due to extraction ontologies focusing on the way of presentation rather than semantics • identified typical transformation rules that could be used to transform parts of OWL-encoded ontologies AI Seminar IE based on Extraction Ontologies
Agenda • Example applications of Web IE • Difficulties in practical applications • Extraction Ontologies • Extraction process • Experimental results • Future work and Conclusion AI Seminar IE based on Extraction Ontologies
The extraction process (1/5) • Tokenize, build HTML formatting tree, apply sentence splitter, POS tagger • Match patterns • Apply trained models • Create Attribute Candidates (ACs) • For each created AC, let PAC= • prune ACs below threshold • build document AC lattice, score ACs by log(PAC) Washington , DC ... ... AI Seminar IE based on Extraction Ontologies
The extraction process (2/5) • Evaluate coreference resolution rules for each pair of ACs • e.g. “Dr. Burns” “John Burns” • possible coreferring groups are remembered • in attribute’s value section: • Compute the best scoring path BP through AC lattice • using dynamic programming • Run wrapper induction algorithm using all AC BP • wrapper induction algorithm described in next slides • if new local patterns are induced, apply them to: • rescore existing ACs • create new ACs • update AC lattice, recompute BP • Terminate here if no instances are to be generated • output all AC BP (n-best paths supported) AI Seminar IE based on Extraction Ontologies
The extraction process (3/5) • Generate Instance Candidates (ICs) bottom-up • triangular trellis used to store partial ICs • when scoring new ICs, only consider axioms and patterns that already can be applied to the IC. Validity is not required. • pruning parameters: abs and relative beam size at trellis node, maximum number of ACs that can be skipped, min IC probability AI Seminar IE based on Extraction Ontologies
The extraction process (4/5) • IC generation: continued • When new IC is created, its P(IC) is computed from 2 components: where |IC| is member attribute count, ACskip is an non-member AC that is fully or partially inside the IC, PAC skip is the probability of AC being a “false positive”. where C is the set of evidence known for the class C, computed using the same probabilistic model as for ACs. • Scores are combined using the Prospector pseudo-bayesian method: AI Seminar IE based on Extraction Ontologies
The extraction process (5/5) • Insert valid ICs into AC lattice • Valid ICs were assembled during IC generation phase • Score of a valid IC reflects all extraction evidence of its class • All unpruned valid ICs are inserted into the AC lattice, scored by • The best path BP is calculated through the IC+AC lattice (n-best supported) • the search algorithm allows constraints to be defined over the extracted path(s) • e.g. min/max count of extracted instances • output all ACs and ICs on BP IC1 AI Seminar IE based on Extraction Ontologies
Extraction evidence based on formatting • A simple wrapper inductionalgorithm • identify formatting regularities • turn them into “local” context patterns to boost contained ACs • Assemble distinct formatting subtrees rooted at block elements containing ACs from the best path BP currently determined by the system • For each subtree S, calculate • If both C(S,Att) and prec(Att|S) reach defined thresholds, a new local context pattern is created with its precision set to C(S,Att) and its recall close to 0 (in order not to harm potential singleton ACs. a formatting tree learned using known names like “John Doe” and applied to unknown names TD TD B A_href B A_href John Doe jdoe@web.ca Argentina Agosto aa@web.br AI Seminar IE based on Extraction Ontologies
Agenda • Example applications of Web IE • Difficulties in practical applications • Extraction Ontologies • Extraction process • Experimental results • Future work and Conclusion AI Seminar IE based on Extraction Ontologies
Experimental results: Seminar announcements • 485 English seminar announcement text documents • Manual: extraction ontology created based on seeing 40 randomly chosen documents, evaluated using remaining 445 • Manual+CRF: same extraction ontology equipped with a CRF classifier used as further extraction evidence. 10-fold cross-validation using test set above AI Seminar IE based on Extraction Ontologies
Cost of the IE system: Seminar announcements • Creation of extraction ontology: 1-2 person weeks • annotate 40 training documents (expect 1-2 days) • inspecting examples in 40 documents • writing patterns, axioms, iterating • Training inductive model in addition to ex. ontology • 2-3 person weeks to annotate training data (445 docs) • F-measure improvement from 2 to 6% • ex. ontologies allow for fast & flexible prototyping (annotation design changes quickly reflected) • then, for parts of the ex. ontology that need accuracy improvement, obtain more training data & reuse as features all manual extraction evidence already provided AI Seminar IE based on Extraction Ontologies
Experimental results: Contact information • 109 English contact pages, 200 Spanish, 108 Czech • Named entity counts: 7000, 5000, 11000, respectively, instances not labeled • Only domain expert’s evidence and formatting pattern induction were used • Domain expert saw 30 randomly chosen documents, the rest was test data • Instance extraction done but not evaluated Instance grouping • Villain score F = 60-70% • Villain recall = % of correct links recovered • Villain precision = % of recovered links that are correct AI Seminar IE based on Extraction Ontologies
Experimental results: Bicycle descriptions • Hidden Markov Model • Trigram, naive topology • 103 labeled web pages, 12346 named entities, • Instances not labeled; instance extraction done but not evaluated • Single HMM for all extracted types: • 1 Background state • 1 Target, 1 Prefix and 1 Suffix state type for each extracted slot • =1+3*N states B P T S P’ T’ S’ ... AI Seminar IE based on Extraction Ontologies
Bicycle structured search interface AI Seminar IE based on Extraction Ontologies
Future work • Attempt to improve a seed extraction ontology by bootstrapping using relevant pages retrieved from the Internet • Adapt the structure of extraction ontology according to data • e.g. add new attributes to represent product features AI Seminar IE based on Extraction Ontologies
Conclusions • Tool+tutorial available • http://eso.vse.cz/~labsky/ex/ • Presented an extraction ontology approach to • allow for fast prototyping of IE applications • accommodate extraction schema changes easily • utilize all available forms of extraction knowledge • domain expert’s knowledge • training data • formatting regularities found in web pages • Results • indicate that extraction ontologies can serve as a quick prototyping tool • accuracy of the prototyped ontology can be improved when training data become available AI Seminar IE based on Extraction Ontologies
Acknowledgements • The research was partially supported by the EC under contract FP6-027026, Knowledge Space of Semantic Inference for Automatic Annotation and Retrieval of Multimedia Content: K-Space. • The medical website application is carried out in the context of the EC-funded (DG-SANCO) project MedIEQ. AI Seminar IE based on Extraction Ontologies