210 likes | 483 Views
Learning Effective Patterns for Information Extraction. Gijs Geleijnse (gijs.geleijnse@philips.com). Overview. my view on O ntology Population / Information Extraction short discussion of global approach wrt Ontology Population a subproblem: learning relation patterns
E N D
Learning Effective Patterns for Information Extraction Gijs Geleijnse (gijs.geleijnse@philips.com)
Overview • my view on Ontology Population / Information Extraction • short discussion of global approach wrt Ontology Population • a subproblem: learning relation patterns • experiments with learned patterns • conclusions
What’s the problem? Information is freely accessible on the web ... but the information on the `traditional’ web is not interpretable by machines. Goal of my research: Find, extract and combine information on the web into a machine interpretable format
What’s the problem? (2) • Come up with a model for the concept information • 2. Come up with algorithms to populate this model Ontologies
Populating an ontology • 1. Formulate queries with instance • `U2’s album’. • 2. Collect Google search results: • `U2’s album Pop ..’, • `U2’s album on a flash card’, • `U2’s album How to Dismantle..’ • Identify instances `U2’ producer album artist (Boy), (HtDaAB)... (U2, Boy), (U2, HtDaAB)
Subproblems of OP How to • identify patterns expressing relations? Amsterdam – Netherlands `is the capital of’ • identify instances in the Googled texts? `buy i still know what you did last summer on dvd` • define acceptance functions for instances and relations `they think Amsterdam is the capital of Germany hahahaha’
Identifying effective relation patterns • We want patterns that give many useful results. Three criteria for effectiveness: • A pattern must frequently occur on the web i.e. it must return many results. • A pattern must be precise, i.e. it must return many useful results. • 3. When relation R is one-to-many, a pattern must be wide-spread, i.e. it must return diverse results.
Identifying effective relation patterns • Approach: • Compose training set with related items. • Google them to get a set of patterns • Compute scores for the patterns • Constraint: • - Don’t Google too often!
Retrieving relation patterns We formulate queries with the elements in the trainingset: “Michael Jackson * Thriller”, “Thriller * Michael Jackson” We retrieve all inner-sentence fragments between the instances and normalize them (remove punctuation marks and capitals).
Evaluate relation patterns We now have a (long) list of patterns: [album] by [artist] ; [artist]’s [album] ; [album] album cover by [artist] ; [album] di [artist] ; ......... Now to compute scores: frequency, precision, wide-spreadness
Evaluate relation patterns • Frequency: we take the frequency of the pattern in the list obtained. • Precision: • - we google the pattern in combination with an instance • observe the fraction of useful results • e.g. if we google “ABBA’s new album” we divide the number excerpts with an album title by the total number of excerpts found
Evaluate relation patterns Wide-spreadness: we count the number of different instances found with the query. Score= freq * prec * spr We only compute the scores of the N most frequent patterns. Number of queries: 2 * |training set| + N * |instance set|
Case-study: Hearst Patterns Are the Hearst Patterns indeed the most effective patterns for the is-a relation? (O = ((country, hynonym), ({all countries}, {‘country’, ‘countries’}), is_a, {(Afghanistan,country), (Afghanistan, countries), (Akrotiri, country), (Akrotiri, countries), ...)})
Case-study: Hearst Patterns Both the common Hearst Patterns and relations typical for this setting (countries) perform well.
Case-study: Burger King TREC QA question: In which countries can Burger King be found? O = ((country, restaurant), ({all countries}, {McDonald’s, KFC}), located_in, (McDonald’s, USA), (KFC, China), ...))
Case-study: Burger King We first find patterns using the method described:
Case-study: Burger King • ... And simultaneously find names of restaurants • Capitalized words • Noh(“restaurants like X and”) >= 50
Case-study: Burger King • Finally, we use the patterns found in combination with `Burger King` to find relations. • - Precision 80% • Recall 85% • Most errors due to countries in which Burger King plans to open restaurants.
Conclusions • Automatic Pattern selecting is successful • Simple methods again lead to good results • Recognition of instances and the filtering of erroneous patterns is still a big chalenge • Ontology Population is fun