250 likes | 417 Views
An Improved Pattern Model for Automatic IE Pattern Acquisition. Kiyoshi Sudo Satoshi Sekine Ralph Grishman. New York University. Automatic Pattern Acquisition. The cost of manual construction of extraction patterns is very high.
E N D
An Improved Pattern Model for Automatic IE Pattern Acquisition Kiyoshi Sudo Satoshi Sekine Ralph Grishman New York University ACL 2003
Automatic Pattern Acquisition • The cost of manual construction of extraction patterns is very high. • The cost of preparation of annotated data for supervised learning is still high. • The recent trend of the researches on pattern acquisition is un- (semi-) supervised learning. ACL 2003
Information Extraction • Identifying entities from source text and mapping from source text to pre-defined table. “A smiling Palestinian suicide bomber triggered a massive explosion in the heavily policed heart of downtown Jerusalem today, …” today Date: downtown Jerusalem Location: A … suicide bomber Perpetrator: ACL 2003
Local Context • Local contexts provides a useful information to identify entities. “A smiling Palestinian suicide bomber triggered a massive explosion in the heavily policed heart of downtown Jerusalem today, …” today Date: downtown Jerusalem Location: A … suicide bomber Perpetrator: ACL 2003
Extraction Pattern • Generalize each instance of entity and its local context into an extraction pattern. “A smiling Palestinian suicide bomber triggered a massive explosion in the heavily policed heart of downtown Jerusalem today, …” <person> triggered a massive explosion NE category Association Rule Perpetrator: ACL 2003
Dependency Tree forPattern Model • Introducing syntax (dependency tree) clarify the relation of arguments with predicates. ADV SBJ triggered OBJ IN A smiling Palestinian suicide bomber a massive explosion heart today heavily policed downtown Jerusalem ACL 2003
Extraction Pattern models • Predicate-Argument model (Yangarber et al. 2000) • Based on direct relation with a predicate • Chain model (Sudo et al. 2001) • Based on a chain of modifiers of a predicate triggered <person> explosion triggered triggered heart <person> downtown Jerusalem ACL 2003
Predicate-Argument model • Predicate-Argument model is based on the direct relation of a predicate and its arguments. SBJ triggered ADV OBJ IN <person> a massive explosion heart <date> heavily policed downtown Jerusalem ACL 2003
Chain model • Chain model can capture the chain of modifier with an arbitrary depth in the tree, regardless phrasal or clausal boundary. SBJ triggered ADV OBJ IN <person> a massive explosion heart <date> heavily policed <location> • (Sudo et al. 2001) reported 5% gain in recall with same level of precision over Predicate-Argument model. ACL 2003
Problem • Chain model contains only one node at each level of the tree. SBJ triggered ADV OBJ IN <person> a massive explosion heart <date> heavily policed downtown Jerusalem ACL 2003
Problem • Lack of the context can make a pattern too general, causing a false match on irrelevant text. SBJ triggered ADV OBJ the Mexican peso a national financial crisis last week “The Mexican peso was devalued and triggered a national financial crisis last week.” ACL 2003
Subtree model • Generalization of Predicate-Argument and Chain model • Any connected subtree of a dependency tree will be considered as a candidate of extraction pattern. • Give reliable contexts as Predicate-Argument model does • Capable to capture long-distance relationship in dependency tree ACL 2003
Subtree model • Subtree model can provide more relevant contexts, as well as have a flexibility in traversing arbitrary depth in the tree. SBJ triggered ADV OBJ IN <person> a massive explosion heart <date> heavily policed downtown Jerusalem ACL 2003
Experiment • Entity Extraction task • Identify if an NE instance is involved in scenario or not • Management Succession • Person, Organization, Post (Position_Title) • Murder Arrest • Arresting Agency (Organization), Suspect (Person), Charge • Source: Japanese newspaper 117,109 articles (Mainichi 1995) • Test: accumulated from Mainichi 1994 • Succession 148 documents • Arrest 205 documents ACL 2003
Acquisition Method • The target scenario is specified by TREC-like narrative description • “Management Succession at the level of executives of a company. The topic of interest should not be limited to the promotion inside the company mentioned, but also includes hiring executives from outside the company of their resignation.” [Translated from Japanese] • Preprocessing • Dependency Analysis, NE-tagging • Document Retrieval R ACL 2003
Acquisition Method • Count all possible subtrees in R • subtree-mining algorithm (Zaki et al. 2002) • make a Pattern List of those that conform the pattern model • Rank each subtree number of times subtree i occurred in the documents in R For each subtree i, R ACL 2003
Acquisition Method • Count all possible subtrees in R • subtree-mining algorithm (Zaki et al. 2002) • make a Pattern List of those that conform the pattern model • Rank each subtree number of documents in the source which contain subtree i For each subtree i, R ACL 2003
Overlapping patterns • Pattern List contains many overlapping patterns • (19) (<organization> report) ((<organization>-wa) Happyo_suru) • (480) (<organization> report that … be appointed) ((<organization>-wa) (Shunin_suru-to) Happyo-suru) • b works as a weight on patterns with more relevant context [Translated from Japanese] ACL 2003
bcomparison ACL 2003
Unsupervised Parameter Tuning • Unsupervised text classification task by pattern matching • retrieved… 300 documents retrieved • random… 300 randomly selected • For each precision-recall curve for b, calculate the area that the curve covers. • Pearson correlation coefficient • rp = 0.80 with 2% confidence ACL 2003
Extraction Performance ACL 2003
Lessons learned • Subtree vs. Chain • Too-general patterns got more penalized for Subtree model • Penalize by Inversed Document Frequency (Subtree, Chain) • More scenario-specific patterns got promoted (Subtree) ACL 2003
Lessons learned • Subtree vs. Predicate-Argument • Patterns with nominalized predicates • Extraction patterns for headlines • e.g. (promotion of <post> <person>) ((<person> <post>-no) Shokaku) • Noun phrase patterns with chain of modifiers • e.g. (<post> with ministerial authority) (((Daihyoken-no (Aru-f (<post>))) [Translated from Japanese] ACL 2003
Lessons to be learned • Enhanced scoring function by modern IR technique. • Some techniques directly helps pattern acquisition • e.g. relevance feedback • However, note the crucial difference between Pattern acquisition and IR • Same pattern does not appear twice in a document. • Generic variable instead of sticking to Named Entity categories as place holder. • How robust can a pattern be without semantic restriction? ACL 2003
Conclusion • We proposed Subtree model as a generalization of • Predicate-Argument model • Chain model • Subtree model patterns overly performed better than other models in Entity Extraction tasks. • Scoring function needs a special consideration for overlapping patterns. • Unsupervised parameter tuning by text classification task. ACL 2003