3 Typical Work on Automatic Relation Extraction

3 Typical Work on Automatic Relation Extraction 自动关系抽取的三种重要方法武文娟 2009.06.04

Outline • DIPRE,1998 • KnowItAll, 2005 • Open IE, 2007

1 DIPRE: Dual Iterative Pattern Expansion Sergey Brin, Extracting Patterns and Relations from the World Wide Web, In : Proceedings of the International Workshop on the Web and Databases, 1998.

1 DIPRE: Dual Iterative Pattern Expansion • 首次利用迭代方法发现数据实体间的模式和关系，并成功的发现了作者/作品数据对。 • Input: 5本书的样本集(author, title) • Output: 自动扩展到了15,000本书 • 有些书是最大的网上书店亚马逊也没有的。

1.1 Idea extract pattern tuple discover 模板和关系之间存在对偶性

1.2 Algorithm • 七元组 • (author, title, order, url, prefix, middle, suffix) R (Tuple set) Occurrence FindOccurrence (R, D) Generate & filter Search Patterns

Pattern generation • Occurrence • 七元组 • (author, title, order, url, prefix, middle, suffix) O1, O2, …, Ok Group by Order, middle For each Oi GenOnePattern(Oi) • Pattern p • 五元组 • (order, urlprefix, prefix, middle, suffix) URL: 匹配urlprefix* 内容： *prefix, author, middle, title, suffix* 输出 p YES p is specific? NO

1.3 Experiments • Corpus • A repository of 24 million web pages • 147G

1.3 Experiments: Initial sample

1.3 Experiments: 3 Patterns in First Iteration

1.3 Experiments: 4047 new pairs in First Iteration

1.3 Experiments: review

1.4 Conclusion • DIPRE： • 半监督关系学习方面的最初的工作 • 利用了关系和模板之间的对偶性，在Web这样的大规模语料库上，通过少量的sample作为种子，以迭代的方法，不断地抽取新的模板和实例。

Outline • DIPRE,1998 • KnowItAll, 2005 • Open IE, 2007

KNOWITALL Oren Etzioni etc. University of Washington Unsupervised Named-Entity Extraction from the Web: An Experimental Study AAAI 2005

Introduction • 以前的工作：HMM, CRF • 小规模的语料库 • 需要提供种子数据 • KNOWITALL: an unsupervised, domain-independent system that extracts information from the Web • 关键挑战: • 保证准确率：a novel generate-and-test architecture • 提高召回率： • Pattern Learning (PL) • Subclass Extraction (SE) • List Extraction (LE)

1 Flowchart of the main components in KnowItAll • For every predicate: • creates extraction rules and discriminators • Train discriminators • “cities such as ” NPList

Information Focus • 唯一领域相关的输入是一组predicate，用来指定所关注的领域。

通用的抽取模板

Extraction Rules • 通用的抽取模板，结合predicate的标签，生成相应领域的抽取规则 • Class1 = ‘city’，规则即为 • “cities such as ” NPList • “towns such as ” NPList • Keywords: “cities such as ” , “towns such as ” （提交给搜索引擎）

Discriminator • 用来确认某个抽取到的信息是否validate • 利用PMI

Training discriminator: Bootstrapping

The result of training • A set of discriminator, eg. • Discriminator: <I> is a city Learned threshold T: 0.000016 Conditional probabilities P(PMI > T | class) = 0.83 P(PMI > T | ¬class) = 0.08

An Example • Predicate: city • Bootstrapping: • Generate extraction rules and discriminators • Train all discriminators, and selected the 5 best discriminators

An Example:Trained discriminator

An ExampleMain cycle: extract • Suppose that • the query is “and other cities” • from a rule with extraction pattern: NP “and other cities”. • 2 instances: Fes, East Coast

An ExampleMain cycle: Assess • To compute the probability of City (Fes) • sends six queries • “Fes” has 446,000 hits; • “Fes is a city” has 14 hits • “cities Fes” (201 hits) • “cities such as Fes” (10 hits); • “cities including Fes” (4 hits) • 0 hits for “Fes and other towns”. • City (East Coast) • below threshold for all discriminators Sum up all the probability, The final probability is 0.99815 The final probability is 0.00027.

1.2 Experiment noise tolerance

1.2 Experiment find negative training seeds for assessor

1.2 Experiment: search cutoff metric • Signal to Noise ratio (STN): 正例与负例的比值 • Query Yield Ratio (QYR)：n个网页抽取到的新信息量

2 如何提高召回率 • Pattern Learning (PL): • 抽取规则 • 评价实例准确性的确认模板 • Subclass Extraction (SE): • 自动识别子概念，便于抽取 • 例如，为了抽取科学家的实例，可以先找到科学家的子概念（物理学家、地理学家等），再抽取这些子概念的实例。 • List Extraction (LE): • learns a “wrapper” for each list, and uses the wrapper to extract list elements. • 使用通用抽取模板抽取到的信息作为这三种方法的初始种子，因此它们都不需要人事先给出训练数据。

2.1 Pattern Learning (PL): • 通用模板对特定领域来说通常并不是最有效的模板 • “the film <film> starring” • “headquartered in <city>”

Pattern Learning algorithm • Estimating recall & precision efficiently • take the positive examples of one class to be negative examples for all other classes. Filter: Recall & precision Context of i Best patterns I: A set of seed instances Search

3 of the most productive rules

如何提高召回率 • Pattern Learning (PL) • Subclass Extraction (SE) • List Extraction (LE)

2.2 Subclass ExtractionBasic subclass extraction (SEbase) • Extracting candidate subclasses • 通用抽取规则在抽取实例的同时也抽取子类.如何区分? • 实例:专有名词,大写 Scientists such as Einstein, Newton,… • 子类: 普通名词 Scientists such as physical scientist, biologist, … • Assessing Candidate Subclasses, a combination method • 子类名是否包含了父类名 • “microbiologist” is a subclass of “biologist” • 在WordNet中是否有父子关系 • SEbase Assessor: • bootstrap training method

Rules for subclass extraction

Improving Subclass Extraction Recall • 对抽取到的候选子类，用table2中后两条规则来抽取它们兄弟,得到更多的候选子类。 • 两种子类 • Context-independent subclass • Person - Priest • Context-dependent subclass • Person - Pharmacist • 两种assessing method • SEself: 用自训练的方式训练一个分类器 • SEiter：迭代地为每个抽取规则计算置信度

Experimental result: Context-independent subclass

Experimental result: Context-dependent subclass

如何提高召回率 • Pattern Learning (PL) • Subclass Extraction (SE) • List Extraction (LE) • 不同于前两种方法处理无结构文本 • LE利用网页中的结构来抽取信息

2.3 List Extractor • 网页中很多列表都是从数据库中生成的，因此通常具有明显的结构特征 • 基本方法 • 定位网页中的list • 学习一个wrapper，自动抽取所有list中的item

Learning a Wrapper

An Example W3 is the BEST • 对应的HTML块尽量小 • 匹配尽量多keywords

Experiments of LE

Discussion • 使用LE可以用较少的查询，抽取到大量的信息 • 虽然准确率不够高，但是 • 帮助缩小了候选信息的数量,使得Assessor工作量大大减少. • 可以发现在标准IE方法没有抽取到的信息 • 在HTML文档中，长选择列表中的一些罕见城市

2.4 PL，SE和LE的比较：recall film city scientist 对于通用概念的实例抽取，SE更有效

extraction rate = num (unique extraction) / num (query) PL，SE和LE的比较: extraction rate

the Trade-off between Recall and Precision

3 Conclusion • KnowItAll: Unsupervised information extraction from the Web • Input a set of predicate names • no hand-labeled training examples of any kind • 准确率 • utilizes a novel generate-and-test architecture • Extractor, Assessor • 召回率 • Pattern learning, Subclass Extraction, List Extraction

3 Typical Work on Automatic Relation Extraction

3 Typical Work on Automatic Relation Extraction

Presentation Transcript

Automatic Bibliographic Extraction System ABES

Relation Extraction

Relation Extraction

Work Preparation – Typical Curriculum Topics

Information Extraction Lecture 7 – Relation Extraction

Kernel Methods for Relation Extraction

Tree Kernel-based Semantic Relation Extraction using Unified Dynamic Relation Tree

Relation Extraction

DSpace, ETDs, Automatic Metadata Extraction

Lecture 14 Relation Extraction

Coreference Based Event-Argument Relation Extraction on Biomedical Text

Relation Extraction

Information Extraction Lecture 7 – Relation Extraction

LTE Automatic Neighbor Relation (ANR)

Impact of different relation extraction methods on network analysis results

Exploiting Background Knowledge for Relation Extraction

Relation Extraction (RE) via Supervised Classification

How Do Automatic Doors Work | How Automatic Doors Work

Relation Extraction