750 likes | 1.18k Views
3 Typical Work on Automatic Relation Extraction. 自动关系抽取的三种重要方法 武文娟 2009.06.04. Outline. DIPRE,1998 KnowItAll, 2005 Open IE, 2007. 1 DIPRE: Dual Iterative Pattern Expansion. Sergey Brin, Extracting Patterns and Relations from the World Wide Web,
E N D
3 Typical Work on Automatic Relation Extraction 自动关系抽取的三种重要方法 武文娟 2009.06.04
Outline • DIPRE,1998 • KnowItAll, 2005 • Open IE, 2007
1 DIPRE: Dual Iterative Pattern Expansion Sergey Brin, Extracting Patterns and Relations from the World Wide Web, In : Proceedings of the International Workshop on the Web and Databases, 1998.
1 DIPRE: Dual Iterative Pattern Expansion • 首次利用迭代方法发现数据实体间的模式和关系,并成功的发现了作者/作品数据对。 • Input: 5本书的样本集(author, title) • Output: 自动扩展到了15,000本书 • 有些书是最大的网上书店亚马逊也没有的。
1.1 Idea extract pattern tuple discover 模板和关系之间存在对偶性
1.2 Algorithm • 七元组 • (author, title, order, url, prefix, middle, suffix) R (Tuple set) Occurrence FindOccurrence (R, D) Generate & filter Search Patterns
Pattern generation • Occurrence • 七元组 • (author, title, order, url, prefix, middle, suffix) O1, O2, …, Ok Group by Order, middle For each Oi GenOnePattern(Oi) • Pattern p • 五元组 • (order, urlprefix, prefix, middle, suffix) URL: 匹配urlprefix* 内容: *prefix, author, middle, title, suffix* 输出 p YES p is specific? NO
1.3 Experiments • Corpus • A repository of 24 million web pages • 147G
1.4 Conclusion • DIPRE: • 半监督关系学习方面的最初的工作 • 利用了关系和模板之间的对偶性,在Web这样的大规模语料库上,通过少量的sample作为种子,以迭代的方法,不断地抽取新的模板和实例。
Outline • DIPRE,1998 • KnowItAll, 2005 • Open IE, 2007
KNOWITALL Oren Etzioni etc. University of Washington Unsupervised Named-Entity Extraction from the Web: An Experimental Study AAAI 2005
Introduction • 以前的工作:HMM, CRF • 小规模的语料库 • 需要提供种子数据 • KNOWITALL: an unsupervised, domain-independent system that extracts information from the Web • 关键挑战: • 保证准确率:a novel generate-and-test architecture • 提高召回率: • Pattern Learning (PL) • Subclass Extraction (SE) • List Extraction (LE)
1 Flowchart of the main components in KnowItAll • For every predicate: • creates extraction rules and discriminators • Train discriminators • “cities such as ” NPList
Information Focus • 唯一领域相关的输入是一组predicate,用来指定所关注的领域。
Extraction Rules • 通用的抽取模板,结合predicate的标签,生成相应领域的抽取规则 • Class1 = ‘city’,规则即为 • “cities such as ” NPList • “towns such as ” NPList • Keywords: “cities such as ” , “towns such as ” (提交给搜索引擎)
Discriminator • 用来确认某个抽取到的信息是否validate • 利用PMI
The result of training • A set of discriminator, eg. • Discriminator: <I> is a city Learned threshold T: 0.000016 Conditional probabilities P(PMI > T | class) = 0.83 P(PMI > T | ¬class) = 0.08
An Example • Predicate: city • Bootstrapping: • Generate extraction rules and discriminators • Train all discriminators, and selected the 5 best discriminators
An ExampleMain cycle: extract • Suppose that • the query is “and other cities” • from a rule with extraction pattern: NP “and other cities”. • 2 instances: Fes, East Coast
An ExampleMain cycle: Assess • To compute the probability of City (Fes) • sends six queries • “Fes” has 446,000 hits; • “Fes is a city” has 14 hits • “cities Fes” (201 hits) • “cities such as Fes” (10 hits); • “cities including Fes” (4 hits) • 0 hits for “Fes and other towns”. • City (East Coast) • below threshold for all discriminators Sum up all the probability, The final probability is 0.99815 The final probability is 0.00027.
1.2 Experiment: search cutoff metric • Signal to Noise ratio (STN): 正例与负例的比值 • Query Yield Ratio (QYR):n个网页抽取到的新信息量
2 如何提高召回率 • Pattern Learning (PL): • 抽取规则 • 评价实例准确性的确认模板 • Subclass Extraction (SE): • 自动识别子概念,便于抽取 • 例如,为了抽取科学家的实例,可以先找到科学家的子概念(物理学家、地理学家等),再抽取这些子概念的实例。 • List Extraction (LE): • learns a “wrapper” for each list, and uses the wrapper to extract list elements. • 使用通用抽取模板抽取到的信息作为这三种方法的初始种子,因此它们都不需要人事先给出训练数据。
2.1 Pattern Learning (PL): • 通用模板对特定领域来说通常并不是最有效的模板 • “the film <film> starring” • “headquartered in <city>”
Pattern Learning algorithm • Estimating recall & precision efficiently • take the positive examples of one class to be negative examples for all other classes. Filter: Recall & precision Context of i Best patterns I: A set of seed instances Search
如何提高召回率 • Pattern Learning (PL) • Subclass Extraction (SE) • List Extraction (LE)
2.2 Subclass ExtractionBasic subclass extraction (SEbase) • Extracting candidate subclasses • 通用抽取规则在抽取实例的同时也抽取子类.如何区分? • 实例:专有名词,大写 Scientists such as Einstein, Newton,… • 子类: 普通名词 Scientists such as physical scientist, biologist, … • Assessing Candidate Subclasses, a combination method • 子类名是否包含了父类名 • “microbiologist” is a subclass of “biologist” • 在WordNet中是否有父子关系 • SEbase Assessor: • bootstrap training method
Improving Subclass Extraction Recall • 对抽取到的候选子类,用table2中后两条规则来抽取它们兄弟,得到更多的候选子类。 • 两种子类 • Context-independent subclass • Person - Priest • Context-dependent subclass • Person - Pharmacist • 两种assessing method • SEself: 用自训练的方式训练一个分类器 • SEiter:迭代地为每个抽取规则计算置信度
如何提高召回率 • Pattern Learning (PL) • Subclass Extraction (SE) • List Extraction (LE) • 不同于前两种方法处理无结构文本 • LE利用网页中的结构来抽取信息
2.3 List Extractor • 网页中很多列表都是从数据库中生成的,因此通常具有明显的结构特征 • 基本方法 • 定位网页中的list • 学习一个wrapper,自动抽取所有list中的item
An Example W3 is the BEST • 对应的HTML块尽量小 • 匹配尽量多keywords
Discussion • 使用LE可以用较少的查询,抽取到大量的信息 • 虽然准确率不够高,但是 • 帮助缩小了候选信息的数量,使得Assessor工作量大大减少. • 可以发现在标准IE方法没有抽取到的信息 • 在HTML文档中,长选择列表中的一些罕见城市
2.4 PL,SE和LE的比较:recall film city scientist 对于通用概念的实例抽取,SE更有效
extraction rate = num (unique extraction) / num (query) PL,SE和LE的比较: extraction rate
3 Conclusion • KnowItAll: Unsupervised information extraction from the Web • Input a set of predicate names • no hand-labeled training examples of any kind • 准确率 • utilizes a novel generate-and-test architecture • Extractor, Assessor • 召回率 • Pattern learning, Subclass Extraction, List Extraction