160 likes | 324 Views
Character-Level Analysis of Semi-Structured Documents for Set Expansion. Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 USA. Summary. We illustrated… the construction of character-based wrappers used in SEAL
E N D
Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 USA
Summary We illustrated… the construction of character-based wrappers used in SEAL a method to extend SEAL to learn binary relational concepts We showed that… character-based wrappers perform better than HTML-based binarySEAL has good performance
Background – SEAL Set Expander for Any Language Wang & Cohen, ICDM 2007 An example of set expansion Given an input query (seeds): { survivor, amazing race } The output answer is: { american idol, big brother, ... }
Features Independent of human&markup language Support seeds in English, Chinese, Japanese, ... Accept documents in HTML, XML, SGML, TeX, … Does not require pre-annotatedtraining data Utilize readily-available corpus: World Wide Web Research contributions Automatically construct wrappers for extracting candidate items Rank candidates using random walk
Fetcher: Download web pages containing all seeds Extractor: Learn and construct wrappers Ranker: Rank candidate items using Random Walk SEAL’s Architecture Pentax Sony Kodak Minolta Panasonic Casio Leica Fuji Samsung … Canon Nikon Olympus
Wrapper Learner • Current WL only learns unary relation • e.g., x is a mayor • A unary wrapper consists of a pair of left (L) and right (R) context string • Extracts all strings between L, R • Extended WL learns binary relation • e.g., x is the mayor of city y • A binary wrapper has an additional middle (M) context string • Extracts string pairs between L, M and M, R
Real Unary Wrappers Given seeds: Ford, Nissan, Toyota Examples of wrappersandextractions:
Mock Unary Example Given seeds: Ford, Nissan, Toyota Example document written in an unknown mark-up language:
Contexttries for mock example: Constructed unarywrappers:
Metric – Mean Average Precision Dataset – 36 datasets(Wang & Cohen, ICDM 2007) Evaluated on 5 types of wrappers Type 1 is least strict – SEAL’s default Type 5 is most strict – less strict than any HTML wrapper Result – stricter wrappers perform worse Unary SEAL Evaluation
Binary Wrapper Construction • Keep track of all middle contexts: • In the unary code, replace Intersect with:
Binary SEAL Evaluation • Relational Datasets • Surveyed more than a dozen • Randomly selected five: • Bootstrap results ten times using iSEAL (an iterative version of SEAL) • Wang & Cohen, ICDM 2008