350 likes | 368 Views
This article provides an overview of information extraction from the web, including different approaches for generating wrappers, such as manually constructed and machine learning-based methods. Examples of each approach are also discussed.
E N D
Outline • Background • Approaches for generating wrappers • Manually constructed • Machine learning • Examples • Conclusion
Terminology • IE = Information Extractor • WIE = Web Information Extractor • TIE = Traditional Information Extractor
Background • Abundant information on web • Structure [tables] • Semi-structure [HTML / XML] • Free context [blogs] • TIE vs. WIE • Scalability • Cost • flexibility • General approach • wrappers
Wrapper • Wrapper • sets of highly accurate rules that extract a particular page's content • a function from a page to the set of tuples it contains • Flow of WIE based on wrappers • collecting training pages • labeling training examples [optional] • generalizing extraction rules (wrappers) • extracting the relevant data • outputting the result in an appropriate format
Outline • Background • Approaches for generating wrappers • Manually constructed • Machine learning • Examples • Conclusion
Approaches for generating wrappers • Automation Degree of approaches • Manually-constructed • Supervised • Semi-supervised • Unsupervised Machine Learning A Survey of Web Information Extraction Systems @ TKDE 06
Manually Constructed Wrapper • Definition: Manually develop rules/commands/patterns for extracting data • Examples • TSIMMIS [Hammer, et al, 1997] • Minerva [Crescenzi, 1998] • WebOQL [Arocena and Mendelzon, 1998] • W4F [Saiiuguet and Azavant, 2001] • XWrap [Liu, et al. 2000]
Manually Constructed Wrapper • Disadvantages • Time-consuming to write rules • Non-general • Need to understand the structure of document • Special expertise of users [programmers]
Wrapper with Supervised Learning • Supervised learning • A machine learning task of inferring a function from supervised (labeled) training data • Examples • SRV [Freitag, 1998] • Rapier [Califf and Mooney, 1998] • WIEN [Kushmerick, 1997] • WHISK [Soderland, 1999] • NoDoSE [Adelberg, 1998] • Softmealy [Hsu and Dung, 1998] • Stalker [Muslea, 1999] • DEByE [Laender, 2002b ]
Wrapper with Supervised Learning • Disadvantage • Manually labeling training data is time-consuming • vs. manually constructed • general users instead of programmers can label training data, thus reducing the cost of wrapper generation
Wrapper with Semi-Supervised Learning • Semi-Supervised Learning • a class of machine learning techniques that make use of both labeled and unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data. • Examples • SEAL [Richard C. Wang,2009] • Automatic Wrapper [NileshDalvi,2011] • IEPAD [Chang and Lui, 2001] • OLERA [Chang and Kuo, 2003] • Thresher [Hogue, 2005]
Wrapper with Unsupervised Learning • Unsupervised Learning • refers to the problem of trying to find hidden structure in unlabeled data • Examples • Roadrunner [Crescenzi, 2001] • DeLa [Wang, 2002; 2003] • EXALG [Arasu and Garcia-Molina, 2003] • DEPTA [Zhai, et al., 2005]
Outline • Background • Approaches for generating wrappers • Manually constructed • Machine learning • Examples • Conclusion
Manually-Constructed Example • TSIMMIS • one of the first approaches that give a framework for manual building of Web wrappers • Wrapper • Manually constructed as commands • Input: a specification file that declaratively states where the data of interest is located on the page • Output: Object Extraction Model (OEM) Semi-structured Data: The TSIMMIS Experience @ ADBIS 97
Manually-Constructed Example • Each command is of the form: [variables, source, pattern] where • source specifies the input text to be considered • pattern specifies how to find the text of interest within the source, and • variables are a list of variables that hold the extracted results. • Note: • # means “save in the variable” • * means “discard” Semi-structured Data: The TSIMMIS Experience @ ADBIS 97
Manually-Constructed Example Specification file Web Page OEM
Supervised Learning Example • SRV • top-down relational algorithm that generates single-slot extraction rules • Learning algorithm work like FOIL • Token-oriented • Logic rules Information Extraction from HTML: Application of a General Machine Learning Approach @ AAAI 98
SRV Learning process • SRV Algorithm • Input annotated document & features • Inducting rules based on 2/3 training data • Validate rules based on remained 1/3 training data • Iterate learning 3 times • Output rules of predicted for single-slot Information Extraction from HTML: Application of a General Machine Learning Approach @ AAAI 98
Supervised Learning Example s Rules for extracting rating Web page
Semi-Supervised Learning Example • SEAL (Set Expander for Any Language) • expands entities automatically by utilizing resources from the Web in a language-independent fashion • Flow of SEAL • Extracting wrappers • Ranking wrappers / Candidates Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09
Semi-Supervised Learning Example • Extracting wrappers • Input seed instance & document • Find seed instance in document • Generate left/right context • Mining between left/right context • find all the longest possible strings from left context set given some constraints, called s for each found string • find the longest possible string s0 from right context such that s and s0 bracket at least one occurrence of every given seed in a document NOTE: left/right context are maintained by Patricia Trie Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09
Semi-Supervised Learning Example • Document • Seeds • {Ford, Nissan, Toyota} Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09
Semi-Supervised Learning Example Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09
Semi-Supervised Learning Example Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09
Unsupervised Learning Example • Roadrunner • A novel approach to wrapper inference for HTML pages. • Idea • Generate HTML page using scripts => Encoding • Data Extracting from HTML pages => Decoding • Formulate the problem • Find the nested type of the source dataset • Extract the source dataset from HTML pages. RoadRunner: Towards Automatic Data Extraction from Large Web Sites @ VLDB 2001
Unsupervised Learning Example • Find Nested Type • Theoretical Background • Based on close correspondence between nested type and union-free regular expressions (UFRE). • => find the Least Upper Bound UFRE • Solution for LUB UFRE. • ACME (Align, Collapse under Mismatch, and Extract) RoadRunner: Towards Automatic Data Extraction from Large Web Sites @ VLDB 2001
Outline • Background • Approaches for generating wrappers • Manually constructed • Machine learning • Examples • Conclusion
Conclusion • WIE will be still important due to “data flood” on Internet • Currently WIE systems almost bases on Machine Learning, but are still not perfect • New technique, such as MapReduce, Hadoop, Spark, etc., promotes ML developing, and it may also benefit the WIE.
Reference • Information Extraction @ Wikipedia • Wrapper (data mining) @ Wikipedia • Supervised learning@ Wikipedia • Semi-Supervised learning@ Wikipedia • Unsupervised learning@ Wikipedia • A Survey of Web Information Extraction Systems @ TKDE 06 • Semi-structured Data: The TSIMMIS Experience @ ADBIS 97 • Information Extraction from HTML: Application of a General Machine Learning Approach @ AAAI-98 • Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09 • RoadRunner: Towards Automatic Data Extraction from Large Web Sites @ VLDB 2001