Web Information Extraction

Web Information Extraction 邵蓥侠

Outline • Background • Approaches for generating wrappers • Manually constructed • Machine learning • Examples • Conclusion

Terminology • IE = Information Extractor • WIE = Web Information Extractor • TIE = Traditional Information Extractor

Background • Abundant information on web • Structure [tables] • Semi-structure [HTML / XML] • Free context [blogs] • TIE vs. WIE • Scalability • Cost • flexibility • General approach • wrappers

Wrapper • Wrapper • sets of highly accurate rules that extract a particular page's content • a function from a page to the set of tuples it contains • Flow of WIE based on wrappers • collecting training pages • labeling training examples [optional] • generalizing extraction rules (wrappers) • extracting the relevant data • outputting the result in an appropriate format

Approaches for generating wrappers • Automation Degree of approaches • Manually-constructed • Supervised • Semi-supervised • Unsupervised Machine Learning A Survey of Web Information Extraction Systems @ TKDE 06

Manually Constructed Wrapper • Definition: Manually develop rules/commands/patterns for extracting data • Examples • TSIMMIS [Hammer, et al, 1997] • Minerva [Crescenzi, 1998] • WebOQL [Arocena and Mendelzon, 1998] • W4F [Saiiuguet and Azavant, 2001] • XWrap [Liu, et al. 2000]

Manually Constructed Wrapper • Disadvantages • Time-consuming to write rules • Non-general • Need to understand the structure of document • Special expertise of users [programmers]

Wrapper with Supervised Learning • Supervised learning • A machine learning task of inferring a function from supervised (labeled) training data • Examples • SRV [Freitag, 1998] • Rapier [Califf and Mooney, 1998] • WIEN [Kushmerick, 1997] • WHISK [Soderland, 1999] • NoDoSE [Adelberg, 1998] • Softmealy [Hsu and Dung, 1998] • Stalker [Muslea, 1999] • DEByE [Laender, 2002b ]

Wrapper with Supervised Learning • Disadvantage • Manually labeling training data is time-consuming • vs. manually constructed • general users instead of programmers can label training data, thus reducing the cost of wrapper generation

Wrapper with Semi-Supervised Learning • Semi-Supervised Learning • a class of machine learning techniques that make use of both labeled and unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data. • Examples • SEAL [Richard C. Wang,2009] • Automatic Wrapper [NileshDalvi,2011] • IEPAD [Chang and Lui, 2001] • OLERA [Chang and Kuo, 2003] • Thresher [Hogue, 2005]

Wrapper with Unsupervised Learning • Unsupervised Learning • refers to the problem of trying to find hidden structure in unlabeled data • Examples • Roadrunner [Crescenzi, 2001] • DeLa [Wang, 2002; 2003] • EXALG [Arasu and Garcia-Molina, 2003] • DEPTA [Zhai, et al., 2005]

A Survey of Web Information Extraction Systems @ TKDE 06

Manually-Constructed Example • TSIMMIS • one of the first approaches that give a framework for manual building of Web wrappers • Wrapper • Manually constructed as commands • Input: a specification file that declaratively states where the data of interest is located on the page • Output: Object Extraction Model (OEM) Semi-structured Data: The TSIMMIS Experience @ ADBIS 97

Manually-Constructed Example • Each command is of the form: [variables, source, pattern] where • source specifies the input text to be considered • pattern specifies how to find the text of interest within the source, and • variables are a list of variables that hold the extracted results. • Note: • # means “save in the variable” • * means “discard” Semi-structured Data: The TSIMMIS Experience @ ADBIS 97

Manually-Constructed Example Specification file Web Page OEM

Supervised Learning Example • SRV • top-down relational algorithm that generates single-slot extraction rules • Learning algorithm work like FOIL • Token-oriented • Logic rules Information Extraction from HTML: Application of a General Machine Learning Approach @ AAAI 98

SRV Learning process • SRV Algorithm • Input annotated document & features • Inducting rules based on 2/3 training data • Validate rules based on remained 1/3 training data • Iterate learning 3 times • Output rules of predicted for single-slot Information Extraction from HTML: Application of a General Machine Learning Approach @ AAAI 98

Supervised Learning Example s Rules for extracting rating Web page

Semi-Supervised Learning Example • SEAL (Set Expander for Any Language) • expands entities automatically by utilizing resources from the Web in a language-independent fashion • Flow of SEAL • Extracting wrappers • Ranking wrappers / Candidates Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09

Semi-Supervised Learning Example • Extracting wrappers • Input seed instance & document • Find seed instance in document • Generate left/right context • Mining between left/right context • find all the longest possible strings from left context set given some constraints, called s for each found string • find the longest possible string s0 from right context such that s and s0 bracket at least one occurrence of every given seed in a document NOTE: left/right context are maintained by Patricia Trie Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09

Semi-Supervised Learning Example • Document • Seeds • {Ford, Nissan, Toyota} Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09

Semi-Supervised Learning Example Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09

Unsupervised Learning Example • Roadrunner • A novel approach to wrapper inference for HTML pages. • Idea • Generate HTML page using scripts => Encoding • Data Extracting from HTML pages => Decoding • Formulate the problem • Find the nested type of the source dataset • Extract the source dataset from HTML pages. RoadRunner: Towards Automatic Data Extraction from Large Web Sites @ VLDB 2001

Unsupervised Learning Example • Find Nested Type • Theoretical Background • Based on close correspondence between nested type and union-free regular expressions (UFRE). • => find the Least Upper Bound UFRE • Solution for LUB UFRE. • ACME (Align, Collapse under Mismatch, and Extract) RoadRunner: Towards Automatic Data Extraction from Large Web Sites @ VLDB 2001

Conclusion • WIE will be still important due to “data flood” on Internet • Currently WIE systems almost bases on Machine Learning, but are still not perfect • New technique, such as MapReduce, Hadoop, Spark, etc., promotes ML developing, and it may also benefit the WIE.

Q&A

Reference • Information Extraction @ Wikipedia • Wrapper (data mining) @ Wikipedia • Supervised learning@ Wikipedia • Semi-Supervised learning@ Wikipedia • Unsupervised learning@ Wikipedia • A Survey of Web Information Extraction Systems @ TKDE 06 • Semi-structured Data: The TSIMMIS Experience @ ADBIS 97 • Information Extraction from HTML: Application of a General Machine Learning Approach @ AAAI-98 • Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09 • RoadRunner: Towards Automatic Data Extraction from Large Web Sites @ VLDB 2001

Web Information Extraction

Web Information Extraction

Presentation Transcript

Information Extraction

Towards Web-Scale Information Extraction

Information Extraction from Web Documents

Information Extraction

Information Extraction

Open Information Extraction from the Web

information extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction on the Web

Information Extraction

Information Extraction

Toward Semantic Web Information Extraction

Information Extraction

Information extraction from web pages using extraction ontologies

Web scale Information Extraction

Information extraction from web pages using extraction ontologies