1 / 35

Web Information Extraction

This article provides an overview of information extraction from the web, including different approaches for generating wrappers, such as manually constructed and machine learning-based methods. Examples of each approach are also discussed.

karriola
Download Presentation

Web Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Information Extraction 邵蓥侠

  2. Outline • Background • Approaches for generating wrappers • Manually constructed • Machine learning • Examples • Conclusion

  3. Terminology • IE = Information Extractor • WIE = Web Information Extractor • TIE = Traditional Information Extractor

  4. Background • Abundant information on web • Structure [tables] • Semi-structure [HTML / XML] • Free context [blogs] • TIE vs. WIE • Scalability • Cost • flexibility • General approach • wrappers

  5. Wrapper • Wrapper • sets of highly accurate rules that extract a particular page's content • a function from a page to the set of tuples it contains • Flow of WIE based on wrappers • collecting training pages • labeling training examples [optional] • generalizing extraction rules (wrappers) • extracting the relevant data • outputting the result in an appropriate format

  6. Outline • Background • Approaches for generating wrappers • Manually constructed • Machine learning • Examples • Conclusion

  7. Approaches for generating wrappers • Automation Degree of approaches • Manually-constructed • Supervised • Semi-supervised • Unsupervised Machine Learning A Survey of Web Information Extraction Systems @ TKDE 06

  8. Manually Constructed Wrapper • Definition: Manually develop rules/commands/patterns for extracting data • Examples • TSIMMIS [Hammer, et al, 1997] • Minerva [Crescenzi, 1998] • WebOQL [Arocena and Mendelzon, 1998] • W4F [Saiiuguet and Azavant, 2001] • XWrap [Liu, et al. 2000]

  9. Manually Constructed Wrapper • Disadvantages • Time-consuming to write rules • Non-general • Need to understand the structure of document • Special expertise of users [programmers]

  10. Wrapper with Supervised Learning • Supervised learning • A machine learning task of inferring a function from supervised (labeled) training data • Examples • SRV [Freitag, 1998] • Rapier [Califf and Mooney, 1998] • WIEN [Kushmerick, 1997] • WHISK [Soderland, 1999] • NoDoSE [Adelberg, 1998] • Softmealy [Hsu and Dung, 1998] • Stalker [Muslea, 1999] • DEByE [Laender, 2002b ]

  11. Wrapper with Supervised Learning • Disadvantage • Manually labeling training data is time-consuming • vs. manually constructed • general users instead of programmers can label training data, thus reducing the cost of wrapper generation

  12. Wrapper with Semi-Supervised Learning • Semi-Supervised Learning • a class of machine learning techniques that make use of both labeled and unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data.  • Examples • SEAL [Richard C. Wang,2009] • Automatic Wrapper [NileshDalvi,2011] • IEPAD [Chang and Lui, 2001] • OLERA [Chang and Kuo, 2003] • Thresher [Hogue, 2005]

  13. Wrapper with Unsupervised Learning • Unsupervised Learning • refers to the problem of trying to find hidden structure in unlabeled data • Examples • Roadrunner [Crescenzi, 2001] • DeLa [Wang, 2002; 2003] • EXALG [Arasu and Garcia-Molina, 2003] • DEPTA [Zhai, et al., 2005]

  14. A Survey of Web Information Extraction Systems @ TKDE 06

  15. A Survey of Web Information Extraction Systems @ TKDE 06

  16. A Survey of Web Information Extraction Systems @ TKDE 06

  17. A Survey of Web Information Extraction Systems @ TKDE 06

  18. Outline • Background • Approaches for generating wrappers • Manually constructed • Machine learning • Examples • Conclusion

  19. Manually-Constructed Example • TSIMMIS • one of the first approaches that give a framework for manual building of Web wrappers • Wrapper • Manually constructed as commands • Input: a specification file that declaratively states where the data of interest is located on the page • Output: Object Extraction Model (OEM) Semi-structured Data: The TSIMMIS Experience @ ADBIS 97

  20. Manually-Constructed Example • Each command is of the form: [variables, source, pattern] where • source specifies the input text to be considered • pattern specifies how to find the text of interest within the source, and • variables are a list of variables that hold the extracted results. • Note: • # means “save in the variable” • * means “discard” Semi-structured Data: The TSIMMIS Experience @ ADBIS 97

  21. Manually-Constructed Example Specification file Web Page OEM

  22. Supervised Learning Example • SRV • top-down relational algorithm that generates single-slot extraction rules • Learning algorithm work like FOIL • Token-oriented • Logic rules Information Extraction from HTML: Application of a General Machine Learning Approach @ AAAI 98

  23. SRV Learning process • SRV Algorithm • Input annotated document & features • Inducting rules based on 2/3 training data • Validate rules based on remained 1/3 training data • Iterate learning 3 times • Output rules of predicted for single-slot Information Extraction from HTML: Application of a General Machine Learning Approach @ AAAI 98

  24. Supervised Learning Example s Rules for extracting rating Web page

  25. Semi-Supervised Learning Example • SEAL (Set Expander for Any Language) • expands entities automatically by utilizing resources from the Web in a language-independent fashion • Flow of SEAL • Extracting wrappers • Ranking wrappers / Candidates Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09

  26. Semi-Supervised Learning Example • Extracting wrappers • Input seed instance & document • Find seed instance in document • Generate left/right context • Mining between left/right context • find all the longest possible strings from left context set given some constraints, called s for each found string • find the longest possible string s0 from right context such that s and s0 bracket at least one occurrence of every given seed in a document NOTE: left/right context are maintained by Patricia Trie Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09

  27. Semi-Supervised Learning Example • Document • Seeds • {Ford, Nissan, Toyota} Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09

  28. Semi-Supervised Learning Example Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09

  29. Semi-Supervised Learning Example Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09

  30. Unsupervised Learning Example • Roadrunner • A novel approach to wrapper inference for HTML pages. • Idea • Generate HTML page using scripts => Encoding • Data Extracting from HTML pages => Decoding • Formulate the problem • Find the nested type of the source dataset • Extract the source dataset from HTML pages. RoadRunner: Towards Automatic Data Extraction from Large Web Sites @ VLDB 2001

  31. Unsupervised Learning Example • Find Nested Type • Theoretical Background • Based on close correspondence between nested type and union-free regular expressions (UFRE). • => find the Least Upper Bound UFRE • Solution for LUB UFRE. • ACME (Align, Collapse under Mismatch, and Extract) RoadRunner: Towards Automatic Data Extraction from Large Web Sites @ VLDB 2001

  32. Outline • Background • Approaches for generating wrappers • Manually constructed • Machine learning • Examples • Conclusion

  33. Conclusion • WIE will be still important due to “data flood” on Internet • Currently WIE systems almost bases on Machine Learning, but are still not perfect • New technique, such as MapReduce, Hadoop, Spark, etc., promotes ML developing, and it may also benefit the WIE.

  34. Q&A

  35. Reference • Information Extraction @ Wikipedia • Wrapper (data mining) @ Wikipedia • Supervised learning@ Wikipedia • Semi-Supervised learning@ Wikipedia • Unsupervised learning@ Wikipedia • A Survey of Web Information Extraction Systems @ TKDE 06 • Semi-structured Data: The TSIMMIS Experience @ ADBIS 97 • Information Extraction from HTML: Application of a General Machine Learning Approach @ AAAI-98 • Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09 • RoadRunner: Towards Automatic Data Extraction from Large Web Sites @ VLDB 2001

More Related