1 / 25

IEPAD: Information Extraction Based on Pattern Discovery

IEPAD: Information Extraction Based on Pattern Discovery. Chia-Hui Chang, Shao-Chen Lui Dept. of Computer Science and Information Engineering National Central University. WWW10 ’01. Introduction (1/4). Introduction (2/4).

holli
Download Presentation

IEPAD: Information Extraction Based on Pattern Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IEPAD: Information Extraction Based on Pattern Discovery Chia-Hui Chang, Shao-Chen Lui Dept. of Computer Science and Information Engineering National Central University WWW10 ’01

  2. Introduction (1/4)

  3. Introduction (2/4) • Great need for value-added service that integrates information from multiple sources • Customizable Web information gathering robots/crawlers • Comparison-shopping agents • Meta-search engines • Newsbots • Suppose the data has been collected from different Web sites… • Write extractor program to extract the contents of the Web pages • Observe the extraction rules in person • Write programs for each Web site • Since the format of Web pages is often subject to change, maintaining the wrapper can be expensive and impractical → labor-intensive !

  4. Introduction (3/4) • Related works • Tools that can generate wrappers automatically • Machine learning techniques to summarize extraction rules • Ex: WIEN, Softmealy, Stalker • Designer must manually label the beginning and the end of the training examples for generating the rules • Manual labeling is time-consuming and not efficient enough • Fully automate wrapper construction • Without users’ training examples • Ex: One-tag separator approach (Embley et al.) • Discover record boundaries in Web documents by identifying candidate separator tags using five independent heuristics • Problem arises when the separator tag is used elsewhere among a record other than the boundary

  5. Introduction (4/4) • Eliminate human intervention by pattern mining • Motivation is from the observation that useful information in a Web page is often placed in a structure having a particular alignment and order • Ex: Web pages produced by search engines generally present search results in regular and repetitive patterns • Mining repetitive patterns may discover the extraction rules for wrappers

  6. System Overview (1/3) • The system IEPAD includes three components : • An extraction rule generator • accepts an input Web page • A graphical user interface • Called pattern viewer • Shows repetitive patterns discovered • An extractor module • Extracts desired information from similar Web pages according to the extraction rule chosen by the user

  7. System Overview (2/3) • Extraction rule generator includes : • Translator • PAT tree constructor • Pattern discoverer • Pattern validator • Extraction rule composer • The results of rule extractor areextraction rules discovered in a Web page

  8. System Overview (3/3) 1. User submits an HTML page 2. Receive and translate into a string of abstract representations 4. Pattern discoverer uses the PAT tree to discover repetitive patterns, called maximal repeats 3. Receives the binary file to construct a PAT tree 5. Filters out undesired patterns and produces candidate patterns 6. Rule composer revises each candidate pattern to form an extraction rule in regular expression

  9. Extraction Rule Generator (1/2) • Desired information in a Web page is often placed in a structure having a particular alignment and forms repetitive patterns • May constitute the extraction rules for wrappers • Repetitive patterns : Any substring that occurs at least twice in the encoded token string • Include too many patterns fitting this requisite • Define maximal repeats to uniquely identify the longest pattern

  10. Extraction Rule Generator (2/2) • Necessary for identifying the well used and popular term repeats • Maximal repeats have to be further verified by the validator to filter interesting ones

  11. Translator(1/2) • HTML page → token string 包含兩種token • Tag token • Html(<tag_name>) • TEXT token • 兩個tag之間的non-tag文字內容當成單一個token • Text(_)

  12. Translator (2/2) • Example – Congo code 13 14 1 3 5 7 9 11 2 4 6 8 10 12

  13. PAT Tree Construction Bit position in the encoded bit string Used when locating a given sistring in PAT tree Store all its data in external nodes Sistring:000110001010110011100$

  14. Pattern Discoverer (1/2)

  15. Pattern Discoverer (2/2) • 不只記下 maximal repeats , 還要記下它們的 occurrence counts, reference positions, pattern length • Ex: 想找出所有長度 > 3 tokens 的 patterns , 因為每個 token 以 3 bits encoded , 所以只需檢察 index bit> 3*3=9 的 internal nodes • d,e,g,l,m • 其中又只有 d 符合 left diverse , maximal repeat 為

  16. Pattern Validator(1/2) • A typical web page usually contains a large number of maximal repeats • Not all useful! • Validator 使用 3criteria 來決定哪些 maximal repeats are useful • Regularity • Measured by computing the standard deviation of the interval between two adjacent occurrences then be devided by the mean of sequence 0

  17. Pattern Validator (2/2) 1 large 利用 3 thresholds 濾掉不符合的maximal repeats 沒有包含 Text token 的也會濾掉

  18. Occurrence Partition • Special case: • The pattern of target information forms three information blocks in the Web page • 因為用所有 instance measure, 所以 Regularity → large! • Partition the occurrences into segments < Set to a small value close to zero

  19. Rule Composer • Find a good representation of the critical common features of multiple strings • Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb” • Multiple alignment for strings • The extraction pattern can be generalized as “adc[w|x]b[d|-]” • 假設records是連續的, 若 alternatives 超過10個, 仍使用maximal repeats • Center String Algorithm • Approximation, reduce time complexity • Another problem • 產生出 pattern: “c1c2c3...cn”, 實際上是“cjcj+1cj+2...cnc1c2...cj–1” • 考慮cj為首的records, 並檢查是否“cjcj+1cj+2...cnc1c2...cj–1”為正確pattern

  20. The Extractor (1/2) 1. 2 patterns discovered 2. Shows the detail measures of the selected pattern

  21. The Extractor (2/2) 3. The selected pattern is then forwarded to the extractor for pattern recognition and extraction PAT tree constructed already Searching in a PAT is fast, since every subtree of a PAT tree has all its sistrings with a common prefix → efficient, linear-time else Pattern-matching algorithm or finite state machine for extraction rule (regular expression)

  22. Experiments (1/3) 14 search engines, each with 10 Web pages Fixed min. length = 3 Min. frequency = 5 All-tag encoding scheme

  23. Experiments (2/3) Encoding Scheme 0.4% A pattern may contain only a portion of the data record recall precision

  24. Experiments (3/3) Occurrence partition Lycos → 92% Multiple string alignment

  25. Summary • Presented an unsupervised approach for pattern discovery in the encoded token string of Web pages • Discovered maximal repeats are filtered by the measure regularity and compactness • Regularity higher than threshold → occurrence partition • Multiple string alignment is applied to patterns to generalize multiple records • Express the extraction rules in regular expressions • High retrieval rate and accuracy rate • No human intervention and training examples • Takes only 3 minutes to extract 140 pages → quick and efficient!

More Related