250 likes | 396 Views
IEPAD: Information Extraction Based on Pattern Discovery. Chia-Hui Chang, Shao-Chen Lui Dept. of Computer Science and Information Engineering National Central University. WWW10 ’01. Introduction (1/4). Introduction (2/4).
E N D
IEPAD: Information Extraction Based on Pattern Discovery Chia-Hui Chang, Shao-Chen Lui Dept. of Computer Science and Information Engineering National Central University WWW10 ’01
Introduction (2/4) • Great need for value-added service that integrates information from multiple sources • Customizable Web information gathering robots/crawlers • Comparison-shopping agents • Meta-search engines • Newsbots • Suppose the data has been collected from different Web sites… • Write extractor program to extract the contents of the Web pages • Observe the extraction rules in person • Write programs for each Web site • Since the format of Web pages is often subject to change, maintaining the wrapper can be expensive and impractical → labor-intensive !
Introduction (3/4) • Related works • Tools that can generate wrappers automatically • Machine learning techniques to summarize extraction rules • Ex: WIEN, Softmealy, Stalker • Designer must manually label the beginning and the end of the training examples for generating the rules • Manual labeling is time-consuming and not efficient enough • Fully automate wrapper construction • Without users’ training examples • Ex: One-tag separator approach (Embley et al.) • Discover record boundaries in Web documents by identifying candidate separator tags using five independent heuristics • Problem arises when the separator tag is used elsewhere among a record other than the boundary
Introduction (4/4) • Eliminate human intervention by pattern mining • Motivation is from the observation that useful information in a Web page is often placed in a structure having a particular alignment and order • Ex: Web pages produced by search engines generally present search results in regular and repetitive patterns • Mining repetitive patterns may discover the extraction rules for wrappers
System Overview (1/3) • The system IEPAD includes three components : • An extraction rule generator • accepts an input Web page • A graphical user interface • Called pattern viewer • Shows repetitive patterns discovered • An extractor module • Extracts desired information from similar Web pages according to the extraction rule chosen by the user
System Overview (2/3) • Extraction rule generator includes : • Translator • PAT tree constructor • Pattern discoverer • Pattern validator • Extraction rule composer • The results of rule extractor areextraction rules discovered in a Web page
System Overview (3/3) 1. User submits an HTML page 2. Receive and translate into a string of abstract representations 4. Pattern discoverer uses the PAT tree to discover repetitive patterns, called maximal repeats 3. Receives the binary file to construct a PAT tree 5. Filters out undesired patterns and produces candidate patterns 6. Rule composer revises each candidate pattern to form an extraction rule in regular expression
Extraction Rule Generator (1/2) • Desired information in a Web page is often placed in a structure having a particular alignment and forms repetitive patterns • May constitute the extraction rules for wrappers • Repetitive patterns : Any substring that occurs at least twice in the encoded token string • Include too many patterns fitting this requisite • Define maximal repeats to uniquely identify the longest pattern
Extraction Rule Generator (2/2) • Necessary for identifying the well used and popular term repeats • Maximal repeats have to be further verified by the validator to filter interesting ones
Translator(1/2) • HTML page → token string 包含兩種token • Tag token • Html(<tag_name>) • TEXT token • 兩個tag之間的non-tag文字內容當成單一個token • Text(_)
Translator (2/2) • Example – Congo code 13 14 1 3 5 7 9 11 2 4 6 8 10 12
PAT Tree Construction Bit position in the encoded bit string Used when locating a given sistring in PAT tree Store all its data in external nodes Sistring:000110001010110011100$
Pattern Discoverer (2/2) • 不只記下 maximal repeats , 還要記下它們的 occurrence counts, reference positions, pattern length • Ex: 想找出所有長度 > 3 tokens 的 patterns , 因為每個 token 以 3 bits encoded , 所以只需檢察 index bit> 3*3=9 的 internal nodes • d,e,g,l,m • 其中又只有 d 符合 left diverse , maximal repeat 為
Pattern Validator(1/2) • A typical web page usually contains a large number of maximal repeats • Not all useful! • Validator 使用 3criteria 來決定哪些 maximal repeats are useful • Regularity • Measured by computing the standard deviation of the interval between two adjacent occurrences then be devided by the mean of sequence 0
Pattern Validator (2/2) 1 large 利用 3 thresholds 濾掉不符合的maximal repeats 沒有包含 Text token 的也會濾掉
Occurrence Partition • Special case: • The pattern of target information forms three information blocks in the Web page • 因為用所有 instance measure, 所以 Regularity → large! • Partition the occurrences into segments < Set to a small value close to zero
Rule Composer • Find a good representation of the critical common features of multiple strings • Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb” • Multiple alignment for strings • The extraction pattern can be generalized as “adc[w|x]b[d|-]” • 假設records是連續的, 若 alternatives 超過10個, 仍使用maximal repeats • Center String Algorithm • Approximation, reduce time complexity • Another problem • 產生出 pattern: “c1c2c3...cn”, 實際上是“cjcj+1cj+2...cnc1c2...cj–1” • 考慮cj為首的records, 並檢查是否“cjcj+1cj+2...cnc1c2...cj–1”為正確pattern
The Extractor (1/2) 1. 2 patterns discovered 2. Shows the detail measures of the selected pattern
The Extractor (2/2) 3. The selected pattern is then forwarded to the extractor for pattern recognition and extraction PAT tree constructed already Searching in a PAT is fast, since every subtree of a PAT tree has all its sistrings with a common prefix → efficient, linear-time else Pattern-matching algorithm or finite state machine for extraction rule (regular expression)
Experiments (1/3) 14 search engines, each with 10 Web pages Fixed min. length = 3 Min. frequency = 5 All-tag encoding scheme
Experiments (2/3) Encoding Scheme 0.4% A pattern may contain only a portion of the data record recall precision
Experiments (3/3) Occurrence partition Lycos → 92% Multiple string alignment
Summary • Presented an unsupervised approach for pattern discovery in the encoded token string of Web pages • Discovered maximal repeats are filtered by the measure regularity and compactness • Regularity higher than threshold → occurrence partition • Multiple string alignment is applied to patterns to generalize multiple records • Express the extraction rules in regular expressions • High retrieval rate and accuracy rate • No human intervention and training examples • Takes only 3 minutes to extract 140 pages → quick and efficient!