370 likes | 463 Views
IEPAD: Information Extraction based on Pattern Discovery. Chia-Hui Chang National Central University, Taiwan http://www.csie.ncu.edu.tw/~chia. Outline. Introduction Problem definition Related Work System architecture Extraction rule generation Experiments Summary and future work.
E N D
IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan http://www.csie.ncu.edu.tw/~chia
Outline • Introduction • Problem definition • Related Work • System architecture • Extraction rule generation • Experiments • Summary and future work
Introduction • Web information integration • multi-search engines, e.g. Metacrawler • shopping agents • etc. • Common tasks • Data collection • Information extraction
Information Extraction • Information Extraction (IE) • Input: Html pages • Output: A set of records
Related Work • Extractor Generation • Hand-coded wrappers by observation • Machine learning based approach • WIEN (Kushmeric), 1997 • SoftMealy (Hsu), 1998 • STALKER (Muslea), 1999 • Fully automatic approach • Embley et al, 1999 • Chang et al, 2000
System Architecture Rule Generator Html Page Patterns Pattern Viewer Users Extractor ExtractionResults Extraction Rule Html Pages
Pattern Discovery based IE • Motivation • Display of multiple records often forms a repeated pattern • The occurrences of the pattern are spaced regularly and adjacently • Now the problem becomes ... • Find regular and adjacent repeats in a string
HTML Page Token Translator A Token String PAT Tree Constructor PAT trees and Maximal Repeats Validator Advenced Patterns Rule Composer Extraction Rules The Rule Generator • Translator • PAT tree construction • Pattern validator • Rule Composer
1. Web Page Translation • Encoding of HTML source • Rule 1: Each tag is encoded as a token • Rule 2: Any text between two tags are translated to a special token called TEXT (denoted by a underscore) • HTML Example: <B>Congo</B><I>242</I><BR> <B>Egypt</B><I>20</I><BR> • Encoded token string T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)
Example of BL Encoding • Encoding scheme=Block-Level Tags 1’. Only block-level tags are considered, each tag is encoded as a token 2. Any text between two tags are translated to a special token called TEXT (denoted by a underscore) <dl><dt><b>1.</b> <b><a ...>MGI 2.4 - Mouse <em>Genome</em> … </a> <dd>The Mouse <b>Genome</b> Informatics (MGI) ..<br> <span>URL:www.informatics.jax.org/ </span><br> <a ...> …</a><a ...>…</a><img src=…><a ...>…</a> Facts about:<a> …</a></dl> <dl> <dt> _ <dd> _ <br> _ <br> _ </dl> 1 5 9 64 68
2. PAT Tree Construction • T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) • T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) • PAT tree: binary suffix tree • A Patricia treeconstructed over all possible suffix strings of a text • Example T(<B>) 000 T(</B>) 001 T(<I>) 010 T(</I>) 011 T(<BR>) 100 T(_) 110 • 000110001010110011100 • 000110001010110011100
Definition of Maximal Repeats • Let a occurs in S in position p1, p2, p3, …, pk • a is left maximal if there exists at least one (i, j) pair such that S[pi-1]S[pj-1] • a is right maximal if there exists at least one (i, j) pair such that S[pi+|a|]S[pj+|a|] • a is a maximal repeat if it it both left maximal and right maximal
Finding Maximal Repeats • Definition: • Let’s call character S[pi-1] the left character of suffix pi • A node is left diverse if at least two leaves in the ’s subtree have different left characters • Lemma: • The path labels of an internal node in a PAT tree is a maximal repeat if and only if is left diverse
3. Pattern Validator • Suppose a maximal repeat are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence. • Characteristics of a Pattern • Regularity: Variance coefficient • Adjacency: Density
Pattern a No V(a)<0.5 Discard Yes No 0.25<D(a)<1.5 Discard Yes Pattern a Pattern Validator (Cont.) • Basic Screening For each maximal repeat a, compute V(a) and D(a) a) check if the pattern’s variance: V(a) < 0.5 b) check if the pattern’s density: 0.25 < D(a) < 1.5
4. Rule Composer • Occurrence partition • Flexible variance threshold control • Multiple string alignment • Increase density of a pattern a, occurrences No Occurrence Partition V(a)<0.1 No V(a)<0.5 Discard Yes No 0.25<D(a)<1.5 Discard Yes Multiple String Alignment Yes D(a)<1 a’ No a
Occurrence Partition • Problem • Some patterns are divided into several blocks • Ex: Lycos, Excite with large regularity • Solution • Clustering of the occurrences of such a pattern Clustering V(P)<0.1 No P Discard Yes Check density
Multiple String Alignment • Problem • Patterns with density less than 1 can extract only part of the information • Solution • Align k-1 substrings among the k occurrences • A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.
Multiple String Alignment (Cont.) • Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb” • If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'': a d c w b d a d c x b - a d c x b d • The extraction pattern can be generalized as “adc[w|x]b[d|-]”
Pattern Viewer • Java-application based GUI • Web based GUI • http://140.115.155.102/WebIEPAD/
The Extractor • Matching the pattern against the encoding token string • Knuth-Morris-Pratt’s algorithm • Boyer-Moore’s algorithm • Alternatives in a rule • matching the longest pattern • What are extracted? • The whole record
Experiment Setup • Fourteen sources: search engines • Performance measures • Number of patterns • Retrieval rate and Accuracy rate • Parameters • Encoding scheme • Thresholds control
# of Patterns Discovered Using BlockLevel Encoding • Average 117 maximal repeats in our test Web pages
Translation • Average page length is 22.7KB
Summary • IEPAD: Information Extraction based on Pattern Discovery • Rule generator • The extractor • Pattern viewer • Performance • 97% retrieval rate and 94% accuracy rate
Problems • Guarantee high retrieval rate instead of accuracy rate • Generalized rule can extract more than the desired data • Only applicable when there are several records in a Web page, currently
Final • Acknowledgement • We would like to thank Lee-Feng Chien, Ming-Jer Lee and Jung-Liang Chen for providing their PAT tree code for us. • Reference • Chang, C.H. and Lui, S.C. IEPAD: Information Extraction based on Pattern Discovery, WWW10, May. 2001, Hong Kong.
Future Work • Interface for choosing a pattern • http://www.csie.ncu.edu.tw/~chia/webiepad/ • Multi-level extraction • From record boundary extraction to attribute value extraction • Extractors in Java and C++
Rule Format level 1 encoding scheme: rule level 2 encoding scheme: rule for block 1 level 2 encoding scheme: rule for block 2 ... level 2 encoding scheme, rule for block k level 1 block 1, level 2 block no for attribute 1 level 1 block 1, level 2 block no for attribute 2 ... level 1 block 1, level 2 block no for attribute t K個 block t個attribute
Example(cont.) Line 0: Blocklevel.h, <DL><DT>String<DD>String<BR>String<BR>String<BR>String</DD></DL> Line 1: Alltag.h, rule for block 1 Line 2: Alltag.h, rule for block 2 ... Line k: Alltag.h, rule for block k Line k+1: level 1 block no, level 2 block no for attribute 1 Line k+2: level 1 block no, level 2 block no for attribute 2 ... Line k+t: level 1 block no, level 2 block no for attribute t Demo ex: 3, 2 ex: 5, all ex: 5, 1 3
Performance Evaluation • Definition: • A pattern is said to enumerate a record if the overlapping percentage between the record and the pattern is greater than • Three Measures • Retrieval Rate • Accuracy Rate • Matching Percentage
Illustration • Let Gi,j denotes the ordered occurrences pi, pi+1, ..., pj S=, i=1; Forj=1 tok-1 do If R(Gi,j+1) > then If R(Gi,j) < mthen S= S{Gi,j}; endif i= j+1; endif endf