Annotation Free Information Extraction

Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw 10/4/2002

Introduction • TEXT IE • AutoSlog-TS • Semi IE • IEPAD

AutoSlog-TS: Automatically Generating Extraction Patterns from Untagged Text Ellen Riloff University of Utah AAAI96

AutoSlog-TS • AutoSlog-TS is an extension of AutoSlog • It operates exhaustively by generating an extraction pattern for every noun phrase in the training corpus. • It then evaluates the extraction patterns by processing the corpus a second time and generating relevance statistics for each pattern. • A more significant difference is that AutoSlog-TS allows multiple rules to fire if more than one matches the context.

AutoSlog-TS Concept

Relevance Rate • Pr(relevant text | text contains patterni) = rel-freqi / total-freqi rel-freqi: the number of instances of patterni that were activated in relevant texts. total-freqi: the total number of instances of patterni that were activated in the training corpus. • The motivation behind the conditional probability estimate is that domain-specific expressions will appear substantially more often in relevant texts than irrelevant texts.

Rank function • Next, we use a rank function to rank the patterns in order of importance to the domain: relevance rate * log2(frequency) • So, a person only needs to review the most highly ranked patterns.

Texts Extraction Patterns AutoSlog: 772 relevant 1237 450 AutoSlog-TS: 1500,50% relevant 32345 11225 210 Experimental Results Setup • We evaluated AutoSlog and AutoSlog-TS by manually inspecting the performance of their dictionaries in the MUC-4 terrorism domain. • We used the MUC-4 texts as input and the MUC-4 answer keys as the basis for judging “correct” output (MUC-4 Proceedings 1992). • Training

Testing • To evaluate the two dictionaries, we chose 100 blind texts from the MUC-4 test set. (50 relevant texts and 50 irrelevant texts) • We scored the output by assigning each extracted item to one of five categories: correct, mislabeled, duplicate, spurious, or missing. • Correct: If an item matched against the answer keys. • Mislabeled: If an item matched against the answer keys but was extracted as the wrong type of object. • Duplicate: If an item was referent to an item in the answer keys. • Spurious: If an item did not refer to any object in the answer keys. • Missing: Items in the answer keys that were not extracted

Experimental Results • We scored three items: perpetrators, victims, and targets.

Experimental Results • We calculated recall as correct / (correct + missing) • Compute precision as: (correct + duplicate) / (correct + duplicate + mislabeled + spurious)

Behind the scenes • In fact, we have reason to believe that AutoSlog-TS is ultimately capable of producing better recall than AutoSlog because it generates many good patterns that AutoSlog did not. • AutoSlog-TS produced 158 patterns with a relevance rate ≧ 90% and frequency ≧ 5. Only 45 of these patterns were in the original AutoSlog dictionary. • The higher precision demonstrated by AutoSlog-TS is probably a result of the relevance statistics.

Future Directions • A potential problem with AutoSlog-TS is that there are undoubtedly many useful patterns buried deep in the ranked list, which cumulatively could have a substantial impact on performance. • The precision of the extraction patterns could also be improved by adding semantic constraints and, in the long run, creating more complex extraction patterns.

IEPAD: Information Extraction based on Pattern Discovery C.H. Chang. National Central University WWW10

Semi-structured Information Extraction • Information Extraction (IE) • Input: Html pages • Output: A set of records

Pattern Discovery based IE • Motivation • Display of multiple records often forms a repeated pattern • The occurrences of the pattern are spaced regularly and adjacently • Now the problem becomes ... • Find regular and adjacent repeats in a string

Pattern Generator Html Page Patterns Pattern Viewer Users Extractor ExtractionResults Extraction Rule Html Pages IEPAD Architecture

HTML Page Token Translator A Token String PAT Tree Constructor PAT trees and Maximal Repeats Validator Advenced Patterns Rule Composer Extraction Rules The Pattern Generator • Translator • PAT tree construction • Pattern validator • Rule Composer

1. Web Page Translation • Encoding of HTML source • Rule 1: Each tag is encoded as a token • Rule 2: Any text between two tags are translated to a special token called TEXT (denoted by a underscore) • HTML Example: Congo242 Egypt20 • Encoded token string T()T(_)T()T()T(_)T()T( ) T()T(_)T()T()T(_)T()T( )

Various Encoding Schemes

2. PAT Tree Construction • PAT tree: binary suffix tree • A Patricia treeconstructed over all possible suffix strings of a text • Example T() 000 T() 001 T() 010 T() 011 T( ) 100 T(_) 110 • T()T(_)T()T()T(_)T()T( ) • T()T(_)T()T()T(_)T()T( ) 000110001010110011100 000110001010110011100

The Constructed PAT Tree

Definition of Maximal Repeats • Let a occurs in S in position p1, p2, p3, …, pk • a is left maximal if there exists at least one (i, j) pair such that S[pi-1]S[pj-1] • a is right maximal if there exists at least one (i, j) pair such that S[pi+|a|]S[pj+|a|] • a is a maximal repeat if it it both left maximal and right maximal

Finding Maximal Repeats • Definition: • Let’s call character S[pi-1] the left character of suffix pi • A node  is left diverse if at least two leaves in the ’s subtree have different left characters • Lemma: • The path labels of an internal node  in a PAT tree is a maximal repeat if and only if  is left diverse

3. Pattern Validator • Suppose a maximal repeat  are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence. • Characteristics of a Pattern • Regularity: Variance coefficient • Adjacency: Density

Pattern a No V(a)<0.5 Discard Yes No 0.25<D(a)<1.5 Discard Yes Pattern a Pattern Validator (Cont.) • Basic Screening For each maximal repeat a, compute V(a) and D(a) a) check if the pattern’s variance: V(a) < 0.5 b) check if the pattern’s density: 0.25 < D(a) < 1.5

4. Rule Composer • Occurrence partition • Flexible variance threshold control • Multiple string alignment • Increase density of a pattern

Occurrence Partition • Problem • Some patterns are divided into several blocks • Ex: Lycos, Excite with large regularity • Solution • Clustering of the occurrences of such a pattern Clustering V(P)<0.1 No P Discard Yes Check density

Multiple String Alignment • Problem • Patterns with density less than 1 can extract only part of the information • Solution • Align k-1 substrings among the k occurrences • A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.

Multiple String Alignment (Cont.) • Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb” • If we have the following multiple alignment for strings `àdcwbd'', `àdcxb'' and `àdcxbd'': a d c w b d a d c x b - a d c x b d • The extraction pattern can be generalized as “adc[w|x]b[d|-]”

Pattern Viewer • Java-application based GUI • Web based GUI • http://www.csie.ncu.edu.tw/~chia/WebIEPAD/

The Extractor • Matching the pattern against the encoding token string • Knuth-Morris-Pratt’s algorithm • Boyer-Moore’s algorithm • Alternatives in a rule • matching the longest pattern • What are extracted? • The whole record

Experiment Setup • Fourteen sources: search engines • Performance measures • Number of patterns • Retrieval rate and Accuracy rate • Parameters • Encoding scheme • Thresholds control

# of Patterns Discovered Using BlockLevel Encoding • Average 117 maximal repeats in our test Web pages

Translation • Average page length is 22.7KB

Accuracy and Retrieval Rate

Summary • IEPAD: Information Extraction based on Pattern Discovery • Rule generator • The extractor • Pattern viewer • Performance • 97% retrieval rate and 94% accuracy rate

Problems • Guarantee high retrieval rate instead of accuracy rate • Generalized rule can extract more than the desired data • Only applicable when there are several records in a Web page, currently

References • TEXT IE • Riloff, E. (1996) Automatically Generating Extraction Patterns from Untagged Text, (AAAI-96) , 1996, pp. 1044-1049. • Riloff, E. (1999) Information Extraction as a Stepping Stone toward Story Understanding, In Computational Models of Reading and Understanding, Ashwin Ram and Kenneth Moorman, eds., The MIT Press.

References • Semi-structured IE • D.W. Embley, Y.S. Jiang, and W.-K. Ng, Record-Boundary Discovery in Web Documents, SIGMOD'99 Proceedings • C.H. Chang. and S.C. Lui. IEPAD: Information Extraction based on Pattern Discovery, WWW10, pp. 681-688, May 2-6, 2001, Hong Kong. • B. Chidlovskii, J. Ragetli, and M. de Rijke, Automatic Wrapper Generation for Web Search Engines, The 1st Intern. Conf. on Web-Age Information Management (WAIM'2000), Shanghai, China, June 2000

Annotation Free Information Extraction