380 likes | 385 Views
Rapid Prototyping of WWW Niche Search Engines: Document Classification and Information Extraction. Christian W. Omlin http://www.cs.uwc.ac.za/~comlin http://www.coe.uwc.ac.za. Telkom/Cisco Center of Excellence for IP and Internet Computing. Department of Computer Science
E N D
Rapid Prototyping of WWW Niche Search Engines: Document Classification and Information Extraction Christian W. Omlin http://www.cs.uwc.ac.za/~comlin http://www.coe.uwc.ac.za Telkom/Cisco Center of Excellence for IP and Internet Computing Department of Computer Science University of the Western Cape
Why Niche Search Engines • Technical Challenges and Issues for Niche SEs • Deadliner: A Search Engine for Conference CFP • What Does Deadliner Do? • Deadliner Architecture • Input Data Preprocessing • Document Classifier • Information Extraction: Simple Detectors, Optimal Detector Fusion, ROC curves • Neyman-Pearson • Presentation and Cataloging • Performance • Conclusions & Future Work
Why Niche Search Engines (SEs) ? General SE • General • Low precision, recall not highly structured • High cost large database high bandwidth • Lots to crawl • More resources – all of which is rarely used Niche SE (specialized SEs) • Domain specific • High precision, recall structured data • Lower cost smaller database lower bandwidth • More up to date • Less resources – local, personalized
Technical Challenges & Issues for Niche SEs • Focused crawling • Page classification • Automated data and knowledge extraction and indexing • - Identification • - Extraction • - Summarization • - Presentation • - Integration • Creating (general purpose) tools to help build niche search engines • Niche SE design research and system integration
What does DEADLINER do? • Domain specific document retrieval • Meta-tool (uses and integrates other existing tools: focused crawler, meta search engine, newsgroups) • Prescreens gathered text using support vector machines • Constructs detectors for target information in text (e.g. theme of a conference) • Bayesian detector fusion • Presentation and cataloging of extracted information • Places data into structured database • Allows complex queries (e.g. search by date)
Input Stages • Build upon other existing systems: • Newsgroup scanner • Query modified meta-search: learned query modifications used with text derived from a database of calls for papers • Focused WWW crawler • - learns context of relevant documents • - context graph models link hierarchies
Support Vector Machines (SVMs) • SVMs screen the harvested documents • Good text classifier • Handles high-dimensional input vectors • Resists overfitting by choosing salient features • Class vectors are separated by hyperplanes • Hyperplanes maximize margins between classes, which controls complexity and generalization
SVM Input Data Vectors • Construct input vocabulary • A feature is a word, bi- or trigram • Features are chosen relative to a fixed vocabulary • If feature occurs on 7.5% of true/false class, then the feature is a candidate for the vocabulary • Candidates are ranked in order of ratio of their frequency on true class to frequency on false class documents • The N highest ranked documents form the vocabulary • Feature vector is constructed from this vocabulary • Vectors are binary labeled {0,1}
Heuristics to Combine Text Fields in Filters • Simplest filters perform keyword matching based on built vocabulary using regular expressions • Program committees: • Text matched against a dictionary of known author names & affiliations obtained from ResearchIndex (~90K) • Deadlines: • Standard date formats from sentences with surrounding or immediately preceding text • Titles: • Contains: city/state, date of meeting, deadline, list of sponsors, name, acronym for conference, theme/summary of conference • Must contain at least 2 to match from the database
Bayesian Detector Fusion • How do we find suitable pieces of the document to apply extraction rules? • Optimally integrate simpler detectors for every target field • A regular expression or formatting rule is called a filter • Filter does/doesn’t match; combination of filter and match is called a detector • These partial detectors are combined and yield a new detector • New detectors’ precision/recall setting can be changed via a single parameter
Bayesian Detector Fusion II • Every detector function is a binary classifier • We find the classifier with the highest probability of detection Pd for a given rate of false alarm Pf • Combined output of N detectors denoted as a bit string • Assume two hypotheses, either relevant, or irrelevant • Therefore 2N possible bit strings, and 22N distinct possible classification rules
Receiver Operating Curves (ROC) • Graph of true positive rate vs. false positive rate • Area under the curve represents the probability of distinguishing between a (normal,abnormal) pair • The operating points defined by the 22Nbinary mappings is called the Achievable Operating Set • The ROC is the set of operating points yielding the maximal detection rate for a given false alarm rate • The Neyman-Pearson (NP) procedure can be used to construct an optimal hypothesis test for a distribution parameter • NP ranks the 2N possible strings according to likelihood ratio function
ROC Interpretation • Selecting features (y1, y2, …, yN) yields a pair probability of detection Pd and probability of false alarm rate Pf • Pd and Pf are maxima for the chosen features
Constructing a ROC Curve Assume that the true class conditional density functions on the data are as follows: The ROC support classifiers i(y), I=0,1,2,3 developed via the NP design are:
Neyman-Pearson(often confused with an optimal Bayesian classifier) Theorem: Let X be a random sample of X ~ fx(x; ), { : 0, 1 }, where both X and can be multidimensional vectors. Let C be a rejection set of significance , and = P(x; 1 )/P(x; 0), where P is the likelihood function. Then C is a best rejection (optimal) to test set H0 against H1 if for a positive number k k for x C if < k for x Complement(C )
Presentation & Cataloging • After detection, need to extract target elements • A particular setting might overrule one or more constituent filters, require a certain combination of features, or enforce some other joint relationship • Matching filters and positive detectors are processed for extraction using heuristics • Match with highest confidence is used, but with “close calls”, all the matches are indexed • In principle no irreversible decisions are made
Extraction of target field - example Different filters attempt to recognize target, here the “title” reflects overall detection confidence
Primitives used in DEADLINER • Primitives are used to construct filters • Most conference materials follow a block layout, e.g. a title, abstracts, program committee, affiliations, discussion topics, venue, scope, and a miscellaneous section • Filters are constructed from dictionaries and/or heuristics • Example: • A title usually contains two or more of: • a country name, city name, affiliation, date of meeting, • deadline, conference name, list of sponsors, conference acronym, theme
Examples of Primitives • For a deadline, we use for example: • (cur_sentence, /deadline/i) • matches the word “deadline” • (cur_sentence, /<date>/) • matches a date • #(cur_sentence, /<date>/) 3 • there are three or more dates in the current sentence
Support Vector Machine Performance • 592 manually classified CFPs, 2269 “random” negative examples from Inquirus2s logs, and 850 negative examples (conference related, but not CFPs) • of these, 249 positive examples, and 1250 negative examples used for training the SVM • remaining 353 positive examples and 1019 negative examples used for testing Non-english and multilingual sites present major issues
ROC Curve – “Deadline” Optimal detection performance for different numbers of combined detectors Significant performance increase from integration multiple detectors
Extractor Performance for Pf of 5% • 500 training documents from DBWorld and 100 testing from DIKU • DBWorld pages contains 208 lists of interesting topics, 338 conference titles, 906 deadlines and 197 program committees • DIKU pages contains 300 deadlines, 1455 program committee members • Extraction results: “deadline” and “program committee” • Target Deadline Committee • Total 300 1455 • Detected/Extracted 214 1252 • Detected/Not Extracted 2 72 • Extraneous 31 136 • Extraction results for “Titles” • Start Date End Date Theme/Name Country Type of Meeting • 73% 71% 81% 77.5% 85%
Related Work ResearchIndex.com, Cora collects and indexes online papers by text and citations SoftMealy, Hsu applies finite state transducer rules to html pages XWRAP is a wrapper construction system, which transforms HTML pages into XML from generated rules Stalker is a greedy sequential covering algorithm that uses landmark automata to generate wrappers
Related WorkResearchIndex, Cora • ResearchIndex.com and Cora are examples of document niche search engines • Both perform citation extraction from online publications, page content and citation analysis • Cora uses trained Hidden Markov Models to extract information in citations • ResearchIndex uses hand-constructed algorithms and heuristics, and machine learning • DEADLINER approach: simple filters to select appropriate text blocks followed by simple heuristics
Related WorkSoftMealy, Hsu • Trains a finite state transducer (FST) for token extraction • Uses a heuristic to prevent non-determinism (and therefore increase efficiency) • Contextual rules are produced by an induction algorithm • The FSTs obtained are applied to HTML pages (DEADLINER uses text-only pages in this experiment)
Related WorkXWRAP, Liu, Pu, Han • XWRAP is a wrapper construction system, which transforms HTML pages into XML • Rules are generated and applied to HTML and interesting document regions are identified via an interactive interface • Similarly for semantic tokens • These steps are follow by a hierarchy determination, resulting in a Context Free grammar
Related WorkStalker, Muslea, Minton, Knoblock • An algorithm that uses landmark automata to generate wrappers • Stalker is a greedy sequential covering algorithm • Generates a landmark automaton that accepts only true positives • Does so by finding a perfect disjunctive, or until it runs out of training examples • The best disjunctive covers the most positive examples and new disjunctives are added iteratively to cover uncovered positive candidates
Future Work • Create feature extraction tools (to decrease modeling time) and improve classification/extraction • Increase the (types of) meta-data extracted • Apply the DEADLINER architecture model to other domains • Extension to WWW image searches • eAfrica.org: repository of timely documents related to e-learning, e-commerce, e-government, e-healthcare, etc. for the African continent.
Reference: A. Kruger, C.L. Giles, F. Coetzee, E. Glover, G.W. Flake, S. Lawrence, C.W. Omlin, “Deadliner: Building A New Niche Search Engine’’, 9th International Conference on Information and Knowledge Management (CIKM), 2000.