260 likes | 370 Views
Bayesian Online Classifiers for Text Classification and Filtering. A Paper by Kian Ming Adam Chai, Hwee Tou Ng and Hai Leong Chieu Presented by Eric Franklin and Changshu Jian. Agenda. Problem Description & Proposed Solution Algorithm Description Evaluation & Conclusions
E N D
Bayesian Online Classifiers forText Classification and Filtering A Paper by Kian Ming Adam Chai, Hwee Tou Ng and Hai Leong Chieu Presented by Eric Franklin and Changshu Jian
Agenda • Problem Description & Proposed Solution • Algorithm Description • Evaluation & Conclusions • Discussion & Critiques • Summary
Problem Description As the number of documents that exist grows daily, there is a greater need for ways to classify text documents, and handcrafting text classifiers is a tedious process. New methods for classifying documents are relevant to the field of Data Mining as classification is the entry point for all subsequent data mining functions.
Problem Description Why document classification? • Spam filter • User wants to browse by topic through mass search results (IR) • A document cluster is relevant to the same query? Why Bayesian? • Bayesian classification proves to be useful in Document Classification, religious theoretic basis • Offline? needs large data set to improve accuracy
Proposed Solution • Two related Bayesian algorithms can perform comparably to Support Vector Machines (SVM) • Bayesian Online Perceptron • Bayesian Online Gaussian Process • The online approach allows continuous learning without storing all the previous data • Continuous learning allows the utilization of information obtained from subsequent data after the initial training
Bayesian Online Learning • Given m instances of past data Dm = {(yt, xt), t = 1...m}, the predictive probability of the relevance of a document described by x is • a is a random variable with probability density p(a|Dm) • Integrate over all the possible values of a to obtain the prediction • Explicit dependence of the posterior p(a|Dt+1) on the past data is removed by approximating it with a distribution p(a|At+1)
Bayesian Online Learning (cont) • Starting from the prior p0(a) = p(a|A0), learning comprises two steps • Update the posterior probability using Bayes rule • Approximate the updated posterior probability • Approximation is done by minimizing the Kullback-Leibler distance between the the approximating and approximated distributions • Kullback-Leibler Distance • Non-symmetric measure of the difference between two probability distributions • Measures the expected number of extra bits required to code from one probability distribution to another
What is a Perceptron? • Simplest feed-forward neural network • A perceptron is a binary classifier that maps real input vector values to binary output • A thresholding function, f(x) is used • If w ∙ x + b > 0, f(x) maps to 1, else f(x) maps to 0 • x is the input vector • w is a vector of real-valued weights • b is a bias, a constant term that does not depend on any input value
Bayesian Online Perceptron • Likelihood is defined as a probit model: • Where a defines a perceptron • σ02 is a fixed noise variance • Φ is the cumulative Gaussian distribution • x is a vector representing a document • y is the document relevance, where y ∈ {−1,1}
Bayesian Online Perceptron Algorithm • Successive calculation of the means ⟨a⟩t and covariances Ct of the posterior probabilities for m documents • Initialize ⟨a⟩0 to be 0 and C0 to be 1 • For t = 0, 1, ..., m−1 • yt+1 is the relevance indicator for document xt+1 • Calculate st+1 , σt+1 ,⟨h⟩t and ⟨p(yt+1 | h)⟩t • Calculate and • Calculate • Calculate • Calculate ⟨a⟩t+1 and Ct+1 • The prediction for datum (y,x) is ⟨p(y|x,a)⟩m = ⟨p(y|h)⟩m
Bayesian Online Gaussian Process • Gaussian process (GP) has been historically constrained to problems with small data sets • Uses efficient and effective approximations to the full GP formulation • Similar to Perceptron, but uses a kernel function to estimate weights
Evaluation • The author uses case studies to validate the proposed methodology • Two benchmark data sets • Strengths • Easily to compare with other methods • Weakness • Incomprehensive test, no real application test • No theoretical proof
Evaluation • Two tasks: classification & filtering • Classification • Reuters-21578 corpus • 9,603 training documents and 3,299 test documents • Filtering • OHSUMED • Only the Bayesian Online Perceptron considered
Evaluation: Classification • Feature Selection • select as features for each category the set of all words for which −2 ln λ > 12.13 • Further prune by using only top 300 features • Thresholding • Bayes decision rule, p(y = 1|x,Dm) > 0.5 • Additionally, MaxF1: empirically optimized threshold for each category for the F1
Classification on Reuters-21578 • Generally, MaxF1 thresholding increases the performance of all the systems, especially for rare categories. • For the Bayesian classifiers, ExpectedF1 thresholding improves the performance of the systems on rare categories. • Perceptron implicitly implements the kernel used by GP-1, hence their similar results. • With MaxF1 thresholding, feature selection impedes the performance of SVM.
Classification on Reuters-21578 • for limited features, Bayesian classifiers outperform SVM for both common and rare categories. • Based on the sign tests, the Bayesian classifiers outperform SVM (using 8,362 words) for common categories, and vice versa for rare categories
Evaluation: Filtering • Feature selection and adaptation • Training the classifier • Information gain • Results
Filtering on OHSUMED • System comparison • Using Bayesian online perceptron
Parameter Settings • Feature selection and Adaptation • Training the classifier • Information Gain
Results • a kind of active learning, where the willingness to tradeoff precision for learning decreases with Nret. • features are constantly added as relevant documents are seen. When the classifier is retrained on past documents, the new features enable the classifier to gain new information from these documents.
Results • Bayesian online perceptron, together with the consideration for information gain, is a very competitive method.
Conclusions & Future Work • These algorithms performed comparably to SVM • Future work • Hybrid classification using Bayesian classifiers for common categories and maximum margin classifiers for rare categories • Modify Bayesian classifiers to use relevance feedback • Compare incremental SVM with the Bayesian online classifiers
Discussion • Major contributions of the paper • Testing of Existing Capability • Implemented and tested Bayesian online perceptron and Gaussian processes • Demonstrated the effectiveness of online learning with information gain on the TREC-9 batch-adaptive filtering task • New Capability • Offers online capability • Online processing is the most significant contribution of this paper, but we both feel that it needs further testing
Discussion • Assumptions made by the authors • Assume the test results are positive • How does negative feedback affect the system? • Assuming that the approximated value for the posterior are close enough to the actual calculated posterior • Based on probability distribution • Computing cost & Scalability • This was tested against a corpus of ~20,000 documents. How would the system perform against 1 million documents? 1 billion documents?
Summary • The authors discuss the problem of classifying large sets of text documents • They propose two variants of Bayesian classifying algorithms • Testing was performed against the Reuters-21578 corpus • The authors algorithm performed similarly to Support Vector Machine