170 likes | 181 Views
This paper presents an efficient Chinese web page classifier based on centroid methodology. It discusses the background, basic technique, classifier design, and implementation, along with various features and experiments. The classifier demonstrates satisfactory performance, high accuracy, and very fast speed.
E N D
An Efficient Centroid Based Chinese Web Page Classifier LIU Hui EE Dept of Tsinghua Univ. China Aug 28, 2003
Outline • Background • Basic Technique • Classifier Design & Implementation • Idea • Architecture • Feature • Experiment • Summary
Background of Web Page Classification • Explosive information need organization • Digital Library • Search Engine • Special (Categorized) Sites • Research hot points • Data Mining • Information Retrieval • Pattern Recognition • Text Automatic Categorization
Background of Our Classifier • Net-compass Search Engine • An emerging large and distributed search engine • Embedded in its new version • Chinese web page categorization competition • Held on March 14th –15th, 2003 • Ranked first • Workgroup • EE Dept of Tsinghua Univ., 3 master students & 1 undergraduate student
Feature Selection • Term Frequency (TF) • Term Frequency & Inverse Document Frequency (TF.IDF) • Mutual Information (MI) • Statistics
Training - Statistical Machine Learning Vector Distance • Centroid Based Method • k-Nearest Neighbor: lazy learning • Support Vector Machine: Structural Risk Minimization Feedback & Combining Classifiers • Neuron Network • Boosting method Probability • Naïve Bayes: Pr (Term/Class) -> Pr(Text/Class)
Idea • Large Database Net-compass Search Engine • Fast Speed • Tolerable Precision • Web Resource Fast changing • Easy building Classifier Fast Training • Supporting multi-language • Word segmentation • Easy Training Set Building & Updating
Features Preprocessing • Chinese Word Segmentation • Dictionary built on search engine log • Adaptability, Manageability, Accuracy • Maximum Matching Segmenting Method • Fast, tolerable accuracy • Noise Filtering • Stop word: common word, abandon word • Advertising links: length & content
Features • Combined Feature Selection • Statistics: tend to choose high-freq words • Mutual Information: tend to low-freq words • Subspace
Features • Adaptive Factors Adjust model, compensate for deficiency of training set • Class Weight • VIP word factor • Implementation • Berkeley DB • Structured dictionary • Avoid I/O • 3000 medium-sized Chinese Web page: 50 seconds
Experiment • Corpus • Chinese Web Page training set • Provided by Peking University • 11 classes, 14000 samples, much unbalanced distribution • Evaluation • Precision, Recall, F-measure
Experiment Discussion • More samples, more accurate • Some classes are more difficult • Corpus cover not large enough • Open testing: 85% Relation between Precision and number of training samples
Summary An efficient Chinese Web Page Classifier • Clear Design • Centroid based, general steps • Novel Features • Preprocessing tricks • Combined feature selection • Subspace & Adaptive factors • Satisfactory Performance • Comparatively high accuracy • Very fast speed • High adaptability
Thank you all! Welcome any question