Hierarchical Classification: Comparison with Flat Method

Hierarchical Classification: Comparison with Flat Method Yongwook Yoon Jun 12, 2003 NLP Lab., POSTECH

Contents • Hierarchical vs. Flat Classifier • System Overview • Experiment & Results • Comparison with Flat Methods • Future Work

Hierarchical Classification • Natural Method in very large group of classes • Top-down Classification • Human readability promotion • The better performance than flat method • Much better recall and precision • Flexible strategy: applicable to different levels of hierarchy

Flat vs. Hierarchical classification Business Root Grain Oil D1 D2 D3 D1 D2 Di Dj Dj+1 Dn Dn

System Overview (1) • Baseline system currently • BOW (Bag Of Words) approach • No syntactic or semantic feature yet • Naïve Bayesian classifier • Many feature selection methods • Pruning vocabulary by document count or word count • Information gain • Hierarchical classification added

System Overview (2) • Many Extensions possible • BOW Library (by Mccallum) supplies • Support Vector Machine • Not on-line but batch • Maximum Entropy • K-nearest neighbor • Pr-TFIDF • Bow does not supplies • On-line learning methods • Syntactic or semantic classification features

System Overview (3) • Functions implemented for hierarchical classification • Construct the hierarchy prototype as a directory structure • Division of documents into different levels of hierarchy • Training of each classifiers in hierarchy • Testing of documents from the root to the leaf classifiers automatically • Logging and evaluation

Business Grain Oil Construct Hierarchical Structure Make up Logical hierarchy Documents Divide docs Into hierarchy Training of Each classifier Training parameters

Classifying documents A doc From Root node Input a document Level_0 From intermediate Result, Go further Grain Steel Oil Class_j Final classification Class_1 Class_i Class_k Class_N

Experiment • Data: 20 newsgroup documents • 20 classes • 1,000 documents per class • Intrinsic hierarchy • rec.sport.hockey, talk.politics.mideast • Two major trials • Flat vs. Hierarchy • Evaluation and comparison ck ck

Experiment detail • Sigir-2001(Bekkerman et. al) 과 동일한 환경에서 실험 • News의 header부분 제외 • 단 subject 라인은 살림 • 모든 문자를 소문자화 • Multi-labeled class허용 • 4-fold cross-validation

4-fold cross validation • Flat and hierarchy case 에 동일 적용 • 20,000의 문서를 random하게 • 5,000개의 문서로 된 4 set로 구분 • 위 4 set중 3개씩을 training으로 • 1개는 testing으로 조합 • 4번의 실험을 수행 • 최종 evaluation은 위 4개를 평균 • Sigir-2001 evaluation 방법

root alt comp misc rec sci soc talk forsale religion atheism os graphics sys windows crypt electronics med space sport motorcycles autos christian x ms-windows ibm mac baseball hocky politics religion misc hardware hardware guns mideast misc misc 총 8개의 classifier가 필요

Result – Flat method The result of 4-fold cross-validation: ./test_out/cv3_info900.stats Correct: 16573 out of 19996 ( 82.88% percent accuracy) classname 0 1 2 3 4 5 6 7 8 18 19 :total 0 alt.atheism 825 . . . 1 . . 1 23 11 82 :1000 82.50% 1 comp.graphics . 866 33 17 14 42 4 2 1 . 1 :1000 86.60% 2 comp.os.ms-windows.misc . 25 873 34 2 65 1 . . . . :1000 87.30% 3 comp.sys.ibm.pc.hardware 1 60 97 662 126 12 8 1 . . . :1000 66.20% 4 comp.sys.mac.hardware . 33 13 25 902 5 4 1 . . . :1000 90.20% 5 comp.windows.x . 122 65 8 1 783 1 2 1 . 1 :1000 78.30% 6 misc.forsale . 8 5 25 17 6 801 31 61 4 1 :1000 80.10% 7 rec.autos . 1 . . 2 2 11 846 98 2 . :1000 84.60% 8 rec.motorcycles . 2 . . . . 2 17 971 . . :1000 97.10% 9 rec.sport.baseball . . . . . 1 1 11 55 1 . :1000 86.60% 10 rec.sport.hockey 2 . . . . 1 2 4 11 1 . :1000 96.70% 11 sci.crypt . 8 6 . . 8 . . 2 1 . : 999 93.99% 12 sci.electronics . 30 6 13 31 3 10 35 12 . . :1000 74.30% 13 sci.med . 16 . . 2 4 1 5 8 3 . :1000 93.00% 14 sci.space 2 13 3 . 3 1 . 3 2 3 1 :1000 95.70% 15 soc.religion.christian 4 . . . . . . . . 1 2 : 997 99.10% 16 talk.politics.guns 1 . 1 . . . 2 3 12 68 22 :1000 87.00% 17 talk.politics.mideast 6 2 . . . 7 3 . 10 149 2 :1000 78.50% 18 talk.politics.misc 5 1 . . 1 . . 6 6 673 38 :1000 67.30% 19 talk.religion.misc 330 3 . . . . . 4 2 107 326 :1000 32.60%

Evaluation measure • From the four entries αi, βi, γi of the confusion matrix we compute • Precision = • Recall = • Micro-averaged BEP = (P+R) / 2 • Overall recall and precision are the same βi: 원래Ci가 아니나 Ci로 분류된 문서 수, γi: 원래 Ci이나 Cj로 분류된 문서 수

Result – Hierarchical (1) # level_0 The result of 4-fold cross-validation: ./test_out/cv3_info20000.stats Correct: 18440 out of 19996 ( 92.22% percent accuracy) classname 0 1 2 3 4 5 6 :total 0 alt.atheism 962 1 . 3 12 4 18 :1000 96.20% 1 comp. 1 4807 32 13 147 . . :5000 96.14% 2 misc.forsale . 117 771 62 49 . 1 :1000 77.10% 3 rec. 1 18 14 3916 43 . 8 :4000 97.90% 4 sci. 6 187 23 39 3730 . 14 :3999 93.27% 5 soc.religion.christian 4 1 . . . 978 14 : 997 98.09% 6 talk. 409 17 2 60 160 76 3276 :4000 81.90% # level_1/comp. The result of 4-fold cross-validation: ./test_out/cv3_info900.stats Correct: 4276 out of 4807 ( 88.95% percent accuracy) classname 0 1 2 3 :total 0 comp.graphics 802 33 35 67 : 937 85.59% 1 comp.os.ms-windows.misc 13 848 55 75 : 991 85.57% 2 comp.sys. 42 78 1743 44 :1907 91.40% 3 comp.windows.x 47 31 11 883 : 972 90.84%

Result – Hierarchical (1) # level_1/sci. The result of 4-fold cross-validation: ./test_out/cv3_info900.stats Correct: 3596 out of 3730 ( 96.41% percent accuracy) classname 0 1 2 3 :total 0 sci.crypt 969 5 1 2 : 977 99.18% 1 sci.electronics 20 731 6 43 : 800 91.38% 2 sci.med 8 14 931 14 : 967 96.28% 3 sci.space 6 10 5 965 : 986 97.87% # level_1/talk. The result of 4-fold cross-validation: ./test_out/cv3_info900.stats Correct: 3014 out of 3276 ( 92.00% percent accuracy) classname 0 1 :total 0 talk.politics. 2680 113 :2793 95.95% 1 talk.religion.misc 149 334 : 483 69.15% # level_2/comp.sys. The result of 4-fold cross-validation: ./test_out/cv3_info10.stats Correct: 1709 out of 1743 ( 98.05% percent accuracy) classname 0 1 :total 0 comp.sys.ibm.pc.hardware 815 22 : 837 97.37% 1 comp.sys.mac.hardware 12 894 : 906 98.68%

Evaluation measure inHierarchical • Recall • Same as the flat method case • The correct classification count in leaf node • Precision • Should consider the incorrect count in the upper level classification • Divide the incorrect count by the count of classes in the lower level • Finally, in the computation at the leaf node we consider all of the averaged incorrect count of the upper levels into account

Evaluation measure inHierarchical – Cont’ # level_0 The result of 4-fold cross-validation: ./test_out/cv3_info20000.stats Correct: 18440 out of 19996 ( 92.22% percent accuracy) classname 0 1 2 3 4 5 6 :total 0 alt.atheism 962 1 . 3 12 4 18 :1000 96.20% 1 comp. 1 4807 32 13 147 . . :5000 96.14% 2 misc.forsale . 117 771 62 49 . 1 :1000 77.10% 3 rec. 1 18 14 3916 43 . 8 :4000 97.90% 4 sci. 6 187 23 39 3730 . 14 :3999 93.27% 5 soc.religion.christian 4 1 . . . 978 14 : 997 98.09% 6 talk. 409 17 2 60 160 76 3276 :4000 81.90% # level_1/comp. The result of 4-fold cross-validation: ./test_out/cv3_info900.stats Correct: 4276 out of 4807 ( 88.95% percent accuracy) classname 0 1 2 3 :total 0 comp.graphics 802 33 35 67 : 937 85.59% 1 comp.os.ms-windows.misc 13 848 55 75 : 991 85.57% 2 comp.sys. 42 78 1743 44 :1907 91.40% 3 comp.windows.x 47 31 11 883 : 972 90.84% (1+117+18+187+1+17) / 5 = 68.2 Precision of comp.graphics = 802 / (802+13+42+47+ 68.2 ) = 82.5

Comparison of Flat vs. Hierarchical

Analysis • Hierarchical이 Flat case보다 월등한 성능향상 • Naïve Bayesian 과 word feature 만의 결과 • SVM, Maximum entropy 등 다른 방법 적용할 필요 • Hierarchical에서는 level마다 다른 classifier와 다른 feature selection방법 사용 가능 • 보다 flexible한 적용 가능 • 성능향상을 위한 tuning의 여지가 많음 • 실험에서는 level마다 feature갯수를 달리 적용 • Level_0: 20,000; Level_1:900; Level_2: 10~900 words • Class 추가, 삭제에 대하여 국부적인 영향 • Relevant level에 대해서만 재 학습

Future Work • Naïve Bayesian을 대체한 다른 classifier 적용 • Support Vector Machine, Maximum Entropy, K-nearest neighbor, Pr-TFIDF 등 • Hierarchy + Adaptive learning • Hierarchy 에 적합한 adaptive 방법 발굴 • 다른 feature selection 방법 적용

Discussion • Overall precision 과 recall이 일치하는 게 우연인가 필연인가? • Micro-averaged 라서 그런가 아니면, class당 문서수가 일정한 결과인가? • Hierarchical의 classification 단계에서 어느 한 단계의 분류결과에 대해서 정답만 추출하여 다음단계로 내려보내지 말고 다 내려보내야 맞음(현실적, 합리적인 적용) • 현실에선 중간단계마다 정답을 구분하지 않음 • 또, Hierarchical precision계산시 상위레벨에서 틀린것을 평균해서 하위로 내려보낼 필요 없이, 위에서 처럼 틀린것까지 다 하위레벨로 내려보낸후 leaf node에서 정답과 비교, precison을 계산하는 것이 맞지 않나? • 지금의 계산방법과 비교 필요 • Future Work에 대하여 • Baseline이후 연구방향 확정 필요 • Class 의 추가, 삭제를 incremental learning으로 cover 하는 방법 • Hard Problem ! • 단순 Hierarchical + adaptive learning

The End

Hierarchical Classification: Comparison with Flat Method