510 likes | 596 Views
Classification and clustering methods development and implementation for unstructured documents collections. by Osipova Nataly St.Petesburg State University Faculty of Applied Mathematics and Control Processes Department of Programming Technology. Contents. Introduction
E N D
Classification and clustering methods development and implementation for unstructured documents collections by Osipova NatalySt.Petesburg State University Faculty of Applied Mathematics and Control Processes Department of Programming Technology
Contents • Introduction • Methods description • Information Retrieval System • Experiments
Contextual Document Clustering was developed in joined project of Applied Mathematics and Control Processes Faculty, St. Petersburg State University and Northern Ireland Knowledge Engineering Laboratory (NIKEL), University of Ulster.
Definitions • Document • Terms dictionary • Dictionary • Cluster • Word context • Context or document conditional probability distribution • Entropy
Document conditional probability distribution Document x y word1 word2 word3 … wordn tf(y) 5 10 6 16 p(y|x) 5/m 10/m 6/m 16/m y – words tf(y) – y frequency p(y|x) – y conditional probability in document x m – document x size (5/m, 10/m,6/m,…,16/m ) – document conditional probability distribution
Word context Word w … Document x1 Document x2 Document xk y word1 word2 … wordn1 tf(y) 5 10 16 p(y|x1) 5/m1 10/m1 16/m1 y word1 word3 … wordn2 tf(y) 7 12 4 p(y|x1) 7/m1 12/m1 4/m1 y word1 word4 … wordnk tf(y) 20 9 3 p(y|x1) 20/mk 9/mk 3/mk … y word1 word2 word3 … wordnk tf(y) 5+7+20=32 10 12 3 p(y|w) 32/m 10/m 12/m 3/m Context conditional probability distribution
Contents • Introduction • Methods description • Information Retrieval System • Experiments
Methods • document clustering method • dictionary build methods • document classification method using training set Information retrieval methods: • keyword search method • cluster based search method • similar documents search method
Documents Dictionary Narrow context words Distances calculation Clusters Contextual Documents Clustering
Entropy y context conditional probability distribution pn p2 p1 p1+p2+…+pn=1 pn p2 p1 Uncertainly measure, here it is used to characterize commonness (narrowness) of the word context.
Contextual Document Clustering maxH(y)=H ( )
Entropy 0 α 0.5 1 H( ) H( ) H( )
Word Context - Document Distance y context conditional probability distribution Average conditional probability distribution Document x conditional probability distribution
Word Context - Document Distance ) JS[p1,p2]=H( - 0.5H( ) ) - 0.5H(
Dictionary construction Why: - big volumes: 60,000 documents, 50,000 words => 15,000 words in a context - narrow context words importance
Dictionary construction Delete words with 1. High or low frequency 2. High or low document frequency 3. 1. and 2.
Retrieval algorithms • keyword search method • cluster based search method • search by example method
Keyword search method Document 1 word 1 word 2 word 3 … word n1 Document 2 word 10 word 25 word 30 … word n2 Document 3 word 15 word 2 word 32 … word n3 Document 4 word 11 word 21 word 3 … word n4 Request: word 2 Result set: document 1 document3
Cluster based search method Documents Documents Documents Cluster 1 word 1 word 2 … word n1 Cluster 2 word 12 word 26 … word n2 Cluster 3 word 1 word 23 … word n3 Cluster context words Request: word 1 Result set: Cluster 1 Cluster 3
Minimal Spanning Tree Cluster name document 1 document 4 document 2 document 5 document 3 document 6 document 7 Cluster Similar documents search Request: document 3 Result set: document 6 document 7
Document classification: method 1 Training set Test documents Clusters List of topics Topics contexts Distances between topics and clusters contexts Classification result: cluster1 – topic 10 cluster 2 – topic 3 … cluster n – topic 30
Document classification: method 2 Training set Test documents All documents set Topics list Clusters Classification result: cluster1 – topic 10 cluster 2 – topic 3 … cluster n – topic 30
Contents • Introduction • Methods description • Information Retrieval System • Experiments
Information Retrieval System • Architecture • Features • Use
data base server client Information Retrieval System architecture.
Data Base Data Base Server MS SQL Server 2000 Local Area Network “thick” client C# IRS architecture
IRS architecture DBMS MS SQL Server 2000: • High-performance • Scalable • Secure • Huge volumes of data treat • T/SQL • Stored procedures
IRS features In the IRS the following problems are solved: • document clustering • keyword search method • cluster based search method • similar documents search method • document classification with the use of training set
DB structure The Data Base of the IRS consists of the following tables: • documents • all words dictionary • dictionary • table of relations between documents and words: document-word • words contexts • words with narrow contexts • clusters • intermediate tables for main tables build and for retrieve realization
Documents All words dictionary Dictionary Keyword search Table “document-word” Cluster based search Words contexts Clusters Centroid Words with narrow contexts Similar documents search Algorithms implementation
0,26967 document2 document1 0,211 0,57231 0,1011 0,16285 document5 document3 0,7231 0,8731 0,23851 0,98154 document4 Cluster Similar documents search
Cluster name document 1 document 4 document 2 document 5 document 3 Cluster Minimal Spanning Tree
Similar documents search Similar documents search Clusters table Distances table Tree table
Contents • Introduction • Methods description • Information Retrieval System • Experiments
Experiments Test goals were: • algorithm accuracy test • different classification methods comparison • algorithm efficiency evaluation
Experiments • 60,000 documents • 100 topics • Training set volume = 5% of the collection size
Result analysis - Russian Information Retrieval Evaluation Seminar - Such measures as macro-average • recall • precision • F-measure were calculated.
Result analysis List of some topics test documents were classified in
Result analysis Recall results for every category. Results which were the best for the category are selected with bold type. All results are set in percents.