1 / 28

Comparing TR-Classifier and kNN by using Reduced Sizes of Vocabularies

Explore topic identification techniques in text data with varying vocab sizes. Learn the TR-Classifier and kNN methods, their applications, speech recognition, language models, and vocabulary building. Discover how triggers impact topic identification.

micahs
Download Presentation

Comparing TR-Classifier and kNN by using Reduced Sizes of Vocabularies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparing TR-Classifier and kNN by using Reduced Sizes of Vocabularies Mourad Abbas Citala 2009

  2. Topic Identification: Definition • Topic identification, what does it mean? It aims to assign a topic label to a flow of textual data. Citala 2009

  3. T.I applications • Documents categorization, • Machine Translation, • Selecting documents for web engines, • Speech recognition system...etc. Citala 2009

  4. Speech Recognition • According to Bayes probability formula , P(W|X) is defined as below: • The probability to observe the sequence of vectors X when a sequence of words W is emitted P(X|W). It is given by an acoustic model. • The probability of the sequence of the words W in the used language P(W). This probability is given by a language model. Citala 2009

  5. Description of the recognition process Speech Acoustic Model P(X|W) Parametrization X={x1,…,xT} Searching arg maxW={w1,…,wT}P(X|W).P(W) Language Model Sequence of recognized words P(W) Citala 2009

  6. Speech Recognition • Statistical Language models are essential for Speech Recognition of large vocabularies. They allow to estimate the a priori probability P(W) to emit a sequence of words W from a training corpus. • Nevertheless, in many times, the language model is not able to find the correct choice. • That is why Language model adaptation is needed. Citala 2009

  7. Language model adaptation • One of Language model adaptation methods consists to divide the training documents to classes. • Each class represents a subset of the language which regroups the documents that share the same characteristics. In our case these subsets are known as topics. Corpus Culture Religion Politics Citala 2009

  8. This allows to construct from these topics a language model which is able to describe the characteristics of each topic. The aim is then to: - find out the topic of the recognized uttered sentences. - Use the model derived from the detected topic. Citala 2009

  9. Building the vocabulary The vocabulary should be representative of the corpus. • Starting from the training corpus the vocabulary is built. • Using the vocabulary, a document is represented. If a word of the vocabulary doesn’t exist in the document the attributed value is zero. • To construct the vocabulary, some methods could be used: - Term Frequency. - document Frequency. - mutual Information. - Transition Point Technique. • We have used the Term Frequency, because it is simple and leads to good results. • Words which frequency don’t exceed value 3 are discarded. • The non content words are too discarded. They do not bring any information with regard to the sens of the text. وأنهذه الاجتماعات لن تمنع من عقد المجلس الوطني9 words اجتماعات تمنع عقد مجلس وطني5 words Citala 2009

  10. One Arabic word equivalent to 4 words in the following example. Citala 2009

  11. Fig 3. Illustratif exemple : Method Bag of words Citala 2009

  12. Role of the vocabulary in representation • Each document d={w1,w2,…,wn} is represented by a vector V={f1,f2,…,fn} with fn = TF(wn,d) .IDF(wn). Real values Word 1 Word 2 Word 3 … … … Word n We put 0 in the case where the word couldn’t be found in the document. |V| Size of the vocabulary Citala 2009

  13. kNN • To identify a topic-unknown document d, kNN ranks the neighbors of d among the training document vectors, and uses the topics of the k Nearest Neighbors to predict the topic of the test document d. Citala 2009

  14. TR-Classifier • Triggers of a word wk are the ensemble of words that have a high degree of correlation with it. • The main idea of the TR-classifier is based on computing the average mutual information of each couple of words which belong to the vocabulary Vi. • Couples of words or "triggers" that are considered important for a topic identification task, are those which have the highest average mutual information (AMI) values. • Each topic is then endowed with a number of selected triggers M, calculated using training corpora of topic Ti. Citala 2009

  15. TR-Classifier • The AMI of two words a and b is given by: • AMI measures the association between words, using the following values: Number of documents in which a et b could be found together. Number of documents in whichb could be found without a. Number of documents that contain the word b . Number of documents that doesn’t contain the word b Number of documents in which both a and b couldn’t be found Citala 2009

  16. TR-Classifier Identifying topics by using TR-method consists in: • Giving corresponding triggers for each word wk Є Vi, where Vi is the vocabulary of a topic Ti. • Selecting the best M triggers which characterize the topic Ti. • In test step, we extract for each word wk from the test document, its corresponding triggers. • Computing Qi values by using the TR-distance given by the equation: Citala 2009

  17. Where i stands for the ith topic. The denominator presents a normalization of AMI computation. are triggers included in the test document d , and characterizing the topic Ti. A Decision for labeling the test document with topic Ti is obtained by choosing arg max Qi. TR-classifier uses topic vocabularies which are composed of words ranked according to their frequencies from the maximum to the minimum. Citala 2009

  18. The ten best triggers which charactrizes the topic Culture Citala 2009

  19. Evaluation of the methods • For a topic Tn , the method is evaluated using the following measures: • Recall R= number of documents correctly labelled (Tn ) / total number of documents (belonging to the topic Tn ) . • Precision P= number of documents correctly labelled (Tn ) / number of documents labelled ( Tn ) by the method. • The combination of R and P gives F1 which allows to measure the number of documents correctly labelled efficiently. F1=2RP/ R+P Citala 2009

  20. Experiments and results Citala 2009

  21. corpus gathering The software WinHTTrack allowed to collect many web pages. We have just to fill the address of the source. Citala 2009

  22. Corpus source • The source of the used corpus is the arabic newspaper: Alwatan Sultanate of Oman. Citala 2009

  23. Size of the corpus ELWATAN newspaper Citala 2009

  24. TR-Classifier Performances Citala 2009

  25. Recall values versus triggers number using a size of vocabulary 300 Maximal value of R=89.67 % with N of triggers= 250. Citala 2009

  26. kNN Performances Citala 2009

  27. TR versus kNN Citala 2009

  28. Conclusion • The experiments are realized using an Arabic corpus. • The strong point of the TR-Classifier is its ability to realize better performances by using reduced sizes of topic vocabularies, compared to kNN. • The reason behind that, is the significance of the information present in the longer-distance history that TR-Classifier uses. • Though the used small corpus (800 words), Performances of kNN are relatively acceptable (~ 76 % in terms or Recall). • In perspectives, we aim to enhance TR-Classifier performances by using superior sizes of vocabularies, though it outperforms kNN by 14 %. Citala 2009

More Related