1 / 16

String Kernels on Slovenian documents

String Kernels on Slovenian documents. Bla ž Fortuna Dunja Mladeni ć Marko Grobelnik. Outline of the talk. Bag-of-words and String Kernel Datasets Experiments Conclusions. Representation of text. Vector-space model (bag-of-words) Most commonly used

steel-rivas
Download Presentation

String Kernels on Slovenian documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. String Kernels on Slovenian documents Blaž Fortuna Dunja Mladenić Marko Grobelnik

  2. Outline of the talk • Bag-of-words and String Kernel • Datasets • Experiments • Conclusions

  3. Representation of text Vector-space model (bag-of-words) • Most commonly used • Each document is encoded as a feature vector with word frequencies as elements • IDF weighting, normalized • Similarity is inner-product (cosine similarity)

  4. Idea behind String Kernels (Lodhi et al., 2002) Words -> Substrings • Each document is encoded as a feature vector with substring frequencies as elements • More contiguous substrings receive higher weighting (trough decay parameter )

  5. String Kernel Explicit computation of feature vectors from previous slide is very expensive. Efficient dynamic programming algorithm exists that takes two strings as input and calculates inner-product between their feature vectors. This can be used as kernel for SVM!

  6. Advantage of String Kernel No need to stem or lemmatize words. Example: • Computer • Computing • Microcomputer • Computational This should help on highly inflected languages like Slovenian or Croatian

  7. Disadvantage of string kernelcompared to bag-of-words • Slower • Linear speed up can not be used for training SVM • Features not explicitly visible – harder to a analyse model

  8. Datasets (1/2) • Mat’kurja – Slovenian internet directory • www.hr – Croatian internet directory Each web-site has a short description and is assigned to a topic from hierarchy. Web site: Vrtnar.com Topic: Science/Biology Description:Obnovljen mini vrtnarski portal s kratkimi informacijami. Web site: Elastik Topic: Arts/Architecture Description:Multidiciplinarna mreza arhitetkov, urbanistov in novomedijskih avtorjev med Amsterdamom in Ljubljano.

  9. Datasets (2/2) Unbalanced! { Slovenian { Croatian

  10. Experimental setting • No pre-processing of documents • Documents for each domain were randomly split into training part (30%) and testing part (70%) • Results were averaged over 5 different splits • Break Even Point as success measure • SVM Cost parameter C = 1.0 • String kernel decay parameter  = 0.2 and length 5

  11. Experiments

  12. Unbalanced datasets (1/3) Higher difference on unbalanced categories!

  13. Unbalanced datasets (2/3) • We tried SVM with different cost parameter for positive and for negative examples (parameter j) • Results for bag-of-words increase • No significant difference for string kernel

  14. Unbalanced datasets (3/3) Bag-of-words with j = 5.0 comparing to String Kernels with j = 1.0 Variation of parameter j on bag-of-words

  15. Conclusions • String kernel significantly outperforms bag-of-words on highly inflected natural languages • Difference is higher on categories with small number of positive examples • SVM support for unbalanced data helps bag-of-words but performance is still lower than of string kernel

  16. Questions?

More Related