String Kernels on Slovenian documents

String Kernels on Slovenian documents Blaž Fortuna Dunja Mladenić Marko Grobelnik

Outline of the talk • Bag-of-words and String Kernel • Datasets • Experiments • Conclusions

Representation of text Vector-space model (bag-of-words) • Most commonly used • Each document is encoded as a feature vector with word frequencies as elements • IDF weighting, normalized • Similarity is inner-product (cosine similarity)

Idea behind String Kernels (Lodhi et al., 2002) Words -> Substrings • Each document is encoded as a feature vector with substring frequencies as elements • More contiguous substrings receive higher weighting (trough decay parameter )

String Kernel Explicit computation of feature vectors from previous slide is very expensive. Efficient dynamic programming algorithm exists that takes two strings as input and calculates inner-product between their feature vectors. This can be used as kernel for SVM!

Advantage of String Kernel No need to stem or lemmatize words. Example: • Computer • Computing • Microcomputer • Computational This should help on highly inflected languages like Slovenian or Croatian

Disadvantage of string kernelcompared to bag-of-words • Slower • Linear speed up can not be used for training SVM • Features not explicitly visible – harder to a analyse model

Datasets (1/2) • Mat’kurja – Slovenian internet directory • www.hr – Croatian internet directory Each web-site has a short description and is assigned to a topic from hierarchy. Web site: Vrtnar.com Topic: Science/Biology Description:Obnovljen mini vrtnarski portal s kratkimi informacijami. Web site: Elastik Topic: Arts/Architecture Description:Multidiciplinarna mreza arhitetkov, urbanistov in novomedijskih avtorjev med Amsterdamom in Ljubljano.

Datasets (2/2) Unbalanced! { Slovenian { Croatian

Experimental setting • No pre-processing of documents • Documents for each domain were randomly split into training part (30%) and testing part (70%) • Results were averaged over 5 different splits • Break Even Point as success measure • SVM Cost parameter C = 1.0 • String kernel decay parameter  = 0.2 and length 5

Experiments

Unbalanced datasets (1/3) Higher difference on unbalanced categories!

Unbalanced datasets (2/3) • We tried SVM with different cost parameter for positive and for negative examples (parameter j) • Results for bag-of-words increase • No significant difference for string kernel

Unbalanced datasets (3/3) Bag-of-words with j = 5.0 comparing to String Kernels with j = 1.0 Variation of parameter j on bag-of-words

Conclusions • String kernel significantly outperforms bag-of-words on highly inflected natural languages • Difference is higher on categories with small number of positive examples • SVM support for unbalanced data helps bag-of-words but performance is still lower than of string kernel

Questions?

String Kernels on Slovenian documents