160 likes | 301 Views
String Kernels on Slovenian documents. Bla ž Fortuna Dunja Mladeni ć Marko Grobelnik. Outline of the talk. Bag-of-words and String Kernel Datasets Experiments Conclusions. Representation of text. Vector-space model (bag-of-words) Most commonly used
E N D
String Kernels on Slovenian documents Blaž Fortuna Dunja Mladenić Marko Grobelnik
Outline of the talk • Bag-of-words and String Kernel • Datasets • Experiments • Conclusions
Representation of text Vector-space model (bag-of-words) • Most commonly used • Each document is encoded as a feature vector with word frequencies as elements • IDF weighting, normalized • Similarity is inner-product (cosine similarity)
Idea behind String Kernels (Lodhi et al., 2002) Words -> Substrings • Each document is encoded as a feature vector with substring frequencies as elements • More contiguous substrings receive higher weighting (trough decay parameter )
String Kernel Explicit computation of feature vectors from previous slide is very expensive. Efficient dynamic programming algorithm exists that takes two strings as input and calculates inner-product between their feature vectors. This can be used as kernel for SVM!
Advantage of String Kernel No need to stem or lemmatize words. Example: • Computer • Computing • Microcomputer • Computational This should help on highly inflected languages like Slovenian or Croatian
Disadvantage of string kernelcompared to bag-of-words • Slower • Linear speed up can not be used for training SVM • Features not explicitly visible – harder to a analyse model
Datasets (1/2) • Mat’kurja – Slovenian internet directory • www.hr – Croatian internet directory Each web-site has a short description and is assigned to a topic from hierarchy. Web site: Vrtnar.com Topic: Science/Biology Description:Obnovljen mini vrtnarski portal s kratkimi informacijami. Web site: Elastik Topic: Arts/Architecture Description:Multidiciplinarna mreza arhitetkov, urbanistov in novomedijskih avtorjev med Amsterdamom in Ljubljano.
Datasets (2/2) Unbalanced! { Slovenian { Croatian
Experimental setting • No pre-processing of documents • Documents for each domain were randomly split into training part (30%) and testing part (70%) • Results were averaged over 5 different splits • Break Even Point as success measure • SVM Cost parameter C = 1.0 • String kernel decay parameter = 0.2 and length 5
Unbalanced datasets (1/3) Higher difference on unbalanced categories!
Unbalanced datasets (2/3) • We tried SVM with different cost parameter for positive and for negative examples (parameter j) • Results for bag-of-words increase • No significant difference for string kernel
Unbalanced datasets (3/3) Bag-of-words with j = 5.0 comparing to String Kernels with j = 1.0 Variation of parameter j on bag-of-words
Conclusions • String kernel significantly outperforms bag-of-words on highly inflected natural languages • Difference is higher on categories with small number of positive examples • SVM support for unbalanced data helps bag-of-words but performance is still lower than of string kernel