130 likes | 322 Views
Subsequent String Kernel by Han Cheng Liang. Advanced Machine Learning Prof. Tony Jebara. Subsequent String Kernel (SSK) . SSK function Measures how similar two strings are by how many subsequences they share in common. The subsequences do not have to be contiguous:.
E N D
Subsequent String Kernel by Han Cheng Liang Advanced Machine Learning Prof. Tony Jebara
Subsequent String Kernel (SSK) • SSK function • Measures how similar two strings are by how many subsequences they share in common. • The subsequences do not have to be contiguous: • s = science is organized knowledge • t = wisdom is organized life • subsequence = “sie” • s = science is organized knowledge • t = wisdom is organized life • s = science is organized knowledge • t = wisdom is organized life • subsequence = “sie”
s = science is organized knowledge • t = wisdom is organized life • subsequence = “sie” = u SSK Continued • But the further apart the first and the last characters in the subsequence are, the more it is penalized. Define a decay factor,
SSK Formally Defined • Alphabet set: • Set of all subsequences with length n, from alphabet set : • string • string • u is a subsequence of s, if there’s a set of indices i, such that • length of the subsequence u:
SSK Formally Defined • run time:
Improve Performance Using DP • Can improve the runtime to • Define: • 3 Basic Cases: • Recursive Step:
DP Continued • Define: • Two Cases:
Experiments Performed • SSK vs. NGK vs. WK • Varying sequence lengths and decay factors • Combining Kernels of Different lengths- has potential • Combining SSK and NGK- no good • Combining SSK with different decay factors- no good
Subsequent Word Kernel • Instead of having individual letters and the space character in have whole English words. • The size of the alphabet set much larger, but using the DP technique, the runtime is still
Experiments • Data: Yahoo! News. News articles from AP, Reuters, etc. • Four categories: business, politics, entertainment, sports • 60 articles in each, 50 of them used for training and 10 used for testing • Comparable performance to SSK (n=3). Accuracy rate both around 90%. Outperformed SSK in some categories and underperformed in others. • Combining SSK and word subsequence Kernel did not yield improvements.
Kernel Estimation • SSK: used most frequent contiguous subsequences found in some data set • Me: used most frequently used English words. • Results: • top 2000: bad • top 3000: bad • top 4000: 80% accuracy
Future Work • Kernels with different lengths • Upper/lower bounds for the kernel estimation.