Subsequent String Kernel by Han Cheng Liang

Subsequent String Kernel by Han Cheng Liang Advanced Machine Learning Prof. Tony Jebara

Subsequent String Kernel (SSK) • SSK function • Measures how similar two strings are by how many subsequences they share in common. • The subsequences do not have to be contiguous: • s = science is organized knowledge • t = wisdom is organized life • subsequence = “sie” • s = science is organized knowledge • t = wisdom is organized life • s = science is organized knowledge • t = wisdom is organized life • subsequence = “sie”

s = science is organized knowledge • t = wisdom is organized life • subsequence = “sie” = u SSK Continued • But the further apart the first and the last characters in the subsequence are, the more it is penalized. Define a decay factor,

SSK Formally Defined • Alphabet set: • Set of all subsequences with length n, from alphabet set : • string • string • u is a subsequence of s, if there’s a set of indices i, such that • length of the subsequence u:

SSK Formally Defined • run time:

Improve Performance Using DP • Can improve the runtime to • Define: • 3 Basic Cases: • Recursive Step:

DP Continued • Define: • Two Cases:

Experiments Performed • SSK vs. NGK vs. WK • Varying sequence lengths and decay factors • Combining Kernels of Different lengths- has potential • Combining SSK and NGK- no good • Combining SSK with different decay factors- no good

Subsequent Word Kernel • Instead of having individual letters and the space character in have whole English words. • The size of the alphabet set much larger, but using the DP technique, the runtime is still

Experiments • Data: Yahoo! News. News articles from AP, Reuters, etc. • Four categories: business, politics, entertainment, sports • 60 articles in each, 50 of them used for training and 10 used for testing • Comparable performance to SSK (n=3). Accuracy rate both around 90%. Outperformed SSK in some categories and underperformed in others. • Combining SSK and word subsequence Kernel did not yield improvements.

Kernel Estimation • SSK: used most frequent contiguous subsequences found in some data set • Me: used most frequently used English words. • Results: • top 2000: bad • top 3000: bad • top 4000: 80% accuracy

Future Work • Kernels with different lengths • Upper/lower bounds for the kernel estimation.

Subsequent String Kernel by Han Cheng Liang

Subsequent String Kernel by Han Cheng Liang

Presentation Transcript

SUBSEQUENT EVENTS

Presented by Yi Liang

Lecture 4: Embedded Application Framework Qt Tutorial Cheng-Liang (Paul) Hsieh

Presenter : Cheng-Han Tsai Authors : Adel Mellit , Alessandro Massi Pavan

Revocation by Subsequent Writing

Wen-Jung Liang and Chao-Cheng Mai October 08, 2008

Revocation by Subsequent Writing

By: Justin Cheng

Group member: Li Yan, Tan Hee Liang, Weng Yongmei, Zhang Cheng

Predictions by string theory?

Indoor/Outdoor Classification December 1, 2009 Liu, Cheng Yang, Hsiu -Han Han, Seung Yeob

Subsequent Settlement

Xingming Liang, Alexander Ignatov, Yong Han, and Hao Zhang

Hong Cheng Jiawei Han

Weichi Hu , Chun Cheng Lin, Liang Yu shyu

Subsequent Deposit

Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011

By Cheng Kai Lok

Liang HAN University of Manchester

Presenter : Cheng-Han Tsai

Presented by Han