Fast Methods for Kernel-based Text Analysis

Fast Methods for Kernel-based Text Analysis Taku Kudo 工藤　拓 Yuji Matsumoto 松本　裕治 NAIST (Nara Institute of Science and Technology) 41st Annual Meeting of the Association for Computational Linguistics, Sapporo JAPAN

Background • Kernel methods (e.g., SVM)become popular • Can incorporate prior knowledge independently from the machine learning algorithms by giving task dependent kernel (generalized dot-product) • High accuracy

Problem • Too slow to use kernel-based text analyzers to the real NL applications (e.g., QA or text mining) because of their inefficiency in testing • Some kernel-based parsers run only at 2 - 3 seconds/sentence

Goals • Build fast but still accurate kernel- based text analyzers • Make it possible to use them to wider range of NL applications

Outline • Polynomial Kernel of degree d • Fast Methods for Polynomial kernel • PKI • PKE • Experiments • Conclusions and Future Work

Outline • Polynomial Kernel of degree d • Fast Methods for Polynomial kernels • PKI • PKE • Experiments • Conclusions and Future Work

Kernel Methods Training data No need to represent example in an explicit 　　 feature vector Complexity of testing is O(L ・|X|)

Kernels for Sets (1/3) Focus on the special case where examples 　 are represented as sets The instances inNLP are usually 　　　　　　　　 represented as sets (e.g., bag-of-words) Feature set: Training data:

Kernels for Sets (2/3) • Simple definition: • Combinations (subsets) of features 2nd order 3rd order

Head-word: ate Head-POS: VBD Modifier-word: cake Modifier-POS: NN Head-word: ate Head-POS: VBD Modifier-word: cake Modifier-POS: NN Head-POS/Modifier-POS: VBD/NN Head-word/Modifier-POS: ate/NN … X= Heuristic selection X= Subsets (combinations) of basic features are critical 　 to improve overall accuracy in many NL tasks Previous approaches select combinations heuristically Kernels for Sets (3/3) Dependent (+1) or independent (-1) ? I ate a cake PRP VBD DT NN head modifier

Explicit form is a set of all subsets of with 　　　　　exactly elements in it is prior weight to the subsets with size (subset weight) Polynomial Kernel of degree d Implicit form

Explicit form: Example (Cubic Kernel d=3 ) Implicit form: Up to 3 subsets are used as new features

Toy Example Feature Set: F={a,b,c,d,e} Examples: α X j j 1 0.5 -2 1 2 3 {a, b, c} {a, b, d} {b, c, d} #SVs L =3 Kernel: Test Example: X={a,c,e}

PKB (Baseline) ３ K(X,X’) = (|X∩X’|+1) α X j {a, b, c} {a, b, d} {b, c, d} K(Xj,X) 1 2 3 1 0.5 -2 Test Example X={a,c,e} ３３３ f(X) = 1・(2+1) + 0.5・(1+1) - 2 (1+1) = 15 Complexity is always O(L・|X|)

PKI (Inverted Representation) ３ K(X,X’) = (|X∩X’|+1) Inverted Index α Xj B = Avg. size a b c d {1,2} {1,2,3} {1,3} {2,3} Test Example X= {a, c, e} {a, b, c} {a, b, d} {b, c, d} 1 2 3 1 0.5 -2 ３３３ f(X)=1・(2+1) + 0.5・(1+1) - 2 (1+1) = 15 Average complexity is O(B・|X|+L) Efficient if feature space is sparse Suitable for many NL tasks

PKE (Expanded Representation) • Convert into linear form by calculating vector w • projects X into its subsets space

W (Expansion Table) C w φ {a} {b} {c} {d} {a,b} {a,c} {a,d} {b,c} {b,d} {c,d} {a,b,c} {a,b,d} {a,c,d} {b,c,d} 1 -0.5 10.5 -3.5 -7 -10.5 18 12 6 -12 -18 -24 6 3 0 -12 c3(0)=1, c3(1)=7, c3(2)=12, c3(3)=6 Test Example X={a,c,e} 7 αj Xj 1 2 3 1 0.5 -2 {a, b, c} {a, b, d} {b, c, d} 12 {φ,{a},{c}, {e}, {a,c},{a,e}, {c,e},{a,c,e}} F(X)= - 0.5 + 10.5 – 7 + 12 = 15 6 w({b,d}) = 12 (0.5 – 2 ) = -18 d Complexity is O(|X| ) , 　independent of the number of SVs (L) Efficient if the number of SVs is large PKE (Expanded Representation) 3 K(X,X’) = (|X∩X’|+1)

PKE in Practice • Hard to calculate Expansion Tableexactly • Use Approximated Expansion Table • Subsets with smaller |w| can be removed, since |w| represents a contribution to the final classification • Use subset mining (a.k.a. basket mining) algorithm for efficient calculation

Subset Mining Problem set id {a}:3 {b}:3 {c}:3 {d}:2 {a b}:2 {b c}: 2 {a c}:2 {a d}: 2 1 { a c d } 2 { a b c } 3 { a b d } 4 { b c e } Results Transaction Database Extract all subsets that occur in no less than 　　　 sets of the transaction database and no size constraints → NP-hard Efficient algorithms have been proposed 　　　　　　　　(e.g., Apriori, PrefixSpan)

Direct generation with subset mining σ=10 s w s φ {a} {b} {c} {d} {a,b} {a,c} {a,d} {b,c} {b,d} {c,d} {a,b,c} {a,b,d} {a,c,d} {b,c,d} W -0.5 10.5 -3.5 -7 -10.5 12 12 6 -12 -18 -24 6 3 0 -12 10.5 -10.5 12 12 -12 -18 -24 -12 {a} {d} {a,b} {a,c} {b,c} {b,d} {c,d} {b,c,d} Exhaustive generation and testing → Impractical! Feature Selection as Mining αi Xi {a, b, c} {a, b, d} {b, c, d} 1 2 3 1 0.5 -2 • Can efficiently build the approximated table • σ controls the rate of approximation

Experimental Settings • Three NL tasks • English Base-NP Chunking (EBC) • Japanese Word Segmentation (JWS) • Japanese Dependency Parsing (JDP) • Kernel Settings • Quadratic kernel is applied to EBC • Cubic kernel is applied to JWS and JDP

Results　(English Base-NP Chunking)

Results　(Japanese Word Segmentation)

Results　(Japanese Dependency Parsing)

Results • 2 - 12 fold speed up in PKI • 30 - 300 fold speed up in PKE • Preserve the accuracy when we set an appropriate σ

Comparison with related work • XQK [Isozaki et al. 02] • Same concept as PKE • Designed only for the Quadratic Kernel • Exhaustively creates the expansion table • PKE • Designed for general Polynomial Kernels • Uses subset mining algorithms to create the expansion table

Conclusions • Propose two fast methods for the polynomial kernel of degree d • PKI (Inverted) • PKE (Expanded) • 2-12 fold speed up in PKI, 30-300 fold speed up in PKE • Preserve the accuracy

Future Work • Examine the effectiveness in a general machine learning dataset • Apply PKE to other convolution kernels • Tree Kernel [Collins 00] • Dot-product between trees • Feature space is all sub-tree • Apply sub-tree mining algorithm [Zaki 02]

English Base-NP Chunking Extract Non-overlapping Noun Phrase from text [NP He ] reckons [NP the current account deficit ] will narrow to [NP only # 1.8 billion ]in [NP September ] . • BIO representation (seeing as a tagging task) • B: beginning of chunk • I: non-initial chunk • O: outside • Pair-wise method to 3-class problem • training: wsj15-18, test: wsj20 (standard set)

Japanese Word Segmentation Taro made Hanako read a book Sentence: 太郎は花子に本を読ませた ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ Boundaries: If there is a boundary between and , otherwise • Distinguish the relative position • Use also the character types of Japanese • Training: KUC 01-08, Test: KUC 09

Japanese Dependency Parsing 私は　　ケーキを　　食べる I-top cake-acc. eat I eat a cake • Identify the correct dependency relations 　　　between two bunsetsu(base phrase in English) • Linguistic features related to the modifier 　 and head (word, POS, POS-subcat, inflections, punctuations, etc) • Binary classification (+1 dependent, -1 independent) • Cascaded Chunking Model [kudo, et al. 02] • Training: KUC 01-08, Test: KUC 09

Kernel Methods (1/2) Suppose a learning task: training examples X : example to be classified Xi: training examples : weight for examples : a function to map examplesto another vectorial space

PKE (Expanded Representation) If we calculate in advance ( is the indicator function) for all subsets

TRIE representation root w 10.5 -10.5 12 12 -12 -18 -24 -12 {a} {d} {a,b} {a,c} {b,c} {b,d} {c,d} {b,c,d} a b c d 10.5 -10.5 c c d d b -24 12 12 -12 -18 d -12 Compress redundant structures Classification can be done by simply 　　　　traversing the TRIE

Kernel Methods Training data No need to represent example in an explicit 　　 feature vector Complexity of testing is O(L |X|)

Fast Methods for Kernel-based Text Analysis

Fast Methods for Kernel-based Text Analysis

Presentation Transcript

Kernel Methods Part 2

Overview of Kernel Methods

Kernel Methods: Basics

Text Mining: Fast Phrase-based Text Indexing and Matching

Kernel Methods and SVM’s

Kernel Methods

Kernel Methods

Kernel methods

Kernel Methods for Relation Extraction

Kernel synchronization methods

Kernel – Based Methods

Kernel Methods

Fast Dynamic Binary Translation for the Kernel

Kernel Methods

Kernel Methods for fMRI Pattern Prediction

An Overview of Kernel-Based Learning Methods

Comparing Kernel-based Learning Methods for Face Recognition

Kernel Density Estimation, Kernel Methods, and fast learning

Kernel Methods

DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation

Kernel methods - overview

Chapter 6: Kernel Methods