180 likes | 292 Views
Parallel Clustering of English Verbs into Levin Classes. 6.338/18.337 Final Project Melanie Goetz Andrew Hogue May 13, 2004. Background. Levin [1993] hand-classified verbs 3086 verbs into 264 classes (with overlaps) Utilized verb arguments and alternations
E N D
Parallel Clustering of English Verbs into Levin Classes 6.338/18.337 Final Project Melanie Goetz Andrew Hogue May 13, 2004
Background • Levin [1993] hand-classified verbs • 3086 verbs into 264 classes (with overlaps) • Utilized verb arguments and alternations • E.g. “the glass broke” or “broke the glass” • Classes correlated with semantic meaning of verbs
Our Approach • Automatically classify verbs • Build graph G with node for each word, edges if words appear in same sentence • First, build bipartite graph with verbs and prepositions • Extend with subject nouns, object nouns • Use spectral partitioning to divide verbs into classes
Parallel Implementation • Three components: • Extract meaningful words from parsed corpus • Merge per-processor sparse matrices without bringing data to front end • Run parallel spectral partitioning on full graph
Parsing • Embarrassingly parallel • Wall Street Journal corpus of 99 documents • Each processor separately extracts tree from corpus and relevant words from tree
Indexing • Need to combine matrices from separate processors into one indexing scheme • Bringing to front end is inefficient • Solution: share “vocabulary lists” between processes • Allows each process to use the same index for each word
Partitioning • Based on specpart.m from Meshpart toolkit • Serial version uses Cholesky decomposition • Our parallel version uses eigs() function as we only need a few eigenvalues
Results • Clustered 3317 sentences from Wall Street Journal corpus • 2827 unique words • Included subjects, verbs, objects, prepositions
Results - Indexing May 13, 2004 6.338/18.337 Final Project 15
Results - Partitioning May 13, 2004 6.338/18.337 Final Project 16
Results - Clustering May 13, 2004 6.338/18.337 Final Project 17
Future Work • Parse other corpora (Project Gutenberg) • Restrict word types to verb/preposition or subject/verb/object • Other ways to use eigenvectors for partitioning into more than 2 parts