20 likes | 157 Views
a 11. a 10. …. a 7. a 8. a 9. SIN. INQ. NQK. QKL. KLA. a 1. a 2. a 3. a 4. a 5. a 6. Let. X be a finite set, and. two probability distributions over X. INQ. SIN. KLA. NQK. QKL. The weighted Jensen-Shannon divergence is given by:. LVI. LAL. ALV.
E N D
a11 a10 … a7 a8 a9 SIN INQ NQK QKL KLA a1 a2 a3 a4 a5 a6 Let X be a finite set, and two probability distributions over X INQ SIN KLA NQK QKL The weighted Jensen-Shannon divergence is given by: LVI LAL ALV Artificial Intelligence Research Laboratory Department of Computer Science • RECOMB 2009 Combining Abstraction and Super-structuring on Macromolecular Sequence Classification Adrian Silvescu, Cornelia Caragea, and Vasant Honavar Introduction: The choice of features that are used to describe the data presented to a learner, and the level of detail at which they describe the data, can have a major impact on the difficulty of learning, and the accuracy, complexity, and comprehensibility of the learned predictive model. The representation has to be rich enough to capture the distinctions that are relevant from the standpoint of learning, but not so rich as to make the task of learning infeasible. Constructing Abstractions over k-grams: Results: Comparison of super-structuring and abstraction (SS+ABS) with super-structuring and feature selection (SS+FSEL), super-structuring only (SS_ONLY), and unigram (UNIGRAM) on the Eukaryotes and Prokaryotes data sets. • greedy agglomerative procedure • initially map each abstraction to a k-gram • recursively group pairs of abstractions until m abstractions are obtained, e.g., m=2 Problem: Predict the subcellular localization for a protein sequence. Example: Eukaryotes 3-grams Prokaryotes 3-grams … Distance between Abstractions: Previous Approaches to Feature Construction: • Super-structuring: • generating k-grams Eukaryotes 2-grams Prokaryotes 2-grams Class distributions induced by one of the m abstractions, and the class distributions induced by three 3-grams sampled from the abstraction on the Eukaryotes 3-gram data set, where (a) m=10; and (b) m=1000. The number of classes is 4. SINQKLALVIKSGKYTLGYKSTVKSLRQGKSKLIIIAANTPVLRKSELEYYAMLSKTKVYYFQGGNNELGTAVGKLFRVGVVSILEAGDSDILTTLA Then, distance between two abstractions is defined as follows: • Abstraction: • grouping similar features to generate more abstract features where Y is the class variable. Feature selection: • alternative approach to reducing the number of k-grams to m k-grams • we used mutual information between the class variable and k-grams to rank the k-grams Conclusions: We have shown that: 10 Abstractions 1000 Abstractions • combining super-structuring and abstraction makes it possible to construct predictive models that use significantly smaller number of features than those obtained using super-structuring alone. • abstraction in combination with super-structuring yields better performing models than those obtained by feature selection in combination with super-structuring. Data sets: Our Approach: • Eukaryotes contains 2,427 protein sequences classified into one of four classes • Prokaryotes contains 997 protein sequences classified into one of three classes • Combining super-structuring and abstraction to construct new features Acknowledgements: This work is supported in part by a grant from the National Science Foundation (NSF 0711356) to Vasant Honavar.