220 likes | 320 Views
Concept and Theme Discovery through Probabilistic Models and Clustering. Qiaozhu Mei Oct. 12, 2005. Concepts and Themes. Language units in biology literature mining: Terms Phrases Entities Concepts (tight groups of terms/entities representing semantics: e.g. Gene Synonyms)
E N D
Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005
Concepts and Themes • Language units in biology literature mining: • Terms • Phrases • Entities • Concepts (tight groups of terms/entities representing semantics: e.g. Gene Synonyms) • Themes (loose groups of terms representing topic/subtopics)
Theme Discovery • What we’ve got now: • A Generative Model to extract k themes from a collection • Each theme as a language model, represented by top probability words in a theme language model • KL Divergence to model the distance/similarity between themes; • retrieve most similar themes to a term group
Theme Discovery (cont.) • What we’ve got now (cont.): • Use HMM to segment the whole collection with the theme extracted • Use MMR to find most representative and least redundant phrases to represent a theme (currently using n-gram prob. as and edit distance as similarity, performance to be tuned..) • Results: http://ucair.cs.uiuc.edu/qmei2/ThemeNavigation.html
Some justifications • Fly collection: • Cluster 0: circadian • Cluster 1: adh, evolution • Cluster 2: a mixture of two topics, apoptosis and promoters • Cluster 6: brain development • Cluster 8: cell division • Cluster 12: drosophila immunity • Cluster 13: nervous systems • Cluster 14: hedgehog segment Polarity gene • Cluster 16: Histone, Polycomb • Cluster 17: visual system
Theme Discovery (cont.) • Problems: • How to select k? (how many themes do we believe are there in the collection: bee collection should have smaller k than fly collection) • Can we find themes in a hierarchical manner? • This can solve the former problem…however, when to cutoff? • How to represent a theme? • Top words sometimes difficult to tell the semantics • Phrases? • Sentences? • Other possible approaches to extract theme? (LDAs, Clustering methods)
Hierarchical Theme Discovery • A straightforward approach (top down splitting): • Discover k themes from the initial collection • Segment the collection by the k themes • For each theme, build a sub-collection with the segments in previous step • For each sub-collection, extract k’ themes • Do these processes iteratively • Problem: When to stop splitting iteration? Collection Theme1 Theme3 Theme2 Theme2.1 Theme2.3 Theme2.2 ……
Hierarchical Theme Discovery (results) A bee collection with 929 documents Level1: 5 themes … … … Level2: 3 sub-themes for each higher level theme
Hierarchical Theme Discovery (results) african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas queen workers worker signal jh vibration pheromone gland eggs signals hormone juvenile anarchistic queens egg iridaceae policing ixia behavioral age pollinator plants pollination flowers plantae spermatophyta angiospermae dicotyledones pollen seed fruit angiosperms spermatophytes vascular dicots crop plant flower pollinators species learning brain conditioning olfactory neural neurons mushroom memory sucrose nervous coordination dopamine extension antennal odor system proboscis bodies lobe kenyon varroa mite mites jacobsoni acarina brood parasite colonies host control chelicerata chelicerates hygienic viruses infestation destructor pest infested parasitology mortality
Hierarchical Theme Discovery (results) african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas queen workers worker signal jh vibration pheromone gland eggs signals hormone juvenile anarchistic queens egg iridaceae policing ixia behavioral age pollinator plants pollination flowers plantae spermatophyta angiospermae dicotyledones pollen seed fruit angiosperms spermatophytes vascular dicots crop plant flower pollinators species learning brain conditioning olfactory neural neurons mushroom memory sucrose nervous coordination dopamine extension antennal odor system proboscis bodies lobe kenyon varroa mite mites jacobsoni acarina brood parasite colonies host control chelicerata chelicerates hygienic viruses infestation destructor pest infested parasitology mortality venom reward patients naja kda proteins wasp protein diptera pla2 vespula primates hominidae chordata vertebrata mug sting sperm dose quality african european population populations patterns pattern genetic discrimination mitochondrial studies information are contrast green two bees have derived africa subspecies larvae microorganisms gram bacteria 0 colonies royal queen jelly eubacteria non workers queens production 2 nest italian 5 fraction nestmates
Hierarchical Theme Discovery (results) african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas queen workers worker signal jh vibration pheromone gland eggs signals hormone juvenile anarchistic queens egg iridaceae policing ixia behavioral age pollinator plants pollination flowers plantae spermatophyta angiospermae dicotyledones pollen seed fruit angiosperms spermatophytes vascular dicots crop plant flower pollinators species learning brain conditioning olfactory neural neurons mushroom memory sucrose nervous coordination dopamine extension antennal odor system proboscis bodies lobe kenyon varroa mite mites jacobsoni acarina brood parasite colonies host control chelicerata chelicerates hygienic viruses infestation destructor pest infested parasitology mortality food foragers dance transfer enzyme biosynthesis receivers contrast nectar flight source flow water information rates ddt rj caucasian visual green queen worker workers colonies pollen vibration eggs foraging development brood signal queens bees anarchistic behavioral iridaceae larvae egg pheromone may mammals vertebrates venom nonhuman l ml models model chordates beeswax mug omega embryo mammalia vertebrata has chordata nurse coloured vg
Hierarchical Theme Discovery (results) african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas queen workers worker signal jh vibration pheromone gland eggs signals hormone juvenile anarchistic queens egg iridaceae policing ixia behavioral age pollinator plants pollination flowers plantae spermatophyta angiospermae dicotyledones pollen seed fruit angiosperms spermatophytes vascular dicots crop plant flower pollinators species learning brain conditioning olfactory neural neurons mushroom memory sucrose nervous coordination dopamine extension antennal odor system proboscis bodies lobe kenyon varroa mite mites jacobsoni acarina brood parasite colonies host control chelicerata chelicerates hygienic viruses infestation destructor pest infested parasitology mortality ecology is species environmental sciences flowering floral terrestrial pollinator visiting reproduction plants c cashew self animalia food insects faba size seed per crop sunflower number cruciferae fruit hybrid agriculture seeds quality cultivar weight helianthus oilseed compositae annuus yield pollination set pollen eep honeybees mating bumblebees sp hive bacteria scent mimosa brazil undertakers chromatography marks recently gram eubacteria caraway microorganisms propolis
Hierarchical Theme Discovery (results) african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas queen workers worker signal jh vibration pheromone gland eggs signals hormone juvenile anarchistic queens egg iridaceae policing ixia behavioral age pollinator plants pollination flowers plantae spermatophyta angiospermae dicotyledones pollen seed fruit angiosperms spermatophytes vascular dicots crop plant flower pollinators species learning brain conditioning olfactory neural neurons mushroom memory sucrose nervous coordination dopamine extension antennal odor system proboscis bodies lobe kenyon varroa mite mites jacobsoni acarina brood parasite colonies host control chelicerata chelicerates hygienic viruses infestation destructor pest infested parasitology mortality bees sucrose conditioning response learning extension proboscis pollen foragers performance between thresholds honeybees solution discrimination strain rate foraging concentration low dopamine levels development age binding pupal brain octopamine division adult colonies labor glass treated colony ryr pigmentation chromosomes arolium da imidacloprid current memory mushroom neurons 1 expressed 4 cells antennal mb bodies currents nervous brain mv kinase receptors term protein
Hierarchical Theme Discovery (results) african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas queen workers worker signal jh vibration pheromone gland eggs signals hormone juvenile anarchistic queens egg iridaceae policing ixia behavioral age pollinator plants pollination flowers plantae spermatophyta angiospermae dicotyledones pollen seed fruit angiosperms spermatophytes vascular dicots crop plant flower pollinators species learning brain conditioning olfactory neural neurons mushroom memory sucrose nervous coordination dopamine extension antennal odor system proboscis bodies lobe kenyon varroa mite mites jacobsoni acarina brood parasite colonies host control chelicerata chelicerates hygienic viruses infestation destructor pest infested parasitology mortality mite varroa mites brood jacobsoni acarina colonies parasite for worker control a drone formic population acid host 0 cells treatment viruses larvae microorganisms virus bacteria animal paenibacillus infection molecular pathogen eubacteria gram forming endospore positives p apv entomopathogen pollen bees foragers their or ta heat at hygienic foraging protein activity behaviour increased response blood flight strips metabolic removal
Phrase Representations: biochemistry and molecular biophysics endocrine system chemical coordination and homeostasis molecular genetics biochemistry and molecular biophysics sense organs sensory reception animals arthropods chordates insects invertebrates mammals system chemical coordination and homeostasis vertebrata chordata animalia honey bee behavior terrestrial ecology mammalia vertebrata chordata animalia juvenile hormone queen rodentia mammalia vertebrata chordata animalia worker laid eggs vibration signal genetics biochemistry and molecular biophysics dufour s gland mammals nonhuman mammals workers egg laying queen mandibular gland pheromone nonhuman vertebrates iridaceae ixia arthropoda invertebrata animalia muridae aves vertebrata chordata animalia mug ml african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas queen workers worker signal jh vibration pheromone gland eggs signals hormone juvenile anarchistic queens egg iridaceae policing ixia behavioral age pollinator plants pollination flowers plantae spermatophyta angiospermae dicotyledones pollen seed fruit angiosperms spermatophytes vascular dicots crop plant flower pollinators species learning brain conditioning olfactory neural neurons mushroom memory sucrose nervous coordination dopamine extension antennal odor system proboscis bodies lobe kenyon varroa mite mites jacobsoni acarina brood parasite colonies host control chelicerata chelicerates hygienic viruses infestation destructor pest infested parasitology mortality
Hierarchical Theme Discovery (cont.) • A bottom up agglomerative approach: • Find many micro-themes • Group similar micro-themes into larger ones • Borrow strategy from data mining: • BIRCH: incrementally form many micro-clusters, organized in a tree structure • Macro-clustering based on micro-clusters. • Problem: Again, when to stop?
Hierarchical Theme Discovery (cont.) • Model-based approach: • Hofmann, IJCAI 99. • Assume we know the collection is generated from a hierarchical structure, use a generative model to learn the themes. (e.g. make use of GO hierarchies) • Problem: in most cases we don’t know the hierarchies.
Other Research Problems • Represent a theme: • Using top words: where to cut • Using phrases: have to tune the MMR (many possible strategies and parameter tuning) • Using sentence? Like summarization • Themes are interesting… but how to make use of the themes? • How to evaluate themes??
Concept Extraction • What we have now: • N-gram algorithm (actually 2-gram): iteratively group a pair of terms which are most likely to be replaceable considering the context of one term before/after it. • Time Complexity: O(N3), Space Complexity: now O(N2). Beespace server can deal with <= 9000 terms now (2.4g memory). (performance not evaluated due to the small data size acceptable). • Problem: based on Mutual Information, preferring 2-grams with low frequency. Doesn’t make use of farther context. • Will removing stop words help or turn down the performance?
Some finding: • A small dataset: (200+ abstracts containing gene synonyms) • Only 600 iterations (merge 600 times) • Most of them are reasonable, but not really useful • E.g. head-to-head tail-to-tail • E.g. within-locus between-locus • FBgn0000017: Dsrc Dabl • FBgn0000078: amylase-null AMY-null • Problem: doc-set too small, n-gram too sparse to find useful concepts.
Concept Extraction (cont.) • Other Possible strategy: • Lin et al, KDD 02: Use feature vector to represent terms, the weights are the mutual information between term and context feature. Thus more flexible than n-gram. (if only consider 2-gram as context features, this will be similar to what we have) • Use committee to represent a cluster, thus assures the clusters are tight and robust. • Problem: not sure how to select features
Summary • Theme Extraction: • Generally performs well, if we can find a good k. • Hierarchical Clustering can solve this problem, but still need to find a reasonable stop criteria. • Representation is an interesting problem: MMR phrase extraction should be further tuned • Difficult to evaluate other than expert justification • Concept extraction: • N-gram has space constraints: haven’t really tested the performance… Generally, the performance should be better on large data sets • Other clustering algorithms can be explored.