300 likes | 405 Views
Topic Extraction from Biology Literature: Prior, Labeling, and Switching. Qiaozhu Mei. A Sample Topic. Word Distribution (language model). Meaningful labels. labels. actin filaments flight muscle flight muscles. filaments 0.0410238 muscle 0.0327107
E N D
Topic Extraction from Biology Literature: Prior, Labeling, and Switching Qiaozhu Mei
A Sample Topic Word Distribution (language model) Meaningful labels labels actin filaments flight muscle flight muscles filaments 0.0410238 muscle 0.0327107 actin 0.0287701 z 0.0221623 filament 0.0169888 myosin 0.0153909 thick 0.00968766 thin 0.00926895 sections 0.00924286 er 0.00890264 band 0.00802833 muscles 0.00789018 antibodies 0.00736094 myofibrils 0.00688588 flight 0.00670859 images 0.00649626 Example documents • actin filaments in honeybee-flight muscle move collectively • arrangement of filaments and cross-links in the bee flight muscle z disk by image analysis of oblique sections • identification of a connecting filament protein in insect fibrillar flight muscle • the invertebrate myosin filament subfilament arrangement of the solid filaments of insect flight muscles • structure of thick filaments from insect flight muscle
Topic/Theme Extraction • A theme/topic is represented with a multinomial distribution over words • Unigram language models • Easier to interpret • Easy to add prior • Easy for retrieval • Assumption: • K themes in a collection • A document covers multiple themes
Topic Extraction v.s. Clustering • Topic Extraction: • Effective to reveal the latent topics, and find most relevant documents to a topic • Better interpretation, worse accuracy • Effective to add priors (control the topics) • Clustering algorithms: • Effective to assign documents into non-overlapped clusters • Better accuracy, worse interpretation • Hard to control
Topic Extraction (Results) Related documents 44 biosis:199598006316: 44 biosis:200000292072: 44 biosis:199293065558: 44 biosis:199799595920: 44 biosis:199395062782: corpora (0.0438967 )allata (0.0315774 )hormone (0.0249687 )juvenile (0.0184049 )insulin (0.0174549 )embryos (0.0165997 )neurosecretory (0.0127734 )embryo (0.0124167 )biosynthesis (0.0118067 )cardiaca (0.00969471 )sexta (0.0088941 )medium (0.00865245 )iran (0.00703376 )mannose (0.00668768 )volume (0.00661038 )synapse (0.00652483 )injected (0.00636151 ) stimulatory effect of octopamine on juvenile hormone biosynthesis in honey bees (apis mellifera): physiological and immunocytochemical evidence • May want a more general topic • How to tell the algorithm to find a more general topic, like “behavioral maturation”?
Topic Extraction (Results cont.) pollen (0.467911 )foraging (0.0373205 )foragers (0.0365857 )collected (0.0318249 )grains (0.0314324 )loads (0.025104 )collection (0.0208903 )nectar (0.0185726 )sources (0.0113751 )collecting (0.00999529 )types (0.00978636 )pellets (0.00942175 )germination (0.00733012 )load (0.00646375 )stored (0.00599516 )amount (0.00481306 )trips (0.00478013 ) Related Documents 13 biosis:200200039990: 13 biosis:199900297835: 13 biosis:200100318017: 13 biosis:199497516580: 13 biosis:200000045397: the response of the stingless bee melipona beecheii to experimental pollen stress, worker loss and different levels of information input • Biased towards “Pollen” • Not precisely covering “foraging” • How to tell the algorithm to focus on “foraging”?
Topic Extraction (Full Results) • 100 topics from biosis-bee: http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/bee-100-basic.html • 5 themes for query “food” in biosis-bee; 500 documents: http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/bee-food-5-basic.html
Incorporating Topic Priors • Either topic extraction or clustering: • Cannot guarantee the themes are expected • User exploration: usually has preference. • E.g., want one topic/cluster is about foraging behavior • Use prior to guild the theme extraction • Prior as a simple language model • E.g. forage 0.2; foraging 0.3; food 0.05; etc.
Incorporating Topic Priors Prior Prior Original EM: Prior: language model; interpreted as pseudo counts EM with Prior:
Incorporating Topic Priors (results) foraging 0.0498044 food 0.0472535 foragers 0.0310718 dance 0.0266078 source 0.0254369 nectar 0.0162739 distance 0.0141869 forage 0.0141503 information 0.0129047 dances 0.012684 hive 0.0124987 landmarks 0.0119087 dancing 0.0109375 waggle 0.0101672 feeder 0.0101266 rate 0.0085641 sources 0.00825884 recruitment 0.00813717 forager 0.00796914 Prior: forage 0.1 foraging 0.1 food 0.1 source 0.1
Incorporating Topic Priors (results: cont.) age 0.0672687 division 0.0551497 labor 0.052136 colony 0.038305 foraging 0.0357817 foragers 0.0236658 workers 0.0191248 task 0.0190672 behavioral 0.0189017 behavior 0.0168805 older 0.0143466 tasks 0.013823 old 0.011839 individual 0.0114329 ages 0.0102134 young 0.00985875 genotypic 0.00963096 social 0.00883439 Prior: labor 0.2 division 0.2
Incorporating Topic Priors (results: cont.) gene 0.0648303 expression 0.0486273 sequence 0.0407999 sequences 0.0311126 brain 0.0233977 drosophila 0.020891 cdna 0.0186153 predict 0.0166939 expressed 0.0166521 amino 0.0126359 dna 0.010655 genome 0.0101629 conserved 0.0098135 bp 0.00908649 nucleotide 0.00906794 phylogenetic 0.00887771 encoding 0.00866418 melanogaster 0.00798409 Prior: brain 0.1 predict 0.1 gene 0.1 expresion 0.1
Incorporating Topic Priors (results: cont.) behavioral 0.110674 age 0.0789419 maturation 0.057956 task 0.0318285 division 0.0312101 labor 0.0293371 workers 0.0222682 colony 0.0199028 social 0.0188699 behavior 0.0171008 performance 0.0117176 foragers 0.0110682 genotypic 0.0106029 differences 0.0103761 polyethism 0.00904816 older 0.00808171 plasticity 0.00804363 changes 0.00794045 Prior: behavioral 0.2 maturation 0.2
Incorporating Topic Priors (Full results) • 30 topics from biosis-bee (first 7 topics w/ prior): http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/bee-30-prior.html • 30 topics from biosis-bee (first 2 topics w/ prior): http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/bee-30-prior3.html
Labeling a Topic • Themes (Topic models) can be hard to interpret. • Give meaningful labels to a topic is hard
What is a Good Label? • Suggesting the theme (relevance) • Understandable – phrases? • High coverage inside topic • A theme is often a mixture of concepts • Discriminative across topics • A theme is usually in the context of k topics • …
Our Method • Guarantee understandability with a pre-processing step • Use phrases as candidate topic labels • Other possible choices: entities • Satisfy relevance, coverage, and discriminability with a probabilistic framework Good labels = Understandable + Relevant+High Coverage + Discriminative
Labeling a Topic: Candidate Labels • Phrase generation: • Statistically significant 2-grams • Hypothesis testing • T-test used; ranked by t-score • Other choices? • Entities? • Behavior ontology? • GO: hard to use, because they are not real phrases from literature.
Clustering Good Label:“clustering algorithm” dimensional algorithm Latent Topic … birch Bad Label:“body shape” shape … body Labeling a Topic: Semantic Relevance • Zero-order: use phrases which well cover the top words:
Clustering Clustering Clustering SIGMOD Proceedings dimension dimension dimension Bad Label:“hash join” … Good Label:“clustering algorithm” Topic partition partition algorithm algorithm algorithm join … … hash hash hash P(w|) P(w|l) D(|l) Labeling a Topic: Semantic Relevance (cont.) • First-order: use phrases with similar context:
Labeling a Topic (results) female (0.0892427 )females (0.0856834 )male (0.0854142 )males (0.0812643 )sex (0.0577668 )reproductive (0.0214618 )ratio (0.0142873 )alleles (0.0133912 )diploid (0.0125172 )offspring (0.0120271 )sexes (0.0116374 )investment (0.0115359 )mating (0.00902159 )number (0.00823397 )success (0.00785498 )sexual (0.00751456 )determination (0.00663546 )size (0.00633002 ) Labels: sex ratio (2.49468) (32 ); male female (2.29508) (51 ); sex determination (2.16534) (21 ); female flowers (1.83686) (23 ); sex alleles (1.79415) (16 ); multiple mating (1.72684) (19 );
Labeling a Topic (results cont.) hormone 0.0536175 jh 0.0518038 juvenile 0.0466941 development 0.0387031 larval 0.0276814 hemolymph 0.0216493 pupal 0.0189934 stage 0.0188286 glands 0.0173832 larvae 0.0169996 adult 0.0154695 instar 0.0149492 haemolymph 0.0140053 vitellogenin 0.0131076 caste 0.0124822 protein 0.0116558 glucose 0.0112673 corpora 0.0105111 Labels: juvenile hormone 2.44992 117 hormone jh 1.58432 49 larval instar 1.53676 20 worker larvae 1.52398 51 corpora allata 1.50391 34
Labeling a Topic (results) foraging 0.0498044 food 0.0472535 foragers 0.0310718 dance 0.0266078 source 0.0254369 nectar 0.0162739 distance 0.0141869 forage 0.0141503 information 0.0129047 dances 0.012684 hive 0.0124987 landmarks 0.0119087 dancing 0.0109375 waggle 0.0101672 feeder 0.0101266 rate 0.0085641 recruitment 0.00813717 forager 0.00796914 Labels food source -6.72378 107 nectar foraging -7.11784 28 nectar foragers -7.58965 47 nectar source -7.78975 16 food sources -7.8487 72 waggle dance -8.21514 31 Prior 0 forage 0.1 0 foraging 0.1 0 food 0.1 0 source 0.1
Labeling a Topic (full results) • 100 topics from biosis-bee (w/ labels): http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/bee-100-basic-l.html • 100 topics from biosis-fly-genetics (w/ labels): http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/fly-100-l.html
Context Switching • Utilize topic extraction for concept switching (two possible ways) • Label the same topic model with phrases in another context • Use the topic model from context A as prior to extract topics from context B
foraging 0.142473 foragers 0.0582921 forage 0.0557498 food 0.0393453 nectar 0.03217 colony 0.019416 source 0.0153349 hive 0.0151726 dance 0.013336 forager 0.0127668 information 0.0117961 feeder 0.010944 rate 0.0104752 recruitment 0.00870751 individual 0.0086414 reward 0.00810706 flower 0.00800705 dancing 0.00794827 behavior 0.00789228 Labels with bee context foraging trip 2.31174 21 nectar foragers 2.23428 47 tremble dance 2.21407 10 returning foragers 2.18954 16 food sources 2.14453 72 food source 2.13647 107 foraging strategy 2.101 14 individual foraging 2.08334 16 waggle dance 2.07836 31 Labels with fly context foraging behavior 2.45263 27 age related 2.29676 20 drosophila larvae 2.15361 67 feeding rate 1.99218 17 apis mellifera 1.9847 23 diptera drosophilidae 1.9 25
foraging 0.142473 foragers 0.0582921 forage 0.0557498 food 0.0393453 nectar 0.03217 colony 0.019416 source 0.0153349 hive 0.0151726 dance 0.013336 forager 0.0127668 information 0.0117961 feeder 0.010944 rate 0.0104752 recruitment 0.00870751 individual 0.0086414 reward 0.00810706 flower 0.00800705 dancing 0.00794827 behavior 0.00789228 foraging 0.290076 nectar 0.114508 food 0.106655 forage 0.0734919 colony 0.0660329 pollen 0.0427706 flower 0.0400582 sucrose 0.0334728 source 0.0319787 behavior 0.0283774 individual 0.028029 rate 0.0242806 recruitment 0.0200597 time 0.0197362 reward 0.0196271 task 0.0182461 sitter 0.00604067 rover 0.00582791 rovers 0.00306051
Questions? Thanks!