450 likes | 601 Views
Part 2: What is under the hood: Topic modeling and things you can do with it …. Topic modeling allows useful analyses. Turn written words into measurable quantities Measure how much is written on topic X? Measure how close are documents A and B?. What is the topic model?.
E N D
Part 2:What is under the hood: Topic modeling and things you can do with it …
Topic modeling allows useful analyses • Turn written words into measurable quantities • Measure how much is written on topic X? • Measure how close are documents A and B?
What is the topic model? • Bayesian model for a collection of text documents • Finds patterns of co-occurring words • Intuition: documents exhibit mixtures of topics • Learn using Gibbs sampling (unsupervised) • Blei, Ng, Jordan: Latent Dirichlet Allocation (2003) • Griffiths & Steyvers: Finding Scientific Topics (2004)
What is a topic? topics are distributions over words documents are a mixture of topics topic is a latent variable Example leaned topics: [BAYESIAN INFERENCE] sampling bayesian prior distribution sample monte_carlo methodmodelsamplesposteriormarkov_chaininferenceimportancegibbslikelihoodparameterbayesmixturemcmcgaussian … [DIGITAL LIBRARIES] digital library librariesaccesscollectioninformationmetadataelectronicrepositoryrepositoriescatalogarchivesarchiveprovidingcontentportalresources …
Topic modeling is better than clustering Multiple Topics One Cluster
Topic model is fast and scalable • Newman+, NIPS 2007 • Porteous, Newman+, SIGKDD 2008 Topic model: 1 year Distributed topic model: 1 day time = 1 year, memory = 400 GB time = 1 day, memory < 1 GB/proc MEDLINE/PubMed 8 million abstracts 700 million words P1 P1 P2 P3 P1024
Topic modeling to “measure” texts • Analyzed state-of-the-field of women’s history • Block & Newman, 2008 (under review, J. Women’s History) • Text mined 20 years of history publications (800,000 abstracts) • Busted some myths … • e.g. Sexuality studies is a modern project
Proportion of women’s history publications devoted to sexuality studies • Q: How modern is sexuality studies?
Use to analyze research portfolio • What research does NINDS fund? • What other institutes fund research done under NINDS? • Measure spending by disease • Potentially more accurate, does not rely on classification/keyword/thesaurus terms • Find spending overlap (e.g. funding duplicated across institutes), and gaps
How is research on ion channels shared across other institutes?
Topic-based document-document distance topic mix Doc A dist (Doc A, Doc B) Doc B
Example: Most similar sections across books? • Between: • Any Austen book, and • Melville’s Moby Dick
Grants close to given grant 5R01NS024471-21 Ion channels of neurons PI: JONES, STEPHEN W Similar Grants: (0.8)5R01NS043259-04 Molecular mechanisms of voltage-gated ion channels, (LARSSON, HANS PETER)(0.8)5R01GM069837-03Ion Regulation of Kv Channel Gating and Permeation, (DEUTSCH, CAROL J.)(0.8)5R01HL075536-03Voltage Sensor Movement in the HERG Potassium Channel, (TRISTANI-FIROUZI, MARTIN)(0.8)5R01HL065299-06Molecular Mechanisms of Pacemaker Channel Function, (SANGUINETTI, MICHAEL C.)(0.8)5R01NS045383-10 Molecular Physiology of K and Ca Channels, (YANG, JIAN)(0.8)5R01DK046950-12Molecular Cloning of Epithelial K Channels, (SACKIN, HENRY)(0.8)5R01HL050411-13Cardiac Na+ Channel:Molecular Basis of Permeation, (TOMASELLI, GORDON)(0.8)5R01HL044630-16Pharmacology of Cardiac Sodium Channel Modifiers, (SHEETS, MICHAEL F.)
Hierarchical labeling of topic maps • Learn topics for D documents • Compute all D2 document-document distances • Compute 2-dim layout of D documents (using DrL, PCA, MDS, Isomap, LLE, etc) • Create labels • Lower levels: domain expert interprets topic, creates short label • Higher levels: cluster topics into group (use hierarchical agglomerative clustering), then domain expert labels group • Place labels • Cluster 2-dim points using K-means • K-means well suited to clumpiness of DrL layouts • Use majority label
Interactive visual browse • Query {documents} • Iterate: {documents} topic map topic map {user subselects documents}
Links • http://datalab-1.ics.uci.edu/anthrax2/test.php • http://datalab-1.ics.uci.edu/newman/pubmed/ • http://datalab-1.ics.uci.edu/ninds/ • http://yarra.ics.uci.edu/pubmedtrends/ • http://yarra.ics.uci.edu/calit2/ • http://yarra.ics.uci.edu/topic/enron/ • http://scimaps.org/maps/ninds/ • http://scimaps.org/maps/neurovis/
Topics of topics • Topic model learns patterns of co-occurring words • Rerun topic model (on ‘documents’ where the topic mixes are the ‘words’) learn patterns of co-occurring topics
Co-occurring topics in PubMed [super56] [t320] laparoscopic patient surgery open time complication procedure postoperative [t640] bladder urethral incontinence urinary urinary_incontinence patient detrusor urodynamic [t542] biliary gallbladder bile_duct patient duct bile common endoscopic [t299] patient surgery surgical treatment operation surgical_treatment indication conservative [t242] resection anastomosis patient anastomotic operation gastrectomy postoperative anastomoses [super48] [t1389] patient radiotherapy chemotherapy treatment survival surgery tumor disease [t392] lymph_node patient lymph_nodes nodes dissection metastases axillary staging [t1163] patient prognostic survival prognosis factor prognostic_factor tumor stage [t1300] patient month follow-up year recurrence range underwent treated [t120] stage patient stages iii disease iv ii stage_i
Austen, Dickens, Melville 1429 sections of 100 lines 8 novels (Austen++) Emma Mansfield Park Northanger Abbey Persuasion Pride and Prejudice Sense and Sensibility Our Mutual Friend Moby Dick Emma Mansfield Park Northanger Abbey Persuasion Pride and Prejudice Sense and Sensibility Our Mutual Friend Moby Dick
Trend of ‘Sentiment’ topic throughout Austen novels [SENTIMENT] felt comfort feeling feel spirit mind heart point moment ill letter beyond mother state never event evil fearimpossiblehopetimeidealeftsituationpoordistress possiblehourendlossreliefdearestsufferingconcerndreadfulmiseryunhappyemotion
Topical similarity across Austen/Dickens/Melville Our Mutual Friend Moby Dick 6 Austen novels Different 6 Austen novels Our Mutual Friend Moby Dick Similar
Case Studies > Finding Funding Overlap Finding Funding Overlap The US Office of Science and Technology Policy wanted to analyze NSF and NIH funding to determine areas of overlap. How much funding overlap is there by topic area?
Case Studies > Finding Funding Overlap Sample Topics from Topic Model (22,000 abstracts)
Visualization of funding programs – nearby program support similar topics NSF – BIO NSF – SBE NIH