1 / 45

Part 2: What is under the hood: Topic modeling and things you can do with it …

Part 2: What is under the hood: Topic modeling and things you can do with it …. Topic modeling allows useful analyses. Turn written words into measurable quantities Measure how much is written on topic X? Measure how close are documents A and B?. What is the topic model?.

avedis
Download Presentation

Part 2: What is under the hood: Topic modeling and things you can do with it …

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Part 2:What is under the hood: Topic modeling and things you can do with it …

  2. Topic modeling allows useful analyses • Turn written words into measurable quantities • Measure how much is written on topic X? • Measure how close are documents A and B?

  3. What is the topic model? • Bayesian model for a collection of text documents • Finds patterns of co-occurring words • Intuition: documents exhibit mixtures of topics • Learn using Gibbs sampling (unsupervised) • Blei, Ng, Jordan: Latent Dirichlet Allocation (2003) • Griffiths & Steyvers: Finding Scientific Topics (2004)

  4. What is a topic? topics are distributions over words documents are a mixture of topics topic is a latent variable Example leaned topics: [BAYESIAN INFERENCE] sampling bayesian prior distribution sample monte_carlo methodmodelsamplesposteriormarkov_chaininferenceimportancegibbslikelihoodparameterbayesmixturemcmcgaussian … [DIGITAL LIBRARIES] digital library librariesaccesscollectioninformationmetadataelectronicrepositoryrepositoriescatalogarchivesarchiveprovidingcontentportalresources …

  5. Topic modeling is better than clustering Multiple Topics One Cluster

  6. Learn topic model using Gibbs sampling

  7. Topic model is fast and scalable • Newman+, NIPS 2007 • Porteous, Newman+, SIGKDD 2008 Topic model: 1 year Distributed topic model: 1 day time = 1 year, memory = 400 GB time = 1 day, memory < 1 GB/proc MEDLINE/PubMed 8 million abstracts 700 million words P1 P1 P2 P3 P1024

  8. Topic modeling to “measure” texts • Analyzed state-of-the-field of women’s history • Block & Newman, 2008 (under review, J. Women’s History) • Text mined 20 years of history publications (800,000 abstracts) • Busted some myths … • e.g. Sexuality studies is a modern project

  9. Proportion of women’s history publications devoted to sexuality studies • Q: How modern is sexuality studies?

  10. Use to analyze research portfolio • What research does NINDS fund? • What other institutes fund research done under NINDS? • Measure spending by disease • Potentially more accurate, does not rely on classification/keyword/thesaurus terms • Find spending overlap (e.g. funding duplicated across institutes), and gaps

  11. What topics does NINDS fund?

  12. How is research on ion channels shared across other institutes?

  13. Topic-based document-document distance topic mix Doc A dist (Doc A, Doc B) Doc B

  14. Example: Most similar sections across books? • Between: • Any Austen book, and • Melville’s Moby Dick

  15. Grants close to given grant 5R01NS024471-21 Ion channels of neurons PI: JONES, STEPHEN W Similar Grants: (0.8)5R01NS043259-04   Molecular mechanisms of voltage-gated ion channels, (LARSSON, HANS PETER)(0.8)5R01GM069837-03Ion Regulation of Kv Channel Gating and Permeation, (DEUTSCH, CAROL J.)(0.8)5R01HL075536-03Voltage Sensor Movement in the HERG Potassium Channel, (TRISTANI-FIROUZI, MARTIN)(0.8)5R01HL065299-06Molecular Mechanisms of Pacemaker Channel Function, (SANGUINETTI, MICHAEL C.)(0.8)5R01NS045383-10   Molecular Physiology of K and Ca Channels, (YANG, JIAN)(0.8)5R01DK046950-12Molecular Cloning of Epithelial K Channels, (SACKIN, HENRY)(0.8)5R01HL050411-13Cardiac Na+ Channel:Molecular Basis of Permeation, (TOMASELLI, GORDON)(0.8)5R01HL044630-16Pharmacology of Cardiac Sodium Channel Modifiers, (SHEETS, MICHAEL F.)

  16. Hierarchical labeling of topic maps

  17. Hierarchical labeling of topic maps • Learn topics for D documents • Compute all D2 document-document distances • Compute 2-dim layout of D documents (using DrL, PCA, MDS, Isomap, LLE, etc) • Create labels • Lower levels: domain expert interprets topic, creates short label • Higher levels: cluster topics into group (use hierarchical agglomerative clustering), then domain expert labels group • Place labels • Cluster 2-dim points using K-means • K-means well suited to clumpiness of DrL layouts • Use majority label

  18. K-Means clustering of DrL layout using K=130

  19. Interactive visual browse • Query  {documents} • Iterate: {documents}  topic map topic map  {user subselects documents}

  20. query = “colon cancer”

  21. query = “colon cancer”

  22. Thank you

  23. Links • http://datalab-1.ics.uci.edu/anthrax2/test.php • http://datalab-1.ics.uci.edu/newman/pubmed/ • http://datalab-1.ics.uci.edu/ninds/ • http://yarra.ics.uci.edu/pubmedtrends/ • http://yarra.ics.uci.edu/calit2/ • http://yarra.ics.uci.edu/topic/enron/ • http://scimaps.org/maps/ninds/ • http://scimaps.org/maps/neurovis/

  24. query = “p53”

  25. query = “p53”

  26. Topics of topics • Topic model learns patterns of co-occurring words • Rerun topic model (on ‘documents’ where the topic mixes are the ‘words’)  learn patterns of co-occurring topics

  27. Co-occurring topics in PubMed [super56] [t320] laparoscopic patient surgery open time complication procedure postoperative [t640] bladder urethral incontinence urinary urinary_incontinence patient detrusor urodynamic [t542] biliary gallbladder bile_duct patient duct bile common endoscopic [t299] patient surgery surgical treatment operation surgical_treatment indication conservative [t242] resection anastomosis patient anastomotic operation gastrectomy postoperative anastomoses [super48] [t1389] patient radiotherapy chemotherapy treatment survival surgery tumor disease [t392] lymph_node patient lymph_nodes nodes dissection metastases axillary staging [t1163] patient prognostic survival prognosis factor prognostic_factor tumor stage [t1300] patient month follow-up year recurrence range underwent treated [t120] stage patient stages iii disease iv ii stage_i

  28. Austen, Dickens, Melville 1429 sections of 100 lines 8 novels (Austen++) Emma Mansfield Park Northanger Abbey Persuasion Pride and Prejudice Sense and Sensibility Our Mutual Friend Moby Dick Emma Mansfield Park Northanger Abbey Persuasion Pride and Prejudice Sense and Sensibility Our Mutual Friend Moby Dick

  29. Trend of ‘Sentiment’ topic throughout Austen novels [SENTIMENT] felt comfort feeling feel spirit mind heart point moment ill letter beyond mother state never event evil fearimpossiblehopetimeidealeftsituationpoordistress possiblehourendlossreliefdearestsufferingconcerndreadfulmiseryunhappyemotion

  30. Topical similarity across Austen/Dickens/Melville Our Mutual Friend Moby Dick 6 Austen novels Different 6 Austen novels Our Mutual Friend Moby Dick Similar

  31. Most similar sections: Austen -- Melville

  32. A topic ID is assigned to every word

  33. A topic ID is assigned to every word

  34. Case Studies > Finding Funding Overlap Finding Funding Overlap The US Office of Science and Technology Policy wanted to analyze NSF and NIH funding to determine areas of overlap. How much funding overlap is there by topic area?

  35. Case Studies > Finding Funding Overlap Sample Topics from Topic Model (22,000 abstracts)

  36. Program Similarity

  37. Visualization of funding programs – nearby program support similar topics NSF – BIO NSF – SBE NIH

More Related