1 / 84

Finding scientific topics

Finding scientific topics. Tom Griffiths Stanford University Mark Steyvers UC Irvine. Why map knowledge?. Quickly grasp important themes in a new field Synthesize content of an existing field Discover targets for funding and research. Why map knowledge?.

eadoin
Download Presentation

Finding scientific topics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine

  2. Why map knowledge? • Quickly grasp important themes in a new field • Synthesize content of an existing field • Discover targets for funding and research

  3. Why map knowledge? • Quickly grasp important themes in a new field • Synthesize content of an existing field • Discover targets for funding and research INFORMATION OVERLOAD

  4. Apoptosis + Plant Biology

  5. Apoptosis + Medicine

  6. Apoptosis + Medicine

  7. Apoptosis + Medicine

  8. Apoptosis + Medicine Apoptosis + Medicine

  9. Apoptosis + Medicine probabilistic generative model

  10. Apoptosis + Medicine statistical inference

  11. 1. A generative model for documents 2. Discovering topics with Gibbs sampling 3. Results • Topics and classes • Mapping science • Topic dynamics 4. Future directions • Tagging abstracts

  12. 1. A generative model for documents 2. Discovering topics with Gibbs sampling 3. Results • Topics and classes • Mapping science • Topic dynamics 4. Future directions • Tagging abstracts

  13. A generative model for documents • Each document a mixture of topics • Each word chosen from a single topic • from parameters • from parameters (Blei, Ng, & Jordan, 2003)

  14. A generative model for documents wP(w|z = 1) = f (1) wP(w|z = 2) = f (2) HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK 0.0 RESEARCH 0.0 MATHEMATICS 0.0 HEART 0.0 LOVE 0.0 SOUL 0.0 TEARS 0.0 JOY 0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 topic 1 topic 2

  15. Choose mixture weights for each document, generate “bag of words” q = {P(z = 1), P(z = 2)} {0, 1} {0.25, 0.75} {0.5, 0.5} {0.75, 0.25} {1, 0} MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY

  16. A generative model for documents q • Called Latent Dirichlet Allocation (LDA) • Introduced by Blei, Ng, and Jordan (2003), reinterpretation of PLSI (Hofmann, 2001) z z z w w w

  17. documents topics documents = P(w|z) topics P(z) words words LDA documents dims dims documents C = U D VT dims words words vectors SVD (Dumais, Landauer) P(w)

  18. 1. A generative model for documents 2. Discovering topics with Gibbs sampling 3. Results • Topics and classes • Mapping science • Topic dynamics 4. Future directions • Tagging abstracts

  19. Inverting the generative model • Maximum likelihood estimation (EM) • Variational EM (Blei, Ng & Jordan, 2003) • Bayesian inference

  20. Bayesian inference • Sum in the denominator over Tn terms • Full posterior only tractable to a constant

  21. Markov chain Monte Carlo • Sample from a Markov chain which converges to target distribution • Allows sampling from an unnormalized posterior distribution • Can compute approximate statistics from intractable distributions

  22. A visual example: Bars sample each pixel from a mixture of topics pixel = word image = document

  23. Interpretable decomposition • SVD gives a basis for the data, but not an interpretable one • The true basis is not orthogonal, so rotation does no good

  24. Bayesian model selection • How many topics do we need? • A Bayesian would consider the posterior: • Involves summing over assignments z P(T|w)  P(w|T) P(T)

  25. Bayesian model selection T = 10 P( w |T ) T = 100 Corpus (w)

  26. Bayesian model selection T = 10 P( w |T ) T = 100 Corpus (w)

  27. Bayesian model selection T = 10 P( w |T ) T = 100 Corpus (w)

  28. Back to the bars

  29. 1. A generative model for documents 2. Discovering topics with Gibbs sampling 3. Results • Topics and classes • Mapping science • Topic dynamics 4. Future directions • Tagging abstracts

  30. Corpus preprocessing • Used all D = 28,154 abstracts from 1991-2001 • Used any word occurring in at least five abstracts, not on “stop” list (W = 20,551) • Segmentation by any delimiting character, total of n = 3,026,970 word tokens in corpus • Also, PNAS class designations for 2001 (thanks to Kevin Boyack)

  31. Running the algorithm • Memory requirements linear in T(W+D), runtime proportional to nT • T = 50, 100, 200, 300, 400, 500, 600, (1000) • Ran 8 chains for each T, burn-in of 1000 iterations, 10 samples/chain at a lag of 100 • All runs completed in under 30 hours on BlueHorizon supercomputer at San Diego

  32. How many topics?

  33. A selection of topics STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STRUCTURAL RESOLUTION HELIX THREE HELICES DETERMINED RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIONAL INTERACTIONS MOLECULE SURFACE NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NEURONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATERAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AREAS THALAMIC TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GROWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MALIGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONTRACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOLATED MYOD FAILURE HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AIDS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL FORCE SURFACE MOLECULES SOLUTION SURFACES MICROSCOPY WATER FORCES PARTICLES STRENGTH POLYMER IONIC ATOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTIONS BEADS MECHANICAL

  34. A selection of topics MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMPLE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PREDICTIONS CONSISTENT PARAMETERS STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUDY DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USING FINDINGS DEMONSTRATE REPORT INDICATED CONSISTENT REPORTS CONTRAST MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKNOWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RESPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING LARGELY KNOWN CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROMOSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MAPS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATAL EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BIRTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROGENESIS ADULTS MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFSPRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCTION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMODIUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFECTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA CRUZI BRUCEI HUMAN HOSTS

  35. A selection of topics MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMPLE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PREDICTIONS CONSISTENT PARAMETERS STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUDY DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USING FINDINGS DEMONSTRATE REPORT INDICATED CONSISTENT REPORTS CONTRAST MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKNOWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RESPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING LARGELY KNOWN CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROMOSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MAPS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATAL EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BIRTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROGENESIS ADULTS MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFSPRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCTION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMODIUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFECTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA CRUZI BRUCEI HUMAN HOSTS

  36. 1. A generative model for documents 2. Discovering topics with Gibbs sampling 3. Results • Topics and classes • Mapping science • Topic dynamics 4. Future directions • Tagging abstracts

  37. Topics and classes • PNAS authors provide class designations • major: Biological, Physical, Social Sciences • minor: 33 separate disciplines* • Find topics diagnostic of classes • validate “reality” of classes • show topics pick out meaningful structure (classes, and the the relations between them)

  38. 210 SYNAPTIC NEURONS POSTSYNAPTIC HIPPOCAMPAL SYNAPSES LTP PRESYNAPTIC TRANSMISSION POTENTIATION PLASTICITY EXCITATORY RELEASE DENDRITIC PYRAMIDAL HIPPOCAMPUS DENDRITES CA1 STIMULATION TERMINALS SYNAPSE

  39. 201 RESISTANCE RESISTANT DRUG DRUGS SENSITIVE MDR MULTIDRUG SUSCEPTIBLE SELECTED GLYCOPROTEIN SENSITIVITY PGP AGENTS CONFERS MDR1 CYTOTOXIC CONFERRED CHEMOTHERAPEUTIC EFFLUX INCREASED

  40. 280 SPECIES SELECTION EVOLUTION GENETIC POPULATIONS POPULATION VARIATION NATURAL EVOLUTIONARY FITNESS ADAPTIVE RATES THEORY TRAITS DIVERSITY EXPECTED NEUTRAL EVOLVED COMPETITION HISTORY

  41. 222 CORTEX BRAIN SUBJECTS TASK AREAS REGIONS FUNCTIONAL LEFT MEMORY TEMPORAL IMAGING PREFRONTAL CEREBRAL TASKS FRONTAL AREA TOMOGRAPHY EMISSION POSITRON CORTICAL

  42. 2 SPECIES GLOBAL CLIMATE CO2 WATER ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE EARTH ECOLOGICAL CHANGE TIME ECOSYSTEM

  43. 39 THEORY TIME SPACE GIVEN PROBLEM SHAPE SIMPLE DIMENSIONAL PAPER NUMBER CASE LOCAL TERMS SYMMETRY RANDOM EQUATION CLASSICAL COMPLEXITY NUMERICAL PROPERTIES

  44. 1. A generative model for documents 2. Discovering topics with Gibbs sampling 3. Results • Topics and classes • Mapping science • Topic dynamics 4. Future directions • Tagging abstracts

  45. Mapping science • Topics provide dimensionality reduction • Some applications require visualization (and even lower dimensionality) • Low-dimensional representation from methods for analysis of compositional data

More Related