840 likes | 856 Views
Finding scientific topics. Tom Griffiths Stanford University Mark Steyvers UC Irvine. Why map knowledge?. Quickly grasp important themes in a new field Synthesize content of an existing field Discover targets for funding and research. Why map knowledge?.
E N D
Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine
Why map knowledge? • Quickly grasp important themes in a new field • Synthesize content of an existing field • Discover targets for funding and research
Why map knowledge? • Quickly grasp important themes in a new field • Synthesize content of an existing field • Discover targets for funding and research INFORMATION OVERLOAD
Apoptosis + Medicine Apoptosis + Medicine
Apoptosis + Medicine probabilistic generative model
Apoptosis + Medicine statistical inference
1. A generative model for documents 2. Discovering topics with Gibbs sampling 3. Results • Topics and classes • Mapping science • Topic dynamics 4. Future directions • Tagging abstracts
1. A generative model for documents 2. Discovering topics with Gibbs sampling 3. Results • Topics and classes • Mapping science • Topic dynamics 4. Future directions • Tagging abstracts
A generative model for documents • Each document a mixture of topics • Each word chosen from a single topic • from parameters • from parameters (Blei, Ng, & Jordan, 2003)
A generative model for documents wP(w|z = 1) = f (1) wP(w|z = 2) = f (2) HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK 0.0 RESEARCH 0.0 MATHEMATICS 0.0 HEART 0.0 LOVE 0.0 SOUL 0.0 TEARS 0.0 JOY 0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 topic 1 topic 2
Choose mixture weights for each document, generate “bag of words” q = {P(z = 1), P(z = 2)} {0, 1} {0.25, 0.75} {0.5, 0.5} {0.75, 0.25} {1, 0} MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY
A generative model for documents q • Called Latent Dirichlet Allocation (LDA) • Introduced by Blei, Ng, and Jordan (2003), reinterpretation of PLSI (Hofmann, 2001) z z z w w w
documents topics documents = P(w|z) topics P(z) words words LDA documents dims dims documents C = U D VT dims words words vectors SVD (Dumais, Landauer) P(w)
1. A generative model for documents 2. Discovering topics with Gibbs sampling 3. Results • Topics and classes • Mapping science • Topic dynamics 4. Future directions • Tagging abstracts
Inverting the generative model • Maximum likelihood estimation (EM) • Variational EM (Blei, Ng & Jordan, 2003) • Bayesian inference
Bayesian inference • Sum in the denominator over Tn terms • Full posterior only tractable to a constant
Markov chain Monte Carlo • Sample from a Markov chain which converges to target distribution • Allows sampling from an unnormalized posterior distribution • Can compute approximate statistics from intractable distributions
A visual example: Bars sample each pixel from a mixture of topics pixel = word image = document
Interpretable decomposition • SVD gives a basis for the data, but not an interpretable one • The true basis is not orthogonal, so rotation does no good
Bayesian model selection • How many topics do we need? • A Bayesian would consider the posterior: • Involves summing over assignments z P(T|w) P(w|T) P(T)
Bayesian model selection T = 10 P( w |T ) T = 100 Corpus (w)
Bayesian model selection T = 10 P( w |T ) T = 100 Corpus (w)
Bayesian model selection T = 10 P( w |T ) T = 100 Corpus (w)
1. A generative model for documents 2. Discovering topics with Gibbs sampling 3. Results • Topics and classes • Mapping science • Topic dynamics 4. Future directions • Tagging abstracts
Corpus preprocessing • Used all D = 28,154 abstracts from 1991-2001 • Used any word occurring in at least five abstracts, not on “stop” list (W = 20,551) • Segmentation by any delimiting character, total of n = 3,026,970 word tokens in corpus • Also, PNAS class designations for 2001 (thanks to Kevin Boyack)
Running the algorithm • Memory requirements linear in T(W+D), runtime proportional to nT • T = 50, 100, 200, 300, 400, 500, 600, (1000) • Ran 8 chains for each T, burn-in of 1000 iterations, 10 samples/chain at a lag of 100 • All runs completed in under 30 hours on BlueHorizon supercomputer at San Diego
A selection of topics STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STRUCTURAL RESOLUTION HELIX THREE HELICES DETERMINED RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIONAL INTERACTIONS MOLECULE SURFACE NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NEURONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATERAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AREAS THALAMIC TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GROWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MALIGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONTRACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOLATED MYOD FAILURE HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AIDS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL FORCE SURFACE MOLECULES SOLUTION SURFACES MICROSCOPY WATER FORCES PARTICLES STRENGTH POLYMER IONIC ATOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTIONS BEADS MECHANICAL
A selection of topics MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMPLE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PREDICTIONS CONSISTENT PARAMETERS STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUDY DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USING FINDINGS DEMONSTRATE REPORT INDICATED CONSISTENT REPORTS CONTRAST MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKNOWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RESPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING LARGELY KNOWN CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROMOSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MAPS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATAL EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BIRTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROGENESIS ADULTS MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFSPRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCTION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMODIUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFECTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA CRUZI BRUCEI HUMAN HOSTS
A selection of topics MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMPLE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PREDICTIONS CONSISTENT PARAMETERS STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUDY DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USING FINDINGS DEMONSTRATE REPORT INDICATED CONSISTENT REPORTS CONTRAST MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKNOWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RESPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING LARGELY KNOWN CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROMOSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MAPS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATAL EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BIRTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROGENESIS ADULTS MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFSPRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCTION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMODIUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFECTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA CRUZI BRUCEI HUMAN HOSTS
1. A generative model for documents 2. Discovering topics with Gibbs sampling 3. Results • Topics and classes • Mapping science • Topic dynamics 4. Future directions • Tagging abstracts
Topics and classes • PNAS authors provide class designations • major: Biological, Physical, Social Sciences • minor: 33 separate disciplines* • Find topics diagnostic of classes • validate “reality” of classes • show topics pick out meaningful structure (classes, and the the relations between them)
210 SYNAPTIC NEURONS POSTSYNAPTIC HIPPOCAMPAL SYNAPSES LTP PRESYNAPTIC TRANSMISSION POTENTIATION PLASTICITY EXCITATORY RELEASE DENDRITIC PYRAMIDAL HIPPOCAMPUS DENDRITES CA1 STIMULATION TERMINALS SYNAPSE
201 RESISTANCE RESISTANT DRUG DRUGS SENSITIVE MDR MULTIDRUG SUSCEPTIBLE SELECTED GLYCOPROTEIN SENSITIVITY PGP AGENTS CONFERS MDR1 CYTOTOXIC CONFERRED CHEMOTHERAPEUTIC EFFLUX INCREASED
280 SPECIES SELECTION EVOLUTION GENETIC POPULATIONS POPULATION VARIATION NATURAL EVOLUTIONARY FITNESS ADAPTIVE RATES THEORY TRAITS DIVERSITY EXPECTED NEUTRAL EVOLVED COMPETITION HISTORY
222 CORTEX BRAIN SUBJECTS TASK AREAS REGIONS FUNCTIONAL LEFT MEMORY TEMPORAL IMAGING PREFRONTAL CEREBRAL TASKS FRONTAL AREA TOMOGRAPHY EMISSION POSITRON CORTICAL
2 SPECIES GLOBAL CLIMATE CO2 WATER ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE EARTH ECOLOGICAL CHANGE TIME ECOSYSTEM
39 THEORY TIME SPACE GIVEN PROBLEM SHAPE SIMPLE DIMENSIONAL PAPER NUMBER CASE LOCAL TERMS SYMMETRY RANDOM EQUATION CLASSICAL COMPLEXITY NUMERICAL PROPERTIES
1. A generative model for documents 2. Discovering topics with Gibbs sampling 3. Results • Topics and classes • Mapping science • Topic dynamics 4. Future directions • Tagging abstracts
Mapping science • Topics provide dimensionality reduction • Some applications require visualization (and even lower dimensionality) • Low-dimensional representation from methods for analysis of compositional data