Term Co-occurrence Analysis as an Interface to Digital Libraries

Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology Drexel University, Philadelphia, Pennsylvania, USA

Digital Library Research • First Wave • How to store it • Next Wave • How to retrieve it (IR) • Text Mining • Visual Information Retrieval Interface (VIRI) • Term Co-occurrence Analysis (TCA) • Co-occurrence vs. lexical associations • Maps vs. lists

Term Definition • Unit of Analysis • Words • Documents • Authors • Journals • Section of Focus • Abstract/Text • Title • Bibliography • Keywords

Words in Title Term Co-occurrence Analysis Interface Digital Library Authors in Bibliography Salton-G Chen-C White-HD Ding-Y Cleveland-W McCain-K Lin-X Schvaneveldt-R Kamada-T Fruchterman-T Example

Term Co-occurrence Methodology • User determines which terms are of interest • Via a seed term • From a pre-defined list • The system returns the pair-wise co-occurrence counts of the terms over the collection of records

Example • Unit: Author; Section: Bibliography • User Supplied List: Plato, Aristotle, Smith, Brown • For a given data set (N = 4 unique terms) • Article 1: Plato, Aristotle, Smith, … • Article 2: Plato, Smith, … • Article 3: Plato, Aristotle, Smith, Brown, … • The following co-citations (C(4,2) = 6) are found • COMBINATIONCOUNTARTICLES • Plato and Smith 3 1, 2, 3 • Plato and Aristotle 2 1, 3 • Plato and Brown 1 3 • Aristotle and Smith 2 1, 3 • Aristotle and Brown 1 3 • Smith and Brown 1 3

Term Co-occurrence Significance • The frequent co-occurrence of term pairs within a set of documents indicates a strong association between those terms, whereas a infrequent count indicates the opposite • The association you would expect is borne out by the frequency • The frequency you compute suggests a level of association • Pain and Management Pain and Obtainment • Plato and Aristotle Plato and Cher • Science and Nature Science and National Tattler • A and B C and D

Term Co-occurrence Uses • Allows a user to get a “foothold” with just one term • One seed term returns many other related terms • Allows a user to get a “overview” with user-supplied/system-supplied terms • Co-occurrence counts with visualization

Seeding • User types in • One term, e.g., Plato • Boolean expression, e.g., Plato AND Brown • System supplies top n terms, in ranked order of frequency of co-occurrence with the initial term

Example • For Plato seed: • ARISTOTLE • PLUTARCH • CICERO • HOMER • BIBLE • EURIPIDES • ARISTOPHANES • XENOPHON • AUGUSTINE • HERODOTUS • KANT-I • AESCHYLUS • SOPHOCLES • THUCYDIDES • OVID • HESIOD • DIOGENES-LAERTI • HEIDEGGER-M • DERRIDA-J • PINDAR • NIETZSCHE-F • HEGEL-GWF • VERGIL • AQUINAS-T

Need for Visualization • Given a list of user- / system-supplied terms • Find the frequency of co-occurrence of each pair-wise combination of terms • Plato AND Aristotle = 1,920 • Plato AND Plutarch = 380, • … • Too many numbers to take in at once • C(25, 2) = (25 * 24)/ 2 = 300 pairs • Three major visualization techniques • Multidimensional Scaling (MDS) • Self-Organizing (Kohonen) Maps (SOMs) • PathFinder Networks (PFNETs)

P Arabie JH Ward JC Gower M Wish RN Shepard RR Sokal JB Kruskal SC Johnson PHA Sneath JD Carroll PE Green JA Hartigan HA Skinner VE McGee RK Blashfield White’s MDS map of 15 co-cited classificationists, ca. 1990

White’s PFNet of co-cited authors in Biblical and literary hermeneutics, 1988-1997

Three tiered User interface Server Database Real-time and interactive Significant data sources ISI AHCI MedLine Live interface for retrieval Our System

User Interface - Seed

User Interface – SOM

Interface - PFNET

Interface - Visual Information Retrieval Interface (VIRI)

User Interface IV

Database Interface • API • String [ ] findRel( String, int ) • Int [ ] findOcc( String [ ] ) • Implemented on: • BRS • API via a wrapper • Oracle • API via JDBC • Noah • Specialized co-occurrence database • API via JNI

Future Plans • User Study • Preference • Type of map, etc. • Cognitive map • How well does the map match experts’ mental models • Larger datasets • Additional data sources

Term Co-occurrence Analysis as an Interface to Digital Libraries

Term Co-occurrence Analysis as an Interface to Digital Libraries

Presentation Transcript

Digital Libraries

Digital Libraries

Analysis of Canonical Chinese Antonym Co-occurrence

Digital Libraries

Libraries as an ideal workplace

Introduction to Digital Libraries

Introduction to Digital Libraries

Digital Libraries

Introduction to Digital Libraries

Digital Libraries

Digital Libraries as Access-Point to Music Culture

Interface, Preservation, and Future of Digital Libraries INFO653 Digital Libraries Week 9

An Update on Digital Libraries

Genome-wide co-occurrence tendencies

An introduction to digital libraries

Introduction to Digital Libraries

Digital Libraries

Digital Libraries: an introduction

Introduction to Digital Libraries