390 likes | 503 Views
A Comparison of Graphical Techniques for the Display of Co-Occurrence Data. Jan W. Buzydlowski, Xia Lin, Howard D. White College of Information Science and Technology Drexel University Philadelphia, PA 19104 USA. Information Visualization.
E N D
A Comparison of Graphical Techniques for the Display of Co-Occurrence Data Jan W. Buzydlowski, Xia Lin, Howard D. White College of Information Science and Technology Drexel University Philadelphia, PA 19104 USA
Information Visualization • (Data) Visualization allows for the revelation of intricate structure which cannot be absorbed in any other way. [Cleveland, 1993] • (Information) Visualization has two aspects, structural modeling and graphic representation.[C. Chen, 1999] • data - model - display
Visualization Overview • Model - Display • Co-Occurrence Model • 3 Graphical Displays • Data • Co-citation counts from the Institute for Scientific Information, Philadelphia, PA • Obtained from a 10-year Arts & Humanities Citation Index database given Drexel by ISI for research purposes
Co-Occurrence Model • Examples • Derivation • Metrics
Co-Occurrence Data - Example 1 • Market Basket Analysis • a shopping cart holds items purchased • e.g., milk, bread, razor blades, newspaper • Over all the sales for one day • what items are purchased together • how can we arrange the items in the store • Pampers and beer on Thursdays...
Co-Occurrence Data - Example 2 • Author Co-citation Analysis (ACA) • Bibliographic data on a given article holds, e.g., • title, keywords, abstract, citations to other documents • An article might cite, e.g.: • Plato, Aristotle, Smith, Brown • Over a given set of many citing articles • Count how many times each pair of authors were cited together • Resulting co-citation count shows common intellectual interest
Co-Occurrence Derivation • For a given data set (N = 4 unique terms) • Article 1: Plato, Aristotle, Smith • Article 2: Plato, Smith • Article 3: Plato, Aristotle, Smith, Brown • The following co-citations (C(4,2) = 6) are found • COMBINATIONCOUNTARTICLES • Plato and Smith 3 1, 2, 3 • Plato and Aristotle 2 1, 3 • Plato and Brown 1 3 • Aristotle and Smith 2 1, 3 • Aristotle and Brown 1 3 • Smith and Brown 1 3
Co-Occurrence Measures • Raw counts • Additional information • Correlations • Replace each cell by correlation measure of each pair-wise column • Conditional Probability • Compute each cell by dividing each unique combination by total occurring
Graphical Techniques • Three Methodologies • Multi-dimensional scaling • Self-organizing maps • Pathfinder networks
MDS Methodology • Given original distances (similarities) estimate coordinates that could give those distances • The computed distances should correspond to the original distances • Stress • Added dimensions
Self-Organizing Maps (SOMs) • Also known as Kohonen Maps • Based on Neural Networks • Related to wetware • robust techniques • If categories are known • supervised technique • backproprogating learning • If categories are sought • unsupervised technique • competitive learning
SOMs • Given a 2-D grid of nodes • each node has N weights • each vector (row) has N terms • map each input vector to a node • Similar to vector quantization (VQ)
SOMs Generation • nodes initially given random weights • randomly sample an input vector • row of co-occurrence matrix • with replacement • find a node closest to vector • Euclidean distance • update node weights • node weight = node weight + gain term * distance • update “neighborhood” • “cool” gain term and neighborhood • repeat…
Pathfinder Networks • Uses on graph notation • nodes = authors • edges = co-citation counts • Co-occurrence is a complete network (weighted, undirected) Plato 3 Smith 2 2 Aristotle
Pathfinder Networks Generation • Pathfinder Network is generated by varying the parameters: • distance (r) • triangle inequality (q)
Pathfinder Distance • Uses Minkowski metric: d = ( eir )1/r • Example • e1 = 3, e2 = 4 • r = 1 => d: 7 = 3 + 4 : • Driving distance / ratio data • r = 2 => d: 5 = (9 + 16)1/2 • Euclidean Distance • r (approaches) infinity => d: 4 = max( 3, 4) • ordinal data • rank rather than value
Pathfinder Triangle Inequality • A required property of a metric definition d(i,j) < d(i,k) + d(k,j) • But may not be justified • in personal judgments • If a is similar to b, and b is similar to c, there may be no transitive judgment of similarity from a to c • in set intersections • Even though Smith and Jones appear 12 times, and Jones and Brown appear 5 times, the overlap between Smith and Brown cannot be predicted
Pathfinder Triangle Inequality • Defines q-triangular • check paths of length q to determine if inequality is met • minimum is 2 • maximum is n -1 • full compliance • the longer the length, the fewer the connections
Pathfinder Network Creation • PFNet (r, q) • Examine all paths of length q or less. • Use Minkowski Metric with parameter r to compute path length. • If a path of less weight is found, then remove the edge.
Pathfinder - Example Smith 5 Jones q = 2 4 3 Brown r = 1 => Smith - Jones is kept r = 2 => Smith - Jones is kept r = infinity => Smith - Jones is removed
Comparison of Techniques • MDS • Reduces dimensions / reveals clusters • 2D may be insufficient • measurement may not be Euclidean • SOM • robust • no guarantee of convergence/unique solution • Pathfinder • does not assume ratio data/triangle inequality • connections rather than position is important • additional methodology needed for display
Comparison of Techniques • Similarities • Spatial models • Differences • use of visual space • semantic meaning • as related to data • research in progress
Graphical Display of Methodologies • MDS • assume that 2 dimensions are sufficient • x, y for each point already defined • SOM • grid defines the 2D surface • plot each label with the appropriate node • Pathfinder • only defines the nodes and links • need additional methodologies • Spring-embedder models • Kamada and Kawai (1989) • Fruchterman and Reingold (1991) • Davidson and Harel (1996)
Graphical Comparison of Three Methods • Data • Institute for Scientific Information • Arts and Humanities Database (AHCI) • 1988 - 1997 • 1.26 million records • Example: • Given Plato, find related authors • Interface described in IV 2000 Paper • CSNA 2000 Paper • (Lin, Buzydlowski, White)
PLATO (4928) ARISTOTLE (1861) PLUTARCH (838) CICERO (699) HOMER (627) BIBLE (552) EURIPIDES (515) ARISTOPHANES (474) XENOPHON (459) AUGUSTINE (432) HERODOTUS (425) KANT-I (385) AESCHYLUS (374) SOPHOCLES (363) THUCYDIDES (363) OVID (334) HESIOD (325) DIOGENES-LAERTIUS (317) HEIDEGGER-M (312) DERRIDA-J (304) PINDAR (292) NIETZSCHE-F (278) HEGEL-GWF (264) VERGIL (259) AQUINAS-T (255) 25 Authors Co-cited with Plato
300 Pair-wise co-citations • 1:PLATO AND ARISTOTLE -1940 docs • 2: PLATO AND PLUTARCH - 872 docs . . . • 300: VERGIL AND AQUINAS-T - 38 docs
Visualization allows for the revelation of intricate structure which cannot be absorbed in any other way...
PFNet of 25 authors co-cited with Plato AESCHYLUS SOPHOCLES EURIPIDES HESIOD AUGUSTINE HOMER PINDAR BIBLE ARISTOPHANES PLATO DIOGENES-LAERTIUS ARISTOTLE XENOPHON KANT-I CICERO AQUINAS-T PLUTARCH HEIDEGGER-M THUCYDIDES DERRIDA-J HEGEL-GWF HERODOTUS OVID NIETZSCHE-F VERGIL
Conclusion • Slides available at: • faculty.cis.drexel.edu/~jbuzydlo/ • janb@drexel.edu
Bibliography • Chen, Chaomei, Information Visualization and Virtual Environments, 1999. • Cleveland, William S., Visualizing Data, Hobart Press, 1993. • Davidson, R, Harel, D, Drawing Graphs Nicely Using Simulated Annealing, ACM Transactions on Graphics, 15(4): 301-31 (1996). • Fruchterman,TMJ, Reingold, EM, Graph Drawing by Force-Directed Placement, Software Practice and Experience, 21: 1129-64 (1991). • Kamada, T,Kawai, S, An Algorithm for Drawing General Undirected Graphs, Information Processing Letters, 31(1): 7-15, (1989).