1 / 43

SCIENCE MAPS

SCIENCE MAPS. SI767 – W10 – Matthew P. Simmons. Overview. What are Science Maps? Various definitions. Usage and utility. What techniques are used? Types of techniques. Overview of certain techniques. What has been done on the topic? Review of papers. The small world of these readings.

thy
Download Presentation

SCIENCE MAPS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SCIENCE MAPS SI767 – W10 – Matthew P. Simmons

  2. Overview • What are Science Maps? • Various definitions. • Usage and utility. • What techniques are used? • Types of techniques. • Overview of certain techniques. • What has been done on the topic? • Review of papers. • The small world of these readings.

  3. What are Science Maps? ...and why do we care?

  4. In the reading... Science Maps are ... • Topic models- • Detecting the finite number of themes/topics that characterize the content of a knowledge domain. • Scientometric analysis of the provenance of ideas- • Tracking the memetic flow in scientific literature by analyzing the pattern of citations and collaborations. • Models of the evolution of bibliometric networks- • Finding the parameters that generate networks with similar structure to citation and coauthorship networks, and determining why those parameters matter.

  5. Various techniques/approaches;Common theme Find the hidden ontologies that organize data. Organize scientific data by: • Topic - • "Which articles are about the same thing as this one?" • Influence - • "What are the five most important papers in topic modeling?" • Provenance - • "Where are bayesian inference theories coming from, and who is using them?" • "Hotness" of a field - • "Where are the research dollars in NLP today?"

  6. Usage and Utility • Enhance the ability to navigate data. • Identify potential collaborators. • Determine the impact of an author or paper. • Identify the problems that need solving (and where the money is...) • Create tools to allow our allocation of attention to efficiently scale with the massive increase of data. • "Revealing implicit knowledge that is presently known only to domain experts..."(Shiffrin et al. 2004)

  7. What Techniques are Used?  Hint: Lots!

  8. Support Vector Machines Clustering Latent Dirichlet Allocation Latent Semantic Analysis Mixture Models Network modeling Network analysis Network visualization Bibliometric analysis Answer: Lots! ...and that's just what's mentioned in Shiffrin et al.!

  9. In the readings... • Markov Chain Monte Carlo – Griffiths et. Al. • Latent Dirichlet Allocation – Blei et al. & Griffiths et. Al. • Pathfinder [Networks|Scaling] – Chen • Bayesian Networks – Blei et. Al & Griffiths et. Al. • Bibliometric networks - Garfield, Borner et al. & Chen ...and more, but there is an easy way to group them.

  10. Statistical models/methods Markov Chain Monte Carlo method Dirichlet Distribution Bayesian Inference Latent Dirichlet Allocation Network models/methods Degree distributions Centrality measures Scale free networks Small world networks Clustering coefficients Pathfinder scaling Two main categories

  11. Stats Stuff Everyone got all that?

  12. Just kidding... Let's start with Markov Chains

  13. Markov Chains • A system that can exist in various states where the components of that system change in discrete steps. • The changes of the components are determined by the transitions probability which displays the Markov property. • The Markov property states the the state of a component at time n+1 is dependent on the state of the system at time n, but not at any time < n. Hence the immediately previous state is the only important factor in determining the next state of the system. • Example: A random walk on the number line with an equal probability of moving +1 or -1 at each step. Largely from: http://en.wikipedia.org/wiki/Markov_chain

  14. Monte Carlo method • A process that utilizes repeated random sampling to derive an approximated result. General Process: • Define a domain of possible inputs. • Generate inputs randomly from the domain using a certain specified probability distribution. • Perform a deterministic computation using the inputs. • Aggregate the results of the individual computations into the final result. From: http://en.wikipedia.org/wiki/Monte_carlo_method

  15. "By our powers combined..."Markov Chain Monte Carlo Method • AKA Gibbs sampling From: Dirichlet Processes, Chinese Restaurant Processes and All That, Michael I. Jordan 2005 http://www.cs.berkeley.edu/~jordan David MacKay, Information theory, inference, and learning algorithms (Cambridge  UK ;;New York: Cambridge University Press, 2003).

  16. Bayesian Paradigm From:  Structured Bayesian Nonparametric Models with Variational Inference ACL Tutorial Prague, Czech RepublicPercy Liang and Dan Klein http://www.cs.berkeley.edu/~pliang/papers/tutorial-acl2007.pdf

  17. Think of A as the topics in a document and B as the words observed. The goal is to infer the most probable topic distribution given the observed words. Bayesian Inference From: Robert Cowell, Introduction to inference for Bayesian networks, in Learning in graphical models, ed. Michael Jordan (MIT Press, 1999), 9-27.

  18. Latent Dirichlet Allocation • A generative document model • Each document is composed of a number of words drawn from a number of topics that comprise the document. • The is a probability distribution of topics defined across documents and a probability distribution of words defined across topics. ...pictures help here...

  19. Latent Dirichlet Allocation - Cont. From: David M. Blei, Andrew Y. Ng, and Michael I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res. 3 (2003): 993-1022.

  20. Dirichlet Distribution From:  Structured Bayesian Nonparametric Models with Variational Inference ACL Tutorial Prague, Czech RepublicPercy Liang and Dan Klein http://www.cs.berkeley.edu/~pliang/papers/tutorial-acl2007.pdf

  21. Network Analysis/Methods From: Bollen, J., Van de Sompel, H., Hagberg, A., Bettencourt, L., Chute, R., Rodriguez, M. A., et al. (2009). Clickstream Data Yields High-Resolution Maps of Science. PLoS ONE, 4(3), e4803. doi: 10.1371/journal.pone.0004803

  22. Centrality • Betweenness: • Bridging • # of shortest paths through a node to other nodes. • Closeness • Avg distance to other nodes. • Degree • Number of edges of one type or another.

  23. Pathfinder Scaling • Network scaling (edge reduction) method. • Generates a minimum spanning tree plus a parameter tunable number of redundant edges. • Can use different metrics to determine which edges to prune, such as the euclidian distance or edge weight.

  24. What's been done on the topic? What did we learn today, class?

  25. R. M. Shiffrin and K. Börner, Mapping knowledge domains, Proceedings of the National Academy of Sciences 101, no. suppl_1 (1, 2004): 5183-5185. • Overview of the field and the PNAS articles. • Important take away: There is a lot of research potential in this area and one that benefits from an interdisciplinary analysis involving a vareity of techniques.

  26. T. L. Griffiths and Mark Steyvers, Finding scientific topics, Proceedings of the National Academy of Sciences 101, no. suppl_1 (1, 2004): 5228-5235. • LDA • Optimal Topics ~300 • LDA over PNAS abstracts 1991-2001 • ~3million words total - ~20k terms in vocabulary • Contributions: • Hot and Cold topics • Topical clustering (heatmap) • Demonstrate that content analysis can reveal topics

  27. The word distribution within 10 topics. 30 randomly generated "documents" generated from the above 10 topics. Using LDA to derive the original topics from the observed documents.

  28. Griffiths et al. cont... Cold Topics.        Hot Topics!

  29. David M. Blei and John D. Lafferty, A correlated topic model of Science, The Annals of Applied Statistics 1, no. 1 (2007): 17-35 • Correlated Topic Models • Evolution of LDA • Introduces the notion that the probability of the topics comprising a document are not necessarily independent. • Replaces the use of the Dirichlet distribution with a log normal distribution with a covariance structure as a parameter.

  30. Blei et al. cont... Contributions: • CTM outperforms LDA when the number of topics is larger. • CTM also predicts words more accurately with less training data than LDA. • Both of these are credited to the effect of topic correlation on the distribution of topics in a document. • A science map! • The covariance matrix is used to create a graph where the topics are vertices and the edges represent some level of covariance.

  31. Blei et al.

  32. Katy Börner and Jeegar T. Maru and Robert L. Goldstone, The simultaneous evolution of author and paper networks, Proceedings of the National Academy of Sciences 101, no. suppl_1 (1, 2004): 5266-5273. • Bibliometric network encompassing coauthorship and citation. • Built to model the PNAS collaboration/citation network. • 2 node types: • Authors • Papers • Several edge types: • directional information flow between paper:author and paper:paper • author and coauthor

  33. Börner cont... 3 main parameters in the model: • Topics - i.e. scientific specializations • Aging - Meant to capture the bias to cite recent material. • Recursive Linking - The propensity to read the papers cited by the papers you have read. Iterative simulation that modeled the addition of new authors, the removal of old ones, coauthorship, the propensity of authors to publish within their topic domain.

  34. Looks like a decent fit...

  35. Börner cont... Contributions: • Models the constraint of aging on preferential attachment in scale free network formation. • Models the "splintering" of science caused by specialization.

  36. Chaomei Chen, CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature, Journal of the American Society for Information Science and Technology 57, no. 3 (2006): 359-377. • Co-citation network. • Clusters labeled based on Kleinberg's burst detection algorithm (Kleinberg 2002). 9 steps: • Identify knowledge domain. i.e. "mass extinction" • Automated data collection. Uses PubMed and Web of Science. • Find burst terms. CiteSpace II scrapes [1-4]-grams. • Time slicing. Generate time series views of the network. • Choose thresholds (intellectual bases & research fronts). • Graph scaling. Reduce edges to improve visual clarity without sacrificing critical visual data. • Layout. Typically a force directed layout to emphasize clustering. • Visual inspection. Tweak labels and display of metadata. • Verify pivot points.

  37. Chen cont... Contributions: • Detecting research fronts, intellectual bases, and pivots. • Detecting trends in scientific research. •  Visualization of knowledge domain.

  38. Tree ring view of citations over time Overview of Citespace II

  39. Eugene Garfield, Historiographic Mapping of Knowledge Domains Literature, Journal of Information Science 30, no. 2 (April 1, 2004): 119-145. • Co-founder of scientometrics. • Bibliometric analysis and link tracking reveal impact of papers on a field. • Concept of local citation score and group citation score. • Adding group and time slicing to learn more about the effect of those slices on bibliometric records.

  40. Garfield cont... Contributions: • Bibliometrics... • Finding out which papers were important at a certain time vs. from a current perspective.

  41. Thank You Questions or Comments?

  42. Bonus Bibliography Slide! • Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of Science. The Annals of Applied Statistics, 1(1), 17-35. doi: 10.1214/07-AOAS114 • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. J. Mach. Learn. Res., 3, 993-1022. • Bollen, J., Van de Sompel, H., Hagberg, A., Bettencourt, L., Chute, R., Rodriguez, M. A., et al. (2009). Clickstream Data Yields High-Resolution Maps of Science. PLoS ONE, 4(3), e4803. doi: 10.1371/journal.pone.0004803 • Borner, K. (2004). The simultaneous evolution of author and paper networks. Proceedings of the National Academy of Sciences, 101(suppl_1), 5266-5273. doi: 10.1073/pnas.0307625100 • Börner, K. (2007). Making sense of mankind’s scholarly knowledge and expertise: collecting, interlinking, and organizing what we know and different approaches to mapping (network) science. Environment and Planning B: Planning and Design, 34(5), 808 – 825. doi: 10.1068/b3302t • Chen, C. (2006). CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. Journal of the American Society for Information Science and Technology, 57(3), 359-377. doi: 10.1002/asi.20317 • Cowell, R. (1999). Introduction to inference for Bayesian networks. In M. Jordan (Ed.), Learning in graphical models (pp. 9-27). MIT Press. • Garfield, E. (2004). Historiographic Mapping of Knowledge Domains Literature. Journal of Information Science, 30(2), 119-145. doi: 10.1177/0165551504042802 • Griffiths, T. L. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl_1), 5228-5235. doi: 10.1073/pnas.0307752101 • Hall, D., Jurafsky, D., & Manning, C. D. (2008). Studying the history of ideas using topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 363-371). Honolulu, Hawaii: Association for Computational Linguistics. Retrieved from http://portal.acm.org.proxy.lib.umich.edu/citation.cfm?id=1613715.1613763 • Hirsch, J. E. (2005). An index to quantify an individual's scientific research output. Proceedings of the National Academy of Sciences of the United States of America, 102(46), 16569-16572. doi: 10.1073/pnas.0507655102 • Janssens, F., Glänzel, W., & Moor, B. D. (2007). Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 360-369). San Jose, California, USA: ACM. Retrieved from http://portal.acm.org.proxy.lib.umich.edu/citation.cfm?id=1281192.1281233

  43. Bibliography continued... • Jordan, M. I. (1999). Learning in graphical models. MIT Press. • Leicht, E. A., Clarkson, G., Shedden, K., & Newman, M. E. J. (2007). Large-scale structure of time evolving citation networks. 0706.0015. doi: doi:10.1140/epjb/e2007-00271-7 • MacKay, D. (2003). Information theory, inference, and learning algorithms. Cambridge  UK ;;New York: Cambridge University Press. • Shibata, N., Kajikawa, Y., Takeda, Y., & Matsushima, K. (2009). Comparative study on methods of detecting research fronts using different types of citation. J. Am. Soc. Inf. Sci. Technol., 60(3), 571-580. • Shiffrin, R. M. (2004). Mapping knowledge domains. Proceedings of the National Academy of Sciences, 101(suppl_1), 5183-5185. doi: 10.1073/pnas.0307852100 • Torres-Moreno, J., St-Onge, P., Gagnon, M., El-Bèze, M., & Bellot, P. (2009, May 1). Automatic Summarization System coupled with a Question-Answering System (QAAS). ArXiv e-prints. Retrieved January 11, 2010, from http://adsabs.harvard.edu/abs/2009arXiv0905.2990T • Zhu D., & Porter A.L.[1]. (2002). Automated extraction and visualization of information for technological intelligence and forecasting. Technological Forecasting and Social Change, 69, 495-506. doi: 10.1016/S0040-1625(01)00157-3

More Related