940 likes | 1.08k Views
Entity-Based Data Mining from Spatio-Temporal Events and Text Sources Presentation at KD-D Program Review, Nov 18-19 2003. Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine {smyth, sharad}@ics.uci.edu www.datalab.uci.edu. Project Participants.
E N D
Entity-Based Data Mining fromSpatio-Temporal Events and Text Sources Presentation at KD-D Program Review, Nov 18-19 2003 Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine {smyth, sharad}@ics.uci.edu www.datalab.uci.edu
Project Participants • Principal Investigators: • Padhraic Smyth: Data mining • Sharad Mehrotra: Databases • Collaborators • Mark Steyvers: Text and Author Modeling • Postdoctoral Researchers • Michal Rosen-Zvi, Dmitri Kalashnikov • Staff Programmer • Amnon Meyers: Information Extraction • Students • Phd: Joshua O Madadhain, Scott White, Yiming Ma, Dawit Seid • Undergraduates: Yan-Biao Boey, Momo Alhazzazi • Acknowledgements • Steve Lawrence for CiteSeer data
Problem of Interest • Intelligence Analysis today • Massive volumes/streams of data • Text (newswire, reports, etc) • Web data • Transactions/events • Central problems • Need flexible tools to support an analyst’s exploration of the data • Automatically focus an analyst’s attention on interesting parts of the data space • Need new theories/methods/tools….
Entities and Events • Entities = Individuals, groups, communities, organizations, etc • Events = Contacts, collaborations, meetings, products, etc • Working hypothesis • A large component of intelligence work is centered on entities and events • Extracting entity-information from text streams and transaction data • Predicting entity behavior • Detecting groups of related entities • Our broad goal • Develop next-generation data management, exploration, and analysis tools for entity-event data
Nodes = Entities = Biotech-Related Organizations Edges = Events = Collaborations
Red indicates nodes selected by the data analyst as important
Algorithm determines blue nodes are important relative to red nodes (Oxford and Cambridge)
Research Issues • Information extraction • Data management tools • Visualization techniques • Interactive ad hoc querying and mining • Statistical modeling of graph data • Query languages for graphs • Scalability to large graphs • ……
Focus of Our Research Text Sources Information Extraction Entity-Event Databases Statistical Modeling and Data Mining Visualization Query Languages User Modeling
Major Themes in Our Work • Focus on data in the form of graphs • Nodes = entities, edges = events • Nodes and edges have attributes (e.g., temporal) • Year 1: entities = computer science researchers • Year 1: limited spatio-temporal aspects • Integration and coupling of • Statistical modeling and data mining • Visualization • Query languages and data management • Scalability • Methods should scale to millions of nodes and edges • User Interaction • Conditional “query-driven” analysis and mining • Contrast with offline global modeling
Accomplishments • Infrastructure and Data Sets • Created testbed data sets, e.g., 100k entities, 400k events • Developed suite of text information extraction tools Developed and released a general public-domain JAVA API for graph data analysis and visualization • Statistical Modeling and Data Mining • Developed new statistical technique for modeling entities based on authored text • Developed new class of scalable algorithms for interactive graph-based data mining
Accomplishments • Graph-based Querying • Developed framework for general graph-based query language • New accurate and efficient algorithms for interactive similarity queries and query refinement on graphs • Software Tools • Netsight: JAVA-based graph visualization and analysis tool • Browser tool for exploring author-topic models • Interactive query refinement system • Prototype system for graph-based query language for interacting with heterogenous graph data
Publications in Year 1 • Data Mining on Graphs • S. White and P. Smyth, Algorithms for Discovering Relative Importance In Graphs,Proceedings of the Ninth International ACM SIGKDD Conference, August 2003. Extended version submitted to JICRD, June 2003. • J. O'Madadhain, D. Fisher, S. White, and Y. Boey, The JUNG (Java Universal Network/Graph) Framework, UCI-ICS Tech Report 03-17, October 2003: invited presentation, Stanford Workshop on Statistical Inference, Computing and Visualization for Graphs, August 2003. • Modeling the Internet and the Web: Probabilistic Methods and Algorithms, P. Baldi, P. Frasconi, and P. Smyth, Wiley, June 2003. • Statistical Author-Topic Models • T. Griffiths and M. Steyvers (in press). Finding Scientific Topics.Proceedings of the National Academy of Sciences • M. Steyvers, M. Rosen-Zvi, T. Griffiths, P. Smyth, Author Attribution with LDA, NIPS workshop on Syntax, Semantics, and Statistics, December 2003 • Data Management and Graph Querying • Y. Ma, S. Mehrotra, D. Seid, A Framework for Refining Similarity Queries Using Learning Techniques, UCI-ICS Tech Report 03-19, Nov. 2003. Extended version submitted to EDBT 2004. • Y. Ma, D. Seid, S. Mehrotra, Interactive Filtering of Data Streams by Refining Similarity Queries, UCI-ICS Tech Report 03-07, June. 2003. • D. Seid, M. Ortega-Binderbergery, Z. Chen, and S. Mehrotra, Evaluating Top-k Selection and Preference Queries on Multiple Indexed Attributes. Submitted to EDBT'04. • D. Seid, and S. Mehrotra, Complex Analytical Queries on Graphs and Hierarchies, (in preparation). • L. Jin, C. Li, S. Mehrotra, Efficient Record Linkage in Large Data Sets, in the 8th International Conference on Database Systems for Advanced Applications (DASFAA 2003) 26 - 28 March, 2003, Kyoto, Japan.
Author Database Schema Note: “individual-centric” not “document-centric”
Focus of Our Research Text Sources Information Extraction Entity-Event Databases Statistical Modeling and Data Mining Visualization Query Languages User Modeling
From graphs to Markov chains 3 C • Importance = recursive function of nodes pointing at you 4 A B 2 2 D
From graphs to Markov chains 3 C 0.6 C • Importance = recursive function of nodes pointing at you 1.0 0.33 4 A B 2 A B 0.4 0.5 0.77 0.33 2 D 0.5 D
From graphs to Markov chains 3 C 0.6 C • Importance = recursive function of nodes pointing at you • Markov approach… • Notion of a “token” circulating around in Markov fashion • Important actors see the token more often • Importance = stationary probability of each node • PageRank: surfer randomly following links on the Web 1.0 0.33 4 A B 2 A B 0.4 0.5 0.77 0.33 2 D 0.5 D
Relative importance of node V to A: Trade off [distance from A, structural importance of V]
Algorithms for Relative Importance(S. White and P. Smyth, ACM KDD 2003: also JICRD, submitted) • PageRank with Priors (PRankP) • Random walks that start from A and return to A periodically • Relative importance = stationary probability • Iterative algorithm (e.g., Haveliwala, 2002) • HITS with priors • Formulate HITS as Markov chain, same idea…. • K-Step Markov • Use the transient probability distribution starting from A • Faster than stationary probability methods • Weighted Paths • Heuristic approximation to K-step Markov: even faster • All algorithms scale linearly in number of edges • Different constant factors
Computation Times for Ranking Algorithms (in seconds) PRankP and HITS converged in 20-30 iterations
Computation Times for Ranking Algorithms (in seconds) PRankP and HITS converged in 20-30 iterations
http://jung.sourceforge.net JUNG Java Universal Network/Graph Framework 16,000 page visits 800 downloads since August
Authors Words Can we model authors, given documents? (more generally, build statistical profiles of entities given sparse observed data)
Authors Hidden Topics Words Model = Author-Topic distributions + Topic-Word distributions Parameters learned via Bayesian learning
Authors Hidden Topics Words
Authors Hidden Topics Words
Authors Hidden Topics Words
Authors Hidden Topics Words
Authors Hidden Topics Words
Authors Hidden Topics Words
Hidden Topics Words “Topic Model”: - document can be generated from multiple topics - Hofmann (SIGIR ’99), Blei, Jordan, Ng (JMLR, 2003)
Authors Hidden Topics Words Model = Author-Topic distributions + Topic-Word distributions NOTE: documents can be composed of multiple topics
Topic Models from CiteSeer WORDS: probabilistic, Bayesian, carlo, monte, distribution, inference, conditional, prior, mixture, Markov, posterior, belief…… AUTHORS: N_Friedman, D_Heckerman, Z_Ghahramani, D_Koller, M_Jordan, R_Neal, A_Raftery, T_Lukasiewicz, J_Halpern…. WORDS: retrieval, text, document, information, content, indexing, relevance, collection, query, IR, feedback…. AUTHORS: D. Oard, W_Croft, K_Jones, P_Schauble, E_Voorhees, A_Singhal, D_Hawking, J_Allan, A_Smeaton, M_Hearst,….
Topic Models from CiteSeer WORDS: Web, user, world, wide, pages, www, site, internet, hypertext, hypermedia, content, links, page, navigation.. AUTHORS: S. Lawrence, B. Mobasher, M. Levene, D. Florescu, O. Etzioni, R_Studer, W. Hall, R. Fielding, J. Pitkow, M. Crovella,…. WORDS: data, mining, attributes, discovery, association, large, knowledge, databases, dataset, interesting, frequent, discover, sets…. AUTHORS: J. Han, R. Rastogi, M. Zaki, R. Ng, B. Liu, H. Mannila, S. Brin, H Liu, L. Holder, H. Toivonen…
Author-Topic Models from CiteSeer • Author = A McCallum: • Topic 1: classification, training, generalization, decision, data,… • Topic 2: learning, machine, examples, reinforcement, inductive,….. • Topic 3: retrieval, text, document, information, content,… • Author = H Garcia-Molina: - Topic 1: query, index, data, join, processing, aggregate…. - Topic 2: transaction, concurrency, copy, permission,distributed…. - Topic 3: source, separation, paper, heterogeneous, merging….. • Author = P Cohen: - Topic 1: agent, multi, coordination, autonomous, intelligent…. - Topic 2: planning, action, goal, world, execution, situation… - Topic 3: human, interaction, people, cognitive, social, natural….
Author-Topic Browser • Interesting scalability issues • CiteSeer model exceeds 1 Gbyte • Real-time query answering demands Gibbs sampling (not well suited to SQL!) • Solution • Coupling of Gibbs sampling and relational DB (it works!) JAVA Query GUI SQL Interface Bayesian Sampling MySQL DB Original Text + Statistical Model