360 likes | 549 Views
Current Research in Data Mining Research Group. Jiawei Han Data Mining Research Group Department of Computer Science University of Illinois at Urbana-Champaign Acknowledgements: NSF, ARL, NASA, AFOSR (MURI), DHS, Microsoft, IBM, Yahoo!, HP Lab & Boeing November 8, 2014. Outline.
E N D
Current Research in Data Mining Research Group Jiawei Han Data Mining Research Group Department of Computer Science University of Illinois at Urbana-Champaign Acknowledgements: NSF, ARL, NASA, AFOSR (MURI), DHS, Microsoft, IBM, Yahoo!, HP Lab & Boeing November 8, 2014
Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions
Data Mining and Data WarehousingJiawei Han’s Group at CS, UIUC • Mining patterns and knowledge discovery from massive data • Data mining in heterogeneous information networks • Exploring broad applications of data mining • Developed many effective data mining algorithms, e.g., FPgrowth, PrefixSpan, gSpan, StarCubing, CrossMine, RankingCube, CrossClus , RankClus, and NetClus • 600+ research papers in conferences and journals • Fellow of ACM, Fellow of IEEE, ACM SIGKDD Innovation Award, W. McDowell Award, Daniel Drucker Eminent Faculty Award • Textbook, “Data mining: Concepts and Techniques,” adopted worldwide • Project lead for NASA EventCube for Aviation Safety [2008-2012] • Director of Information Network Academic Research Center funded from Army Research Lab (ARL) [2009-2014]
New Books on Data Mining & Link Mining Sun and Han, Mining Heterogeneous Information Networks, 2012 Han, Kamber and Pei, Data Mining, 3rd ed. 2011 Yu, Han and Faloutsos (eds.), Link Mining, 2010
Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions
Mining Heterogeneous Information Networks RankClus/NetClus VS. RankCompete: A Competing Random Walk Model for Rank-Based Clustering RankClass [KDD11] Knowledge Propagation in Heterogeneous Network
Similarity Search and Role Discovery in Information Networks Which images are most similar to me in Flickr? PathSim [VLDB11] Meta Path-Guided Similarity Search in Networks Path: ITI Path: ITIGITI 8 A “dirty” Information Network (imaginary) Cleaned/Inferred Adversarial Network Automatically infer Chief Cell Lead Insurgent Role Discovery in Information Networks [KDD’10]
Meta-Paths & Their Prediction Power List all the meta-paths in bibliographic network up to length 4 Investigate their respective power for coauthor relationship prediction Which meta-path has more prediction power? How to combine them to achieve the best quality of prediction 9
Relationship Prediction in Heterogeneous Info Networks Why Prediction of Co-Author Relationship in DBLP? Prediction of relationships between different types of nodes in heterogeneous networks E.g., what papers should Faloutsos writes? Traditional link prediction: homogeneous networks Co-author networks in DBLP, friendship networks in Facebook Relationship prediction Study the roles of topological features in heterogeneous networks in predicting the co-author relationship building Meta-path guided prediction! Y. Sun, et al., "Co-Author Relationship Prediction in Heterog. Bibliographic Networks", ASONAM'11, July 2011 10
Guidance: Meta Path in Bibliographic Network Relationship prediction: meta path-guided prediction Meta path relationships among similar typed links share similar semantics and are comparable and inferable publish publish-1 venue paper author topic mention-1 write-1 • Co-author prediction (A—P—A) using topological features also encoded by meta paths, e.g., citation relations between authors (A—P→P—A) write mention cite/cite-1 contain/contain-1 11
Case Study in CS Bibliographic Network The learned significance for each meta path under measure “normalized path count” for HP-3hop dataset 12
Case Study: Predicting Concrete Co-Authors High quality predictive power for such a difficult task • Using data in T0 =[1989; 1995] and T1 = [1996; 2002] • Predict new coauthor relationship in T2 = [2003; 2009] 13
Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions
iTopicModel: Model Set-Up & Objective Function • Graphical model: ϴi=(ϴi1,ϴi2,…,ϴiT): Topic distribution for document xi Structural Layer: follow the same topology as the document network Text Layer: follow PLSA, i.e., for each word, pick a topic z~multi(ϴi), then pick a word w~multi(βz) Can model them separately! Structure part Text part • X: observed text information • G: document network • Parameters • ϴ: topic distribution • β: word distribution • ϴ is the most critical, need to be consistent with the text as well as the network structure • Objective function: joint probability
Probabilistic Topic Models with Network-Based Biased Propagation • Text-rich heterogeneous information network • Ubiquitous textual documents (news, papers) • Connect with users and other objects: Topic propagation • Deng, Han et al, “Probabilistic Topic Models with Biased Propagation on Heterogeneous Information Networks”, KDD’11 • How to discover latent topics and identify clusters of multi-typed objects simultaneously? • How can text data and heterogeneous information network mutually enhance each other in topic modeling and other text mining tasks?
Biased Topic Propagation Intuition: • InfoNet provides valuable information • Different objects have their own inherent information (e.g., D with rich text and U without explicit text) • To treat documents with rich text and other objects without explicit text in a different way Topic(D) inherent text + connected U Topic(U) connected D Basic Criterion: (Biased Topic Propagation) • The topic of an object without explicit text depends on the topic of the documents it connects • The topic of a document is correlated with its objects to some extend, and should be principally determined by its inherent content of the text • A simple and unbiased topic propagation does not make much sense
Incorporating Heterogeneous Info. Network R(G): Biased propagation L(C): Topic model
Experiments: DBLP & NSF Awards • Data Collection • DBLP • NSF-Awards • Metrics • Accuracy (AC) • Normalized mutual information (NMI) • Results
Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions
Event Cube: An Overview Topic Topic … Analyst Funded by NASA (2008-2010) turbulence Analysis Support Multidimensional OLAP, Ranking, Cause Analysis, Encounter birds …… Topic Summarization/Comparison undershoot Deviation overshoot 1998 98.02 Event Cube Representation 98.01 1999 99.02 99.01 Time Time LAX SJC MIA CA FL TX AUS Location Location drill-down roll-up Multidimensional Text Database Event Cube: An Organized Approach for Mining and Understanding Anomalous Aviation Events 22
Text/Topic Cube: General Idea ACN Time Location Place Environment … … Event Report Text data Cube: Categorical Attributes • Heterogeneous: categorical attributes + unstructured text • How to combine? • Our solution: Measure Text/Topic Model: Unstructured Text
Effective Keyword Search TopCells System Healthcare Reform Person: Obama, Year: 2010 Org: Congress, Year: 2010 Person: Hillary, Year: 2008 … 24 TopCells (ICDE’ 10): Ranking aggregated cells (objects) in TextCube.
Effective OLAP Exploration TEXplorer System Top-1 Dimension: Person Healthcare Reform Top-2 Dimension: Org Top-3 Dimension: Time 2010 2008 2004 25 TEXplorer (submitted): Integrating keyword-based ranking and OLAP exploration
Effective Event Tracking Popular Event Tracking System Healthcare Reform Time Feb 2010 Mar 2010 Apr 2010 Popularity debate, cost, senate, … pass, success, law, … benefit, profit, effective, … Content 26 PET (KDD’ 10): tracking popularity and textual representation of events in social communities (twitter)
Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions
Mapping Pages to Records (CIKM’10) Database records can be found on link paths!
WinaCS: Web Information Network Analysis for Computer Science Integration of Web structure mining and information network analysis Tim Weninger, Marina Danilevsky, et al., “WinaCS: Construction and Analysis of Web-Based Computer Science Information Networks", ACM SIGMOD'11 (system demo), Athens, Greece, June 2011.
Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions
Discovery of Swarms and Periodic Patterns in Moving Object Data A system that mines moving object patterns: Z. Li, et al., “MoveMine: Mining Moving Object Databases", SIGMOD’10 (system demo) Z. Li, B. Ding, J. Han, and R. Kays, “Mining Hidden Periodic Behaviors for Moving Objects”, KDD’10 (sub) Z. Li, B. Ding, J. Han, and R. Kays, “Swarm: Mining Relaxed Temporal Moving Object Clusters”, VLDB’10 (sub) ← Bird flying paths shown on Google Earth Mined periodic patterns by our new method → Swarmdiscovers more patterns → ← Convoy discovers only restricted patterns 32
GeoTopic Discovery: Mining Spatial Text Geo-tagged photos w. landscape (coast vs. desert vs. mountain) LDM TDM GeoFolk LGTA Z. Yin, et a., GeoTopic Discovery and Comparison, WWW'11
Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions
Conclusions: Towards Mining Data Semantics in Integrated Heterog. Networks 35 • Most data objects are linked, forming heterogeneous information networks • Most datasets can be “organized” or “transformed” into “structured” multi-typed heterogeneous info. networks • Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, … • Structures can be progressively mined from less organized data sets by info. network analysis • Surprisingly rich knowledge can be mine from such structured heterogeneous info. networks • Clustering, ranking, classification, data cleaning, trust analysis, role discovery, similarity search, relationship prediction, …… • It is promising to mine data semantics from rich info. networks !
References for the Talk 36 J. Han, Y. Sun, X. Yan, and . S. Yu, “Mining Heterogeneous Information Networks" (tutorial), KDD'10. Ming Ji, Jiawei Han, and Marina Danilevsky, "Ranking-Based Classification of Heterogeneous Information Networks", KDD'11. Y. Sun, J. Han, et al., "RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis", EDBT’09 Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema", KDD’09 Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks”, VLDB'11 Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction in Heterogeneous Bibliographic Networks", ASONAM'11 C. Wang, J. Han, et al.,, , “Mining Advisor-Advisee Relationships from Research Publication Networks", KDD'10. Tim Weninger, Marina Danilevsky, et al., “WinaCS: Construction and Analysis of Web-Based Computer Science Information Networks", ACM SIGMOD'11 (system demo) X. Yin, J. Han, and P. S. Yu, “Truth Discovery with Multiple Conflicting Information Providers on the Web”, IEEE TKDE, 20(6), 2008