Exploring Big Data: KDD2012 Conference Insights

KDD2012 &龙星课程：机器学习 胡伟, 程龚 & 倪传蕾 whu@nju.edu.cn

Outline • KDD2012 summer school • KDD2012 main conference • 龙星暑期课程：机器学习 • 于凯[NEC] & 张潼[Rutgers Univ., USA] • 清华大学/中科院计算所/中科院研究生院 • 8月6日~10日 Websoft, Nanjing Univ. [http://ws.nju.edu.cn]

KDD2012 • 1200+ attendee to conference, 356 to summer school • Acceptance rate: 133 / 734 = 18.1% • KDD2013 will be held in Chicago, USA Grand theme: Mining the Big Data Websoft, Nanjing Univ. [http://ws.nju.edu.cn]

Big data • Big Data are high- • Volume – amount of data (量大) • Velocity –speed of data in and out (实时) • Variety – range of data types and sources (多样) information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. [Gartner, 2012] 3V Websoft, Nanjing Univ. [http://ws.nju.edu.cn]

Big data • Technologies include: [McKinsey, 2011] • A/B testing, association rule learning, classification, cluster analysis, crowdsourcing, data fusion and integration, ensemble learning, genetic algorithms, machine learning, natural language processing, neural networks, pattern recognition, predictive modeling, regression, sentiment analysis, signal processing, supervised and unsupervised learning, simulation, time series analysis and visualization • massively parallel-processing (MPP) databases, search-based applications, data-mining grids, distributed file systems, distributed databases, cloud computing platforms, the Internet, and scalable storage systems Data science & engineering Websoft, Nanjing Univ. [http://ws.nju.edu.cn]

KDD2012 summer school • Applications • Jiawei Han: Mining heterogeneous information networks • Bing Li: Modeling opinions and beyond in social media • Analytics • Christos Faloutsos: Large graph mining • Jure Leskovec: Methods for mining social media and networks • Algorithms • Ravi Kumar: Two computational paradigms for big data • Haixun Wang: Managing and mining billion-node graphs Websoft, Nanjing Univ. [http://ws.nju.edu.cn]

Summer school • Jiawei Han • Meta-paths in heterogeneous information networks • Ravi Kumar • Counting distinct elements in a stream • Jure Leskovec • Kronecker graphs • Softwares / tools • Neo4j (NoSQL graph database) • PEGASUS (Hadoop + graph mining)

KDD2012 main conference • Best paper • Searching and mining trillionsof time series subsequences under dynamic time warping • Best student papers • Integrating meta-path selection with user-guided object clustering in heterogeneous information networks • Intrusionas (anti)social communication: Characterization and detection • Best video award • A 30-second video advertisement program Websoft, Nanjing Univ. [http://ws.nju.edu.cn]

Word cloud from paper titles Websoft, Nanjing Univ. [http://ws.nju.edu.cn]

KDD2012 keynotes • Robin Li: Nine real hard problems we’d like you to solve • Jiawei Han: Mining heterogeneous information networks: The next frontier • Michael Jordan: Divide-and-conquer and statistical inference for big data • Michael Kearns: Experiments in social computation (and the data they generate) Websoft, Nanjing Univ. [http://ws.nju.edu.cn]

Keynotes and talks • Robin Li • Image search page views surpassed web search last year • Open Publishing Platform • 搜索框用户输入框 • 查询用户生成内容 • 搜索用户意图分类处理 • Semantic Web is useful, for recommendation, not directly for search.= • Michael Kearns • Local communication • Collectively “compute” global solutions • Wei-Ying Ma • Query-centric search: organizing query results as a magazine

CrowdKDD workshop • Pomics • 对用户：漫画制作 • 对系统：图片排序、标注…… • 一些联想 • Crowdsourcing SW tasks • 典型成本：$200 • SView中的自动推荐 • User-Type-View (feature)

Papers • 范文 • PageRank on an Evolving Graph （如何开展演化问题的研究） • Integrating Meta-Path Selection with User Guided Object Clustering in Heterogeneous Information Networks （如何纵向组合多种技术） • 值得关注的问题 • RolX: Structural Role Extraction & Mining in Large Graphs • 与摘要问题的研究相关 • Entity-Centric Topic-Oriented Opinion Summarization in Twitter • Estimating Entity Importance via Counting Set Covers

Misc. • datatang.com

Panel • What is the nature of Big Data? What are the Big Data problems that you have encountered? Is this a long-term challenge or a short-term fad? • What opportunities and challenges does data mining face on Big Data? • What are effective Big Data solutions? What platforms, sampling solutions, and applications are most effective for handling Big Data? • Panelists • M.I. Jordan, C. Faloutsos, U. Fayyad • Jiawei Han, Wen Gao, xxx (instead of H. Wang, also from MSRA) Websoft, Nanjing Univ. [http://ws.nju.edu.cn]

龙星暑期课程：机器学习 Websoft, Nanjing Univ. [http://ws.nju.edu.cn]

Learning theory • Classification : want to construct prediction function f from training data to minimize future loss • Learning algorithm: minimize training error • What we are interested in : how test error is close to training error and how to theoretically justify it ? • Chernoff bound (Hoeffding’sinequlity) Websoft, Nanjing Univ. [http://ws.nju.edu.cn]

Model selection & combination • Many algorithms • Which one to choose? (Model selection) • How to combine for better performance? (Model combination) • Combination • Equally weighted averaging (voting) • Exponentially weighted model averaging • Weight optimization using stacking • Bagging • Additive model and boosting K-fold Cross-Validation Websoft, Nanjing Univ. [http://ws.nju.edu.cn]

Regularization • The objective: empirical loss + regularization • The regularization term is usually L2 norm, but also often L1 norm for sparse models • The empirical loss can be hinge loss, logistic loss, smooth hinge loss,…. Or your own invention • -regularization (p=0: sparsity; p=1: Lasso; p=2: ridge regression) • Model complexity Vs. feature selection Websoft, Nanjing Univ. [http://ws.nju.edu.cn]

Deep learning (feature learning) • Image/video pixels => Hand-designed feature extraction =>Trainable classifier => Object class • Features are not learned; Trainable classifier is often generic Where next? Better classifiers Or keep building more features? • “…well-known depth-breadth tradeoff in circuits design.[Hastad 1987]. This suggests many functions can be much more efficiently represented with deeper architectures…”[Bengio& LeCun2007] • Some papers about text processing • Semi-supervised Learning of Compact Document Representations with Deep Networks [ICML 2008] • Deep Unsupervised Feature Learning for Natural Language Processing [HLT 2012] • A Uniﬁed Architecture for Natural Language Processing : Deep Neural Networks with Multitask Learning [ICML 2008] Websoft, Nanjing Univ. [http://ws.nju.edu.cn]

Thanks Comments?

Exploring Big Data: KDD2012 Conference Insights