130 likes | 247 Views
Practical Lessons of Data Mining at Yahoo!. Presenter: Jun-Yi Wu Authors: Ye Chen, Dmitry Pavlov, Pavel Berkhin , Aparna Seetharaman , Albert Meltzer. 國立雲林科技大學 National Yunlin University of Science and Technology. 2009 CIKM. Outline. Motivation Objective Experience
E N D
Practical Lessons of Data Mining at Yahoo! Presenter: Jun-Yi Wu Authors: Ye Chen, Dmitry Pavlov, PavelBerkhin, AparnaSeetharaman, Albert Meltzer 國立雲林科技大學 National Yunlin University of Science and Technology 2009 CIKM
Outline • Motivation • Objective • Experience • Conclusion • Comments
Motivation Information Raw Data The usage of data in many commercial applications has been growingat an unprecedented pace in the last decade. While successful data mining efforts lead to major business advances, there were also numerous, less publicized efforts that for one or another reason failed.
Objective • To discuss practical lessons based on years of our data mining experiences at Yahoo! • To offer insights into how to drive the data mining effort to success in a business environment. • To reflect on four success factors: methodology, data, infrastructure, and people.
Success Factors • Methodology • Data • A Data-driven Perspective • Data Preprocessing • Data Size and Sampling • Data Distribution • Data Understanding • Modeling Goals and Evaluation
Success Factors • Infrastructure • An infrastructure for Web-scale Data • Gridification • The Scalability Dilemma • People • Engaging the Wider Community
Success Factors • Methodology • Many companies fail to take full advantage of their data because they do not apply data mining techniques to study, manage and learn from their data.
Success Factors • Data-A Data-driven Perspective • Companies habitually rely on their "gut feelings" instead of relying on the data to drive decision-making. • That being said, one should not underestimate the importance of domain knowledge. • We argue that domain knowledge should guide empirical investigation, especially at the exploratory stage.
Success Factors • Data-DataPreprocessing • The data mining process starts with data preprocessing, or so-called ETL (extract, transform and load), during which raw user data logs go through a series of perturbations and get loaded into a data warehouse (DW). • ETL may introduces biases in downstream data. • The timestamp may not be consistently normalized • Data consistency is a big challenge. • Data integration is a big architectural challenge.
Success Factors Data-DataDistribution
Success Factors • Data • A Data-driven Perspective • Data Preprocessing • Data Size and Sampling • Data Distribution • Data Understanding • Modeling Goals and Evaluation
Comments • Advantage • Drawback • … • Application • Information Search and Retrieval 13