270 likes | 445 Views
1 st DATA SCIENCE MEETUP in SEOUL. JINYOUNG KIM & HEEWON JEON. Data Science?. Organizations use their data for decision support and to build data-intensive products and services.
E N D
1st DATA SCIENCE MEETUP in SEOUL JINYOUNG KIM & HEEWON JEON
Data Science? • Organizations use their data for decision support and to build data-intensive products and services. • The collection of skills required by organizations to support these functions has been grouped under the term "Data Science". - J. Hammerbacher
Taxonomy of Data Science • What
Taxonomy of Data Science • How
Data Scientist? • A data scientist is someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning – H. Mason
Data Science Meet-up • Let’s learn from each other! • Foster collaboration among participants • Beginning of a long and fruitful journey!
In What Follows… • Presentation • Discussion • Who should care about Big Data, everyone? • Developing Career Path as a Data Scientist
Information Retrieval? • Definition • The study and the practice of how an automated system can enable its users to access, interact with, and make sense of information. • Characteristics • More than ten-blue-links of search results • Algorithmic solutions for information problems UX / HCI / Info. Vis. Information Retrieval / RecSys Large-scale System Infra. Large-scale (Text)Analytics
IR in the Taxonomy of Data Science • What • How
Major Problems in IR & RecSys • Matching • (Keyword) Search : query – document • Personalized Search : (user+query) – document • Item Recommendation : user – item • Contextual Advertising : (user+context) – advertisement • Quality • PageRank / Spam filtering / Freshness • Relevance • Combination of matching and quality features • Evaluation is critical for optimal performance
The Great Divide: IR vs. RecSys IR RecSys User / Item Support decision making Proactive (push item) RecSys / KDD / UMAP • Query / Document • Provide relevant info. • Reactive (given query) • SIGIR / CIKM / WSDM • Both requires similarity / matching score • Personalized search involves user modeling • Most RecSys also involves keyword search • Both are parts of user’s info seeking process
Improved Query Modeling for Structured Documents A Sneak-Peak of Information Retrieval Research
Matching for Structured Document Retrieval [ECIR09,12,CIKM09] • Field Relevance • Different field is important for different query-term ‘registration’ is relevant when it occurs in <subject> 1 1 2 2 1 2 ‘james’ is relevant when it occurs in <to> Why don’t we provide field operator or advanced UI?
Estimating the Field Relevance • If User Provides Feedback • Relevant document provides sufficient information • If No Feedback is Available • Combine field-level term statistics from multiple sources from/to from/to + ≅ title title content content Collection Top-k Docs from/to title content Relevant Docs
Retrieval Using the Field Relevance • Comparison with Previous Work • Ranking in the Field Relevance Model sum q1 q2 ... qm q1 q2 ... qm f1 f1 f1 f1 w1 w1 multiply P(F1|q1) P(F1|qm) f2 f2 f2 f2 w2 w2 P(F2|q1) P(F2|qm) ... ... ... ... Per-term Field Score fn fn fn fn wn wn P(Fn|q1) P(Fn|qm) Per-term Field Weight
Evaluating the Field Relevance Model • Retrieval Effectiveness (Metric: Mean Reciprocal Rank) Per-term Field Weights Fixed Field Weights
Lessons from Data Science Perspective • Understanding user behavior provides key insights • The notion of field relevance • Choice of estimation technique relies on many things • Availability of data and labels (e.g., can we use CRF?) • Efficiency concerns (possibility of pre-computation) • Evaluation is critical for continuous improvement • IR people are very serious about dataset and metrics
LiFiDeA (= Life+Idea) Project • Goal • Improved Personal Info Mgmt. => Self Improvement • Collect behavioral data (schedule and tasks) • Correlate them with subjective judgments of happiness • Workflow • Write task-centric journals on Evernote • Weekly data migration into spreadsheet • Statistical analysis using Excel chart and R • Findings • Tracking itself helps, but not for a long time • Keeping right amount of tension is critical My Source of Inspiration
My Self-tracking Efforts • Life-optimization Project (2002~2006) • Software dev. project for myself, for 4 years • Covers all aspects of personal info mgmt. • Core component of my Ph.D application
My Self-tracking Efforts • LiFiDeA Project (2011-2012) Data Moved onto Excel Sheet Raw Data on Evernote Happiness by Place Happiness by Wake-up Time
Lessons Learned • Combine existing solutions whenever possible • “Done is better than perfect” applies here • *You* should own your data, not the app you use • Apps can come and go, but the data should stay • Minimize data collection efforts for sustainability • Integrate self-tracking into your daily routine • “Effort << Benefit” should be kept all the time • Communicating regularly helps you make progress • Writing has been the best way to learn about the subject
Criteria for Choosing IR vs. RecSsys • Confidence in predicting user’s preference • Availability of matching items to recommend RecSys IR • User’s willingness to express information needs • Lack of evidence about the user himself
HCIR Way: IRWay: The from Query to Session Rich User Modeling Rich User Interaction USER SYSTEM Response Action User Model Interaction History Response Action Profile Context Behavior Response Action Filtering Conditions Related Items … Filtering / Browsing Relevance Feedback … Providing personalized results vs. rich interactions are complementary, yet both are needed in most scenarios. No real distinction between IR vs. HCI, and IR vs. RecSys
The Great Divide: IR vs. CHI IR CHI User / System User Value / Satisfaction Interface / Visualization Human-centered Design User Study CHI / UIST / CSCW • Query / Document • Relevant Results • Ranking / Suggestions • Feature Engineering • Batch Evaluation (TREC) • SIGIR / CIKM / WSDM Can we learn from each other?