MASS COLLABORATION AND DATA MINING

MASS COLLABORATION AND DATA MINING Raghu Ramakrishnan Founder and CTO, QUIQ Professor, University of Wisconsin-Madison Keynote Talk, KDD 2001, San Francisco

DATA MINING Extracting actionable intelligence from large datasets • Is it a creative process requiring a unique combination of tools for each application? • Or is there a set of operations that can be composed using well-understood principles to solve most target problems? • Or perhaps there is a framework for addressing large classes of problems that allows us to systematically leverage the results of mining.

“MINING” APPLICATION CONTEXT • Scalability is important. • But when is 2x speed-up or scale-up important? When is 10x unimportant? • What is the appropriate measure, model? • Recall, precision • MT for search vs. MT for content conversion Answers to these questions come from the context of the application.

TALK OUTLINE • A New Approach to Customer Support • Mass Collaboration • Technical challenges • A framework and infrastructure for P2P knowledge capture and delivery • Role of data mining • Confluence of DB, IR, and mining

TYPICAL CUSTOMER SUPPORT Web Support KB Customer Support Center

TRADITIONAL KNOWLEDGE MANAGMENT QUESTION KNOWLEDGE BASE ANSWER EXPERTS Knowledge created and structured by trained experts using a rigorous process. CONSUMERS

MASS COLLABORATION QUESTION KNOWLEDGE BASE People using the web to share knowledge and help each other find solutions SELF SERVICE Answer added to power self service ANSWER MASS COLLABORATION -Experts -Partners -Customers -Employees

TIMELY ANSWERS 77% of answers are provided within 24h 6,845 • No effort to answer each question • No added experts • No monetary incentives for enthusiasts 86%(4,328) 74%answered 77%(3,862) 65%(3,247) 40%(2,057) Answers provided in 3h Answers provided in 12h Answers provided in 24h Answers provided in 48h Questions

MASS CONTRIBUTION Users who on average provide only 2 answers provide 50% of all answers Answers 100 % (6,718) Contributed by mass of users 50 % (3,329) Top users Contributing Users 7 %(120) 93 %(1,503)

POWER OF KNOWLEDGE CREATION SUPPORT SHIELD 2 SHIELD 1 Knowledge Creation Self-Service *) - 85% Customer Mass Collaboration *) - 64% 5 % Support Incidents Agent Cases *) Averages from QUIQ implementations

TYPICAL SERVICE CHAIN 40% 50% 10% Self Service Knowledge base Auto Email Manual Email Call Center 2nd Tier Support FAQ Chat $ $$ $$$ QUIQ SERVICE CHAIN 80% 15% 5% QUIQ QUIQ 2nd Tier Support Self Service Manual Email Call Center Chat Mass Collaboration $ $$ $$$

CASE STUDIES: COMPAQ “In newsgroups, conversations disappear and you have to ask the same question over and over again. The thing that makes the real difference is the ability for customers to collaborate and have information be persistent. That’s how we found QUIQ. It’s exactly the philosophy we’re looking for.” “Tech support people can’t keep up with generating content and are not experts on how to effectively utilize the product … Mass Collaboration is the next step in Customer Service.” – Steve Young, VP of Customer Care, Compaq

ASP 2001 “Top Ten Support Site” “Austin-based National Instruments deployed … a Network to capture the specialized knowledge of its clients and take the burden off its costly support engineers, and is pleased with the results. QUIQ increased customers’ participation, flattened call volume and continues to do the work of 50 support engineers.” – David Daniels, Jupiter Media Metrix

MASS COLLABORATION Internet-scale P2P knowledge sharing Communities + Knowledge Management + Service Workflows Mass Collaboration Many Experts Support Newsgroups Few Experts Call Center Support Knowledge Base Solutions Interactions

Customers Partners Knowledgebase Employees Suppliers CORPORATE MEMORY Untapped Knowledge in Extended Business Community

User Community Universe Editor Analyst Enthusiasts Experts Editor Analyst User-to-User Exchange User-to-Enthusiast Structured User Forum User-to-Expert Self-Organizing Incentive to Participate User Acquisition Web Site Areas of Interest

GOALS & ISSUES • Interactions must be structured to encourage creation of “solutions” • Resolve issue; escalate if necessary • Capture knowledge from interactions • Encourage participation • Sociology • Privacy, security • Credibility, authority, history • Accountability, incentives

REQUIRED CAPABILITIES • Roles: Credibility, administration • Moderators, experts, editors, enthusiasts • Groups: Privacy, security, entitlements • Departments, gold customers • Workflow: QoS, validation, escalation

TECHNICAL CHALLENGES

SEARCHING “PEOPLE-BASES” ROUTING, NOTIFICATION ? SEARCH “If it’s not there, find someone who knows” - And get “it” there (knowledge creation)!

QUIQ, the “Best in Class” Support Channel SUPPORT Email Support Call Center Automated Emails 1) -20% 100% 80% Support Incidents Agent Cases Support Incidents Agent Cases Mass Collaboration Web Self-Service Knowledge Creation Self-Service 2) Self-Service -42% -85% Customer Mass Collaboration -64% 68% 5% Support Incidents Agent Cases Support Incidents Agent Cases 1) Source: QUIQ Client Information 2) Source: Association of Support Professionals

SEARCH AND INDEXING • User types in “How can I configure the IP address on my Presario?” • Need to find most relevant content that is of high quality and is approved for external viewing, and that this user is entitled to see based on her roles, groups, and service levels. • User decides to post question because no good answer was found in the KB. • Search controls when experts and other users will see this new question; need to make this real-time. • Concurrency, recovery issues!

SEARCH AND INDEXING • Data is organized into tabular channels • Questions, responses, users, … • Each item has several fields, e.g., a question: • Author id, author status, service level, item popularity metrics, rating metrics, answer status, approval status, visibility group, update timestamp, notification timestamp, usage signature, category, relevant products, relevant problems, subject, body, responses Which 5 items should be returned?

RUNTIME ARCHITECTURE Web server Web server Hive Manager Email Real-time Indexing, Caching, Alerts Cache Alerts Indexer Files, Logs DBMS Warehouse RAID STORAGE

LEARNING FROM ACTIVITY DATA TO KNOWLEDGE Periodic offline activity Miner Indexer Large R/W Small reads Files, Logs DBMS Warehouse RAID STORAGE

SEARCH AND INDEXING Which 5 items should be returned? • Question text, user attributes, system policies • IR-style ranked output • Search constraints: • Show matches; subject match twice as important • Show only approved answers to non-editors • Give preference to category Laptop • Give preference to recent solutions • Weight quality of solution

VECTOR SPACE MODEL • Documents, queries are vectors in term space • Vector distance from the query is used to rank retrieved documents = Q w , w ..., w 1 11 12 , 1 t = D w , w ..., w 2 21 22 , 2 t t å = * w sim ( Q , D ) w unnormaliz ed i 1 2 1 2 i = 1 i i’th term in summation can be seen as the “relevance contribution” of term i

TF-IDF DOCUMENT VECTOR

A HYBRID DB-IR SYSTEM • Searches are queries with three parts: • Filter • DB-style yes/no criteria • Match • TF-IDF relevance based on a combination of fields • Quality • Relevance “boost” based on a policy

A HYBRID DB-IR SYSTEM • A query is built up from atomic constraints using Boolean operators. • Atomic constraint: • [ value opterm, constraint-type ] • Terms are drawn from discrete domains and are of two types: hierarchy and scalar • Constraint-type is exact or approximate

A HYBRID DB-IR SYSTEM • Applying an atomic constraint to a set of items returns a tagged result set: • The result inherits the constraint-type • Each result item has a (TF-IDF) relevance score; 0 for exact • Combining two tagged item sets using Boolean operators yields a tagged set: • The result type is exact if both inputs are exact, and approximate otherwise • Result contains intersection of input item sets if either input is exact; union otherwise • Each result item is tagged with a combined relevance

A HYBRID DB-IR SYSTEM • Semantics of Boolean expressions over constraints is associative and commutative • Evaluating exact constraints and approximate constraints separately (in DB and IR subsystems) is a special case. Additionally: • Uniform handling of relevance contributions of categories, popularity metrics, recency, etc. • Absolute and relative relevance modifiers can be introduced for greater flexibility.

CONCURRENCY, RECOVERY, PARALLELISM • Concurrency • Index is updated in real-time • Automatic partitioning, two-step locking protocol result in very low overhead • Relies upon post-processing to address some anomalies • Recovery • Partitioning is again the key • Leverages recovery guarantees of DBMS • Approach also supports efficient refresh of global statistics • Parallelism • Hash based partitioning

NOTIFICATION • Extension of search: Each user can define one or more “standing searches”, and request instant or periodic notification. • Boolean combinations of atomic constraints. • Major challenges: • Scaling with number of standing searches. • Requires multiple timestamps, indexing searches. • Exactly-once delivery property. • Many subtleties center around “notifiability” of updates!

ROLE OF DATA MINING

DATA MINING TASKS • There is a lot of insight to be gained by analyzing the data. • What will help the user with her problem? • Who does a given user trust? • Characteristic metrics for high-quality content. • Identify helpful content in similar, past queries. • Summarize content. • Who can answer this question?

LEVERAGING DATA MINING • How do we get at the data? • Relevant information is distributed across several sources, not just the DBMS. • Aggregated in a warehouse. • How do we incorporate the insights obtained by mining into the search phase? • Need to constantly update info about every piece of content (Qs, As, users …)

LEVERAGING DATA MINING • Three-step approach: • Off-line analysis to gather new insight • Periodic refresh indexes • Use insight (from KB/index) to improve search using the extended DB/IR query framework Use mining to create useful metadata

SOME UNIQUE TWISTS • Identify the kinds of feedback that would be helpful in refining a search. • I.e., Not just specific terms, but the types of concepts that would be useful discriminators (e.g., a good hierarchy of feedback concepts) • Metrics of quality • Link-analysis is a good example, but what are the “links” here? • Self-tuning searches • The more the knobs, the more the choices • Next step: self-personalizing searches?

CONCLUSIONS

CONFLUENCES IR SEARCH ? DB QUERIES P2P KM

MASS COLLABORATION AND DATA MINING

MASS COLLABORATION AND DATA MINING

Presentation Transcript

Data Warehousing and Data Mining

Data Warehousing and Data Mining

CS583 – Data Mining and Text Mining

Data Mining: Concepts and Techniques Mining Text Data

CS583 – Data Mining and Text Mining

Data warehouse and data mining

Big Data and Data Mining

Mass Measurement Collaboration

Data Mining: Concepts and Techniques Mining data streams

MASS COLLABORATION AND DATA MANAGEMENT

DATA WAREHOUSING AND DATA MINING

DATA WAREHOUSING AND DATA MINING

Data Fusion and Data Mining

DATA WAREHOUSING AND DATA MINING

DATA WAREHOUSING AND DATA MINING

Data Warehousing and Data Mining

CS583 – Data Mining and Text Mining

MASS COLLABORATION AND DATA MINING

Data Mining: Concepts and Techniques Mining data streams

DATA WAREHOUSING AND DATA MINING

Data Mining and Data Visualization