540 likes | 695 Views
Juliet Hougland and Jonathan Natkins. Real-time recommendations for retail: Architecture, algorithms, and design. Who Are We?. Jonathan Natkins Field Engineer at WibiData Before that, Cloudera Software Engineer Before that, Vertica Software/Field Engineer. Juliet Hougland
E N D
Juliet Hougland and Jonathan Natkins Real-time recommendations for retail: Architecture, algorithms, and design
Who Are We? • Jonathan Natkins • Field Engineer at WibiData • Before that, Cloudera Software Engineer • Before that, Vertica Software/Field Engineer • Juliet Hougland • Data Scientist, previously at WibiData • MS in Applied Math • BA in Math-Physics
Recommendations in Retail • Personalized versus Non-Personalized
Recommendations in Retail • Personalized versus Non-Personalized
Recommendations in Retail • Personalized versus Non-Personalized
Recommender Contexts • Taste History • Based on everything you know about a user • Interests over months/years • Current Taste • Based on a user’s immediate history • Interests over minutes/hours • Ephemeral • Extreme version of current taste • For example, location • Demographic* • Similar to taste history, but less subjective • Geographic region, age bracket, etc.
Why Does Real-Time Matter? Relevancy
I am a Special Snowflake Natty
Requirements for a Real-Time System • General System Requirements • Handle millions of customers/users • Support collection and storage of complex data • Static and event-series • Real-Time System Requirements • Quickly retrieve subsets of data for a single user • Aggregate/derive new, first-class data per user
What is Kiji? • The Kiji project is a modular, open-source framework for building real-time applications that collect, store, and analyze entity-centric data • kiji.org • github.com/kijiproject
What is Kiji? • The Kiji project is a modular, open-source framework for building real-time applications that collect, store, and analyze entity-centric data • kiji.org • github.com/kijiproject
Three Challenges • Developing models for use in real-time • Scoring models in real-time • Deploying models into a production environment
How Can We Make Real-Time Models? Population interests change slowly Individual interests change quickly
How Can We Make Real-Time Models? Population interests change slowly Individual interests change quickly Models don’t need to be retrained frequently
How Can We Make Real-Time Models? Population interests change slowly Application of a model should be fast Individual interests change quickly Models don’t need to be retrained frequently
A Common Workflow • Train a model over the entire dataset • Save fitted model parameters to a file or another table • Access the model parameters when generating new recommendations based on new data This is EXPENSIVE
Developing Models • KijiExpress • Scala interface for interacting with Kiji data • Uses Scalding for designing complex dataflows • Model Lifecycle • Allows analysts and data scientists to break apart a model into phases
Scoring Models in Real-Time • Batch isn’t real-time
Scoring Models in Real-Time • Batch isn’t real-time Number of Users Number of Interactions
Scoring Models in Real-Time • Batch isn’t real-time Number of Users A few users with many interactions Number of Interactions
Scoring Models in Real-Time • Batch isn’t real-time A lot of users with few interactions Number of Users A few users with many interactions Number of Interactions
Fresheners Compute Lazily Read a column Get from HBase Client KijiScoring Server HBase
Fresheners Compute Lazily Read a column Get from HBase Client Freshness Policy KijiScoring Server HBase
Fresheners Compute Lazily Read a column Get from HBase Client Freshness Policy Yes, return to client KijiScoring Server HBase
Fresheners Compute Lazily Read a column NO Get from HBase Client Freshness Policy Scorer KijiScoring Server HBase
Fresheners Compute Lazily Read a column Get from HBase Client Freshness Policy Scorer Yes, return to client Write back for next time KijiScoring Server HBase
Kiji Model Repository • Link between application and models • Stores Freshener metadata • FreshnessPolicy, Scorer, attached column • Location of trained model • Stores Scorer code • Code repository makes model scoring code available to the application from a central location • New models can be deployed to the Model Repository and made immediately available to the application
Types of Recommenders Recommendation Algorithms Collaborative Filtering Methods Content Based Methods Memory Based Model Based
Content-Based Recommenders Build models around entities using features that we think reflect inherent characteristics Orange-Nosed Lab Assistant Meeps a lot
Content-Based Recommenders safer faster knife
Pandora: Content-Based Expertly-Characterized Music
Collaborative Filtering Represent users-item affinities as a sparse matrix Beaker Banana Slicer Pineapple Slicer Users ≈ Rows Items ≈ Columns
Aspirational Ratings I put in my queue… I actually watch
Collaborative Filtering Represent users-item affinities as a sparse matrix Beaker Banana Slicer Pineapple Slicer Users ≈ Rows Items ≈ Columns
Collaborative Filtering: How It Works Similar Users Similar Products Simple aggregate predictors
Similar Entities • What do we mean by similar? • Jaccard Index: a measure of set similarity • Cosine Similarity: the angle between two vectors • Pearson Correlation: statistical measure, similar to cosine • Naively, we could compare every entity to each other …But that would not scale will with increasing numbers of entities
Collaborative Filtering: Is This Useful? • Problem: Too much data! • Tracking user preferences and all their events generates huge amounts of data • Problem: Too little data! • Dimensions of user-space and item-space are usually very large • More variables makes it more difficult to generate user preferences • Problem: Cold start • If you don’t know anything about a user, what should you recommend? • Problem: More ratings means slower computations • Identifying neighborhoods of entities is expensive
Collaborative Filtering: Why Is It Useful? • Because it works • Content-agnostic • All that matters is co-occurrence of events
Amazon: Item-Item Collaborative Filtering • Used for personalized recommendations • Fill screen real estate with related items • Produces specific, but non-creepy recommendations > Linden, G.; Smith, B.; York, J., "Amazon.com recommendations: item-to-item collaborative filtering," Internet Computing, IEEE , vol.7, no.1, pp.76,80, Jan/Feb 2003
Item-Item Collaborative Filtering • Beaker buys a banana slicer • Then: • Generate list of candidate items to predict ratings for • Predict ratings for candidate items • Select Top-N items
Accessing External Data • KeyValueStore API enables external data access when applying a model • External data might be… • Trained model parameters • Hierarchical/Taxonomic data • Geo-lookup • Store external data flexibly • Text files, sequence files, Kiji tables, etc. • Data access is decoupled from use during execution • If the data doesn’t fit in memory, put it in a table
How Much Less Work Can We Do? • We can choose a predictor that allows us to truncate a sum • There are two ways terms in the sum of our predictor can be small • No rating • Small similarity
How Much Less Work Can We Do? • We can choose a predictor that allows us to truncate a sum • There are two ways terms in the sum of our predictor can be small • No rating • Small similarity
How Much Less Work Can We Do? • We can choose a predictor that allows us to truncate a sum • There are two ways terms in the sum of our predictor can be small • No rating • Small similarity Ignore unrated items
How Much Less Work Can We Do? • We can choose a predictor that allows us to truncate a sum • There are two ways terms in the sum of our predictor can be small • No rating • Small similarity Ignore dissimilar items