Recommending Similar Items at Large Scale

Recommending Similar Items at Large Scale Jay Katukuri Merchandising Team - eBay 07/25/2012

Similar Items Clustering Platform • Introduction • Merchandising Challenges • Similar Item Clustering (SIC) Architecture • Clustering Approach • Features • Method • Cluster Assignment Service • Applications • Replacement/Equivalent items on CVIP – Non-winner • Related/Complementary items on Checkout

Introduction • Grouping of items that are similar to each other is essential for recommendation algorithms. • Two distinct items can be considered similar if important features are similar: • Titles • Attributes • Images. • Similar Item Clustering (SIC) platform creates clusters of items. • These clusters are used for various recommendation systems on the site now.

Similar Recommendations: Before

Similar Recommendations: Before & After

Merchandising Challenges - Motivation for SIC • Non-productized inventory, long tail. • Product coverage is there only for few categories • Majority of items are ad hoc listings not covered by catalog taxonomy • Maintaining catalogs is a daunting task for the long tail. • One-of-a-kind inventory, Items are short-lived • Unstructured data • Attribute coverage is minimal • Sparsity in the transactional data • Very few purchases for certain kinds of items

Merchandising Challenges - Motivation for SIC • Item-item pairs are supported by even fewer users. • We may not see users buying both a product and accessories on eBay. • Large Data • Much bigger data set in both users and inventory than other ecommerce sites. • Scale • Several 100 Million listings. • Several millionnew items every day

Similar Item Recommendations

Item Signatures: possibility ? Cluster apple ipod touch 4g clear film protector screen clarks women shoe pumps classics

Similar Items: Clustering Architecture Off-line Hadoop Cluster Generation Cluster Dictionary Slow, Periodic ItemCluster Index Cluster Assignment Service • Applications: • Merchandising • Navigation • etc. item Run-time Fast

Cluster Generation

Query-Item Set Click-stream Log Filter Queries by Demand/Supply • Use 1 month of User behavior data to collect initial query-set. • Filter queries by length and category specific demand/supply ratios. Query Normalization Query Backend Query to Items Data

Query Selection • Input Data: • Click-stream logs • Method for choosing the queries: • Minimum frequency • Average supply threshold • Min and max token constraint • Morphological constraints • Queries that have only numbers are not allowed: “10 5”

K-Means Clustering Query to Items Data Base Cluster Generation • Use item title, category and attributes as features for clustering. • Applying k-means on the base clusters separately produce better quality of clusters and makes the process faster. • Use cosine distance for item clustering. • Cluster size is chosen as a tuning parameter. Generate Item Features Scoring Models K-Means Clustering of Base Clusters Split Clusters

Base Cluster Generation • Base Cluster ≡ Query • Find merge candidates based on query term overlap • Eg: “nikeairmax tennis shoes” -> “nikeairmax” “nikeairmax tennis shoes” -> “nike shoes” • Score candidates using cosine similarity • Term weight : TF-IDF in the query space(document=query) • TF : Query Demand • IDF : Number of Queries • Most similar merge candidate wins • Eg: “nikeairmax tennis shoes” -> “nikeairmax” • Merge corresponding recall sets

Base Cluster Merge • Reduces the number of base clusters to half. • Example phrase(hand,made) phrase(king,s) queen quilt phrase(hand,made) phrase(pink,s) quilt phrase(hand,made) phrase(prae,owned) queen quilt phrase(hand,made) queen quilt phrase(hand,made) phrase(prae,owned) quilt phrase(hand,made) quilt size twin phrase(hand,made) quilt silk phrase(hand,made) quilt twin phrase(hand,made) phrase(patch,work) quilt phrase(hand,made) quilt white phrase(hand,made) phrase(king,size) quilt phrase(hand,made) phrase(yo,yo,s) quilt phrase(hand,made) quilt sale phrase(hand,made) quilt red phrase(hand,made) quilt

Item Features Generation Item Title 3x clear screen protector film skin for apple ipod touch 4 4g Normalization 3-x clear screen protector film skin for apple ipod touch 4 4-g 3-x color=clear type=‘screen protector’ film skin compatible brand = ‘for apple’ compatible product=‘ipod touch’ 4 model = 4-g Concept Extractor Expansions PHRASE(3,x) color=clear type=‘screen protector’ OR(film,films)OR(skin,skins)compatible brand = ‘for apple’ compatible product=‘ipod touch’ 4 model = 4-g Normalized Item Features

Item Features : Concept Extraction • Problem: Extract concepts from item title. • Purpose: • Attributes coverage is sparse in many categories. • Extracted concepts can be used as features • Approach: • Fast online service to extract entities from any eBay Text (item title/product title etc) • Batch capability to be able to use in Hadoop • Restricted to known and important (above certain threshold) name/values. • Unsupervised model • Use a statistical approach based on large amount of data

Examples Extracted Structured Data Unstructured Item Title Size - 16 Gender – Women Color- Black Style – Dress Women’s black dress size 16 worn once Itemid : 300494995198 Meta : CSA Brand - Gucci Size– Medium Color - Ivory Material- Leather Style – Handbag Gucci medium ivoryleatherhandbag Itemid : 300477503372 Meta : CSA Brand – Amazon Kindle 3 Model – 3G Type – Leather Case Color– Black Black Leather Case Cover for Reader Amazon Kindle 3 3G Itemid : 380361729748 Meta : Computers & Networking

Dictionary Generation Method Data-warehouse Other dictionaries used: Units dictionary Synonym names Famous persons list Data Cleansing Dictionary Generation Co-occurence Matrix of Name-values Tf-Idf scores of name-values in a category Concept Dictionary

Item Features : Concept Extraction • Co-occurrence of concepts is used to approximate the joint probability. • Brand=apple, model=iphone 4 • Use of dictionaries at multiple levels reduces ambiguity in same value having multiple names. • “apple” is “compatible brand” in accessories category • “apple” is “brand” in devices category • 'hp pavilion', 'hp' are both valid values for brand , ambiguity is resolved using tf-idfscores of name value pairs in particular category. • Regexes were added to extract size patterns in CSA.

Item Features : Term Scores • Problem: Given an item title in a leaf category, compute the significance of the terms in the title • While assigning items to clusters, identify which terms in item title are more important that others • Issues: • Existing scoring models built as service • Inefficient for using them in batch mode on hadoop • Unigram models

Mutual Information • Score of a term ‘t’ for a given item ‘i’ is computed using the mutual information of term ‘t’ and category ‘c’. • ‘c’ is the l2 category of item ‘i’. • Item titles from EDW are used as input data. • Scores are computed for the normalized tokens.

K-Means 1/3 K-Means is a well known clustering Algorithm. Choose k initial cluster centroids: m1(1),…,mk(1) Assignment Step: Update Step: Optimize: Inter Cluster Distortion Intra-Cluster Similarity

K-Means 2/3 1. Choose Random Cluster Centroids 2. Update centroids based on neighborhood We use a version of k-means called “Bisecting K-means” which tend to produce better quality results than standard k-means. 3. Final clusters

K-Means 3/3 • Pros • Simple to understand and implement. • Easily parallelizable • Generally produce good quality clusters when K is small. • Cons • Slow to converge when K is large. • Cluster quality degrades with large K. • Need to decide K before hand, needs domain knowledge and tuning to find suitable K.

K-Means Clustering : Cluster Description • Clusters are described using the centroids of the clusters. • Cluster 1: “L1=293 L2=56169 L3=168096 compatible brand = applecompatibleproduct = ipodtouch Phrase(4,g) clear film protectorscreen“ • Cluster 2: “L1=11450L2=3034L3=55793 brand = indigo by clarks shoestyle = pumps classics” • There are about x million clusters for US. • These x million clusters cover more than 92% of the US inventory.

Shingling for cluster merging • Problem: Given a set of clusters, find a grouping of similar clusters. • Approach: • Represent each cluster as a “document” • Compute 5 min 3-shingles • Check for 80% match for belonging to the same group

Shingling basics 1/3

Cluster Assignment

Cluster Assignment Inverted Index Cluster Dictionary Item Title, Attributes, Leaf Category, Site Meta-data Files Closed View Item Pre-processing Assignment Service Rank Clusters Recommended Similar Items Rank top N similar Items* Voyager Call for top N clusters Implemented using Lucene

Cluster Assignment : Pre-processing new 2xfor canon lp-e8battery + charger + lens hood eos550d600d digital rebel t3i RTL Normalization new,2-x,for,canon,lp-e-8,battery,charger,lens,hood,eos,550-d,600-d,digital,rebel,t-3-i Concept Extraction new,2-x,forcanon,lp-e-8,battery,charger,lens,hood,eos,550-d,600-d,digitalrebel,t-3-i Stop Word Filtering 2-x,for canon,lp-e-8,battery,charger,lens,hood,eos,550-d,600-d,digital rebel,t-3-i STL Expansion 2-x,for canon,PHRASE(lp,e,8),OR(batteries,battery,batterys),OR(charger,chargers),lens,OR(hood,PHRASE(hood,s),hoods),eos,PHRASE(550,d),PHRASE(600,d),digital rebel,t-3-i Query Reduction / Unification 2-x,for canon,phrase(lp,e,8),batteries,charger,lens,phrase(hood,s),eos,phrase(550,d),phrase(600,d),digital rebel,t-3-i

Cluster Assignment : Scoring • Indexing Fields: Title terms and Categories • Reward matching terms and penalize on non-matching terms • - Reward matching terms • Number of terms matching from input • - Importance of term in input • Query Time Boost • - Penalize non-matching terms from cluster ‘c’ • Index time boost: Field length normalization

Cluster Assignment Cross Validation • Compute precision of recommending items from the “correct” cluster(s) • Clusters that generate purchases (BIDs and/or BINs • Labeled Data • View-Buy data generated from user session analysis • CVIP -> Bid/BIN in same user session • Same category

Cluster Assignment Cross Validation : Method • For each and in , top k(5) clusters list and • Ignoring the position, compute precision in top k • Ignores • True dependent on ranking • Assume every item belonging to a cluster is equally likely to be recommended • Normalized Precision • where is the smallest cluster in

Merchandising Applications

Merchandising Applications • There are two kinds of recommendation systems that are using SIC: • Recommending similar items on CVIP-non winner page • Collaborative Filtering (CF) algorithms: • “Buy-Buy” – On Checkout Page • “View-Buy” – On AVIP

Similar Item Recommendations • User bid on but lost an item • Show similar items as replacement items. • User was watching an item that has ended • Show similar items as replacement items • User viewed an item but did not make a purchase • Show similar items to showcase more choices. • Inject diversity in the recommendation.

Similar Item Recommendations - Example

Similar Item Recommendations ( contd..)

Collaborative Filtering on SIC – “Buy-Buy” • Once a user has purchased an Item, what else can we recommend to the user to go with his purchase? • Drive incremental purchases • On check-out, recommend other items that “go-together” with the purchased item • E.g. for a cell-phone we may recommend a charger, case, screen protector. • For a dress shirt, we may recommend a tie, a dress shoe or a jacket.

Collaborative Filtering on SIC – “Buy-Buy” • Non-productized item inventory with short lifetime makes any CF based approach difficult. • Map the items to a higher level abstraction (clusters) to handle data sparsity. • Re-use the item clusters generated for Similar Item Recommendation.

Related Recommendations: Before & After Recommendations for Xbox 360 4GB on Checkout page

Conclusion • SIC platform has proven its utility and is a critical component of merchandising algorithms • Future Work • Quality needs to be improved for long tail categorieslike Art, Collectibles, etc • Better distinguish between CVIP loser/browser • End-to-End Cross-Validation framework

Cluster Assignment : Aspect Demand • Historical (6-7 months) user behavior data • Rank ordered lists of aspects used in • Search Queries • Left Navigation Filters • Combined using rank aggregation • Importance of aspect in category • Used as query time boost during cluster index lookup • Example : • Input : AIR JORDAN RETRO 4 IV MILITARY BLUE 2006 SIZE 9.5 USED • k:air jordan^2.0 k:retro^1.25 k:military^1.25 k:blue^1.2 • Also used in Concept Landing Pages (CLPs) and Popular Watches w/ Aspects

Ranking • Aspect demand data based on the input item is used in ranking • Ex: material=‘leather’ may not be there in the cluster description. • Clarks Women Shoes • Format Bias based on seed item’s format

Format Affinity • X%seed items are auction for CVIP non winner • High affinity towards the seed item's format

Recommending Similar Items at Large Scale

Recommending Similar Items at Large Scale

Presentation Transcript

Large Scale Internet Search at Ask.com

Large Scale Structure

Auctioning Many Similar Items

Large-scale Messaging at IMVU

Large-Scale Machine Learning at Twitter

Large-scale Single-pass k-Means Clustering at Scale

Large-scale matching

LARGE SCALE

Caution: Large-Scale Projects at Work

Finding Similar Items

Large- scale Organisations

Large Scale Internet Search at Ask

Large scale

Large-Scale Systems

Large Scale Sharing

Finding Similar Items

Large Scale Operations

Large Scale Applications

Finding Similar Items

Finding Similar Items

Finding Similar Items

Large Scale Drupal