Old Dominion University, Atsuyuki Harada April 13, 2011

Extracting User Profiles from Large Scale DataWritten by Shmueli-Scheuer, Roitman, Carmel, Mass, Konopnicki Old Dominion University, Atsuyuki Harada April 13, 2011

Introduction • The authors developed a large scale user profiling framework on top of Apache Hadoop. • The framework is designed for extracting and maintaining a very large number of user profiles from large scale data. • Hadoop’s Map/Reduce is used to learn user profiles from large scale data associated with users. • Hadoop’s HDFS is used to store both user profiles and large scale data associated with users.

User Profiles • A user is one who visits the content provider’s website or updates a blog managed by the content provider’s blogging service. • In this presentation, I call a user a customer. • A user profile is information about the customer (e.g., interests and preferences). • User profiling is the process of implicitly learning a user profile from data associated with the customer. • A user profile can be explicitly defined by the customer her or himself (e.g., the customer’s registration to some service).

Large Scale Data Associated with Customers • The data is collected from: • The customer’s browsing sessions • The customer’s own generated content (e.g., the customer’s blog) • The customer’s social interactions with friends in the user’s social network (e.g., the user’s discussions with others) • Click-through data extracted from search logs • Other user profiles using collaborative filtering techniques • Collaborative filtering (CF) is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc. (Adapted from Wikipedia)

User Profiles Are Used for Personalization • A user profile can be utilized for: • Personalized search by re-ranking search results according to the customer’s interests and preferences (e.g., Rerank.net). • Personalized recommendations for potential customers (e.g., Last.fm, Amazon, Shopping.com). • Improving the click through rate (CTR) of disseminated ads (e.g., Google AdSense).

General Content Management Setting Adapted from Extracting User Profiles from Large Scale Data (Shmueli-Scheuer, Roitman, Carmel, Mass, Konopnicki)

General Content Management Setting HTTP-GET, HTTP-POST Web server, Blogging service, etc. The system logs stores all requests from users, and all interaction between users and Web documents. The user profiling module learns user profiles from the system logs using the Kullback-Leibler (KL) divergence algorithm. Adapted from Extracting User Profiles from Large Scale Data (Shmueli-Scheuer, Roitman, Carmel, Mass, Konopnicki)

General Content Management Setting The system logs stores all requests from users, and all interaction between users and Web documents. The u is the customer. The d is the Web document associated with that customer. The context is metadata derived from the customer-document association (e.g., time, geographic location,　search keyword, bookmark tag). Each record is a tuple <u, d, context>. Adapted from Extracting User Profiles from Large Scale Data (Shmueli-Scheuer, Roitman, Carmel, Mass, Konopnicki)

Recall the First Slide • The framework is designed for extracting and maintaining a very large number of user profiles from large scale data. • In the context, to maintain a user profile means to guarantee the quality of a user profile. A customer’s interests and preferences may vary over time.

A Single User Profile Evolution Over Time Adapted from Extracting User Profiles from Large Scale Data (Shmueli-Scheuer, Roitman, Carmel, Mass, Konopnicki)

A Single User Profile Evolution Over Time In this particular figure, the user profile is represented by a single “tag-cloud.” More fresh user profile is represented by larger clouds. Adapted from Extracting User Profiles from Large Scale Data (Shmueli-Scheuer, Roitman, Carmel, Mass, Konopnicki)

A Single User Profile Evolution Over Time j represents the current time period. j-1 represents the previous time period. j-2 represents the previous, previous time period. At the j time period, only logs between j and j-1 are dumped into the log system and used to update user profiles. At the j-1 time period, only logs between j-2 and j-1 are dumped into the log system and used to update user profiles. Adapted from Extracting User Profiles from Large Scale Data (Shmueli-Scheuer, Roitman, Carmel, Mass, Konopnicki)

Scalability • Is the scalability affected by: The number of user profiles to be maintained? Yes. The running time ration is correlated with the number of user profiles but not the size of data associated with customers. Adapted from Extracting User Profiles from Large Scale Data (Shmueli-Scheuer, Roitman, Carmel, Mass, Konopnicki)

Scalability, cont. • Is the scalability affected by: The size of data associated with customers? No. This liner affect is negligible. Adapted from Extracting User Profiles from Large Scale Data (Shmueli-Scheuer, Roitman, Carmel, Mass, Konopnicki)

Two Contributions of This Paper • Provides the design of an efficient user profiling framework • Provides a real, scalable implementation of the above design on top of Apache Hadoop • We can know the real performance of Hadoop as well.

Any questions?

Old Dominion University, Atsuyuki Harada April 13, 2011