190 likes | 306 Views
Joint work with Michal Shmueli-Scheuer, Haggai Roitman, David Carmel and Yosi Mass. Extracting User Profiles from Large Scale Data. David Konopnicki. IBM Haifa Research Lab. Motivating Example. User Browsing.
E N D
Joint work with Michal Shmueli-Scheuer, Haggai Roitman, David Carmel and Yosi Mass Extracting User Profiles from Large Scale Data David Konopnicki IBM Haifa Research Lab
Motivating Example User Browsing Keywords Modeling:for each user, report the most meaningful keywords to describe her profile. san-francisco peer michael jackson alive analysis Large scale content analysis for mass amount of users. Update users profiles Dashboard Profiles database AdvertisementSystem Track statistics about readers interests
Contributions • User Profiling Framework: • User profile model • KL approach to weight user profile • Large scale implementation: • MapReduce flow • Experiments: • Quality analysis • Scalability analysis
User Profiling Framework- Setting <userID, docID> <u1,d1> <u1,d2> <u2,d2> <docID, content> <d1,{bla,bla,bla}> <d2,{foo,foo}> logging targeting
User Profiling - Definitions • Bag of words model (BOW) • Profile maintenance • User snapshot • Community snapshot
User Profiling - Intuition • Find terms that are highly frequent in the user snapshot and separatethe most between the user and the community snapshots { Travel, Tennis ,Sport }
User Profiling – Naïve approach • Term frequency: number of times a term t appears in document d- tf(t,d) • Document frequency: the number of documents containing the term t – df(t,D) frequent separate average tf over the user snapshot inverse document frequency (df) of a term in the community snapshot probability to find a term in the user snapshot
Kullback-Leibler (KL) Divergence • Measures the difference between two probability distributionsP1andP2 : • KL measures the distance between the Community distribution and the User distribution • Each term is scored according to its contribution to the KL distance between the community and the user distributions. • The top scored terms are then selected as the user important terms. Community User
User Profiling – KL method • Community marginal term distribution: • User marginal term distribution probability normalization factor average tf over the community snapshot Probability to find a term t in community snapshot l=0.001 Relative initial weight of term t Smoothing with the community snapshot
HDFS HDFS HDFS HDFS ¯ TF MapReduce Flow HDFS Mapper: input: (d,text) output ({t,d},1) Reducer: output ({t,d}, tf(t,d)) // Sum Mapper: input: (u,d) output (u,1) Reducer: output (u,|Dj(u)|) // Sum TF |Dj(u)| Mapper: input: (t,tf(t,d),|Dj|) output (t,{tf(t,d),|Dj|,1}) Reducer: output (t, tf(t, Dj)) //Avg Mapper: input: ({u,t,d},{tf(t,Dj(u)),|Dj(u)|}) output ({u,t,|Dj(u)},{1}) Reducer: output ({u,t},{udf(t,Dj(u))}) UDF DF Mapper: input: ({t,d},tf(t,Dj)) output (t,1}) Reducer: output (t, {df(t,Dj),idf(t,Dj),cdf(t,Dj}) Mapper: input: ({t},{tf(t,Dj),cdf(t,Dj)}) output (t,Nj}) Reducer: identity Nj P(t|Dj) Mapper: input: ({t},{tf(t,Dj),|Dj|,cdf(t,Dj),Nj}) output (t,P(t|Dj)}) Reducer: identity
HDFS HDFS HDFS HDFS HDFS MapReduce Flow- cont. w ∑w P(t|Dj(u))
Experimental Data- quality analysis • Open Directory Project (ODP): • Categories are associated with manual labels • Considered as “ground-truth” in this work • Examples: • ODP: Science/Technology/Electronics: Manual label: “Electronics” • ODP: Society/Religion/and/Spirituality/Buddhism: Manual label: “Buddhism” • Data Collection: • 100 different categories randomly selected from ODP • 100 documents randomly selected per category • A total collection size of about 10,000 Web pages • Evaluation: • A match is considered if the suggested label is identical, an inflection, or a Wordnet’s synonym to the manual label
Results In how many cases, we got at least one correct term from the top-K terms. • KL outperforms all other approaches for features selection
Experimental Data- scalability analysis • Blogger.com • Data Collection: • We crawled 973,518 blog posts from March 2007 until January 2009 • Total collection size of 5.45GB, with ~120,000 users • Cluster setting: • 4-node commodity machines cluster (each machine with 4GB RAM, 60GB HD, 4 cores) • Hadoop 0.20.1 http://grannyalong.blogspot.com/ Blog entry
Number of User Profiles Document ratio User profile ratio Time ratio • Runtime ratio is correlated with the number of user profiles ratio
Data Size #user: chose 18,000 users between March-Apr 2007 • Runtime linearly increases with the increasing of data size
Related Work • Content-based user profiling: • Profile contains a taxonomic hierarchy for the long-term model. The Taxonomy is taken from the ODP. Short-term activities update the hierarchy. • Adaptive user profile: Use words that appear in the Web pages and combine them using tfidf, looking on some window and giving different weights according to the recency of the browsing • KL approach to user tasks: • Filter new documents that are not related to the user based on his profile. • Annotate a url with the most descriptive query term for a given user, based on his profile. • User targeting in large-scale systems: • Behavioral targeting system over Hadoop MapReduce. • Large scale CF technique for movies recommendations for users. • Incremental algorithm to construct user profile based on monitoring and user feedback which trades-off between complexity and quality of the profile.
Conclusions & Future Work • We proposed a scalable user profiling solution Implemented on top of Hadoop MapReduce • We showed quality and scalability results • We plan to extend the user model into semantic model • Extend the user profile to include structured data