Extracting User Profiles from Large Scale Data

Joint work with Michal Shmueli-Scheuer, Haggai Roitman, David Carmel and Yosi Mass Extracting User Profiles from Large Scale Data David Konopnicki IBM Haifa Research Lab

Motivating Example User Browsing Keywords Modeling:for each user, report the most meaningful keywords to describe her profile. san-francisco peer michael jackson alive analysis Large scale content analysis for mass amount of users. Update users profiles Dashboard Profiles database AdvertisementSystem Track statistics about readers interests

Contributions • User Profiling Framework: • User profile model • KL approach to weight user profile • Large scale implementation: • MapReduce flow • Experiments: • Quality analysis • Scalability analysis

User Profiling Framework- Setting <userID, docID> <u1,d1> <u1,d2> <u2,d2> <docID, content> <d1,{bla,bla,bla}> <d2,{foo,foo}> logging targeting

User Profiling - Definitions • Bag of words model (BOW) • Profile maintenance • User snapshot • Community snapshot

User Profiling - Intuition • Find terms that are highly frequent in the user snapshot and separatethe most between the user and the community snapshots { Travel, Tennis ,Sport }

User Profiling – Naïve approach • Term frequency: number of times a term t appears in document d- tf(t,d) • Document frequency: the number of documents containing the term t – df(t,D) frequent separate average tf over the user snapshot inverse document frequency (df) of a term in the community snapshot probability to find a term in the user snapshot

Kullback-Leibler (KL) Divergence • Measures the difference between two probability distributionsP1andP2 : • KL measures the distance between the Community distribution and the User distribution • Each term is scored according to its contribution to the KL distance between the community and the user distributions. • The top scored terms are then selected as the user important terms. Community User

User Profiling – KL method • Community marginal term distribution: • User marginal term distribution probability normalization factor average tf over the community snapshot Probability to find a term t in community snapshot l=0.001 Relative initial weight of term t Smoothing with the community snapshot

HDFS HDFS HDFS HDFS ¯ TF MapReduce Flow HDFS Mapper: input: (d,text) output ({t,d},1) Reducer: output ({t,d}, tf(t,d)) // Sum Mapper: input: (u,d) output (u,1) Reducer: output (u,|Dj(u)|) // Sum TF |Dj(u)| Mapper: input: (t,tf(t,d),|Dj|) output (t,{tf(t,d),|Dj|,1}) Reducer: output (t, tf(t, Dj)) //Avg Mapper: input: ({u,t,d},{tf(t,Dj(u)),|Dj(u)|}) output ({u,t,|Dj(u)},{1}) Reducer: output ({u,t},{udf(t,Dj(u))}) UDF DF Mapper: input: ({t,d},tf(t,Dj)) output (t,1}) Reducer: output (t, {df(t,Dj),idf(t,Dj),cdf(t,Dj}) Mapper: input: ({t},{tf(t,Dj),cdf(t,Dj)}) output (t,Nj}) Reducer: identity Nj P(t|Dj) Mapper: input: ({t},{tf(t,Dj),|Dj|,cdf(t,Dj),Nj}) output (t,P(t|Dj)}) Reducer: identity

HDFS HDFS HDFS HDFS HDFS MapReduce Flow- cont. w ∑w P(t|Dj(u))

Experimental Data- quality analysis • Open Directory Project (ODP): • Categories are associated with manual labels • Considered as “ground-truth” in this work • Examples: • ODP: Science/Technology/Electronics: Manual label: “Electronics” • ODP: Society/Religion/and/Spirituality/Buddhism: Manual label: “Buddhism” • Data Collection: • 100 different categories randomly selected from ODP • 100 documents randomly selected per category • A total collection size of about 10,000 Web pages • Evaluation: • A match is considered if the suggested label is identical, an inflection, or a Wordnet’s synonym to the manual label

Results In how many cases, we got at least one correct term from the top-K terms. • KL outperforms all other approaches for features selection

Experimental Data- scalability analysis • Blogger.com • Data Collection: • We crawled 973,518 blog posts from March 2007 until January 2009 • Total collection size of 5.45GB, with ~120,000 users • Cluster setting: • 4-node commodity machines cluster (each machine with 4GB RAM, 60GB HD, 4 cores) • Hadoop 0.20.1 http://grannyalong.blogspot.com/ Blog entry

Number of User Profiles Document ratio User profile ratio Time ratio • Runtime ratio is correlated with the number of user profiles ratio

Data Size #user: chose 18,000 users between March-Apr 2007 • Runtime linearly increases with the increasing of data size

Related Work • Content-based user profiling: • Profile contains a taxonomic hierarchy for the long-term model. The Taxonomy is taken from the ODP. Short-term activities update the hierarchy. • Adaptive user profile: Use words that appear in the Web pages and combine them using tfidf, looking on some window and giving different weights according to the recency of the browsing • KL approach to user tasks: • Filter new documents that are not related to the user based on his profile. • Annotate a url with the most descriptive query term for a given user, based on his profile. • User targeting in large-scale systems: • Behavioral targeting system over Hadoop MapReduce. • Large scale CF technique for movies recommendations for users. • Incremental algorithm to construct user profile based on monitoring and user feedback which trades-off between complexity and quality of the profile.

Conclusions & Future Work • We proposed a scalable user profiling solution Implemented on top of Hadoop MapReduce • We showed quality and scalability results • We plan to extend the user model into semantic model • Extend the user profile to include structured data

Thank You !

Extracting User Profiles from Large Scale Data

Extracting User Profiles from Large Scale Data

Presentation Transcript

Extracting Collection Data From Websites

Large scale genomic data mining

Extracting structure information from data

Large-scale Data Processing Challenges

Large scale genomic data mining

Large- scale Linked Data Management

DATA MINING Extracting Knowledge From Data

Extracting Schema From Data

Large scale data processing

Extracting Schema from Semistructured Data

Extracting insight from large networks: implications of small-scale and large-scale structure

Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Large Scale Data Integration

Large Scale Data Analytics

large scale data analysis

Extracting insight from large networks: implications of small-scale and large-scale structure

Learning Profiles from User Interactions

Large-scale structure from 2dFGRS

Extracting Profiles from Social Media Websites

Product Data Extracting from Safeway

Extracting Review Data From Amazon

Extracting Business Data from Truelocal