Quantization of Social Data for Friend Advertisement Recommendation System

Quantization of Social Data for Friend Advertisement Recommendation System Lynne Grewe and Sushmita Pandey California State University East Bay lynne.grewe@csueastbay.edu

Your friends Nathan and Marty will like this User and Friends PPARS Social Network Application Advertisements Social Network The Goal • Using Social Data to make Social Advertisement Recommendations.

The Problems • What is the Social data? • Which Social Data is useable/best? • How do we capture and analyze it? • How do relate Social data to Advertisements? • How do we deliver a Social Advertisement?

The Environment • Social Network: MySpace, Facebook, Hi5, Orkut, LinkedIn, Netlog, more

Overview of Talk • PPARS overview • Data – problem of multiple networks • Example of Data • Parsing • Quantization • Results • Advertisement Recommendation Results • Future Work

DATA INPUT User-origin FRONT END Quantized Model Ads Get user-friends quantized User Ad choice Ad Group / Ad matches & socialize Process groups Peer – Pressure Ad Selection Our System Overview • PPARS = Peer Pressure Advertisement Recommendation System

Social Data • Every network can provide different social data • Two main splits: Facebook and OpenSocial (majority of others). • OpenSocial is an open standard adopted by over 30 containers and growing --- international audience. Allows for “standardized” access. • Popular containers like MySpace, Linkedin, Google, Yahoo!, etc. • Corporate support Google, Yahoo!, IBM, Microsoft, and more.

Data Fields

Some Example Data

Social Data – which? • Not all networks provide access to same data • Users can keep information private • Not all data is “social” • Not all data is directly useful for advertisers

Data • Not typically available / private • Not all data is “social” • Not all data is directly useful for advertisers

Infrequent data • For our scheme need in common data to be able to reason over in common feature space. • Data that is NOT frequent:

Social Data - which • First go around- based on network availability and commonality, user prevalence and estimated advertisement usefulness • Balance between small sample space and feature dimensionality

User Data Friend 1 data Friend2 data FriendX data User-origin PARSING Individual Social Data Tokens QUANTIZATION Web Services OntologyCodebook Codebooks Quantized Set of User and Friend Quantized Data Vectors PPARS – Front End I like cars, have 2 kids, ….. Movies: Star Wars Age= 30 …..

Raw Social Data Null Data Test Split by . / ! / ? Split by : Split by - Hierarchical Segmentation Split by ; Split by , Individual Social Data Tokens I like lots of movies. Like:Star Wars, Star Wars II, Jaws. And I love Harrison Fords acting. Parsing • Create small social data tokens to passto Quantization • I like lots of movies • Like • Star Wars • Star Wars II • Jaws • And I love Harrison Fords acting.

Parsing Example About Me input = "I work as an engineer at Motorola. I work in the peripherals department and do chip design. I am doing some management.“ Resulting Social Data Tokens: • I work as an engineer at Motorola • I work in the peripherals department and do chip design • I am doing some management

Parsing Example Interests input = “Internet, Movies, Reading, Karaoke,Building alternate communities” Resulting Social Data Tokens: • Internet • Movies • Reading • Karaoke • Language • Building alternative communities

Parsing Example Music input = “Bands:Superdrag, Weezer, The Doors, The Beach Boys, Journey Solo Artists: Billy Joel, Albums: Appetite for Destruction - Guns & Roses; Blue - Weezer“ Resulting Social Data Tokens: • Bands • Superdrag • The Doors • Cheap Trick • The Beach Boys • Journey Solo Artists • Billy Joel • Albums • Appetite for Destruction • Guns & Roses • Blue • Weezer Lost formatting of line return between Journey and Solo Artists

Parsing • Simple technique of segmentation • Future work – include semantics of phrases to detect potential “headings”, syntax rules around delimiters like : and –

Quantization Take a social data token and translate it into a numerical feature vector. “I like cars”  Cars = 0.2 • For each social data field need to create meaningful feature vector elements. • For each social data field need to come up with techniques/algorithms to translate the raw social data token into support for its different feature vector elements.

Quantization- feature vector • Pattern Recognition and Matching are later parts of PPARS • Need numerical representations for this of our user, friend social data and also to represent Ads. “I like cars” =???what ad?? Cars = 0.2  Ad with cars around 0.2

Quantization – feature vector • For each social data element like “About Us”, “Gender”, “Movies” we have designed its own feature vector. • Result of technique used to quantize the input social token data • Result of studying keywords /trends in user database of sample social tokens. • To understand this ---- lets first discuss techniques used to quantize social data tokens as it related to the “type” of data element.

Quantization and Social Data Type Numerical Data • Data is naturally numerical – i.e. Age, date of birth • Can be quickly and effectively translated into number in some defined range: • Address – can be translated into lattitude and longitude • Phone – again limited in digits • Time zone – again predefined ranges Categorizable Data • Data where there is a predefined accepted taxonomy – i.e. movies their genre • Data where through sample analysis and advertisement goals categories can be derived • Example: interests, about me, food, fashion Indexed Data • This is data that has defined sets of values specific to either container or OpenSocial. • Example : smoker = yes, no, occasionally, quit, never • Other examples: gender, relationship, drinker, sexual orientation Other • This is data for which we can not easily derive an algorithm for categorizing. • Examples Profile Image , Profile Song URL, etc.

Collapsing of Data • Some data fields have almost same meaning or content typically greatly overlaps • About Me and Interests (and even Status) • Age and Date of Birth

Categorizable Data • This is the bulk of the data fields: About Me, Interests, Music, Movies, TV, Books, Looking For, Religion, Ethnicity, Language • Determine Feature Elements: • Accepted “standard” taxonomies • Web Service taxonomies • Advertisement driven taxonomies

User Data Friend 1 data Friend2 data FriendX data User-origin PARSING Individual Social Data Tokens QUANTIZATION Web Services OntologyCodebook Codebooks Quantized Set of User and Friend Quantized Data Vectors PPARS – Front End I like cars, have 2 kids, ….. Movies: Star Wars Age= 30 …..

Categorization: Web Service • For some of our social data fields we are able to utilize popular web services to convert our social data tokens into search hits that have categorized information associated with them. • Example: Internet Video Archive and IMDB • Use movie genre

IVA – movie search by actor “Robert Redford” • http://api.internetvideoarchive.com/Video/MoviesByActorName.aspx?DeveloperId=f377f57f-3bad-4704-8e80-1b643b206abd&SearchTerm=Robert+Redford • Some of the Results : - <item> - <Description> - <![CDATA[ The Unforeseen movie trailer - starring Robert Redford, Willie Nelson, Ann Richards, Gary Bradley, Judah Folkman, William Greider. Directed by Laura Dunn. Theatrical Release Date: 2/29/2008 Genre: Documentary Rating: Not Rated ]]> </Description> <Title>THE UNFORESEEN</Title> <Language>English</Language> <Country>United States</Country> <SiteUrl /> <Studio>Two Birds Films</Studio> <StudioID>3018</StudioID> <Rating>Not Rated</Rating> <Genre>Documentary</Genre> <GenreID>13</GenreID>

IVA – movie search continued • http://api.internetvideoarchive.com/Video/MoviesByActorName.aspx?DeveloperId=f377f57f-3bad-4704-8e80-1b643b206abd&SearchTerm=Robert+Redford <HomeVideoReleaseDate>9/16/2008</HomeVideoReleaseDate> <TheatricalReleaseDate>2/29/2008</TheatricalReleaseDate> <Director>Laura Dunn</Director> <DirectorID>36635</DirectorID> <Actor1>Robert Redford</Actor1> <ActorId1>7105</ActorId1> <Actor2>Willie Nelson</Actor2> <ActorId2>8591</ActorId2> <Actor3>Ann Richards</Actor3> <ActorId3>36642</ActorId3> <Actor4>Gary Bradley</Actor4> <ActorId4>36637</ActorId4>

IVA – movie search continued • http://api.internetvideoarchive.com/Video/MoviesByActorName.aspx?DeveloperId=f377f57f-3bad-4704-8e80-1b643b206abd&SearchTerm=Robert+Redford <HomeVideoReleaseDate>9/16/2008</HomeVideoReleaseDate> <Link>http://videodetective.com/titledetails.aspx?publishedid=947964</Link> <BoxOfficeInMillions>-1</BoxOfficeInMillions> -  <AirDayOfWeek>-1</AirDayOfWeek> <AirStartTime /> <ShowLengthInMinutes>-1</ShowLengthInMinutes> <IsTelevisionContent>false</IsTelevisionContent> <FirstReleasedYear>2008</FirstReleasedYear> <Image>http://content.internetvideoarchive.com/content/photos/1250/05253626_.jpg</Image> <Duration>164</Duration> <DateCreated>3/20/2008 8:00:00 AM</DateCreated> <Media>Movie</Media> <PublishedId>947964</PublishedId> <DateModified>4/22/2011 1:57:00 PM</DateModified> AND MORE !!!! selected GENRE

IVA genres --- our movie feature elements

Movie Quantization • For each Social data token “Adam Sandler” , “Star Wars” we can get multiple hits. • Example, “Robert Redford” – first 8 hits: • Drama = 5 • Western = 1 • Documentary = 2 • Issues: • How do we know if actor name, movie title, director or other? • Multiple hits for actor or director ---what do we do? (evidence them all) • Multiple hits for movie title – what do we do? (take first hit) These genres become our Movie feature elements

Order of Movie Quantization • Given any social data element parsed from the user’s MOVIE data, we cannot know apriori if it is a title or actor or director’s name. It may even be the genre of movies a user likes. • Title search (take first hit) • Actor search(evidence all) • Director Search (evidence all) • Keyword Matching (see next)

Quantization Result 1 Up,Forrest Gump,Rear Window,District 9,Pac-Man,WALL·E,My Flesh and Blood, MacMusical, Yields: • MOVIE_FAMILY=0.6, MOVIE_SCIFI=0.2, MOVIE_DOCUMENTARY=0.4, MOVIE_THRILLER=0.2

Quantization using other services • TV - IMDB, http://www.imdb.com/search/title?title_type=tv_series&title=". • Books - Google Books Search, http://books.google.com/books/feeds/volumes? • Music - IVA’s music API http://api.internetvideoarchive.com/Music/**

Individual Social Data Tokens OntologyCodebook Codebooks Quantized Set of User and Friend Quantized Data Vectors Quantization via Keyword Matching • What do we do when there is no pre-determined taxonomy and no services for database hits? • Natural Language Processing techniques • Currently employ simple (but, effective and efficient) technique of Keyword matching /lookup • Create database of predetermined phrases/ keywords • Lookup scheme to quantize social data token(s). “I work as an engineer”  About ME lookup?? “Watch a lot of drama”  Movies look up ??

Keyword Database • Used on : About Me / Interests, Religion, Ethnicity, Looking For, Language, Relationship • Secondary use: Books, TV, Music, Movies • When service fails to provide any hits

Keyword Database Creation • manual scanning of hundreds (at starting level) of user profiles • domain specific expert (human) knowledge • dictionaries and taxonomies when exist Issue: how determine weights for every entry • Expert determined (consistency) or all equal valued (no sense of importance) Issue: at very beginning level---can we create a dictionary for everything ---no --- are there more advance NLP techniques

Some arbitrary Keyword DB entries • ABOUT_ME HOME Cats 0.2 • ABOUT_ME HOME Children 0.2 • ABOUT_ME HOME Daughter 0.2 • ABOUT_ME HOME Dog 0.2 • ABOUT_ME HOME Cats 0.2 • ABOUT_ME HOME Children 0.2 • ABOUT_ME HOME Daughter 0.2 • ABOUT_ME HOME Dog 0.2 • ABOUT_ME HOME home 0.5

Some arbitrary Keyword DB entries • ABOUT_ME ENTERTAINMENT Shopping 0.2 • ABOUT_ME ENTERTAINMENT Shows 0.2 • ABOUT_ME ENTERTAINMENT Sing 0.2 • ABOUT_ME ENTERTAINMENT Ski 0.2 • ABOUT_ME ENTERTAINMENT Songwriter 0.2

Keyword DB- evidence weight Issue: how determine weights for every entry • Expert determined (consistency) • or all equal valued (no sense of importance) System options: DB weights can take on different values, option to run with all weights equal.

Keyword DB- ?? Issue: at very beginning level---can we create a dictionary for everything ---no --- are there more advance NLP techniques to explore for inferences. • While users can write anything (and do), remember we are focuses on Advertisement Recommendation --- so the scope of our language is limited to hits related to our feature vector elements….this is a constrained problem • Home, Entertainment, Smoking, Work, Social, Movies, TV, Shopping, Books, etc.—these are the kinds of areas we are concerned with.

Types of Keyword Matching STRICT • Social data token must match exactly a DB entry “Drama”  Drama √ “I like Drama”  Drama X DB_ENTRY_CONTAINS_DATA_ELEMENT • Data token must exist inside the DB entry“Drama”  Drama and Comedy √ DB_ENTRY_PARTOF_DATA_ELEMENT • Part of data token matches DB entry (this is further segmenting data token) “I like Drama”  Drama √

Quantization Results different kinds of Keyword Matching ‘ I am a student and I work and love cars' Output STRICT: • No hits • ABOUT_ME_ENTERTAINMENT = -1 ABOUT_ME_WORK = -1ABOUT_ME_HOME] = -1ABOUT_ME_SOCIAL = -1ABOUT_ME_FOOD = -1

Quantization Results different kinds of Keyword Matching ‘ I am a student and I work and love cars' Output DB_ENTRY_CONTAINS_DATA_ELEMENT • No hits • ABOUT_ME_ENTERTAINMENT = -1 ABOUT_ME_WORK = -1ABOUT_ME_HOME] = -1ABOUT_ME_SOCIAL = -1ABOUT_ME_FOOD = -1

Quantization Results different kinds of Keyword Matching ‘ I am a student and I work and love cars' Output DB_ENTRY_PARTOF_DATA_ELEMENT keyword = student • ABOUT_ME_WORK =0.2 keyword = work • ABOUT_ME_WORK =0.5 keyword = cars • ABOUT_ME_ENTERTAINMENT =0.2 keyword = LOVE • ABOUT_ME_HOME=0.2 • ABOUT_ME_SOCIAL=0.2 • ABOUT_ME_ENTERTAINMENT = 0.2ABOUT_ME_WORK = 0.7 • ABOUT_ME_HOME = 0.2 • ABOUT_ME_SOCIAL = 0.2 • ABOUT_ME_FOOD = -1

Quantization of Social Data for Friend Advertisement Recommendation System