240 likes | 361 Views
Lexicon: exploring language trends on Facebook Walls. Roddy Lindsay Data Team. What’s a Wall?. Walls are semi-public and public forums on profiles, groups, events, etc. Old. New. Numbers. Blogs 1.6 million posts per day (Technorati) ~18 posts per second Walls
E N D
Lexicon: exploring language trends on Facebook Walls • Roddy Lindsay • Data Team
Walls are semi-public and public forums on profiles, groups, events, etc. Old New
Numbers • Blogs • 1.6 million posts per day (Technorati) • ~18 posts per second • Walls • 12-20 million wall posts per day • ~180 posts per second • 5-9 million unique users per day • 2-2.5 GB of unstructured text per day
Brief History of Lexicon • First iteration: “Pulse” (2006) • Interests in profile fields ranked by count • E.g. “Top movies in San Francisco Network” • Pros • Structure through comma delimitation • Cons • Limited to profile field categories (movies, books, interests, TV shows, music) • Profile information is static (not updated frequently)
Brief History of Lexicon • Attempt 2: • Extract terms from public and semi-public conversations between friends (on the Wall) • Anonymize user data to respect privacy • Plot time series data to show usage trends • Pros • Wall conversations closer to RL conversations • Topics are constantly changing, giving a strong temporal signal • Cons • No structure • Greater computational requirements
How does Lexicon work? • Count occurrences of each word and bigram that is posted each day • Aggregate by unique user to minimize the effect of spam • Trim the long tail to handle data explosion • Normalize for intraweek and seasonal variance by putting total posts in the denominator “apple” “apple” • Interactive Flash charts rolled at home (used internally and externally for all Facebook reporting products)
How does Lexicon work? • More technically... • Use Scribe (distributed log file aggregation service built with Thrift) to collect wall post logs from web servers • Have a 180-node Hadoop cluster that loads the log files into Hive, our homegrown data warehouse sitting on top of Hadoop • Pipeline of Map-Reduce scripts (written in Python) that count the number unique users for each (term, day) pair, trim the long tail • Load into horizontally partitioned MySQL tier for user queries • PHP front-end • Memcached sits in front to cache common queries • All of these are (or will be) open-source projects • Facebook is an active contributor to most of these projects
What is Lexicon useful for? • Tracking news • Lexicon shows relative chatter surrounding current events • Can understand which events are of interest to the Facebook audience “tibet” “died” (Heath Ledger)
What is Lexicon useful for? • Natural language trends • Words and phrases constantly enter and exit the lexicon • Track the popularity of terms that are used in everyday conversation “lulz” “pwned”
What is Lexicon useful for? • Understanding the Facebook audience • Lexicon trends can yield insights into Facebook demographics, user attitudes towards Facebook products, and how the products are used “the add”
What is Lexicon useful for? • Brand Mindshare • Brands and commercial products are mentioned in Wall conversations, just as in face-to-face conversations “verizon” “juno”
What is Lexicon useful for? • Categories that are social in nature yield the strongest signal • Entertainment, Mobile, Automotive, QSR, etc. “honda”, “toyota”
What is Lexicon useful for? • Measuring the success of sponsored gift campaigns on Facebook • Sponsored gifts: images you can send to friends along with a Wall post “coors”
Challenges • Term disambiguation • Words are used in a variety of contexts • E.g. my cousin Wendy’s birthday vs. Wendy’s hamburgers OR ? • Tracking each different context automatically with machine learning techniques is difficult • Language classifiers, proper tokenization, and smart cleaning of the data can get us part way there
Challenges • Sentiment • Is the mention of a term positive, negative, neutral, something else? • Most challenging aspects: irony, ambiguous sentiment terms, complex grammar • Many top companies use humans to rate a sizable percentage of posts • Numerous Ph.D. candidates have quit graduate school over this problem • Obviously a difficult task...
Challenges • Sentiment • The language on Facebook wall posts is characterized by: • slang, lulz • mispellings • blunt sentences. • superfluous punctuation!!! • absent punctuation for example • emoticons ^_^ • acronyms, omg • a big freaking mess
Challenges • Sentiment • Blunt language without complex grammar means that irony and sarcasm aren’t big issues • Synonym identification (figuring out that “hotttt” == “hot”), subjective/objective classification, and tokenization are more troublesome • Something to keep in mind: strong prior probability of a subjective post being positive (80-90% as rated by humans) • Walls are not blogs or movie reviews • Theory: users don’t want to appear to be negative, and so avoid making overtly negative comments for the most part • Sentiment classifier that guesses positive every time gives the least error • Maybe sentiment isn’t as important for us...
Future trends for text analytics • Data visualization • Graph structure/Diffusion analysis • Cloud computing