TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

TI: An Efficient Indexing Mechanism for Real-Time Search on TweetsSIGMOD ‘11C. Chen et al Pete Bohman Adam Kunk

Outline • Introduction • Related Work • System Overview • Indexing Scheme • Ranking • Evaluation • Conclusion

Real-Time Search • Requirements • Contents searchable immediately following creation • Scale to thousands of updates/sec OBL Death 5,000 tweets/sec • Results relevant to query via cost efficient ranking

Real-Time Search Vs. Time Rank TI Rank

TI Concepts • Real-time search of microblogging applications is provided via two components: • Indexing Mechanism – for pruning tweets, only looking at a subset of all tweets (allows for speed) • Ranking Mechanism – for looking at relevant tweets (weeding out tweets that are not deemed important enough) • Main idea: look at important tweets only

Real-Time Search • Real-Time Search = Indexing + Ranking • TI Index • Scalable indexing scheme based on partial indexing • Only index tweets likely to appear in query result • TI Rank • User’s pagerank • Popularity of topic • Tweet to query similarity

Partial Indexing • The Case for Partial Indexes • Stonebreaker, 1989 • Index only a portion of a column • User specified index predicates (where salary > 500) • Build index as a side-effect of query processing • Incremental index building

View Materialization • An application of materialized views is to use cost models to automatically select which views to materialize. • Materialized views can be thought of as snapshots of a database, in which the results of a query are stored in an object. • The concept of only indexing essential tweets in real-time was borrowed from the idea of view materialization.

Microblog Search • Google and Twitter have both released real-time search engines. • Google’s engine adaptively crawls the microblog • Twitter’s engine relies on Apache’s Lucene (high-performance, full-featured text search engine library) • But, both the Google and Twitter engines only utilize time in their ranking algorithms. • TI’s ranking algorithm takes much more than just time into account.

TI Cost Reduction • TI clusters similar tweets together and offloads noisy tweets in order to reduce computation costs of real-time search. • Tweets are grouped into topics by grouping them by relationship in a tree structure. • Tweets replying to the same tweet or belonging to the same thread are organized as a tree. • TI also maintains popular topics in memory.

TI Architecture

User Graph • Twitter users have links to other friends • A User Graph is utilized to demonstrate this relationship • Gu = (U, E) • U is the set of users in the system • E is the friend links between them

Tweet Tree Structure • Nodes represent tweets • Directed edges indicate replies or retweets • Implemented by assigning tweets a tree encoding ID

TI Design • Search is handled via an inverted index for tweets • Given a keyword, the inverted index returns a tweet list, T • T contains set of tweets sorted by timestamp

TI Inverted Index • TID = Tweet ID • U-PageRank = Used for ranking • TF = Term Frequency • tree = TID of root node of tweet tree • time = timestamp

Ranking Support • In order to help ranking, TI keeps a table of metadata for each tweet • TID = tweet ID • RID = ID of replied tweet (to find parent) • tree = TID of root node of tweet tree • time = timestamp • count = number of tweets replying to this tweet

In-memory structures • Certain structures are kept in-memory to support indexing and ranking • Keyword threshold – records statistics of recent popular queries • Candidate topic list – information about recent topics • Popular topic list – information about highly discussed topics

TI Indexing Overview

Tweet Classification • Observation • Users are only interested in top-K results for a query • Given a tweet t and a user’s query set Q, • ∃qi∈ Q and t is a top-K result for qi based on the ranking function F t is a distinguished tweet • Maintenance cost for query set Q?

Query Set • Observation • 20% of queries represent 80% of user requests (Zipf’s dist.) • Suppose the nth query appears with a prob. of (Zipf’s distribution) • Let s be the # of queries submitted /sec. Expected time interval of the nth query is • Batch processing occurs every t’ sec We will keep the n-th query in Q, only if t(n) < t’

Naïve Classifier • Dominant set ds(qi,t) • The tweets that have higher ranks than tfor query qi • Performance problems • Full scan of tweet set required for dominant set • Test each tweet against every query

Optimization 1 • Observation • The rank of the lower results are stable • Replace dominant set with a comparison to the score of Q’s Kth result.

Optimization 2 • Compare a tweet to similar queries • Given tweet t = <k1, k4>, compare t to Q1, Q3, Q4

Real-Time Indexing • New tweets categorized as being distinguished (index these immediately) • If tweet belongs to existing tweet tree, retrieve its parent tweet to get root ID and generate encoding. Update count number in parent. • Tweet is inserted into tweet data table. • Tweet is inserted into inverted index. • Main cost is updating the inverted index (due to each keyword in the tweet).

Batch Indexing • New tweets categorized as being noisy (index these at a later time) • Instead of indexing in inverted index, append tweet to log file. • Batch indexing process periodically scans the log file and indexes the tweets there.

Ranking Desiderata • “The ranking function must consider both the timestamp of the data and the similarity between the data and the query.” • “The ranking function is composed of two independent factors, time and similarity.” • “The ranking function should be cost-efficient.”

Ranking Overview • Ranking functions are completely separate from the indexing mechanism • New ranking functions could be used • TI’s proposed ranking function is based on: • User’s PageRank • Popularity of the topic • Timestamp (self-explanatory) • Similarity between tweet and the query

User’s PageRank • Twitter has two types of links between users • f(u): the set of users who follow user u • f-1(u): the set of users who user u follows • A matrix, Mf[i][j], is used to record the following links between users • A weight factor is given for each user • V = (w1, w2, ….. wn)

User’s PageRank Formula • PageRank formula is given as: Pu = VMfx • So, the user’s PageRank is a combination of their user weight and how many followers they have • The more popular the user, the higher the PageRank

Popularity of Topics • Users can retweet or reply to tweets. • Popularity can be determined by looking at the largest tweet trees. • Popularity of tree is equal to the sum of the U-PageRank values of all tweets in the tree

Similarity between query and tweet • The similarity of a query and the tweet t can be computed as follows: sim(q,t) = (q x t) / (|q||t|)

Ranking Function • q.timestamp = query submittal time • tree.timstamp = timestamp of tree t belongs to (timestamp of root node) • w1, w2, w3 are weight factors for each component (all set to 1)

Adaptive Indexing • The size of the inverted index limits the performance of the search for tweets • The size of the inverted index grows with the number of tweets • To alleviate this problem, adaptive indexing is proposed:

Adaptive Indexing Cont. • The main idea: • Iteratively read a block of the inverted index (rather than the entire thing) • Stop iterating blocks when the timestamp value gives a score low enough to throw out the results • Stop because the rest of the tweets in the inverted index will also have a lower score

Evaluation • Evaluation performed on real dataset • Dataset collected for 3 years (October 2006 to November 2009) • 500 random users picked as seeds (from which other users are integrated into the social graphs) • 465,000 total users • 25,000,000 total tweets • Experiments typically 10 days long • 5 days training, 5 days measuring performance

Evaluation Cont. • Queries lengths are distributed as follows: • ~60% are 1 word • ~30% are 2 words • ~10% are more than 2 words • Queries submitted at random, tweets are inserted into system based on original timestamps (from dataset)

Indexing Cost (per 10k tweets)

Performance of Query Processing • TimeBased represents using only tweet timestamp (like Google)

Conclusion • Current search engines unable to index social networking data • Adaptive indexing mechanism to reduce update cost • Cost efficient and effective ranking function • Successful evaluation using real data set from twitter

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Presentation Transcript

Chapter 8 Lecture:

Indexing and Hashing

Real-time Mesh Simplification Using the GPU

3 Determination of Mechanism

The “One Minute Preceptor”: Time Efficient Teaching in Clinical Practice

Real-Time quantitative PCR: Choices and Decisions

Chapter 11 Death: Manner, Mechanism, Cause, and Time

Managing Information Extraction SIGMOD 2006 Tutorial

Chapter 11: Indexing and Hashing

Indexing and Data Mining in Multimedia Databases

DD* Lite: Efficient Incremental Search with State Dominance

Metrics for real time probabilistic processes

SCOOP – Eiffel Concurrency

Real-Time Embedded Systems

Real-Time Systems, COSC-4301-01, Lecture 2

Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Efficient IR-Style Keyword Search over Relational Databases

Indexing

Chapter 2 Modeling

Chapter 2 Modeling

Introduction to Real-Time Spectrum Analysis.

The Very Best Tweets of #SOCHIPROBLEMS