TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

TI: An Efficient Indexing Mechanism for Real-Time Search on TweetsSIGMOD ‘11C. Chen et al Pete Bohman Adam Kunk

Outline • Introduction • Related Work • System Overview • Indexing Scheme • Ranking • Evaluation • Conclusion

Real-Time Search • Requirements • Contents searchable immediately following creation • Scale to thousands of updates/sec OBL Death 5,000 tweets/sec • Results relevant to query via cost efficient ranking • Tradeoff: • Scalability and Performance vs. Ranking

Real-Time Search • Applications • The ability to receive updates as they occur • Applicability • It may not be feasible to provide real-time search results in a system with thousands of new entries per second

TI: Tweet Index • TI is an indexing and ranking mechanism for real-time search in microblogging systems, such as Twitter. • In order for TI to return real-time results, only some of the tweets are indexed immediately (distinguished tweets), and the others are handled periodically (those deemed not as important, noisy tweets).

Partial Indexing • The Case for Partial Indexes • Stonebreaker, 1989 • Index only a portion of a column • User specified index predicates (where salary > 500) • Build index as a side-effect of query processing

View Materialization • An application of materialized views is to use cost models to automatically select which views to materialize. • Materialized views can be thought of as snapshots of a database, in which the results of a query are stored in an object. • The concept of only indexing essential tweets in real-time was borrowed from the idea of view materialization.

Microblog Search • Google and Twitter have both released real-time search engines. • Google’s engine adaptively crawls the microblog • Twitter’s engine relies on Apache’s Lucene (high-performance, full-featured text search engine library) • But, both the Google and Twitter engines only utilize time in their ranking algorithms. • TI’s ranking algorithm takes much more than just time into account.

TI Cost Reduction • TI clusters similar tweets together and offloads noisy tweets in order to reduce computation costs of real-time search. • Tweets are grouped into topics by grouping them by relationship in a tree structure. • Tweets replying to the same tweet or belonging to the same thread are organized as a tree. • TI also maintains popular topics in memory.

TI Architecture

User Graph • Twitter users have links to other friends • A User Graph is utilized to demonstrate this relationship • Gu= (U, E) • U is the set of users in the system • E is the friend links between them

Tweet Tree Structure • Nodes represent tweets • Directed edges indicate replies or retweets • Implemented by assigning tweets a tree encoding ID

TI Design • Search is handled via an inverted index for tweets • Given a keyword, the inverted index returns a tweet list, T • T contains set of tweets sorted by timestamp

TI Inverted Index • TID = Tweet ID • U-PageRank= Used for ranking • TF = Term Frequency • tree = TID of root node of tweet tree • time = timestamp

Ranking Support • In order to help ranking, TI keeps a table of metadata for each tweet • TID = tweet ID • RID = ID of replied tweet (to find parent) • tree = TID of root node of tweet tree • time = timestamp • count = number of tweets replying to this tweet

In-memory structures • Certain structures are kept in-memory to support indexing and ranking • Keyword threshold – records statistics of recent popular queries • Candidate topic list – information about recent topics • Popular topic list – information about highly discussed topics

TI Indexing Overview • TI categorizes tweets as either being distinguished or noisy • Distinguised: real-time indexing scheme • Noisy: background batch indexing scheme • As a new tweet is entered, its content is analyzed and in order to categorize the tweet as one of the above two types.

TI Inverted Index

Real-Time Indexing • New tweets categorized as being distinguished (index these immediately) • If tweet belongs to existing tweet tree, retrieve its parent tweet to get root ID and generate encoding. Update count number in parent. • Tweet is inserted into tweet data table. • Tweet is inserted into inverted index. • Main cost is updating the inverted index (due to each keyword in the tweet).

Batch Indexing • New tweets categorized as being noisy (index these at a later time) • Instead of indexing in inverted index, append tweet to log file. • Batch indexing process periodically scans the log file and indexes the tweets there.

Ranking Desiderata • “The ranking function must consider both the timestamp of the data and the similarity between the data and the query.” • “The ranking function is composed of two independent factors, time and similarity.” • “The ranking function should be cost-efficient.”

Ranking Overview • Ranking functions are completely separate from the indexing mechanism • New ranking functions could be used • TI’s proposed ranking function is based on: • User’s PageRank • Popularity of the topic • Timestamp (self-explanatory) • Similarity between tweet and the query

User’s PageRank • Twitter has two types of links between users • f(u): the set of users who follow user u • f-1(u): the set of users who user u follows • A matrix, Mf[i][j], is used to record the following links between users • A weight factor is given for each user • V = (w1, w2, ….. wn)

User’s PageRank Formula • PageRank formula is given as: Pu = VMfx • So, the user’s PageRank is a combination of their user weight and how many followers they have • The more popular the user, the higher the PageRank

Popularity of Topics • Users can retweet or reply to tweets. • Popularity can be determined by looking at the largest tweet trees. • Popularity of tree is equal to the sum of the U-PageRank values of all tweets in the tree

Similarity between query and tweet • The similarity of a query and the tweet t can be computed as follows: sim(q,t) = (q x t) / (|q||t|)

Ranking Function • q.timestamp = query submittal time • tree.timstamp = timestamp of tree t belongs to (timestamp of root node) • w1, w2, w3 are weight factors for each component (all set to 1)

Evaluation

Conslusion

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al