440 likes | 546 Views
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al. Pete Bohman Adam Kunk. Outline. Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion. Real-Time Search. R equirements
E N D
TI: An Efficient Indexing Mechanism for Real-Time Search on TweetsSIGMOD ‘11C. Chen et al Pete Bohman Adam Kunk
Outline • Introduction • Related Work • System Overview • Indexing Scheme • Ranking • Evaluation • Conclusion
Real-Time Search • Requirements • Contents searchable immediately following creation • Scale to thousands of updates/sec OBL Death 5,000 tweets/sec • Results relevant to query via cost efficient ranking • Tradeoff: • Scalability and Performance vs. Ranking
Real-Time Search Vs. Time Rank TI Rank
Real-Time Search • Real-Time Search = Indexing + Ranking • TI Index • Scalable indexing scheme based on partial indexing • Only index tweets likely to appear in query result • TI Rank • User’s pagerank • Popularity of topic • Tweet to query similarity
Outline • Introduction • Related Work • System Overview • Indexing Scheme • Ranking • Evaluation • Conclusion
Partial Indexing • The Case for Partial Indexes • Stonebreaker, 1989 • Index only a portion of a column • User specified index predicates (where salary > 500) • Build index as a side-effect of query processing
View Materialization • An application of materialized views is to use cost models to automatically select which views to materialize. • Materialized views can be thought of as snapshots of a database, in which the results of a query are stored in an object. • The concept of only indexing essential tweets in real-time was borrowed from the idea of view materialization.
Microblog Search • Google and Twitter have both released real-time search engines. • Google’s engine adaptively crawls the microblog • Twitter’s engine relies on Apache’s Lucene (high-performance, full-featured text search engine library) • But, both the Google and Twitter engines only utilize time in their ranking algorithms. • TI’s ranking algorithm takes much more than just time into account.
TI Cost Reduction • TI clusters similar tweets together and offloads noisy tweets in order to reduce computation costs of real-time search. • Tweets are grouped into topics by grouping them by relationship in a tree structure. • Tweets replying to the same tweet or belonging to the same thread are organized as a tree. • TI also maintains popular topics in memory.
Outline • Introduction • Related Work • System Overview • Indexing Scheme • Ranking • Evaluation • Conclusion
User Graph • Twitter users have links to other friends • A User Graph is utilized to demonstrate this relationship • Gu = (U, E) • U is the set of users in the system • E is the friend links between them
Tweet Tree Structure • Nodes represent tweets • Directed edges indicate replies or retweets • Implemented by assigning tweets a tree encoding ID
TI Design • Search is handled via an inverted index for tweets • Given a keyword, the inverted index returns a tweet list, T • T contains set of tweets sorted by timestamp
TI Inverted Index • TID = Tweet ID • U-PageRank = Used for ranking • TF = Term Frequency • tree = TID of root node of tweet tree • time = timestamp
Ranking Support • In order to help ranking, TI keeps a table of metadata for each tweet • TID = tweet ID • RID = ID of replied tweet (to find parent) • tree = TID of root node of tweet tree • time = timestamp • count = number of tweets replying to this tweet
In-memory structures • Certain structures are kept in-memory to support indexing and ranking • Keyword threshold – records statistics of recent popular queries • Candidate topic list – information about recent topics • Popular topic list – information about highly discussed topics
Outline • Introduction • Related Work • System Overview • Indexing Scheme • Ranking • Evaluation • Conclusion
Tweet Classification • Observation • Users are only interested in top-K results for a query • Given a tweet t and a user’s query set Q, • ∃qi∈ Q and t is a top-K result for qi based on the ranking function F t is a distinguished tweet • Maintenance cost for query set Q?
Query Set • Observation • 20% of queries represent 80% of user requests (Zipf’s dist.) • Suppose the nth query appears with a prob. of (Zipf’s distribution) • Let s be the # of queries submitted /sec. Expected time interval of the nth query is • Batch processing occurs every t’ sec We will keep the n-th query in Q, only if t(n) < t’
Naïve Classifier • Dominant set ds(qi,t) • The tweets that have higher ranks than tfor query qi • Performance problems • Full scan of tweet set required for dominant set • Test each tweet against every query
Optimization 1 • Observation • The rank of the lower results are stable • Replace dominant set with a comparison to the score of Q’s Kth result.
Optimization 2 • Compare a tweet to similar queries • Given tweet t = <k1, k4>, compare t to Q1, Q3, Q4
Real-Time Indexing • New tweets categorized as being distinguished (index these immediately) • If tweet belongs to existing tweet tree, retrieve its parent tweet to get root ID and generate encoding. Update count number in parent. • Tweet is inserted into tweet data table. • Tweet is inserted into inverted index. • Main cost is updating the inverted index (due to each keyword in the tweet).
Batch Indexing • New tweets categorized as being noisy (index these at a later time) • Instead of indexing in inverted index, append tweet to log file. • Batch indexing process periodically scans the log file and indexes the tweets there.
Outline • Introduction • Related Work • System Overview • Indexing Scheme • Ranking • Evaluation • Conclusion
Ranking Desiderata • “The ranking function must consider both the timestamp of the data and the similarity between the data and the query.” • “The ranking function is composed of two independent factors, time and similarity.” • “The ranking function should be cost-efficient.”
Ranking Overview • Ranking functions are completely separate from the indexing mechanism • New ranking functions could be used • TI’s proposed ranking function is based on: • User’s PageRank • Popularity of the topic • Timestamp (self-explanatory) • Similarity between tweet and the query
User’s PageRank • Twitter has two types of links between users • f(u): the set of users who follow user u • f-1(u): the set of users who user u follows • A matrix, Mf[i][j], is used to record the following links between users • A weight factor is given for each user • V = (w1, w2, ….. wn)
User’s PageRank Formula • PageRank formula is given as: Pu = VMfx • So, the user’s PageRank is a combination of their user weight and how many followers they have • The more popular the user, the higher the PageRank
Popularity of Topics • Users can retweet or reply to tweets. • Popularity can be determined by looking at the largest tweet trees. • Popularity of tree is equal to the sum of the U-PageRank values of all tweets in the tree
Similarity between query and tweet • The similarity of a query and the tweet t can be computed as follows: sim(q,t) = (q x t) / (|q||t|)
Ranking Function • q.timestamp = query submittal time • tree.timstamp = timestamp of tree t belongs to (timestamp of root node) • w1, w2, w3 are weight factors for each component (all set to 1)
Adaptive Indexing • The size of the inverted index limits the performance of the search for tweets • The size of the inverted index grows with the number of tweets • To alleviate this problem, adaptive indexing is proposed:
Adaptive Indexing Cont. • The main idea: • Iteratively read a block of the inverted index (rather than the entire thing) • Stop iterating blocks when the timestamp value gives a score low enough to throw out the results • Stop because the rest of the tweets in the inverted index will also have a lower score
Outline • Introduction • Related Work • System Overview • Indexing Scheme • Ranking • Evaluation • Conclusion
Evaluation • Evaluation performed on real dataset • Dataset collected for 3 years (October 2006 to November 2009) • 500 random users picked as seeds (from which other users are integrated into the social graphs) • 465,000 total users • 25,000,000 total tweets • Experiments typically 10 days long • 5 days training, 5 days measuring performance
Evaluation Cont. • Queries lengths are distributed as follows: • ~60% are 1 word • ~30% are 2 words • ~10% are more than 2 words • Queries submitted at random, tweets are inserted into system based on original timestamps (from dataset)
Performance of Query Processing • TimeBased represents using only tweet timestamp (like Google)
Outline • Introduction • Related Work • System Overview • Indexing Scheme • Ranking • Evaluation • Conclusion