750 likes | 929 Views
Clustering and exploring search results using timeline constructions Omar Alonso, Michael Gertz , Ricardo Baeza -Yates Pages: 97-106 CIKM 09 Citation : 8. Clustering and Exploring Search Results using Timeline Constructions. Authors : Omar Alonso ( University of California )
E N D
Clustering and exploring search results using timeline constructions Omar Alonso, Michael Gertz, Ricardo Baeza-Yates Pages: 97-106 CIKM 09 Citation : 8 Clustering and Exploring Search Results using Timeline Constructions Authors : Omar Alonso (University of California) Michael Gertz (University of Heidelberg, Germany) Ricardo Baeza-Yates (Yahoo! Research, Spain) Presenter : Zhong-Yong
這篇論文一開始會先將所有的document做成temporal document profile (explicit, implicit and relative三種時間),然後將所有文件的時間通通拿出來排序,依據排序結果來決定該用何種timeline(year, month,…)當成群的label。 • 假設是使用year當成群的label, 那每一年都是一個cluster, 每一個doc就能屬於好多個cluster。 • Cluster中的排序是依據doc與所下query的similarity去計算。 • 當使用者點開cluster時,cluster中還能以month繼續排序下去,這是exploratory search的方法。 • 此篇論文的衡量方式是讓使用者覺得滿不滿意,還有precision來當衡量指標。
Outline • 1. Introduction • 2. Related work • 3. Exploration scenarios • 4. Time annotated document model • 5. Timeline construction and document exploration • 6. Prototype • 7. Evaluation and results • 8. Conclusions and future work
Introduction (1/5) • Time plays a central role in any information space, and it has been studied in other areas like information extraction, question-answering, and summarization. • A look at any of the current search engines shows that temporal aspects of documents are exclusively used to sort the hit list by date, which is primarily the date a Web page has been createdor last modified.
Introduction (2/5) • Hit list clustering has emerged as an alternative mechanism to present similar documents without requiring the user to go through hundreds of items. • For example : • If one would like to know the earliest or most recent paper(s) on that topic or even the period of time when the topic was “popular”, organizing relevant documents along some kind of a timeline would be very helpful.
Introduction (3/5) • Similar scenarios can be envisioned for exploring a news repository. • For example, how would one search for news about acquisitions a company has made before a particular date or in a particular time period?
Introduction (4/5) • It is important to identify diverse types of temporal information associated with documents. • A temporal expression can be explicit, such as “May 20, 2007” or “December 5th, 2005”, implicit, such as “New Years Eve 2006” or “Labor Day 2001”, or relative to a point of narration, such as “yesterday” or “in two weeks”.
Introduction (5/5) • In this paper : • 1. Utilize temporal information extracted from the documents. • 2. Make this information explicit in the form of temporal document profiles (tdp). • 3. Arrange documents in the form of clusters along a timelinesupporting multiple time granularities.
Related work (1/8) • There is some research on using time for a different search applications but only little work has been done on exploiting temporal information associated with documents for clustering and exploring search results.
Related work (2/8) • New research has emerged for future retrieval where temporal information can be used for searching the future. R. Baeza-Yates. Searching the Future. In SIGIR Workshop MF/IR, 2005. (Citation : 11) The idea is to use news information to obtain future possible events and then search events related to our current (or future) information needs.
Related work (3/8) • Google has added the view : timeline feature to display search results along a timeline, allowing a limited exploration of a hit list.
Related work (4/8) • Another technique related to theapproach in this paper is hit list clustering. • Hit list clustering groups search results into categories that are derived from the actual search. O. Zamir and O. Etzioni. Web Document Clustering: A Feasibility Demonstration. In Proc. of 21st International ACM SIGIR Conference, ACM, 46–54, 1998. (C: 811)
Related work (5/8) • Current hit list clustering engines like Vivisimo rely on a separate search engine that provides some information like Web page title, URL, and document snippets for the construction of the clusters.
Related work (7/8) • Crowdsourcing has emerged as a viable alternative to conduct large scale evaluation of different types of experiments for a wide range of applications like relevance evaluation and user studies. O. Alonso, D. E. Rose, and B. Stewart . Crowdsourcingfor Relevance Evaluation. SIGIR Forum (42):2, 12–18, 2008. (C : 33) A. Kittur, E. H. Chi, and B. Suh. CrowdsourcingUser Studies with Mechanical Turk. In Proc. 26th SIGCHI Conference on Human Factors in Computing Systems, 453–456, 2008. (C : 107)
Related work (8/8) • In contrast to most of the approaches mentioned above, the authors establish a solid foundation to combine the aspects of : • 1. Extracting various types of temporal information from documents. • 2. Clustering and organizing document based on temporal data. • 3. Visualizing such information in an exploratory search interface that helps users to study.
Exploratory Scenarios (1/6) • Start the research by conducting a series of user surveys about timelines. • In the first user study, performed a survey among 30 persons (graduate students and faculty) regarding temporal information. • The same user survey was conducted using AMT(Amazon Mechanical Turk), using a crowdsourcingparadigm. • 50 people responded to the survey.
Exploratory Scenarios (2/6) • 1. Do you think current timelines for organizing or clustering search results (such as in Google’s timeline) are useful for some of your daily search activities? • 76% answered “yes” for question 1.
Exploratory Scenarios (3/6) • 2. Do you use (or would use) timelines to explore search results? • 71% answered “yes” for question 2.
Exploratory Scenarios (4/6) • 3. Please indicate some search scenarios where you use timelines or would like to use timelines to organize search results. • Three main categories
Exploratory Scenarios (5/6) • 4. Please give some examples of search scenarios where current search engines do not sufficiently support the concept of timelines to organize and explore search results?
Exploratory Scenarios (6/6) • 5. What other features would you like to see in the context of timelines?
Identify presentation and exploration as the main categories where users see the value in using timelines for search.
Time annotated document model4.1Time and Timelines (1/3) • As the basis for anchoring documents in time, the authors assume a discrete representation of time based on the Gregorian Calendar, with a single day being an atomic time interval called chronon. • The base timeline, denoted Td, is an interval of consecutive day chronons. • For example, the sequence “March 12, 2002; March 13, 2002; March 14, 2002”.
Time annotated document model4.1Time and Timelines (2/3) • Contiguous sequences of chronons can be grouped into larger units called granules, such as weeks, months, years, or decades. • An example of a week chronon in Tw is “3rd week of 2005”.
Time annotated document model4.1Time and Timelines (3/3) • Assume the four timelines T = {Td, Tw, Tm, Ty}. • Relationship Tj>>Tiif timeline Tj is composed of granules of timeline Ti. • There are Ty>>Td, Ty>>Tm, Tm>>Td, but not Tm>>Tw as months are composed of days and not weeks. • For two chronons, ti, tjЄT, ti≠tj , and then either ti<Ttj or tj<Tti. • For example, for the two day chrononsti= “March 12, 2004” and tj= “January 5, 2004”, tj<Tdti holds.
Time annotated document model4.2Temporal Expressions (1/5) • The first type of such information is the document metadata, which appears as the date a document d belongs to D has been created or last modified. • Denote as document timestamp d.ts.
Time annotated document model 4.2 Temporal Expressions (2/5) • The second type of temporal information is a little bit more involved as it relates to the linguistic analysis of the textual content of documents.
Time annotated document model 4.2 Temporal Expressions (3/5) • Explicit temporal expressions :describe chronons in some timeline, such as an exact date or year. • For example, “December 2004” is an explicit expression that is anchored in the timeline Tm. F. Schilder and C. Habel. (2001) From Temporal Expressions to Temporal Information: Semantic Tagging of News Messages. In ACL’01 Workshop on Temporal and Spatial Information Processing, 1–8. (C : 107)
Time annotated document model 4.2 Temporal Expressions (4/5) • Implicit temporal expression : such as names of holidays or events. • For example, the token sequence “Columbus Day 2006” in the text of a document can be mapped to the expression “October 12, 2006”. • In general, implicit temporal expressions require that at least a year chronon appears in the context of a named event.
Time annotated document model 4.2 Temporal Expressions (5/5) • Relative temporal expressions : represent temporal entities that can only be anchored in a timeline in reference to another explicit or implicit, already anchored temporal expression. • For example, the expression “today” alone cannot be anchored in any timeline. However, it can be anchored if the document is known to have a creation date as a reference.
Time annotated document model4.3Temporal document profiles (1/4) • The process of entity extraction is a function denoted tdp (temporal document profile). • : denotes the set of explicit, implicit, and relative temporal expressions. • C : the set of chronons from timelines in T={Td, Tw, Tm, Ty}. • P : the set of positions of temporal expressions in a document.
Time annotated document model 4.3 Temporal document profiles (2/4) Describe the explicit temporal expressions that have been determined in d with their normalized chronons and positions in d.
Time annotated document model 4.3 Temporal document profiles (2/4) Describe implicit temporal expressions.
Time annotated document model 4.3 Temporal document profiles (2/4) Corresponds to the timestamp d.tsof the document d. It is assumed that every document has such a timestamp. For example, if it is known that the document creation times are exact, then the document timestamp should be considered as an explicit temporal expression.
Time annotated document model 4.3 Temporal document profiles (2/4) Describe relative temporal expressions.
Time annotated document model 4.3 Temporal document profiles (3/4) • There are some important properties of a temporal document profile that need to be recognized. • 1. All chrononsci, i = 1 . . . l, are normalized. That is, all chronons that are elements of the same timeline which belongs to T have the same format. • For example, all day chronons that have been associated with temporal expressions are represented in the day/month/year format, such as “15/04/1966”.
Time annotated document model 4.3 Temporal document profiles (4/4) • 2. A chrononc can be associated with many explicit, implicit, and relative temporal expressions. • In fact, the same chronon can even occur several times in a single profile tdp(d) but then at different positions in the document d.
A brief summary • 4.1 Definethe time, timeline and the relationship about time. • 4.2 Define the type of time : • Explicit, implicit and relative • 4.3 Define the expression of the dtp.
Timeline construction and document exploration • Assume that for a query term q against a document collection D, the retrieval algorithm determines a hit list Lq = [d1, d2, . . . , dk] of k documents. • Given such a hit list, the temporal document profiles are used to construct a time outline for the documents first. • The documents are then clustered along this timeline based on their document profiles.
Timeline construction and document exploration5.1 Constructing a time outline (1/2) • The first step in organizing documents along a multiple-granularity timeline is to construct a time outline for the documents in the hit list Lq. • For this, all chronons are extracted from the temporal document profiles of the documents in Lq. • Denote this multi-set of chrononsch(Lq), defined as follows: • Note that the elements in ch(Lq) may come from different timelines.
Timeline construction and document exploration 5.1 Constructing a time outline (2/2) • The range, for example, Lq contains a document with a temporal expression mapped to the year 1974 (as lower bound) and another document with a temporal expression mapped to the year 2007 (as upper bound). Ty is chosen as time outline for Lq. • Time outline is a timeline representation that describes the temporal range of documents in Lq, independent of the “temporal distribution” of documents along this timeline.
For example d1 : 11/11/1990, 11/12/1990, 11/13/1990 d2 : 4/1/1992, 4/2/1992 d3 : 5/17/1996, 5/20/1996 d4 : 9/1/2000, 9/21/2000 d5: 3/10/1997 ch(Lq) Find the upper bound and lower bound. Ty is chosen as time outline for Lq.
Timeline construction and document exploration5.2 Document clustering (1/3) • The timeline chosen as time outline for Lq is used to normalize the chronons in ch(Lq), here according to Ty. • Denote such a type of normalization of a chrononc based on a time granule g belongs to {y, m, w, d} as normg(c). • For example, normy(“15/4/1966“) = “1966“, and normm(“15/4/1996“) = “4/1996“).
Timeline construction and document exploration5.2 Document clustering (2/3) • The labels for the initial document clusters for Lq and time granule g are then determined by the following set. • Assume there are l cluster labels y1, y2, . . . , yl, yj belongs to Ty in chy(Lq) among which the precedence relationship > holds.
Timeline construction and document exploration5.2 Document clustering (3/3) • The documents in a cluster yj , denoted cluster(yj), are then determined as follows: • There is a main cluster for each document in Lq. • For example, if the chronons associated with d refer to n different years, the main cluster for d, denoted c_main(d), would be the year for which d has the most chronons.
For example d1 : 11/11/1990, 11/12/1990, 1/1/1991 11/13/1990 d2 : 4/1/1992, 4/2/1992 d3 : 5/17/1996, 5/20/1996 d4 : 9/1/2000, 9/21/2000 d5: 3/10/1997 ch(Lq) The c_main(d1) is the cluster with label 1990. Find the upper bound and lower bound. d1 : 1990 d1 : 1991 d2 : 1992 d3 : 1996 d4 : 2000 d5: 1997 chy(Lq) 6 clusters. Ty is chosen as time outline for Lq. Normalize by Ty.