270 likes | 275 Views
This research examines collection synthesis in digital libraries, including the concept of clusters, the document vector space model, and the use of centroids. It also explores the process of building seed URL sets and crawl control. The evaluation of collections is discussed, along with possible future developments in machine learning.
E N D
Collection Synthesis Donna Bergmark Cornell Digital Library Research Group March 12, 2002
Collection – what is it? • For a digital library, it could be a set of URLs • The documents pointed to are about the same topic • They may or may not be archived • They may be collected by hand or automatically
Collections and Clusters • Clusters are collections of items • The items within the cluster are closer to each other than to items in other clusters • There exist many statistical methods for cluster identification • If clusters are pre-existing, then collection synthesis is a “classification problem”
The Document Vector Space • Classic approach in IR • The documents pointed to are about the same topic • They may or may not be archived • They may be collected by hand or automatically
Document Vector Space Model • Classic “Saltonian” theory • Originally based on collections • Each word is a dimension in N-space • Each document is a vector in N-space • Best to use normalized weights • Example: <0, 0.003,0,0,.01,.984,0,.001>
Distance in DV Space • How similar are two documents, or a document and a query? • You look at their vectors in N space • If there is overlap, the documents are similar • If there is no overlap, the documents are orthogonal (I.e. totally unrelated)
Cosine Correlation • Correlation ranges between 0 and 1 • 0 nothing in common at all (orthogonal) • 1 all terms in common (complete overlap) • Easy to compute • Intuitive
Cosine Correlation • Given vectors x, y both consisting of real numbers x1, x2, … xN and y1, y2, …yN • Compute cosine correlation by:
The Dictionary • Usual to keep a dictionary of actual words (or their stems) • Efficient word lookup • Common words left out • Their document frequency df(I) • Their discrimination value idf(I)
Computing the Document Vector • Download a document, get the words, look each one up in our dictionary • For each word that is actually in the dictionary, compute a weight for it: W(I) = tf(I) * idf(I)
Assembling a Collection • Download a document • Compute its term vector • Add it to the collection it is most like, based on its vector and the collection’s vector • How to get the collection vectors?
The Centroids • “Centroid” is what I called the collection’s document vector • It is critical to the quality of the collection that is assembled • Where do the centroids come from? • How to weight the terms?
The Topic Hierarchy 0 Algebra 1 Basic Algebra 2 Equations 3 Graphing Equations 2 Polynomials 1 Linear Algebra 2 Eigenvectors/Eigenvalues :
Building a seed URL set • Given topic “T” • Find hubs/authorities on that topic • Exploit a search engine to do this • How many results to keep? I chose 7; Kleinberg chooses 200. • Google does not allow automated searches without prior permission
Query: Graphing Basic Algebra… Accessone.com/~bbunge/Algebra/Algebra.html Library.thinkquest.org/20991/prealg/eq.html Library.thinkquest.org/20991/prealg/graph.html Sosmath.com/algebra/algebra.html Algebrahelp.com/ Archives.math.utk.edu/topics/algebra.html Purplemath.com/modules/modules.htm
Results: Centroids • 26 centroids (from about 30 topics) • Seed sets must have at least 4 URLs • All terms from seed URL documents were extracted and weighted • Kept the top 40 words in each vector • Union of the vectors became our dictionary • Centroid evaluation: 90% of seed URLs classified with “their” centroid
Three Knobs for Crawl Control • “On topic”: downloaded page correlates with the nearest centroid at least “Q”, where 0 < Q <= 1.0 • Cutoff – how many off-topic pages to travel through before cutting off this search line? 0 <= Cutoff <= D • Time limit – how many hours to crawl
Results: Some Collections • Built 26 collections in Math • Keep 20-50 of the best correlating URLs for each class • Best Cutoff is 0 • I have crawled (for math) about 5 hours • Some collections are larger than others
Collection “Evaluation” • The only automatic evaluation method is by the correlative value == how close to the collection is an item • With human relevance assessments, one can also compute a “precision” curve • Precision P(n) after considering the n most highly ranked items is number of relevant, divided by n.
Results: Class 14 Mathforum.org/dr.math/problems/keesha.12.18.01.html Mathforum.org/dr.math/problems/kmiller.9.2.96.html Mathforum.org/dr.math/problems/santiago.10.14.98.html www.geom.umn.edu/docs/education/build-icos : Mtl.math.uiuc.edu/message_board/messages/326.html
Conclusions We are still working on the collections. Picking parameters. Will add machine learning. Discussion? Questions?