140 likes | 268 Views
Managing Distributed Collections: Evaluating Web Page Change, Movement, and Replacement. Richard Furuta and Frank Shipman Center for the Study of Digital Libraries and the Department of Computer Science Texas A&M University. Distributed Collections. The Web is continuously changing
E N D
Managing Distributed Collections:Evaluating Web Page Change, Movement, and Replacement Richard Furuta and Frank Shipman Center for the Study of Digital Libraries and the Department of Computer Science Texas A&M University
Distributed Collections • The Web is continuously changing • .gov and .edu pages change less frequently than .com pages (1999) • Collections are needed to “organize” the Web • Bookmark lists • Yahoo! directories • Web portals (NSDL) • Walden’s Paths • Collection managers cannot control changes
Changes to Items in Collections • Items in collections • Play specific roles • Are semantically related • To each other • To the collection • Change to an item may • Change its relationship to the collection • Less coherent with other items (default assumption) • More or no change in relationship • Affect the role it plays in the collection • Less suitable (default assumption) • More suitable or no effect on the role
Research Focuses • Develop techniques to help collection managers cope with changes • Change, migration, disappearance • Categories of Change • Missing pages (migration and disappearance) • Find exact matches • Suggest similar pages • Changed pages: characterizing change • Quantity of change • Nature of change • Relevance to the collection • Implementation: Path Manager – A tool that helps collection managers cope with changes
Management of Distributed Collections • Detection of change is easy • Determination of • Quantity of change is relatively easy • Relevance of change is less easy • Meaning of change is difficult • Approaches • Human validation (Yahoo! surfers) • Automatic detection of change (Path Manager)
Path Manager – The tool • Types of change • Content changes (what) • Presentation changes (how) • Structural changes (linking) • Behavioral changes (scripting – not addressed) Collection-level overview Page-level overview Page details
Little Change Server unreachable 404 error No change Drastic change Page-level Overview
Page Details Page Information Modification details
Content-based Metrics Angle between original and replacing pages (in degrees) Change is change… High angle of change for all cases
Context-based Change Detection • Context consists of • Content from other pages in the path • Annotations created by the author • Additional metadata provided by the author • Distinguishes between edited and replaced pages
Evaluation • 20 paths, pages selected from Yahoo! Directories • Each path consisted of 10 to 12 pages • Pages were randomly selected • no flash presentations or images • A page in each path was randomly selected for replacement • Each selected page was replaced by 3 pages • CNN Financials (large change) • Elephants (large change) • A page from the same Yahoo! Directory (small change)
Experimental thresholds • Negative angle = divergence from the collection • Distinction between similar and different pages • Managers can now focus on divergent pages Results – Distribution of Context-based changes Replacements resulting in moving towards and away from the context vector
For more information on Walden’s Paths http://www.csdl.tamu.edu/walden/ walden@csdl.tamu.edu Principal Investigators: Richard Furuta (furuta@csdl.tamu.edu) Frank Shipman (shipman@csdl.tamu.edu)