340 likes | 351 Views
Digital Libraries: Re-inventing Scholarly Information Dissemination and Use. Robert Wilensky Principal Investigator David Forsyth Co-principal Investigator The UC Berkeley Digital Library Team. Central Thrusts. Provide tools to facilitate changing the publishing model
E N D
Digital Libraries:Re-inventing Scholarly Information Dissemination and Use Robert Wilensky Principal Investigator David Forsyth Co-principal Investigator The UC Berkeley Digital Library Team
Central Thrusts • Provide tools to facilitate • changing the publishing model • from centralized, linear, binary, expensive, “filter-then-disseminate” model, • to a much less costly, powerful, fully distributed “disseminate-filter-collaborate” cycle • without sacrificing good organization, peer review • treating non-textual material (photos, video, maps, primary data sets) as first class citizens
Who We Are • Other Investigators • Henry Baird (Xerox PARC) • Bernie Hurley (UCB Library) • Pinar Duygulu (Middle East Technical University) • Students • Byunghoon Kang • Xiaofeng Ren • Sumeet Solanki • Staff • Ginger Ogle • Jeff Anderson-Lee • Howard Foster • Loretta Willis • Joyce Gross • Tom Phelps • PI and Co-PI: • Robert Wilensky (CS & SIMS) • David Forsyth (CS) • Faculty Investigators • Richard Fateman (CS) • Ray Larson (SIMS) • Jitendra Malik (CS) • Philip Stark (Statistics) • Doug Tygar (CS & SIMS) • Nancy Van House (SIMS) • Hal Varian (SIMS) • Marti Hearst (SIMS) • James Landay (CS) • Joe Hellerstein (CS) • Post-docs • Kobus Barnard • Tracy Riggs • Byunghoon Kang • Jon Traupman
Partners • UCB Organizations • Museum of Vertebrate Zoology • Jebson Herbarium • U.C.B. Library • U.C.B. Instructional Technology Program • DLIB InterOp Project Partners • Stanford, UCSB • California Digital Library • SDSC • Not-for-profits • CalFlora • California Academy of Science • Fine Arts Museum of S.F. • California Department of Fish and Game • Corporate • Xerox PARC • Hewlett-Packard • IBM Almaden • NEC • SUN Microsystems • Microsoft • Sharp
Some Technology • New Document Models • Multivalent Documents • GIS Viewer • Related Tools: TilePic and GISLite • Collaborative quality filtering as a proxy for academic review • Robust Linking • Personal Libraries • Image Analysis: • Better content-based image analysis • Combining image and text • Self-administrating Documents • Document recognition: • A turbo recognition DID-based approach to document layout analysis. • Collections: • Biologically-related large image and data collections • Rare books scanning effort
Tools for Information Management and Collaboration • Multivalent Documents: A Platform for New Ideas • An “Anytime, Anywhere, Any Type, Every Way User-Improvable Digital Document Platform” • Not format-centric. radically extensibleto • support any format • perform standard document functionality • implement your new idea • Extensions work across all formats.
Multivalent Architecture • Extensibility achieved by • behaviors and layers paradigm • behaviors written to conform to an open protocol • document tree (that includes UI) • So each document can be its own custom browser. • Conducive to developing a “digital library”-centric browser • E.g., easy to support distributed annotation.
Multivalent Status • Multivalent Browser, DR4, available; beta ASN • An open source Java (1.4) application, at http://http.cs.berkeley.edu/~phelps/Multivalent/ • standard browser features (cache, UI, bookmarks, etc.), robust URL support • Implemented behaviors: • Media Adaptors: • HTML 3.2 + CSS • LaTex/DVI • ASCII • PDF • “enlivened scanned images” • “multi-page” • Span: hyperlink, highlight, copyeditor annotations, “anchored ink”, style, redaction • Lenses: “show OCR”, magnify, cypher, notes, rulers, etc. • Structural: alt. select-and-paste, Notemarks • Misc: search hit visualization, “managers”
Multivalent Plans • Support project goals by providing • Complete media adaptors for common document formats (esp. HTML+CSS, XML, LaTeX/DVI, PDF) • More standard browser features (e.g., hierarchical bookmarks, preference editing) • Experiment with “history-enriched digital objects” • Mechanisms for manipulating multiple annotations • Support from document collaboration services • Support for (non-textual) data types • temporal and geographic extent, via JMF 2.0 • involving dynamic elements • data set elements
Related Image-oriented Tools • GIS Viewer • 4.0 released to public • Related Tools: • TilePic • GIS Lite
Robustness: The Challenge • How do we put together distributed applications • that rely on independently administered distributed resources • which change chaotically • yet whose performance degrades gracefully as the world changes? • One answer: Provide multiple, largely independent descriptions along uncontrolled network boundaries.
Robust Linking • Robust Locations • Refer to locations within a resource, but can still be used to find the location after the page is edited. • Implemented in Multivalent Browser • Robust Hyperlinks • Refer to whole resources, but can still be used to find the resource after the page is moved, etc. • Available now: http://www.cs.berkeley.edu/~phelps/Robust
Robust Hyperlink Example • Compute “lexical signature” of page http://www.eng.nsf.gov/engnews/2001/Dec01RobotLegos/dec01robotlegos.htm • which turns out to be jjarosz lambirth telesurgery jarosz simulating • Add to URL to make robust URL: • http://www.eng.nsf.gov/engnews/2001/Dec01RobotLegos/dec01robotlegos.htm/?lexical-signature= jjarosz+lambirth+telesurgery+jarosz+simulating • Feed signature to a search engine on URL failure:
Robust Hyperlinks: Plans • Problem: No one wants to bother signing anything. • Proposal: Build a URL-signature data base; fail over to this upon 404 errors. • using Stanford’s WebBase
Collaborative Quality Filtering • Idea: Traditional peer review is majorized by a good collaborative filtering system. • I.e., publishing = dissemination + collaborative quality filtering • Approach: • Good papers are ones good reviewers rate highly, etc. • Good reviewers are the ones that rate papers accurately. • Assumption: Good reviewers’ reviews should agree with the asymptotic average (looking forwards) • Use hubs-and-authorities type algorithm to establish credentials. • Note: • Can rate along multiple dimensions, e.g., importance and correctness • Later on, can add other factors, e.g., predication of asymptotic citation index, credentials, expertise
Collaborative Quality Filtering (con’t) • Simple algorithm predicts users evaluations of reviewer in empirical study. • Parameters for number of items reviewer has reviewed, no. of reviews item receives, rank of review. • Advanced version incorporating notion of areas of expertise being tested. • Maintains reviewer ratings on a per document basis; computes document rating based on similarity of documents. • Initial implementations in collab. with NEC CiteSeer • More details are available.
Personal Libraries • Goal: Make it easy for individuals and groups to build and manage document collections. • Seamlessly incorporating digital-born and legacy documents • Approach: Provide collection manager • Manages collections in distributed repositories • Initial prototype • Supports collection creation, population, editing, access by metadata searching, full-text indexing. • HTML, PDF, ASCII, scanned images, composites (prototype) • Provides affiliated repository service • Scan-to-collection service
Personal Libraries (con’t) • Future directions: • Incorporate robust linking • Full support for composites • Automatic collection population • OceanStore backend? • Have begun experimental use by CS Division and SIMS
Image Analysis for Access • BlobWorld++ • A new framework for segmentation (normalized cuts) • Shape • New algorithm developed for measuring image shape similarity using “shape contents” • Now hold world record for handwritten digit recognition • Combining Image Features with Text for Image Data Organization and Search • Kobus Barnard and David Forsyth
Combining Image Features with Text • Idea: Use text and image features together • Text semantic categories • Image features visual similarity • Together learn interesting relationships • Use statistical models to learn structure • Cluster on “blobs” and (disambiguated, hierarchically enhanced) words • Arrange clusters in hierarchy • Result is automatic organization of large collections for • Browsing (using GIS Viewer as interface, and using TilePic) • “auto-illustration” • “auto-captioning”
Combining Image Features with Text • And here are some results for labeling image segments.
Summary • We need to rethink the entire cycle of information use • creation, dissemination and collaboration • We must provide support for • finding and presenting non-textual material (photos, video, maps) • collection creation of primary data sources and informal “publication” • radically new modes of use • robustness in a chaotic world • We will need a lot of help!