E N D
Information Access for a Digital Library:Cheshire II and the Berkeley Environmental Digital LibraryRay R. LarsonSchool of Information Management & SystemsUniversity of California, Berkeleyray@sherlock.berkeley.eduChad CarsonComputer Science Division, EECSUniversity of California, Berkeleycarson@eecs.berkeley.edu ASIS Annual Meeting 1999: Ray R. Larson
UCB Digital Library Project: Research Agenda • Funded by NSF/NASA/DARPA Digital Library Initiative (Phases I and II) • Research agenda • Understand user needs. • Extend functionality of documents. • “Enliven” legacy documents. • Improve access to information. • Scale to large systems. • Re-Invent Scholarly Information Access and Use ASIS Annual Meeting 1999: Ray R. Larson
Testbed: An Environmental Digital Library • Collection: Diverse material relevant to California’s key habitats. • Users: A consortium of state agencies, development corporations, private corporations, regional government alliances, educational institutions, and libraries. • Potential: Impact on state-wide environmental system (CERES ) ASIS Annual Meeting 1999: Ray R. Larson
The Environmental Library -Users/Contributors • California Resources Agency, California Environment Resources Evaluation System (CERES) • California Department of Water Resources • The California Department of Fish & Game • SANDAG • UC Water Resources Center Archives • New Partners: CDL and SDSC ASIS Annual Meeting 1999: Ray R. Larson
The Environmental Library - Contents • Environmental technical reports, bulletins, etc. • County general plans • Aerial and ground photography • USGS topographic maps • Land use and other special purpose maps • Sensor data • “Derived” information • Collection data bases for the classification and distribution of the California biota (e.g., SMASCH) • Supporting 3-D, economic, traffic, etc. models • Videos collected by the California Resources Agency ASIS Annual Meeting 1999: Ray R. Larson
The Environmental Library - Contents • As of mid 1999, the collection represents about three quarters of a terabyte of data, including over 70,000 digital images, over 300,000 pages of environmental documents, and over a million records in geographical and botanical databases. ASIS Annual Meeting 1999: Ray R. Larson
Botanical Data: • The CalFlora Database contains taxonomical and distribution information for more than 8000 native California plants. The Occurrence Database includes over 300,000 records of California plant sightings from many federal, state, and private sources. The botanical databases are linked to our CalPhotos collection of Calfornia plants, and are also linked to external collections of data, maps, and photos. ASIS Annual Meeting 1999: Ray R. Larson
Geographical Data: • Much of the geographical data in our collection is being used to develop our web-based GIS Viewer. The Street Finder uses 500,000 Tiger records of S.F. Bay Area streets along with the 70,000-records from the USGS GNIS database. California Dams is a database of information about the 1395 dams under state jurisdiction. An additional 11 GB of geographical data represents maps and imagery that have been processed for inclusion as layers in our GIS Viewer. This includes Digital Ortho Quads and DRG maps for the S.F. Bay Area. ASIS Annual Meeting 1999: Ray R. Larson
Documents: • Most of the 300,000 pages of digital documents are environmental reports and plans that were provided by California state agencies. This collection includes documents, maps, articles, and reports on the California environment including Environmental Impact Reports (EIRs), educational pamphlets, water usage bulletins, and county plans. Documents in this collection come from the California Department of Water Resources (DWR), California Department of Fish and Game (DFG), San Diego Association of Governments (SANDAG), and many other agencies. Among the most frequently accessed documents are County General Plans for every California county and a survey of 125 Sacramento Delta fish species. ASIS Annual Meeting 1999: Ray R. Larson
Documents - cont. • The collection also includes about 20Mb of full-text (HTML) documents from the World Conservation Digital Library. In addition to providing online access to important environmental documents, the document collection is the testbed for our Multivalent Document research. ASIS Annual Meeting 1999: Ray R. Larson
Photographs: • The photo collection includes 17,000 images of California natural resources from the state Department of Water Resources, several hundred aerial photos, 17,000 photos of California native plants from St. Mary's College, the California Academy of Science, and others, a small collection of California animals, and 40,000 Corel stock photos. ASIS Annual Meeting 1999: Ray R. Larson
Testbed Success Stories • LUPIN: CERES’ Land Use Planning Information Network • California Country General Plans and other environmental documents. • Enter at Resources Agency Server, documents stored at and retrieved from UCB DLIB server. • California flood relief efforts • High demand for some data sets only available on our server (created by document recognition). • CalFlora: Creation and interoperation of repositories pertaining to plant biology. • Cloning of services at Cal State Library, FBI ASIS Annual Meeting 1999: Ray R. Larson
Research Highlights • Documents • Multivalent Document prototype • Page images, structured documents, GIS data, photographs • Intelligent Access to Content • Document recognition • Vision-based Image Retrieval: stuff, thing, scene retrieval • Natural Language Processing: categorizing the web, Cheshire II, TileBar Interfaces ASIS Annual Meeting 1999: Ray R. Larson
User Interface Paradigms: Multivalent Documents • An approach to new document types and their authoring. • Supports active, distributed, composable transformations of multimedia documents. • Enables sophisticated annotations, intelligent result handling, user-modifiable interface, composite documents. ASIS Annual Meeting 1999: Ray R. Larson
Network Protocols & Resources Cheshire Layer GIS Layer Table Layer OCR Layer OCR Mapping Layer Valence: 2: The relative capacity to unite, react, or interact (as with antigens or a biological substrate). Webster’s 7th Collegiate Dictionary History of The Classical World kdk dkd kdk Modernjsfj sjjhfjs jsjj jsjhfsjf sslfjksh sshf jsfksfjk sjs jsjfs kj sjfkjsfhskjf sjfhjksh skjfhkjshfjksh jsfhkjshfjkskjfhsfh skjfksjflksjflksjflksf sjfksjfkjskfjskfjklsslk slfjlskfjklsfklkkkdsj The jsfj sjjhfjs jsjj jsjhfsjf sjhfjksh sshf jsfksfjk sjs jsjfs kj sjfkjsfhskjf sjfhjksh skjfhkjshfjksh jsfhkjshfjkskjfhsfh skjfksjflksjflksjflksf sjfksjfkjskfjskfjklsslk slfjlskfjklsfklkkkdsj ksfksjfkskflk sjfjksf kjsfkjsfkjshf sjfsjfjks ksfjksfjksjfkthsjir\\ ks ksfjksjfkksjkls’ks klsjfkskfksjjjhsjhuu sfsjfkjs Scanned Page Image taksksh kdjjdkd kdjkdjkd kj sksksk kdkdk kdkd dkk skksksk jdjjdj clclc ldldl Table 1. Multivalent Documents ASIS Annual Meeting 1999: Ray R. Larson
GIS in the MVD Framework • Layers are georeferenced data sets. • Behaviors are • display semi-transparently • pan • zoom • issue query • display context • “spatial hyperlinks” • annotations • Written in Java (to be merged with MVD-1 code line?) ASIS Annual Meeting 1999: Ray R. Larson
GIS Viewer Example http://elib.cs.berkeley.edu/annotations/gis/buildings.html ASIS Annual Meeting 1999: Ray R. Larson
Overview of Cheshire II • The Cheshire II system is intended to provide an easy-to-use, standards-compliant system capable of retrieving any type of information in a wide variety of settings. ASIS Annual Meeting 1999: Ray R. Larson
Overview of Cheshire II • It supports SGML and XML. • It is a client/server application. • Uses the Z39.50 Information Retrieval Protocol. • Server supports a Relational Database Gateway. • Supports Boolean searching of all servers. • Supports probabilistic ranked retrieval in the Cheshire search engine. • Search engine supports ``nearest neighbor'' searches and relevance feedback. • GUI interface on X window displays. • WWW/CGI forms interface for DL, using combined client/server CGI scripting via WebCheshire. • Image Content retrieval using BlobWorld • Support for the SDLIP (Simple Digital Library Interoperability Protocol) for search and as Z39.50 Gateway ASIS Annual Meeting 1999: Ray R. Larson
Local Remote Z39.50 Z39.50 Internet Z39.50 Z39.50 Images Scanned Text Cheshire II Searching ASIS Annual Meeting 1999: Ray R. Larson
Current Usage of Cheshire II • Web clients for: • NSF/NASA/ARPA Digital Library • Includes support for full-text and page-level search. • Experimental Blob-World image search • SunSite • University of Liverpool. • University of Essex, HDS (part of AHDS) • California Sheet Music Project • Cha-Cha (Berkeley Intranet Search Engine) • Univ. of Virginia • Cheshire ranking algorithm is basis for Inktomi (i.e., Yahoo, Hotbot, MSN? and others) ASIS Annual Meeting 1999: Ray R. Larson
Image Retrieval Research • Finding “Stuff” vs “Things” • BlobWorld • Other Vision Research ASIS Annual Meeting 1999: Ray R. Larson
Blobworld: use regions for retrieval • We want to find general objectsRepresent images based on coherent regions ASIS Annual Meeting 1999: Ray R. Larson
Outline • Why regions? • Creating Blobworld: segmentation and description • Using Blobworld: query experiments • Indexing blobs for faster querying • Conclusions ASIS Annual Meeting 1999: Ray R. Larson
query extract features segment image describe regions Creating and using Blobworld Create Use ASIS Annual Meeting 1999: Ray R. Larson
Extract features for each pixel • Color • Take average color (L*a*b*) at the selected scale ignore local color variations due to texture • “zebra = gray horse + stripes” • Texture • Find contrast, anisotropy, polarity at the selected scale • Position ASIS Annual Meeting 1999: Ray R. Larson
Find groups in feature space • Model feature distribution as a mixture of Gaussians using Expectation-Maximization (EM) ASIS Annual Meeting 1999: Ray R. Larson
2 3 3 1 4 1 Find regions in the image • Label each pixel based on its Gaussian cluster • Find connected components regions 2 1 3 ASIS Annual Meeting 1999: Ray R. Larson 4
Describe regions by color, texture, shape • Color • Color histogram within region • Quadratic distance: encode similarity between color bins d2hist(x, y) = (x - y)' A (x - y) • Texture • Mean contrast and anisotropy stripes vs. spots vs. smooth • (Basic) Shape • Fourier descriptors of contour ASIS Annual Meeting 1999: Ray R. Larson
Select appropriate scale for processing • Polarity: do all the gradient vectors point in the same direction? • Choose scale where polarity stabilizes include one approximate period ASIS Annual Meeting 1999: Ray R. Larson
Initialize means using image data • Before, we picked random initialization • Now, choose initial means based on image tiles • Add noise to means and restart EM (4 runs per K) K = 2 K = 3 K = 4 K = 5 ASIS Annual Meeting 1999: Ray R. Larson
Grouping: Expectation-Maximization • Given class characteristics (,), find class membership • Given class membership, find class characteristics (,) • Iterate update , update labels update , update labels ASIS Annual Meeting 1999: Ray R. Larson
How many Gaussians? • Model selection: Minimum Description Length • Prefer fewer Gaussians if performance is comparable vs. vs. ASIS Annual Meeting 1999: Ray R. Larson
Find groups in feature space • Model feature distribution as a mixture of Gaussians using Expectation-Maximization (EM) ASIS Annual Meeting 1999: Ray R. Larson
EM math Probability density: Update equations: where
Encode similarity between color bins • Quadratic distance • Distance between histograms x and y: d2hist(x, y) = (x - y)' A (x - y) • Aij is based on the similarity between bins i and j • Neighboring bins haveAij = 0.5 ASIS Annual Meeting 1999: Ray R. Larson
Fourier descriptors for shape • [Zahn & Roskies ’72, Kuhl & Giardina ’82] • Find (x,y) representation of outer contour • Find Fourier series of (x,y) • Coefficients specify an ellipse (4 parameters): major axis, minor axis, orientation, starting point • Remove starting point ambiguity • Store first ten Fourier coefficients ASIS Annual Meeting 1999: Ray R. Larson
query extract features segment image describe regions Creating and using Blobworld Create Use ASIS Annual Meeting 1999: Ray R. Larson
Querying: let user see the representation • Current systems are unsatisfying • User can’t see what the computer sees • Unclear how parameters relate to the image • User should interact with the representation • Helps in query formulation • Makes results understandable • Minimizes disappointment http://elib.cs.berkeley.edu/photos/blobworld ASIS Annual Meeting 1999: Ray R. Larson
Query experiments • Collection of 10,000 Corel stock photos • Five query images in each of ten categories(e.g., cheetahs, polar bears, airplanes) • Compare Blobworld to global histogram queries • Precision (% of retrieved images that are correct) vs. Recall (% of correct images that are retrieved) ASIS Annual Meeting 1999: Ray R. Larson
cheetahs zebras Distinctive objects • Tigers, cheetahs, and zebras: • Blobworld does better than global histograms ASIS Annual Meeting 1999: Ray R. Larson
black bears Distinctive objects and backgrounds • Eagles and black bears: • Blobworld does better than global histograms ASIS Annual Meeting 1999: Ray R. Larson
airplanes Distinctive scenes • Airplanes and brown bears: • Global histograms do better than Blobworld • But Blobworld has room to grow (shape, etc.) ASIS Annual Meeting 1999: Ray R. Larson