540 likes | 642 Views
Gordon Paynter Infomine Lead Programmer and the Infomine team: Steve Mitchell, Margaret Mooney, Julie Mason et al. at the University of California, Riverside. The INFOMINE project. The Infomine Project. Introduction to Infomine The core Infomine system
E N D
Gordon Paynter Infomine Lead Programmer and the Infomine team: Steve Mitchell, Margaret Mooney, Julie Mason et al. at the University of California, Riverside The INFOMINE project
The Infomine Project • Introduction to Infomine • The core Infomine system • Automation: finding and describing resources • Collaboration: the Fiat Lux portals • Conclusions
Introduction to Infomine • Infomine is a virtual library • Infomine's goal is to provide organised access to the Internet in the same way that we do for printed works • Library catalogs focus on books and periodicals • Infomine focuses on web sites (mostly, now) • There are many differences between books and web sites
Web sites: What is a “web site” anyway? Continually changing Frequently disappear Google: 2 billion pages Books Vs. Web sites • Books: • Easily-defined, physical objects • Static • Permanent • LC: 119 million items
Web sites: Anyone can publish Few indexers: Infomine, LII, IPL, BUBL, MEL, Scout; all are post-hoc Can be downloaded and processed Books Vs. Web sites • Books: • Limited number of publishers • Existing, coordinated cataloging effort • Text not usually electronically available
Simplifying the problem • Editorial standards: • Only select the best Web sites • Automated assistance: • Collection building • Automated and semi-automated resource description • Catalog maintenance • Wide collaboration • More contributors • Less redundant effort
The core Infomine system • Infomine for patrons • Behind the scenes: Infomine for content builders • Open source inputs: what the community gives us • Open source outputs: what we're distributing
Infomine core: open source inputs • The Linux operating system • Debian GNU/Linux • Infrastructure: • The Apache webserver • MySQL and Berkeley DB databases • Programming tools: • The GNU Compiler (gcc) and libraries, emacs • Common libraries
Infomine core: open source outputs • The Infomine general-purpose library • http://infomine.ucr.edu/iVia/ • The full libInfomine library • Available in August (as documentation completed) • The full Infomine source • Available Fall 2002
Automation: finding and describing resources • Discovering new resources • The Infomine record builder • Extracting useful metadata • Automatically classifying records • Open source inputs • Open source outputs
Discovering new resources • The semi-automatic focused web crawler • You suggest a topic or search term • The crawler searches for web pages and clusters them • You identify useful clusters of documents (optional) • The crawler reports the top 20 hubs and authorities • You choose from the list of URLs • The automatic record builder helps generate metadata • The fully-automatic focused web crawler • Coming soon!
The Infomine record builder • Input: a URL or list of URLs • From the focused crawler • From the record builder interface • The record builder creates a new record • Fully-automatic operation • The builder creates new records on its own • Semi-automatic operation • The builder interacts with you at each stage • Output: new records in the pending database
New research: LCSH assignment • Dr. Steve Jones, of the University of Waikato • Aim: assign LCSH based on document content • Use training data to build a model • Training data: documents with keyphrases and LCSH • Model: based on keyphrase and LCSH co-occurrence • Use model to assign LCSH to new documents • Extract keyphrases with Kea • Similarity measures identify the best LCSH
forest insects bark beetles borers (insects) tobacco hornworm scolytidae greenhouse whitefly agriculture in literature mountain pine beetle New research: LCSH assignment • forest insects
cruciferae Buriats brassica phytophagous insects plants, effect of metals on blood groups in animals rapeseed hybridization, vegetable New research: LCSH assignment • BRASSICA • CROPS • PLANT BREEDING
atmospheric chemistry meteorology continentality (meteorology) chemical oceanography multidimensional chromatography turbulent diffusion (meteorology) aerosols precipitation scavenging New research: LCSH assignment • CLIMATOLOGY • ENVIRONMENTAL SCIENCES • POLLUTION
New research: LCC assignment • Dr. Eibe Frank, of the University of Waikato • Aim: assign LCC based on a set of LCSH • Infomine has LCSH but no LCC • Use with LCSH classifier for new documents • Use training data to build a model • Training data: documents with LCSH and LCC • Model: LCC-hierarchy of Support Vector Machines • Use model to assign LCC to new documents
New research: LCC assignment • Performance (preliminary) • Absolute accuracy around 58% (pleasing) • Also: 4% are too specific, 3% too general • Top-level accuracy around 80% • What to do if we encounter completely new LCSH? • QA1-43: Science > Mathematics > General
Automation: open source inputs • General and C++ tools • Linux, Apache, gcc, flex, curl, etc • Java tools • The Java MARC Events (James) toolkit • The Waikato Environment for Knowledge Analysis (WEKA) machine learning toolkit • The Kea keyphrase extraction program
Automation: open source outputs • LCSHtoLCC: LCC assignment • http://infomine.ucr.edu/iVia/ • KPtoLCSH: LCSH assignment • Available August • PhraseRate: keyphrase extractor • http://infomine.ucr.edu/iVia/ • Artur's Automatic Annotator • Available in Fall 2002 (with Infomine)
Collaboration: the Fiat Lux Portals • Fiat Lux • Advantages of collaboration • MyI:Research guides and pathfinders • Themes: co-branding for collaborators • Open standards, protocols and source code • Challenges of collaboration
Fiat Lux • Established at ALA Midwinter 2002 • Prominent, librarian-built, public portals: • BUBL, Infomine, IPL, lii.org, MEL, VRL • Goal: resource sharing through collaboration • Fiat Lux represents: • 170 librarians • 100,000 records • 30 million searches/year
Advantages of collaboration • Greater sustainability and scalability • Reduced redundant effort • Shared cataloging effort • More resources cataloged • Everyone gets a bigger (better) dataset • Shared systems development • Scalability of systems • Preserving institutional identity
Themes: co-branding through iVia • Co-branding for institutional cooperators • Many data views can be “themed” • The data is the same • The appearance is altered • http://infomine.ucr.edu/cgi-bin/canned_search?query=tree&theme=wfu
MyI: custom collections • Create research guides / pathfinders • Create a “MyI category” • Add records to categories in the record editor • “Batch add” to your category • Create searches for your records • Examples: • CSUF-MC, CSUF-MC-NATAM, CSUF-MC-ASIAM... • UCR-DB-MUSIC, UCR-ACCESS-CDL-PASSWORD • UDM-edu459
Challenges of collaboration • Investigating lii.org integration: • Granularity of metadata • Different editorial processes • Collection focus and audience level • Scholarly Vs. K-12 Vs. public library • How do you merge duplicate records? • LCSH, keywords: easy to combine • Annotation: not sure yet • These are editorial issues rather than technical