The Indexer’s Legacy: Promoting Access to a Million Books

The Indexer’s Legacy:Promoting Access to a Million Books Michael Huggett Edie Rasmussen ICDL 2010

Problem statement Background to study Indexers and Indexes From Print to Digital Book Collections Searching Digital Collections Research Project Pilot Study Phase I: Building the Collection Phase II: Deconstructing the Indexes Phase III: Building a meta-index Phase IV: Index-augmented search Overview

Project Gutenberg (1971+) Million Book Project (2002+) Universal Digital Library Google Books Library Project (2004+) Open Content Alliance (2005+) Universal Digital Library, Internet Archive And many others... Digital Book Projects

Combination of ’dirty OCR’ of text plus page image Standard IR retrieval techniques: query leads to relevance ranked output Text level vs. Passage level retrieval (e.g. INEX Book Track) Adequate for many purposes Problems with heterogeneity of text, ambiguity of terms Searching Digital Collections

The ”million books problem” ”…the human life contains only about 30,000 days; reading a book a day we would finsih a million books only after 30 lifetimes of reading…No longer a distant probability, a digital representation of [the vast written record of humanity] is taking shape before us… ” ”digitization does provide scale (or quantity) but does so at the price of rich, largely manual encoding” (Many More Than A Million, 2007) Problem Statement I

Role of indexes: the index is one of the oldest known information retrieval devices, representing a network of interrelationships among concepts in a text Intellectual effort: an index represents hours of interpretation and analysis Intellectual content: includes information about a book’s content but also incorporates the structure of knowledge in a given field Standard information retrieval techniques reduce index terms (and all text terms) to a ’bag of words’ model Problem Statement II

As we move from print to digital collections of scholarly works, how can we retain, extract and use the knowledge that is embedded in the indexes? The goal of this research is to develop techniques that will help to capture, visualize and access the world’s digital knowledge through application of text processing techniques to digital indexes of legacy materials Research Goal

Read  identify indexable concepts  (mark)  create vocabulary  invert?  sort and format (s/ware)  add cross references -edit for consistency Reduces contents of a book to its essentials (5 – 10%) Vocabulary is author’s plus indexer’s Goal is to facilitate access to material in the text The Indexing Process

Premises: The index identifies the most significant topics in the book The index expresses the topics in the author’s vocabulary and in the vocabulary of the field (i.e. that of the reader) The index provides links between concepts, showing how they are related As indexes on a topic are aggregated, significant concepts related to that topic, and the relationships between them are reinforced, creating both a vocabulary and a guide to the collection Knowledge in Indexes

Not all books are indexed Indexing conventions have changed over time Books in public domain are older; quality of index may be lower Quality of OCR, errors in text No markup; index structure is indicated visually (e.g. indents, punctuation) Matching page numbers in index to physical pages in text Challenges

’key ideas’ (Schilit and Kolak, Google Research, 2008) Mining and linking ideas in digital books Quotation extraction (quote plus context) ’Searching in a book’ (Liesaputra, Witten and Bainbridge, NZDL, 2009) E-book usability with indexes (Noorhidawata, 2007) Reorganizing indexes (Chi et al., 2004) Creating mini-indexes ’on the fly’ Related Research

Work on a small number of digital items 3 biographies of Charles Darwin 12 books on BC history from UBC University Press Software to parse indexes From pdf to index structures Operator driven: scan and correct ocr errors; key indicators in database Parse index terms and entries by shared references Identify common words on shared page references Pilot Study I

Preliminary results: Measure of coherence Rank terms by frequency and normalize Deviation = ∑(average rank – term rank) Calculated for content, index entries, index words Calculated for all terms, and for shared terms only Pilot Study II

Pilot Study III • Preliminary Results: • index terms show more coherence than corpus terms • Suggests that BoB are a good source of corpus-level keywords

Needed: General collection Collection of 1000 books With indexes! In the public domain Topic-oriented collections (5-6?) Collections of 100(?) books in a topic GRAs to identify and download target books Result: a test collection for this project (and others) Phase I: Building a Test Collection

No BoB indexing standards No controlled vocabulary A few indexing conventions Headings, subheadings, sub-subheadings... Structure is indicated by spacing and punctuation Need to parse the index to identify entries and page references Parsing software written and tested Phase II: Deconstructing the Indexes

How can index structure (run-on or indented, heading hierarchies) be extracted? Can keyphrases be extracted (proper nouns, concepts)? What are the syntax and semantics of indexes? Can we identify the historical development of indexes? How have they changed over time? Can we use XML to create a useful intermediate product? Phase II: Research Questions

Meta-index: a digital collection-level aggregation of the BoB indexes for a digital collection Merging/ concatenating index entries May be a standard index format (alphabetical, hierarchical entries), i.e., a digital browsalbe index Or may use new formats, e.g. Visualizations, topic maps Phase III: Building a Meta-Index

BOOK INDEX META-INDEX 1 DIGITAL COLLECTION META-INDEX 2

Can digital versions of BoB indexes be used to facilitate access to digital collections? What form should these indexes take? Conventional index format (alphabetical/searchable with headings and subheadings) Index visualization How do these meta-indexes compare to a standard search engine when searching a digital collection? Evaluation: task-oriented evaluation with human subjects (e.g. Humanities scholars) Phase III: Research Questions

Using the index information in new ways Building a ontology in domain areas Identifying concept relationships between index vocabulary and term vocabulary Use for Query expansion Question answering Summarization Categorization Phase IV: Index Augmented Search

Based on standard text processing procedures, i.e. stemming, use of stopwords, keyphrase extraction, term weighting such as tf*idf or BM25 How strong is the relationship between the index entry and the words on the page(s) referred to? Assume that for a single entry, this relationship is weak; over multiple similar entries in many books, do real relationships emerge and false ones disappear? Evaluation: using external collections, e.g. TREC or INEX, to measure contribution of index term relationships to retrieval performance Phase IV: Research Questions

Building themed or personalized collections (using index for book similiarity measures) Ability to mine large multidisciplinary collections for references (historical, economic, etc.) Ability to mine collections and build special-format indexes and browsers (e.g. images, figures) Changes in topics over time, evolution of thinking on a subject Knowledge discovery: detecting previously undiscovered links between topics Further Research

The Indexer’s Legacy… (a) an archaic addendum to an obsolete medium? OR (b) value-added knowledge in electronic text that enhances access to digital collections?

Thank you!

The Indexer’s Legacy: Promoting Access to a Million Books