Increasing the Information Density in Digital Library Results

Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003

Map Courtesy of Univ. of Pittsburgh DL Project IndoUS

Attention is the issue. "What information consumes is rather obvious; it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of information sources that might consume it."[Herb Simon] Complementary objective: Don't waste the attention afforded to information IndoUS

My focus: Science & business use (I ignore now the artistic aspects, also important) Required by customer: • knowledge to process information, and • tools to facilitate that process • Locate • Select • Articulate, not Integrate • Summarize • Project - exploit data mining IndoUS

Technologies to filter Information Survey of Technologies from common to rare • Ranking • Eliminate redundancy • Assure novelty • Abstraction • Data mining • Reduction for visual presentation • Modeling • Prediction • Finding Abnormal Events IndoUS

1. Ranking Assumption: The consumer only considers a few documents on the top of the list. • Ranking by authority. • Select sites that are valued in a context, • a journal versus a workshop report, • a recent document. • Ranking by reference authority • recursive value by references to it (Google) • extracts global communal knowledge • Rank by customer's context IndoUS

2. Eliminate redundancy • If similar documents are retrieved • present the latest one • present the highest ranked one, per a suitable criterion, I.e, user's context. • Only report differences among documents • look for additional material • decide what are significant differences • abstract differences (see later) • show differences in layout only as maps • compute metric if difference • deal with many documents IndoUS

3. Value is in the Novelty • Information relative to a document collection • Exploits prior technologies • Information relative to a customer. • What is the knowledge held by an individual • Can it be captured ? Domain recognition to determine context Avoid (unsolvable?) problem of `common knowledge' IndoUS

4. Abstraction • Only present essentials of textual documents • Domain-independent abstraction selecting sentences that appear to represent the contents; • Domain-specific text can be effectively abstracted • pathology reports -- being done • automatic annotation of gene-sequences from papers. • Abstracting contents of document collections • Classify • Differentiate (2.b) • Integrate • Semantic matching if the sources are autonomous IndoUS

5. Data mining Out of scope for digital library research, but • Linking data-mining results with information from textual sources strengthens users' explanatory capabilities. • Data-mining develops models that can be further exploited IndoUS

6. Reduction for Visualization Motivated by modern customers’ settings • Reduce numeric data for visual presentation • Common • Can be automated, but rarely done well • Reduce textual information into visuals Requires • Abstraction • Placing the result into some model: I.e., temporal or spatial aspects: • Progress notes for a patient – disease model • Description of an exploratory journey – attach to a map • Progress of a scientific project – versus proposal IndoUS

7. Modeling Models of a domain allow analysis & manipulation • to discern novelty • representation of normal behavior • corporate finances from 10-K • ecological processes, and global change • metabolic models, needed to formulate an understanding of food, drug, and environmental effects on organisms. IndoUS

8. Prediction Current information technologies, { databases, data-mining, digital libraries } provide only background information for decision-making Today: decision maker • copies results into a spreadsheet • add formulas to make extrapolations into the future • Continue models scenarios into the possible futures • Investments - monetary, personnel, research, . . . • Probabilities of outcomes etc. • Allow comparison of alternatives Information systems should not terminate their support with the past, but also to extrapolate the results with the models used for analysis IndoUS

9. Finding Abnormal Events • A hard challenge is discovering abnormal situations. • I.e., looking for terrorists. Note: observables are the effect of many good and a few bad scenarios • Traditional data-mining finds frequent relationships • abduct the processes that generate those data serves marketing folk, • Intelligence tasks seek unusual or abnormal behavior 1. Use model based on recent incidents, • flight-schools enrollments of terrorists 2. Create and use a reasonable, but hypothetical model • shipping containers can carry nuclear devices into the US 3. Create a model of normal findings  IndoUS

9+. Create & exploit a normal state model Prerequisite for finding abnormal events abnormalities can only be identified if normality can be quantified • Populate an initial model with normal findings • Coverage: all likely causes of some observable(s) • Identify variation not due to known causes • Temporal tracking is better than static schemes • Increase coverage as needed - feedback to b. • Maintain models to recognize unexplainables Such models will be large since observed data are the aggregate of activities from many domains, travel patterns: business, holidays, and family visits, emergencies. IndoUS

Benefits A `business model' for justifying ongoing DL research is needed[Y.T.Chien] • A business model includes benefits and costs • Benefits: • Broad access to knowledge • Education of the next generation • Preservation of cultural heritage • Mutual, inter-cultural understanding, reduction of conflicts • Improved decision-making • Costs • Time and money spent on information systems • Technology  Contents • Time spent on obtaining the information • Time spent on analyzing the information • Due to errors Focus of prior slides Focus of prior slides IndoUS

Type 1 errors Omitted relevant information Lost opportunities Unperceived risk Suboptimal choices Cost: f (variance) High if  is high Low if  is low purchasing Type 2 errors Excess irrelevant information Overload Inability to analyze all Risk of being misled Cost: delay, human High if excess is high human time is valuable Low if precision is high Cost of Errors -- balance  IndoUS

Action 3 Action 2 Action 1 know-ledge Exploiting Information Effects Data and their relationships Decision Has not been an explicit focus of DL research. It is the point that generates benefits ? IndoUS

user exploiting communities distilled knowledge, categorized computer scientists, will provide tools know-ledge user contributing communities human knowledge, validated by data The Major Feedback Loop IndoUS

Conclusion • Much work is left to be done with digital libraries • Exploiting the results will motivate more investment • In technology • In content breadth and depth • Customer's expectations will change • Global access is here • Heterogeneity will remain, cause errors • Ubiquitous access is near IndoUS

Optional discussion points • Interfaces • Personalization • Heterogeneity • Computer scientists and their customers • Data versus relationships • Disruptive factors IndoUS

New user interface settings The new generation • is more comfortable with screen displays • can navigate to analyses, backup • considers paper to be heavy and awkward • is poor in handwriting and spelling • is facile in brief keyboard messages • expects simple voice command technology IndoUS

Personalization, 2 models • Everything about an individual • learn all about the individual • slow, delayed, lags • An individual as a member of groups • learn about the likely memberships { 8th grader, ...; carpenter, ..; opera goer, ...; ...} • learn and assign knowledge to group • inherit knowledge collected in those groups • leads so that individual also benefits Context IndoUS

Interoperation/Interoperability • Heterogeneity is a fact, and attempts at enforcing consistency are misguided • natural consistency will be an outcome of collaboration, IndoUS

Data and their relationships • Data are verifiable first-order objects • observable • automatic acquisition is common • Relationships are also first-order objects • defined by metadata in context { schemas, references, dependencies, is-as, causality, ... ) • Hard to discover • Instances verifiable in contexts • Needed for exploitation IndoUS

Customers and Computer Scientists • Mutual arrogance fed by misunderstandings • Differing scientific paradigms • Mathematical: formal, definite • Social, biological: case-based, indefinite IndoUS

Disruptive factors [June 03 NSF meet] • Technologies • ubiquitous access • community empowerment • data & semantics contribution • Machine translation of modest quality • Sociological • Imposed privacy constraints • TIA reactions national/international • Commercial pressures - • skimming the cream IndoUS

Roadblocks [Y.T. Chien] • lack of a business model • matching technology to user needs • define a research pipeline [NAS HPCC report] IndoUS

Increasing the Information Density in Digital Library Results

Increasing the Information Density in Digital Library Results

Presentation Transcript

The African Digital Library

Information Visualization for Digital Library

The Connecticut Digital Library

Multilingual Information Access in a Digital Library

The digital library

Integration in the Digital Library Environment

Library Design in the Digital Age

Information literacy: the digital library and beyond

Increasing MTJ Density

The Digital Library Challenge

Digital Library in Finland

Digital Library

INCREASING ROI IN INFORMATION RESOURCES

The Humanities scholar in the digital library

The AMICAL Digital Library

Increasing Digital Opportunity

The Digital Library Challenge

Information Visualization for Digital Library

Increasing Digital Information to Expand Digital Evidence Management Industry