280 likes | 293 Views
Learn techniques to enhance information processing and extraction using advanced technologies such as data mining, visualization, modeling, and prediction. Discover methods to eliminate redundancy, ensure novelty, and find abnormal events.
E N D
Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003
Map Courtesy of Univ. of Pittsburgh DL Project IndoUS
Attention is the issue. "What information consumes is rather obvious; it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of information sources that might consume it."[Herb Simon] Complementary objective: Don't waste the attention afforded to information IndoUS
My focus: Science & business use (I ignore now the artistic aspects, also important) Required by customer: • knowledge to process information, and • tools to facilitate that process • Locate • Select • Articulate, not Integrate • Summarize • Project - exploit data mining IndoUS
Technologies to filter Information Survey of Technologies from common to rare • Ranking • Eliminate redundancy • Assure novelty • Abstraction • Data mining • Reduction for visual presentation • Modeling • Prediction • Finding Abnormal Events IndoUS
1. Ranking Assumption: The consumer only considers a few documents on the top of the list. • Ranking by authority. • Select sites that are valued in a context, • a journal versus a workshop report, • a recent document. • Ranking by reference authority • recursive value by references to it (Google) • extracts global communal knowledge • Rank by customer's context IndoUS
2. Eliminate redundancy • If similar documents are retrieved • present the latest one • present the highest ranked one, per a suitable criterion, I.e, user's context. • Only report differences among documents • look for additional material • decide what are significant differences • abstract differences (see later) • show differences in layout only as maps • compute metric if difference • deal with many documents IndoUS
3. Value is in the Novelty • Information relative to a document collection • Exploits prior technologies • Information relative to a customer. • What is the knowledge held by an individual • Can it be captured ? Domain recognition to determine context Avoid (unsolvable?) problem of `common knowledge' IndoUS
4. Abstraction • Only present essentials of textual documents • Domain-independent abstraction selecting sentences that appear to represent the contents; • Domain-specific text can be effectively abstracted • pathology reports -- being done • automatic annotation of gene-sequences from papers. • Abstracting contents of document collections • Classify • Differentiate (2.b) • Integrate • Semantic matching if the sources are autonomous IndoUS
5. Data mining Out of scope for digital library research, but • Linking data-mining results with information from textual sources strengthens users' explanatory capabilities. • Data-mining develops models that can be further exploited IndoUS
6. Reduction for Visualization Motivated by modern customers’ settings • Reduce numeric data for visual presentation • Common • Can be automated, but rarely done well • Reduce textual information into visuals Requires • Abstraction • Placing the result into some model: I.e., temporal or spatial aspects: • Progress notes for a patient – disease model • Description of an exploratory journey – attach to a map • Progress of a scientific project – versus proposal IndoUS
7. Modeling Models of a domain allow analysis & manipulation • to discern novelty • representation of normal behavior • corporate finances from 10-K • ecological processes, and global change • metabolic models, needed to formulate an understanding of food, drug, and environmental effects on organisms. IndoUS
8. Prediction Current information technologies, { databases, data-mining, digital libraries } provide only background information for decision-making Today: decision maker • copies results into a spreadsheet • add formulas to make extrapolations into the future • Continue models scenarios into the possible futures • Investments - monetary, personnel, research, . . . • Probabilities of outcomes etc. • Allow comparison of alternatives Information systems should not terminate their support with the past, but also to extrapolate the results with the models used for analysis IndoUS
9. Finding Abnormal Events • A hard challenge is discovering abnormal situations. • I.e., looking for terrorists. Note: observables are the effect of many good and a few bad scenarios • Traditional data-mining finds frequent relationships • abduct the processes that generate those data serves marketing folk, • Intelligence tasks seek unusual or abnormal behavior 1. Use model based on recent incidents, • flight-schools enrollments of terrorists 2. Create and use a reasonable, but hypothetical model • shipping containers can carry nuclear devices into the US 3. Create a model of normal findings IndoUS
9+. Create & exploit a normal state model Prerequisite for finding abnormal events abnormalities can only be identified if normality can be quantified • Populate an initial model with normal findings • Coverage: all likely causes of some observable(s) • Identify variation not due to known causes • Temporal tracking is better than static schemes • Increase coverage as needed - feedback to b. • Maintain models to recognize unexplainables Such models will be large since observed data are the aggregate of activities from many domains, travel patterns: business, holidays, and family visits, emergencies. IndoUS
Benefits A `business model' for justifying ongoing DL research is needed[Y.T.Chien] • A business model includes benefits and costs • Benefits: • Broad access to knowledge • Education of the next generation • Preservation of cultural heritage • Mutual, inter-cultural understanding, reduction of conflicts • Improved decision-making • Costs • Time and money spent on information systems • Technology Contents • Time spent on obtaining the information • Time spent on analyzing the information • Due to errors Focus of prior slides Focus of prior slides IndoUS
Type 1 errors Omitted relevant information Lost opportunities Unperceived risk Suboptimal choices Cost: f (variance) High if is high Low if is low purchasing Type 2 errors Excess irrelevant information Overload Inability to analyze all Risk of being misled Cost: delay, human High if excess is high human time is valuable Low if precision is high Cost of Errors -- balance IndoUS
Action 3 Action 2 Action 1 know-ledge Exploiting Information Effects Data and their relationships Decision Has not been an explicit focus of DL research. It is the point that generates benefits ? IndoUS
user exploiting communities distilled knowledge, categorized computer scientists, will provide tools know-ledge user contributing communities human knowledge, validated by data The Major Feedback Loop IndoUS
Conclusion • Much work is left to be done with digital libraries • Exploiting the results will motivate more investment • In technology • In content breadth and depth • Customer's expectations will change • Global access is here • Heterogeneity will remain, cause errors • Ubiquitous access is near IndoUS
Optional discussion points • Interfaces • Personalization • Heterogeneity • Computer scientists and their customers • Data versus relationships • Disruptive factors IndoUS
New user interface settings The new generation • is more comfortable with screen displays • can navigate to analyses, backup • considers paper to be heavy and awkward • is poor in handwriting and spelling • is facile in brief keyboard messages • expects simple voice command technology IndoUS
Personalization, 2 models • Everything about an individual • learn all about the individual • slow, delayed, lags • An individual as a member of groups • learn about the likely memberships { 8th grader, ...; carpenter, ..; opera goer, ...; ...} • learn and assign knowledge to group • inherit knowledge collected in those groups • leads so that individual also benefits Context IndoUS
Interoperation/Interoperability • Heterogeneity is a fact, and attempts at enforcing consistency are misguided • natural consistency will be an outcome of collaboration, IndoUS
Data and their relationships • Data are verifiable first-order objects • observable • automatic acquisition is common • Relationships are also first-order objects • defined by metadata in context { schemas, references, dependencies, is-as, causality, ... ) • Hard to discover • Instances verifiable in contexts • Needed for exploitation IndoUS
Customers and Computer Scientists • Mutual arrogance fed by misunderstandings • Differing scientific paradigms • Mathematical: formal, definite • Social, biological: case-based, indefinite IndoUS
Disruptive factors [June 03 NSF meet] • Technologies • ubiquitous access • community empowerment • data & semantics contribution • Machine translation of modest quality • Sociological • Imposed privacy constraints • TIA reactions national/international • Commercial pressures - • skimming the cream IndoUS
Roadblocks [Y.T. Chien] • lack of a business model • matching technology to user needs • define a research pipeline [NAS HPCC report] IndoUS