1 / 28

Increasing the Information Density in Digital Library Results

Learn techniques to enhance information processing and extraction using advanced technologies such as data mining, visualization, modeling, and prediction. Discover methods to eliminate redundancy, ensure novelty, and find abnormal events.

pbarksdale
Download Presentation

Increasing the Information Density in Digital Library Results

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003

  2. Map Courtesy of Univ. of Pittsburgh DL Project IndoUS

  3. Attention is the issue. "What information consumes is rather obvious; it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of information sources that might consume it."[Herb Simon] Complementary objective: Don't waste the attention afforded to information IndoUS

  4. My focus: Science & business use (I ignore now the artistic aspects, also important) Required by customer: • knowledge to process information, and • tools to facilitate that process • Locate • Select • Articulate, not Integrate • Summarize • Project - exploit data mining IndoUS

  5. Technologies to filter Information Survey of Technologies from common to rare • Ranking • Eliminate redundancy • Assure novelty • Abstraction • Data mining • Reduction for visual presentation • Modeling • Prediction • Finding Abnormal Events IndoUS

  6. 1. Ranking Assumption: The consumer only considers a few documents on the top of the list. • Ranking by authority. • Select sites that are valued in a context, • a journal versus a workshop report, • a recent document. • Ranking by reference authority • recursive value by references to it (Google) • extracts global communal knowledge • Rank by customer's context IndoUS

  7. 2. Eliminate redundancy • If similar documents are retrieved • present the latest one • present the highest ranked one, per a suitable criterion, I.e, user's context. • Only report differences among documents • look for additional material • decide what are significant differences • abstract differences (see later) • show differences in layout only as maps • compute metric if difference • deal with many documents IndoUS

  8. 3. Value is in the Novelty • Information relative to a document collection • Exploits prior technologies • Information relative to a customer. • What is the knowledge held by an individual • Can it be captured ? Domain recognition to determine context Avoid (unsolvable?) problem of `common knowledge' IndoUS

  9. 4. Abstraction • Only present essentials of textual documents • Domain-independent abstraction selecting sentences that appear to represent the contents; • Domain-specific text can be effectively abstracted • pathology reports -- being done • automatic annotation of gene-sequences from papers. • Abstracting contents of document collections • Classify • Differentiate (2.b) • Integrate • Semantic matching if the sources are autonomous IndoUS

  10. 5. Data mining Out of scope for digital library research, but • Linking data-mining results with information from textual sources strengthens users' explanatory capabilities. • Data-mining develops models that can be further exploited IndoUS

  11. 6. Reduction for Visualization Motivated by modern customers’ settings • Reduce numeric data for visual presentation • Common • Can be automated, but rarely done well • Reduce textual information into visuals Requires • Abstraction • Placing the result into some model: I.e., temporal or spatial aspects: • Progress notes for a patient – disease model • Description of an exploratory journey – attach to a map • Progress of a scientific project – versus proposal IndoUS

  12. 7. Modeling Models of a domain allow analysis & manipulation • to discern novelty • representation of normal behavior • corporate finances from 10-K • ecological processes, and global change • metabolic models, needed to formulate an understanding of food, drug, and environmental effects on organisms. IndoUS

  13. 8. Prediction Current information technologies, { databases, data-mining, digital libraries } provide only background information for decision-making Today: decision maker • copies results into a spreadsheet • add formulas to make extrapolations into the future • Continue models scenarios into the possible futures • Investments - monetary, personnel, research, . . . • Probabilities of outcomes etc. • Allow comparison of alternatives Information systems should not terminate their support with the past, but also to extrapolate the results with the models used for analysis IndoUS

  14. 9. Finding Abnormal Events • A hard challenge is discovering abnormal situations. • I.e., looking for terrorists. Note: observables are the effect of many good and a few bad scenarios • Traditional data-mining finds frequent relationships • abduct the processes that generate those data serves marketing folk, • Intelligence tasks seek unusual or abnormal behavior 1. Use model based on recent incidents, • flight-schools enrollments of terrorists 2. Create and use a reasonable, but hypothetical model • shipping containers can carry nuclear devices into the US 3. Create a model of normal findings  IndoUS

  15. 9+. Create & exploit a normal state model Prerequisite for finding abnormal events abnormalities can only be identified if normality can be quantified • Populate an initial model with normal findings • Coverage: all likely causes of some observable(s) • Identify variation not due to known causes • Temporal tracking is better than static schemes • Increase coverage as needed - feedback to b. • Maintain models to recognize unexplainables Such models will be large since observed data are the aggregate of activities from many domains, travel patterns: business, holidays, and family visits, emergencies. IndoUS

  16. Benefits A `business model' for justifying ongoing DL research is needed[Y.T.Chien] • A business model includes benefits and costs • Benefits: • Broad access to knowledge • Education of the next generation • Preservation of cultural heritage • Mutual, inter-cultural understanding, reduction of conflicts • Improved decision-making • Costs • Time and money spent on information systems • Technology  Contents • Time spent on obtaining the information • Time spent on analyzing the information • Due to errors Focus of prior slides Focus of prior slides IndoUS

  17. Type 1 errors Omitted relevant information Lost opportunities Unperceived risk Suboptimal choices Cost: f (variance) High if  is high Low if  is low purchasing Type 2 errors Excess irrelevant information Overload Inability to analyze all Risk of being misled Cost: delay, human High if excess is high human time is valuable Low if precision is high Cost of Errors -- balance  IndoUS

  18. Action 3 Action 2 Action 1 know-ledge Exploiting Information Effects Data and their relationships Decision Has not been an explicit focus of DL research. It is the point that generates benefits ? IndoUS

  19. user exploiting communities distilled knowledge, categorized computer scientists, will provide tools know-ledge user contributing communities human knowledge, validated by data The Major Feedback Loop IndoUS

  20. Conclusion • Much work is left to be done with digital libraries • Exploiting the results will motivate more investment • In technology • In content breadth and depth • Customer's expectations will change • Global access is here • Heterogeneity will remain, cause errors • Ubiquitous access is near IndoUS

  21. Optional discussion points • Interfaces • Personalization • Heterogeneity • Computer scientists and their customers • Data versus relationships • Disruptive factors IndoUS

  22. New user interface settings The new generation • is more comfortable with screen displays • can navigate to analyses, backup • considers paper to be heavy and awkward • is poor in handwriting and spelling • is facile in brief keyboard messages • expects simple voice command technology IndoUS

  23. Personalization, 2 models • Everything about an individual • learn all about the individual • slow, delayed, lags • An individual as a member of groups • learn about the likely memberships { 8th grader, ...; carpenter, ..; opera goer, ...; ...} • learn and assign knowledge to group • inherit knowledge collected in those groups • leads so that individual also benefits Context IndoUS

  24. Interoperation/Interoperability • Heterogeneity is a fact, and attempts at enforcing consistency are misguided • natural consistency will be an outcome of collaboration, IndoUS

  25. Data and their relationships • Data are verifiable first-order objects • observable • automatic acquisition is common • Relationships are also first-order objects • defined by metadata in context { schemas, references, dependencies, is-as, causality, ... ) • Hard to discover • Instances verifiable in contexts • Needed for exploitation IndoUS

  26. Customers and Computer Scientists • Mutual arrogance fed by misunderstandings • Differing scientific paradigms • Mathematical: formal, definite • Social, biological: case-based, indefinite IndoUS

  27. Disruptive factors [June 03 NSF meet] • Technologies • ubiquitous access • community empowerment • data & semantics contribution • Machine translation of modest quality • Sociological • Imposed privacy constraints • TIA reactions national/international • Commercial pressures - • skimming the cream IndoUS

  28. Roadblocks [Y.T. Chien] • lack of a business model • matching technology to user needs • define a research pipeline [NAS HPCC report] IndoUS

More Related