190 likes | 521 Views
An Evaluation of Text Mining Tools as Applied to Selected Scientific and Engineering Literature. Walter J. Trybula, Ph.D., IEEE Fellow Ronald E. Wyllys, Ph.D. ASIS 2000 – Chicago, Illinois November 14, 2000. Introduction. Data volume is growing and sources of information are more diverse
E N D
An Evaluation of Text Mining Tools as Applied to Selected Scientific and Engineering Literature Walter J. Trybula, Ph.D., IEEE Fellow Ronald E. Wyllys, Ph.D. ASIS 2000 – Chicago, Illinois November 14, 2000
Introduction • Data volume is growing and sources of information are more diverse • There is a need to evaluate this information • There are tools that claim to be able to find information in textbases • An investigation of existing tools would provide a measure of their ability. • If such tools worked, it might be possible to discover new knowledge.1 1 As described by Swanson as Undiscovered Public Knowledge w.trybula@ieee.org
Objective/Goals • Provide a means of testing the existing instruments to determine their ability to “find” knowledge. • Determine if any of these instruments provide useful insight to the data. • Evaluate the findings of domain experts to determine if the instruments are helpful. • Develop recommendations based on the results of the experiments. w.trybula@ieee.org
Overview of Process • Selected a technical area with known commonality (lithography masks). • Collected the most recent reports available. • Compile results into textbase for analysis by text mining tools. • Have domain experts evaluate the results. • Analyze their conclusions and draw recommendations for future directions. w.trybula@ieee.org
Example of Commonality w.trybula@ieee.org
Selection of Information • Information from leading researchers was collected. • Asian efforts on X-ray technology. • U.S. efforts on X-ray technology. • European efforts on Ion Projection Lithography. • U.S. efforts on Electron Projection Lithography. • U.S. efforts on Extreme UltraViolet technology. • Data was their annual update on technology progress provided for yearly review. • All reports, presentations, and data were assembled into a single textbase for analysis. w.trybula@ieee.org
Sources of Data U.S. Europe Asia • Concerns: • Language • Terminology • Program (format) w.trybula@ieee.org
Text Mining Tools • Selected three types of Text Mining Instruments available for desk-top operation. • Key terms identified with pointers to text • Excerpt presentation format • Hierarchal tree-structure presentation • Did not include Self-Organizing Maps (SOMs) • Included a search engine for baseline evaluation of the results (AltaVista). w.trybula@ieee.org
Text Mining Tools Text Mining Tool that returns Key Terms w.trybula@ieee.org
Text Mining Tools Text Mining Tool that returns Excerpts w.trybula@ieee.org
Text Mining Tools Text Mining Tool that returns Hierarchy w.trybula@ieee.org
Results • No method provided any novel results. There was some difficulty with mixed format documents. • Domain experts were required to evaluate the output and determine importance of delivered information. • Graphical information presentation was preferred over simple text. • Search Engine provided many pointers to occurrences of search terms. • There was no evidence that this approach provided any novel knowledge. w.trybula@ieee.org
Conclusions • Text Mining instruments are in a developmental stage and need refinement to be more useful. • Text Mining instruments must be able to handle data in various formats, i.e., documents, spreadsheets, presentations, etc. • Without a defined goal of what data will be delivered, there is no commonality among the various instruments. • Experts had difficulty in retrieving information that was known to be present due to methodology of evaluating information in textbase. • There must be a cohesive direction provided for the development of these instruments. w.trybula@ieee.org
Future Directions – Information Needs • An Instrument that evaluates the text in the textbase and provides an accurate representation of the information contained therein. • An Instrument that provides this information in a manner that can be accurately and quickly evaluated by the intended user. • An Instrument that draws the best elements from existing work and provides information based on proven methodologies. (In rapidly evolving technologies, efforts in one area may ignore developments in others. This is not acceptable.) Recommendations w.trybula@ieee.org
Data Mining Process Recommendations Start with existing methodology. w.trybula@ieee.org
Text Mining Process Recommendations Develop new methodology from existing ones. w.trybula@ieee.org
Future Directions – Instrument Needs • There needs to be a cohesive direction for future work. The existing development must draw on the knowledge developed in the Library Science field. • Can build from Data Mining to derive Text Mining functionality. A key concern will remain the method of presenting the results. • Need to have some agreement on the purpose of the Text Mining Instruments • What is the purpose of “mining” text? • What kind of user will there be? • What is the anticipated outcome? • Consider the application of the latest software developments, e.g., Groove, Napster, for information sharing. Recommendations w.trybula@ieee.org
Challenges • Establish a “goal” for the results of Text Mining. What will be accomplished? • Drive toward widespread application, i.e., desktop and handheld applications. • Incorporate latest hardware developments, i.e., distributed, parallel processing and wireless communications. • Deliver what the intended user needs. • Don’t reinvent the “wheel” • Have the Library Science, the Information Science, and the Computer Science people work together. Recommendations w.trybula@ieee.org
Acknowledgements • Dean Brooke Sheldon, Sanda Erdelez, Mary Lynn Rice-Lively (GSLIS, University of Texas at Austin). • John Konopka of IBM. • The International SEMATECH team including Scott Mackay, Mark Mason, Phil Seidel, David Stark. • The various technology champions for their efforts in providing the latest technology information. w.trybula@ieee.org