190 likes | 415 Views
CSC 9010: Text Mining Applications. Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851. So What Next?. Evaluating systems Systems available Some good resources. Evaluating Text Mining Systems. There are dozens of text mining tools and systems available commercial
E N D
CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851
So What Next? • Evaluating systems • Systems available • Some good resources
Evaluating Text Mining Systems • There are dozens of text mining tools and systems available • commercial • open source • research • How do you decide which to use?
Determine Information Need • First step: what are you trying to find out? • Locate a specific piece of information? • Locate and capture a large amount of specific information • Locate a specific document? • Get the gist of one or more documents? • Organize documents into groups? • Find out something about the overall domain which is reflected in a set of documents? • ???
Determine Environment • What operating system? • What document formats? • ASCII or something richer? • What level of software maturity? • COTS, with support available, maybe already tuned for your specific problem • Open source or other fairly stable • Research tool • What is the cost justification?
Thinking About Information Needs • How specific is your need? • How much do you know already? • How big a corpus? How well-defined? • One-time question or continuing? • Incremental or episodic?
Information Extraction Tools Extract specific information, probably from a large number of documents. • What's the typical precision and recall? • KB info: • What entities are already defined? • How easy is it to add enumerated lists? • How easy is it to add patterns? • What document formats does it accept? • Performance?
Document Retrieval Need a specific document or some information • For spidering: • Coverage, including kinds of documents • Performance, which affects refresh speed • flexibility/configuration of spiders • special needs? (focused crawling) • For retrieval: • Relevance ranking • Performance • Richness of query engine • Precision and recall • Query broadening and narrowing • For both: ease of use
Document Categorization You need to sort your documents • Does system perform in real time? • How many categories total can it handle? • How many categories/document? Flat or hierarchical? • Categories defined automatically or by hand? • Automatically: • Assumes significant vocabulary differences among different groups. • Requires training examples • By hand assumes: • Time to do it! • Readily identifiable characteristics to distinguish groups
Document Clustering What is going on in this domain? • What features of document are used to cluster? Linguistic? Semantic? TF*IDF? • What methods are used for clustering? (How do we define "similar"?) • Any capability for incorporating domain knowledge? • Performance • Incremental? Or do you have to start over again to add new documents?
Document Summarization What do I have? • Sentence extraction or capture and generate? • How much can it be shortened? • How many documents at once? • Sentence extraction methods are heavily dependent on the method used to identify "important" words.
Grab Bag of Systems Available: Entity or Information Extraction • AeroText: Lockheed Martin • GATE: U of Sheffield • Sophia: CELI • iMiner: IBM • ClearTag: ClearForest • Thing Finder: Inxight • LexiQuest: SPSS • Faustus/TextPRO: SRI
Categorization/Clustering • Semio: Entrieva • Oracle Text: Oracle • Inxight Categorizer: Inxight • Verity K2: Verity • Autonomy • ClearForest • LexiMine: SPSS • iMiner, Lotus Discovery Server: IBM (IBM)
Summarizing • All over the place! • Every search engine • Mac OS 10.2 and later • Many others
What's Happening • Some specific domains are very hot or interesting or intriguing • Expertise finder • Patent retrieval, visualization • Reputation Minder • Biological text mining • Semantic web • In fact, anything web-related • ??
What's Happening • Some technologies are also gaining speed: • Taxonomy identification/extraction • Question answering • Automatic markup: for the semantic web, for instance • Integrated domain-based and statistical approaches • Machine learning of KBs
Some Useful Resources: Links • Portal text mining links, kept reasonably up to date: • filebox.vt.edu/users/wfan/text_mining.html • www.cs.utexas.edu/users/pebronia/text-mining • A really excellent overview paper, still useful although 2001: • www.mitre.org/work/tech_papers/tech_papers_01/maybury_unstructured/maybury_unstructured.pdf • Best site to start with for software, conferences, etc: • www.kdnuggets.com/index.html
Useful Resources: Conferences • AAAI and IJCAI: Basic NL research; some good workshops and tutorials on text mining. Some of everything. • KDD: Text Mining often included as a form of data mining, especially more statistical approaches. KDD cup sometimes text based. • SIGIR: Lots of information retrieval • ACL: Lots of linguistic-based info, especially things like entity recognition and tagging. • Data mining conferences: often include text mining component. ICDM, for example. • Domain-specific conferences: often include a text mining component too.
So Where Now? • You now all have a good background in the techniques and applications of text mining, and some ideas of how it's been applied. • Where do you think it will it be in 10 years, and what will we be doing with it?