190 likes | 309 Views
WordSieve: Learning Task Differentiating Keywords Automatically. Travis Bauer Sandia National Laboratories (Research discussed today was done at Indiana University). Learning Task Contexts: Calvin. Learn what characterizes a user’s task contexts Unobtrusive Observing Keyword Extraction
E N D
WordSieve: Learning Task Differentiating Keywords Automatically Travis Bauer Sandia National Laboratories (Research discussed today was done at Indiana University)
Learning Task Contexts:Calvin • Learn what characterizes a user’s task contexts • Unobtrusive Observing • Keyword Extraction • Index based on Context
Currently Used Algorithms • TFIDF • Latent Semantic Analysis • Log-Entropy
Currently Used Algorithms • TFIDF • "One of the most successful and well tested techniques in Information Retrieval." - Pazanni • Syskill & Webert (Pazanni '96) • Hierarchical Feature Map (Merkl '97) • Learning in Document Filtering (Callen '98) • Topic Detection (Shultz '99) • Remembrance Agent (Rhodes '00) • Lexical Signatures (Park '02) • Latent Semantic Analysis • Log-Entropy
Currently Used Algorithms • TFIDF • Latent Semantic Analysis • Well known, popular, well covered in the literature • Grading Essay Tests • Taking Physics tests • Taking synonym exams • Cross Linguistic IR (Dumais '97) • Assigning papers for peer review (Dumais '92) • Information Filtering (Foltz '90) • Log-Entropy
Currently Used Algorithms • TFIDF • Latent Semantic Analysis • Log-Entropy • Not used as much for Personal Information Retrieval • Higher overhead than TFIDF • Indexes based on the distribution of terms across documents – potentially better performance
Current Techniques Static Corpora Comprehensive Statistics WordSieve Neural Network-like processing Stream of data Local learning Competitive Learning Comparison to Current Techniques
WordSieve Concept User Browsing Attributes Term Activation Priming
Doc Stream WordSieve 1 Words Absent in Document Sequences User Profile Context Profile Words Occurring in Document Sequences Words Currently Occurring Frequently
Doc Stream WordSieve 2 User Profile Words Reflecting Context Context Profile Words Currently Occurring Frequently
Web Browsing Data Set • Sixteen Users • Four Topics, 10 minutes Each • Political Life Al Gore • Political Life George Bush • Traditional Indonesian Cooking • Traditional Thai Cooking Categorized Document Set Automatically Generated Queries
Contributions • It is possible to extract context differentiating terms from document streams using unsupervised competitive learning. • Comprehensive statistics are not necessary in the described situations given an ordering of the documents. • Performance is comprable to LSI and better than Log-Entropy and TFIDF
Potential Next Steps • WordSieve • Automate Parameter Optimization • Co-occurrance of terms • Other Domains • Multi-dimensional data stream • Machine Vision
Support This work was conducted under the advisement of David Leake at Indiana University. It was sponsored in part by the GAANN fellowship. The original version of the personal information agent was designed and written with partial support from NASA under award No NCC 2-1035
For More Information Travis Bauer www.cs.indiana.edu/~trbauer/publications.htm
Usenet Data Set Three sets of 5 newsgroups • alt.atheismtalk.religion.miscsoc.religion.christianrec.sport.baseballrec.sport.hockey • comp.os.ms-windows.misccomp.sys.ibm.pc.hardwarecomp.sys.mac.hardwarerec.autosrec.motorcycles • talk.politics.gunstalk.politics.miscsci.electronicssci.medsci.space Categorized Document Set Automatically Generated Queries