300 likes | 407 Views
TIDES IFE-Bio KickOff Meeting. David Anderson, Laurie Damianos, David Day, Lynette Hirschman, Robyn Kozierok, Scott Mardis, Tom McEntee, Chad McHenry, Michael Merideth, Keith Miller, Bev Nunan, Jay Ponte, George Wilson, Flo Reeder, Steve Wohlever October 17, 2001. Agenda.
E N D
TIDES IFE-BioKickOff Meeting David Anderson, Laurie Damianos, David Day, Lynette Hirschman, Robyn Kozierok, Scott Mardis, Tom McEntee, Chad McHenry, Michael Merideth, Keith Miller, Bev Nunan, Jay Ponte, George Wilson, Flo Reeder, Steve Wohlever October 17, 2001
Agenda • Current Status and Experiments (Laurie) • User Feedback on MiTAP and Exercise (Eric) • Lessons Learned (Laurie) • Architecture Briefing (Jay & Scott) • Geospatial Processing (George) • Schedule (Jay) • Issues and Discussion (All)
Status of MiTAP • Availability: excellent • Available ~100% to users inside, outside firewall • 12 individual user accounts, 6 group accounts • 8 daily users on average, mostly repeat users • Data capture: rich & dynamic • ~70 working sources, new source added in 30 min • Average 5.8K msgs/day, 1 min latency • 250K msgs total in system • Analysis tools: improving • Messages in 6 languages (with COTS translation) • Sorted into 173 newsgroups • Color coded tagging (pers/org/loc/disease) • Popup summarization • Product: need to understand how system is being used
MiTAP Activity:Messages and Users Over Time Attack on America Aug Experiment
Feedback from Eric • Report on Bio-Threats • Deployment for N2 • MiTAP Status • Utility • Usability • Accessibility
Lessons Learned Availability • User accounts for production system • No training needed (instructions available on website) • Stronger security (e.g., intrusion detection) • Better back-up, monitoring of throughput • More processing power Capture • Reduced latency on scheduled downloads and spidering, hourly capture of headlines • Distributed capture processing • Better capture of formatted sources • Some badly filtered, excess volume causes backlog • Poor zoning/formatting/decoding of some sources
Lessons Learned (2) Analysis • Improved search (e.g., by date/relevance, popups, integrated with news server) • Improved “normalization” of names, regions • Too much data! - need better filtering, topic detection & clustering, summarization • Better MT, support for Arabic • Q&A • Geospatial & temporal visualization • Advanced search • Better information extraction
Lessons Learned (3) Product • No environment for preparing reports • Workspace • Drag&drop repository • Editing capabilities • Multidoc summarization • Collaboration feature (chat & shared workspace)
Catalyst Update: Recent work • Usability for developers • Logger • Configuration file refinements • Improvements for distributed systems • Redesign of I/O polling procedures • Explicit synchronization feature for Language Processor developers
catlogger Logger Tokenize MetaData Documents Entity Extraction Entities Word.Text Sentence Sentence Tagger Word.POS catlogger
In progress • Usability for developers • Monitor (system status capability) • Native XML I/O! (for ease of debugging & for lightweight Catalyst ) • Information retrieval • Integration between Catalyst and new IR engine • Pushing stream filters toward archived streams • Documentation
Monitor Tokenize MetaData Documents Entity Extraction Entities Word.Text Sentence Sentence Tagger Word.POS Monitor Monitor
XML I/O Present XML to Catalyst Event Extraction Catalyst to XML XML doc XML doc With XML I/O feature Easier to debug! Event Extraction XML doc XML doc
XML I/O With XML I/O feature Catalyst Processes Catalyst Processes Wrapper Process XML Easier path to integrate existing language processing systems! Non-Catalyst Process
Archived streams Filter criteria must be pushed upstream from its origination point toward the indices so that process may be reduced to little more than is absolutely necessary. Question Answering Application Coreference Indices XML doc Index Refinement Candidate Selection Answer Extraction filter criteria Origination point
For the Midterm - 12/12/2001 • Monitor • XML I/O support in the Catalyst library • Lightweight Catalyst design • Documentation
Catalyst collaborations • Qanda • Catalyst-based Qanda used for TREC • Catalyst-based Qanda deployed at AFIWC • Information retrieval • Archived annotation streams (for creating IR indexes) • Seekable streams (for processing IR queries) • Other projects • ACE/Alembic (Information Extraction) • Audio hot-spotting (Speech Retrieval) • Reading-comp (Question Answering)
Document Management • Process scheduling • System linkage • Inter-site cooperation support • User features
Process Scheduling • Problem: MiTAP needs the ability to prioritize sources • ‘Catching up’ on a new source shouldn’t prevent timely processing of an important existing source • Solution: • Preprocessing daemon will notify scheduler of incoming content • Scheduler assigns jobs to available resources based on priority • Status: • Prototype scheduler delivered (Ponte) • Preprocessing daemon rewrite in mid-November (Wohlever)
System Linkage • Problem: Ever notice how new features tend to only apply to new content? • MiTAP is not flexible - difficult to: • Reprocess and repost a message that has errors • Find the original source document • Etc. • Currently, retroactive changes require 11th hour hacking (or sometimes 12th hour hacking) • Solution: Keep database of linkage information to make the system more flexible • Status: • Additional information currently being logged • Linkage database - March
Inter-site Cooperation Support • Problem: Collaboration with other TIDES contractors who have large legacy systems • Issue of communication more than scalability • Solution: • Linkage database for annotations, similar to the one used for system maintenance • Web client server communication • Path to scalable solution w/richer interactions • Status: • Data management - January • Communications: investigation of relevant protocols and preliminary design - completed • Native XML support for Catalyst - December
User features • Problem: MiTAP helps you find good information, then what? • Solution: • Web accessible support for user views and data organization to assist in reporting and analysis • Automated view construction/feedback incorporating additional TIDES technologies • Status: • Schema for v.1 of workspace developed (Ponte, Anderson) • Supporting code in progress (Ponte) • Prototype - December
Geo-Spatial Normalization - Goal Goal: We have: Text containing place names We want: Points on maps Process: Extract place names Look up places on a list Determine Lat-Long Display Seattle 47.6 N 122.317 W • Problems: • Place name not on list • More than one place with same name
Geo-Spatial Normalization - Solution Solution: Part 1: A significant portion of the references can be resolved using easy methods. Unambiguous: Seattle Toulouse Ambiguous:Paris Washington Disambiguated:Paris, TexasThe State of Washington Solution: Part 2: Use the “easily resolved” references as training data for a machine learning classifier which will distinguish the rest.
Geo-Spatial Normalization - Plans • For MidTerm (Dec. 12, 2001) • Detect a significant portion of the “easily resolvable” references • Display with some map tool • - Web delivery desirable • After MidTerm (May, 2002) • Try to find more “easily resolvable” references • Do the machine learning part • Integrate with other mapping tools
Issues and Discussion • How is MiTAP currently being used? • Who are the users? • What are the users doing? • What do users want? • Prioritization of issues • Integrated feasibility experiment versus operational prototype: • Possible deployment vs integration of other TIDES technologies • (Do we need to adjust our priorities?) • Along what dimensions should we optimize? • Availability, capture, analysis, presentation