1 / 25

Conference Tracker (project presentation)

Conference Tracker (project presentation). Andy Carlson Vitor Carvalho Kevin Killourhy Mohit Kumar. Overview. Goal: To find and gather salient details about conferences and workshops. Submission Deadline Location Home page … and others Preliminary results:

cady
Download Presentation

Conference Tracker (project presentation)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Conference Tracker(project presentation) Andy Carlson Vitor Carvalho Kevin Killourhy Mohit Kumar

  2. Overview • Goal: To find and gather salient details about conferences and workshops. • Submission Deadline • Location • Home page … and others • Preliminary results: • Succeeded in autonomously finding conferences, submission deadlines, locations, and homepages … although not without error … approaches ranged from bootstrapping to focused crawling

  3. Conference TrackerModule With four modules, each group member worked primarily on the design and implementation of a particular component.

  4. Bootstrapped Conference Acronym Discovery • Goal: find conference acronyms • Examples: ICML2006, IJCAI01, SIGMOD’98 • Discovers patterns of the form “token token _____ token token” that frequently have acronyms in the blank • Redundant features: web page text, morphology

  5. Seed Conferences • We start by searching for: • “academic conferences including” • “academic conferences such as” • “and other academic conferences” • “or other academic conferences” • This yields seeds: • SC2001, WWW2003

  6. Finding patterns • Searching for “SC2001” and “WWW2003” yields these ten most frequent patterns: • QUESTIONS ABOUT ___ MAY BE • PAPERS AT ___ IN DENVER • GATHER AT ___ TO DEFINE, • TRIP TO ___ PC MEETING • PREVIOUS MESSAGE: ___ BEOWULF PARTY • FWD FW ___ CALL FOR • TO OFFER ___ CLUSTER TUTORIAL • FWD AGENTS ___ WORKSHOP ON • EXHIBIT AT ___ TO FEATURE • 1 0 ___ 1 1

  7. Finding more acronyms • Searching for these new patterns yields more acronyms: • HFES2003 • ICKM2005 • SC2000 • SCCG2000 • SPLIT 2001 • SVC05 • WWW2002 • WWW2004 • WWW2005

  8. Repeat… • Repeating process for 5 cycles yields 95 conference acronyms • AAAI-05, AAAI'05, AAAI-2000, AAAI-98, AAMAS 2002, AAMAS 2005, ACL 2005, ACSAC 2002, ADMKD'2005, AGENTS 1999, AIAS 2001, AMPT95, AMST 2002, APOCALYPSE 2000, AVI2004, AWESOS 2004, BABEL01, CASCON 1999, CASCON 2000, CHI2006, CHI 2006, CHI97, CHI99, CITSA 2004, COMPCON 93, CSCW2000, EACL06, ECOOP 2002, ECOOP 2003, ECSCW 2001, EDMEDIA 2001, EDMEDIA 2002, EDMEDIA 2004, EMBODY2, ES2002, ESANN 2002, ESANN 2004, GECCO 2000, GWIC'94, HFES2003, HT05, HT'05, IAT99, ICKM2005, ICSM 2003, IFCS 2004, IJCAI-03, IJCAI05, IJCAI 2001, IJCAI 2005, IJCAI91, IJCAI95, ISCSB 2001, LICS 2001, MEMOCODE 2004, METRICS02, MIDDLEWARE 2003, NORDICHI 2002, NUFACT05, NWPER'04, NWPER'2000, OOPSLA'98, PARCO2003, PARLE'93, PKI04, PODC 2005, POPL'03, PROGRESS 2003, PRORISC 2002, PRORISC 2003, PRORISC 2004, PRORISC 2005, PROSODY 2002, RIAO 94, ROMANSY 2002, SAC 2004, SAC2005, SC2000, SC2001, SCCG2000, SIGDOC'93, SIGGRAPH'83, SIGIR 2001, SPIN97, SPLIT 2001, SPS 2004, SVC05, UML'2000, WOTUG 16, WOTUG 19, WWW2002, WWW2003, WWW2004, WWW2005, WWW2006

  9. Best patterns • Most productive patterns: • “cfp ___ workshop” • “proceedings of ___ pages” • “for the ___ workshop”

  10. Bootstrapped Acronym Discovery-- Conclusions • Using morphology to find only conference acronyms gave 100% precision, low recall (all acronyms discovered were conferences or workshops) • Bootstrapping from a generic set of queries can take us from 2 to 95 acronyms • To boost recall, we need some method of focusing on the best patterns

  11. Name/Page Finder (Algorithm) • Supplied with an acronym/year (SAC’04), finds the corresponding conference and its homepage (Selected Areas in Cryptography / http://vlsi.uwaterloo.ca/~sac04) • Search Google for “SAC 04” and “SAC 2004” (10 results each) • Extract potential conference names (using capitalization heuristics) • Score each web page and potential conference name • Select highest-scoring page / name pair • Score each name and page based on • Heuristics (e.g., acronym embedded in name, title contains acronym) • Inclusion of words distinctive to conference names and pages • Distinctive words are determined using TF-IDF* scoring, and word counts are updated after each acronym.

  12. Name/Page Finder (Results) • Evaluation within Conference Tracker • Given output of the Acronym Finder, find name/homepage for all the acronym/year pairs. • When the homepage and name is completely right, it is labeled as all-correct. If the name is correct (but the homepage wrong), it is labeled as name-correct. • Evaluation as stand-alone component • Given set of 27 manually collected acronyms for conferences with homepages in 2006, repeat the above procedure

  13. Location Finder – Approach (Focused Crawling) • Motivation: Sergei-Brin’s approach for author-book title • Observation: Searching for <Conference Name> <Location> returns conference main page or similar pages. • Pattern Observation: These pages state the full name of the conference in close proximity of the conference location. • Generalized pattern: Proximity – Defined currently by a window of 200 characters. • Algorithm: • Query Google with Conference Long name and year • Use top URLs to look for “locations” in ‘Proximity’ of conference long name (Currently using topmost query only) • Use heuristics to assess whether the page contains the conference location or is a list of such conference-location pair

  14. Location Finder – Pros & Cons • PROs • Quite Generalised approach, because of Proximity operator • Scalable approach • CONs • Depends on the Google query results • Query ‘crafting’ important • Dependant on finding out the ‘home page’ or similar page for the conference • Needs Location annotators

  15. Location Finder – Test Results • 13 Conferences & Workshops – IEEE & ACM (Using full name to query Google & using top link for extraction) • Correct – 7 • Partially Correct -1 • No result – 5 • Reasons: • Annotator coverage: 1 (Partially correct) • Name in image: 4 • Text extraction from web page: 1

  16. Location Finder – Improvements • Use Co-training: • Redundancy on the web is not being exploit • model is not probabilistic (currently using just top link for extraction) • Location annotator • Currently, a simple dictionary look-up (Use Minorthird/BBN) • Intelligent adaptable window

  17. Submission Date • Task: find the Paper Deadline Submission Date • Google : “call for papers conferenceName conferenceAcronym year submission deadline” and similar queries • 2 types of processing: pages with CFP lists and usual Conference pages. • Most of the times, no sentence structure. • Idea: Proximity of keywords (submission, deadline, conference name, year, etc.)

  18. Lists of CFP

  19. Conference Dates Page

  20. Submission Date • Hand-tuned Entity recognizer for dates • Several heuristics and regular expressions • No learning • Rank by the “closest” date to keywords • Some keywords: submission, deadline, conference acronym, year • Precision: • All conferences: top1 = 2%, top 3 = 5.8%, events = 13.4% • More recent conferences: (SIGIR, ICML, KDD, 2003-2006): • Top 1 = 50%, Top 3 = 75%

  21. Submission Date • Problems: • Main conference and workshop/tutorial dates • Conferences co-located • Same conference but previous year • Actual conference event dates • Change of deadlines • Hard to evaluate: just couldn’t find the deadline for some old conferences

  22. Acronym finder 100% precision Name/page finder 49% names correct 23% names & URLs (85% on vetted data) Location finder 21% locations correct 38% lists, 30% none 11% wrong Date finder 2% completely right 5.8% in top 3 13.4% event dates Overall Results

  23. Lessons Learned • If we really are learning, then reconsider earlier decisions in light of new knowledge • Pass 1: AAAI = Holger Hoos and Thomas Stuetzle, IJCAI Workshop • Pass 2: AAAI = National Conference on Artificial Intelligence • Supplement creative learning algorithms with simple, focused crawling • Don’t underestimate the time it takes to build foundational tools before “learning”

  24. Useful Resources • Perl • Rapid prototyping • Packages/extensions • Quick/dirty text manipulation • Shell scripts and Unix tools • grep, sed, bash, lynx ... • Google • wildcards (*) and date ranges 2003..2006 • cached web pages

  25. What’s Next? • Failure notifications from later components could propagate backward. • All components could be smarter about how long to descend Google’s returns (i.e., as long as they provide valuable info) • Given good name/acronym/location/date sets, we could look for lists.

More Related