80 likes | 422 Views
Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics 2a. Course projects Jonas Kuhn Universität Potsdam, 2007 Leistungen im Kurs Übungsaufgaben (werden nicht benotet) 2-3 größere Programmieraufgaben (Abgabe; werden bewertet)
E N D
Statistische Methoden in der ComputerlinguistikStatistical Methods in Computational Linguistics 2a. Course projects Jonas Kuhn Universität Potsdam, 2007
Leistungen im Kurs • Übungsaufgaben (werden nicht benotet) • 2-3 größere Programmieraufgaben (Abgabe; werden bewertet) • Teilnahme in einem “Projekt-Team” (à 2-5 Mitglieder) • Bezug zu einem Gesamt-Kursprojekt (s.u.) • Recherchen zu einem Teil-Thema (zu Literatur und/ oder verfügbaren Werkzeugen/Ressourcen) • (Kurz-)Referat zu Ergebnissen / evtl. kleines Tutorium zu Techniken von allgemeinem Interesse • Experimente mit Werkzeugen bzw. Programmierung • Dokumentation der Projektarbeit (nach Teilnehmern aufgeschlüsselt
The Spock Challenge • The Entity Resolution Problem • A common problem that we face is that there are many people with the same name. Given that, how do we distinguish a document about Michael Jackson the singer from Michael Jackson the football player? • World-wide contest for a software solution • http://challenge.spock.com/ • Winning team receives $ 50,000 prize • (NOTE RULES! “Upon acceptance of the prize, the winning Software Submissions and all source code and algorithms related thereto becomes the sole and exclusive property of Spock.”)
The Spock Challenge • With billions of documents and people on the web, we need to identify and cluster web documents accurately to the people they are related to. • Mapping these named entities from documents to the correct person is the essence of the Spock Challenge.
The Spock Challenge • Data set • The complete data-set is divided into training and test sets containing roughly 25,000 and 75,000 documents, respectively. • Along with a set of documents we've included a set of target names. You can assume that each document contains only one of the target names (even though most documents contain many names). • The challenge is to partition all the documents relevant to a target name by their referent. Consider the following two documents with the target name "Michael Jackson": Michael Jackson - The King of Pop or Wacko Jacko? Michael Jackson statistics - pro-football-reference.com The referents of these articles are the pop star and football player, respectively. We've included the ground truth for the training set so you have something to compare against.
The Spock Challenge • Test/Application • Once you're done training, you can run your algorithm on the test set and submit your results on this site. (http://challenge.spock.com/) • We will provide instant feedback in the form of a percentage rank score (using the F-measure). This way you can see how you stack up against the other teams. What good is a problem without a little competition?
Course projects inspired by Spock challenge • Experiment with various (mostly statistical) NLP techniques on the data set • Any Ideas?
Sub-tasks (we need a team for each) • State of the Art in Entity Resolution (a.k.a. deduplication, or merge-purge) • Clustering • Starting point: Manning/Schütze 1999, ch. 14 • Information/Document Retrieval (?) • Starting point: Manning/Schütze 1999, ch. 15 • Term weighting techniques • Possibly build additional data sets • Named Entity Detection • Coreference Resolution • Parsing, Semantic Role Labelling • Using Word-Net (and other ontological resources) • Using Wikipedia (and other encyclopaedic resources) • Word Sense Disambiguation (possibly similar techniques) • Software Integration, Testing