140 likes | 245 Views
Using Corpora and Evaluation Tools. Diana Maynard Kalina Bontcheva. http://gate.ac.uk/ http://nlp.shef.ac.uk/. March 2004. Corpus structure. Located in gatecorpora in cvs Each directory under gatecorpora has a corpus, e.g., gatecorpora/ace Each corpus can have sub-parts, e.g. ace/bnews
E N D
Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva http://gate.ac.uk/http://nlp.shef.ac.uk/ March 2004 1/(13)
Corpus structure • Located in gatecorpora in cvs • Each directory under gatecorpora has a corpus, e.g., gatecorpora/ace • Each corpus can have sub-parts, e.g. ace/bnews • Each (sub-)corpus has a clean and marked directory, these are important • Clean holds the unannotated version, while marked holds the human-marked ones • There may also be a processed subdirectory – this is a datastore (unlike the other two) • Corresponding files in each subdirectory must have the same name 2/(13)
Tools for corpus manipulation • There are lots of tools available in gatecorpora/utilities and in subdirectories of each corpus • Many of the corpora, e.g. MUC, ACE come in different formats (e.g. inline vs standoff markup) and have been converted to GATE-style annotations • Also tools for e.g. counting things, changing annotation names etc (mostly JAPE grammars) 3/(13)
Corpora available • MUC7 (newswires) • MUSE (news texts from the web) • ACE • ACE Chinese • ACE Arabic • Romanian (news texts; 1984) • CMU seminars • Jobs • CONLL’03 – part of Reuters with NEs • Bulgarian - news 4/(13)
MUC 7 corpus • Newswires used in the official MUC 7 evaluation • Data available in MUC format and GATE format • Annotation types: Person, Location, Organization, Money, Percent, Date, Time • Division into training and test sets 5/(13)
MUSE corpus • News texts from various websites (BBC, Guardian, etc.) • Annotation types: Person, Organisation, Location, Date, Time, Money, Percent, Address • Slight differences in annotation guidelines from MUC, e.g. people’s titles are included in names • Available from gatecorpora/news in various subdirectories 6/(13)
ACE corpus • 3 types of text: newswire, broadcast news and newspaper • Broadcast news and newspaper available as ground truth and original (degraded) texts • Annotation types: Person, Organisation, Location, GPE, Facility • Some annotations have roles to indicate metonymous usage • Guidelines are different from MUC and MUSE • Available from gatecorpora/ace in various subdirectories 7/(13)
Multilingual ACE • As for ACE, but in Chinese and Arabic • Texts are in UTF-8 • No degraded versions of these texts • Available from gatecorpora/ace/ace03/Chinese/ and gatecorpora/ace/ace03/Arabic/ 8/(13)
CMU Seminars & Jobs • Corpora frequently used to evaluate relation extraction and wrapper induction systems • gatecorpora/jobs-corpus and gatecorpora/cmu-seminars • Converted into gate xml, ready for use 9/(13)
CONLL’03 shared task • Corpus used in the CONLL’03 shared task for evaluating NE recognition • In English, part of the Reuters corpus • Markup is e.g., <I-LOC>, not converted to Muse tags • Use reuterstogate.jape to convert to Muse tags • gatecorpora/ReutersWithNamedEntities 10/(13)
Regression Test At corpus level – corpus benchmark tool – tracking system’s performance over time 12/(13)
How it works • Clean, marked, and processed • Corpus_tool.properties – must be in the directory from where gate is executed • Specifies configuration information about • What annotations types are to be evaluated • Threshold below which to print out debug info • Input set name and key set name • Modes • Default – regression testing • Human marked against already stored, processed • Human marked against current processing results 13/(13)
Conclusion This talk: http://gate.ac.uk/sale/talks/corpora-tutorial.ppt More information: http://gate.ac.uk/ 14/(13)