Project Presentation

Lin572 Advanced Statistic Methods in NLP Project Presentation Team Members: Anna Tinnemore Gabriel Neer Yow-Ren Chiang

PART 3 MaxEnt (yipee!)

The Good Stuff: • Simple feature templates and extraction • Elegant data structures for storage and easy access • Pretty good results!

The Bad Stuff: • Hmmm. . . .

Features • A few short loops collected the most relevant context features • No long-winded feature templates • Easy-access hashes

Decent Results • Mid-nineties increasing with the size of the training data • Result

PART 4 Task 2 Bagging

Tie Function • use Tie::File; • use Fcntl; • for my $bag_num (1 .. $B) { # The Nth bag from file foo.txt becomes foo.txtbagN, etc. my $bag_name = "$file_name-bag$bag_num"; open (BAG, ">$bag_name") or die "Can't open $bag_name for writing: $!"; for (@lines) { # Pick random line of file. my $line = $lines[ rand @lines ]; print BAG "$line\n"; # Output to the bag. } }

Combination • VOTING!!

Step 1: • # Loop through file and remember words. Keep them grouped by sentence. while (<FILE>) { foreach (@word_tags) { my @wordtag = split /\//; push (@words, ($wordtag[0])); } push (@sentences, (\@words)); }

Step 2: • # Go through file and for each word, increase the count of its tag for (@ARGV) { my $tag_index = 0; while (<FILE>) { foreach (@word_tags) { my @wordtag = split /\//; my $tag = $wordtag[1]; $tags[$tag_index]->{$tag}++; $tag_index++; } } }

Step 3: • # Go through the sentences and print out each word/tag pair. my $tag_index = 0; foreach my $sent (@sentences) { foreach my $word (@$sent) { my $tag = max_tag($tags[$tag_index]); $tag_index++; print "$word/$tag "; } print "\n"; }

Finding the “Best Tag” • # Find the tag with the highest count. sub max_tag { my $tag_hash = shift; (my $tag) = keys %$tag_hash; my $tag_count = $tag_hash->{$tag}; foreach (keys %$tag_hash) { if ($tag_hash->{$_} > $tag_count) { $tag = $_; $tag_count = $tag_hash->{$tag} } } return $tag; }

Procedure • Creating Bootstrap samples • Treating the file as an array for lines. • N random array indices are selected and each corresponding line is output to a file • Combine_tool.pl • opens the file corresponding to its first argument • reads in all words, aggregated by sentence • An array of tag hashes is created. • For each file in its arg list, opens that file and reads the tags sequentially • The hash item corresponding to the tag in the appropriate index of the tag area is incremented • For each index, the hash label with the highest count is chosen as the correct tag • Re-associate the tags with their words • Print out the word/tag pairs

Result

Project Presentation