80 likes | 183 Views
Project Part 2. LING 572 Fei Xia 1/26/06. NLP Packages. FST: Carmel, AT&T toolkit TBL: fnTBL MaxEnt: DT: C4.5 Boosting: AdaBoost LM: SRI LM MT: GIZA++, Pharoah, …. Main steps. Download and compile the package, and test the code with given examples. License, citation
E N D
Project Part 2 LING 572 Fei Xia 1/26/06
NLP Packages • FST: Carmel, AT&T toolkit • TBL: fnTBL • MaxEnt: • DT: C4.5 • Boosting: AdaBoost • LM: SRI LM • MT: GIZA++, Pharoah, …
Main steps • Download and compile the package, and test the code with given examples. • License, citation • Compilers, libraries, operating system • Create your own test data, write a few wrappers/converters, and test the code. • Fix bugs • Understand the main algorithm of the package: • Read README files, tutorials, and related papers • Check the source code. • Modify and improve the package • Run experiments
Using fnTBL • Download and compile the package, and test the code: (< 1 hour) • Create your own test data, write a few wrappers/converters, and test the code: (about 6 hrs, my time) • Understand the main algorithm of the package: (?? Hrs) • Modify and improve the package: (?? Hrs) • Run experiments: (computer time) • 12 experiments
Main tasks • Understand the code: • Core algorithm: fnTBL-1.1/src • POS tagger: perl_code/pos-train.prl and pos-apply.prl • A wrapper: perl_code/build_TBL_tagger1.pl • Modify the code: • Here you don’t need to change the core algorithm. • A new way of treating unknown words. In Report2, explaining the algorithms and your modification
Main tasks (cont) • Run the code with different settings • Corpus size: 1K, 5K, 10K, 40K • Feature templates: all the types or a subset • Treatment of unknown words Report 1
Report1 # of standard fewer feature w/ simple treatment sents case types for unknown words (tagger1.pl) (t=agger2.pl) (tagger3.pl) ================================================= 1K a11 a12 a13 5K a21 a22 a23 10K a31 a32 a33 40K a41 a42 a43 Replace each cell with a(b, c, d): a: tagging accuracy, b: # of lexical rules c: # of context rules, d: running time
Files for the project • Files given to you: • fnTBL-1.1.linux.tar.gz • params/ • data/: • perl_code/ • Files that will be produced by you: • new_params/: feature templates • new_perl_code/: build_TBL_tagger3.pl, pos-train3.prl and pos-apply3.prl. • report/: Report1 and Report2 • result/: a11/, a12/, …., a43/