Fex Feature Extractor - v2

Fex Feature Extractor - v2

Topics • Vocabulary • Syntax of scripting language • Feature functions • Operators • Examples • POS tagging • Input Formats

Vocabulary • example • A list of active records for which Fex produces a single SNOW example. Usually a sentence. • record • a single position in an example (sentence). • Contains a list of fields, each of which holds a different info: e.g. NLP: Word, Tag, Vision: color, etc. • Raw input to Fex • A list of valid example, (raw sentences, tagged corpora, etc. ) • Fex’s Output • Lexical features written to the lexicon file. • Their corresponding numeric ID’s are written to the example file. • feature function • A relation among one or more records.

Example: Feature Functions

Script Syntax • A Fex script file contains a list of definitions, each of which will rewrite the given observation into a set of active features. • Definition format, terms in ()’s optional: • target (inc) (loc): FeatureFunc ([left, right]) • target - Target index or word. To treat each record in the observation as a target, use -1. This is a macro for “all words”. • inc - Include target word instead of placeholder (*) in some features. • loc - Generate features with location relative to target.

FeatureFunc - A feature function defined in terms of certain unary and n-ary relations, and operators. • left - Left offset of scope for generating features. Negative values are left of the target, positive to the right. • right - Right offset of scope.

Basic Feature Functions • Type DefFex NotationInterpretationOutput to Lexicon Label lab produces a label feature lab[target word] lab(t) lab[target tag] Word w Active if word(s) in current w[current word]record is within scope Tag (pos) t Active if tag(s) in current t[current tag]record is within scope Vowel v Active if the word(s) in v[initial vowel] current record begin with a vowel. Prefix pre Active if the word(s) in the pre[active prefix] current record begins with a prefix in a given list.

Type DefFex NotationInterpretationOutput to Lexicon Suffix suf Active if the word(s) in suf[theactive suffix] the current record begins with a prefix in a given list Baseline base Active if a baseline tag from base[baseline tag] a prepared list exists for the word(s) in the current recordLemma lem Active if a lemma from the lem[active lemma] WordNet database exists for the word(s) in the currentrecord

Example • Sentence = “(DET The) (NN dog) (V is) (JJ mad)” method 1 Script DefOutput to lexiconOutput to example file dog: w [-1,1] 10001 w[The] 10001, 10002, 10003, 10004: 10002 w[is] dog: t [1,2] 10003 t[V] 10004 t[JJ] method 2 Script DefOutput to lexicon Output to example file -1: lab 10001 w[The] 1, 10001, 10002, 10003, 10004: -1: w [-1,1] 10002 w[is] -1: t [1,2] 10003 t[V] 10004 t[JJ]

Operators & Complex Functions • (X) operator - Indicate that a feature is active without any specific instantiation. Script DefOutput to Lexicon dog: v(X) [-1,1] 10001 v[] • (x=y) operator – Creates an active feature iff the active instantiation matches the given argument. Script DefOutput to Lexicon dog: w(x=is) 10001 w[is] Sentence = “(DET The) (NN dog) (V is) (JJ mad)”

Operators & Complex Functions • & operator - conjunct two features: producing a new feature which is active iff record fulfills both constituent features. Script DefOutput to Lexicon dog: w&t [-1,-1] 10001 w[The]&t[DET] • | operator - disjunction of two feature: outputting a feature for each term of the disjunction that is active in the current record. Script DefOutput to Lexicon dog: w|t [-1,-1] 10001 w[The] 10002 t[DET] Sentence = “(DET The) (NN dog) (V is) (JJ mad)”

Operators & Complex Functions • coloc function - Consecutive feature function: takes two or more features as arguments to produce a consecutive collocation over two or more records. The order of the arguments is preserved in the active feature. Script DefOutput to Lexicon mad: coloc(w, t) [-3,-1] 10001 w[The]-t[NN] 10002 w[dog]-t[V] • scoloc function –Sparse Consecutive feature function: operates similarly to coloc, except that active colocations need not be consecutive. However, the order of the arguments is still preserved in determining whether a feature is active. Script DefOutput to Lexicon mad: scoloc(w,t) [-3,-1] 10001 w[The]-t[NN] 10002 w[dog]-t[V] 10003 w[The]-t[V]

Example: POS tagging • Useful features for POS tagging: • The preceding word is tagged c. • The following word is tagged c. • The word two before is tagged c. • The word two after is tagged c. • The preceding word is tagged c and the following word is tagged t. • The preceding word is tagged c and the word two before is tagged t • The following word is tagged c and the word two after is tagged t. • The current word is w. • The most probable part of speech for the current word is c.

Given the sentence: • (t1 The) (t2 dog) (t3 ran) (t4 very) (t5 quickly) • The following Fex script will produce the features from the last slide. -1: lab(t) -1 loc: t [-2,2] -1: coloc(t,t,t) [-2,2] -1 inc: w[0,0] -1: base[0,0] • To do POS tagging, an example needs to be generated for each word in observation.

For the third word, “ran”, the script produces the following output: • Script: LexiconOutput: -1: lab(t) 1 lab[t3] -1 loc: t [-2,2] 10001 t[t1_*] 10002 t[t2*] 10003 t[*t4] 10004 t[*_t5] -1: coloc(t,t,t) [-2,2] 10005 t[t1]-t[t2]-* 10006 t[t2]-*-t[t4] 10007 *-t[t4]-t[t5] -1 inc: w [0,0] 10008 w[ran] -1: base [0,0] 10009 base[V] • And an example in the example file: • 1, 10001, 10002, 10003, 10004, 10005, 10006, 10007, 10008, 10009:

Input Formats • Fex can presently accept data in two formats: • w1 w2 w3 w4 … • (t1 w1) (t2 w2) (t3 w3) (t4 w4) … • w1 (t2 w2) (t3 t3a; w3) (t4; w4 w4a) …

Using Fex (command line) fex [options] script-file lexicon-file corpus-file example-file Options: • -t: target file • do not have any empty line in your file!!! • Each target in a separate line • -r: test mode • Does not create new features • -h, -I • Creates a histogram of active features

Using Fex (command line) • Target file= targ: Script file = script: dog -1 : lab cat -1 : w [-1,-1] -1 : t [-1,-1] Corpus file = corpus (DET The) (NN dog) (V is) (JJ mad) Lexicon file =lexicon Example file=example fex –t targ script lexicon corpus example

SNoW

Word representation

Restrictions on the learning approach • Multi- Class • Variable number of features • per class • per example • Efficient learning • Efficient evaluation

SNoW • Network of threshold gates • Target nodes represent class labels • Input nodes (features) and links are allocated in a data driven way (Order of 105 input features for many target nodes) • Each sub-network (target nodes) is learned autonomously as a function of the features • An example presented is positive to one network negative to others (depends on the algorithm) • Allocations of nodes (features) and links is Data-Driven (a link between feature fi and targettjis created only when fi was active with any target tj)

Word prediction using SNoW • Target nodes each word in the set of candidates words is a target node • Input nodes an input node for feature fi is allocated only if that feature fi was active with any target • Decision task we need to choose one target among all possible candidates

SNoW (Command line) snow –train –I inputfile –F networkfile [-ABcdePrsTvW] snow –test –I inputfile –F networkfile [-bEloRvw] Architecture Winnow: -W [, , , init weight] :targets Perceptron: -P [, , init weight] :targets NB: -B :targets

SNoW parameters (testing) -b <k> : smoothing for NB -w <k> : smoothing for W, P output modes -E : error file -o < accuracy | winners | allpred | allact | allboth > :details for the output -R : results file (stdout)

File Format (Example file) 6, 10034, 10141, 10151, 10158, 10179: 177, 10034, 10035, 10047: With weights: 6, 10034(1), 10141(1.5), 10151(0.4), 10158(2), 10179(0.1): 177, 10034(2), 10035(4), 10047(0.6): Only active feature appear in an example !!!

File Format (Network file) NB target 111 0 1 135 1 naivebayes 0 0.1 0.5 111 : 0 : 10020 : 4 0 -3.518980417 111 : 0 : 10021 : 1 0 -4.905274778 Winnow target 111 1 1 135 1562 winnow 0 1.1 0.9 15 1 111 : 0 : 10020 : 4 1 1.1 111 : 0 : 10021 : 1 0 1 Perceptron target 111 2 1 2701 perceptron 0 0.1 4 0.2 111 : 0 : 10020 : 4 1 0.3 111 : 0 : 10021 : 1 0 0.2

File Format (Error file) Algorithms: Perceptron: (1, 30, 0.05) Targets: 3, 53, 73 Ex: 8 Prediction: 3 Label: 53 3: 0.5866 53: 0.2592* 73: 0.1192 Ex: 15 Prediction: 3 Label: 73 3: 0.5987 73: 0.001229* 53: 0.0002248

Fex Feature Extractor - v2

Fex Feature Extractor - v2

Presentation Transcript

RIP v2

RIP V2

V2 ROCKET

Citation Extractor

UNIDAD 6 Configuracion N5K y FEX

Feature Extractor

FEX

Comment Extractor

slides v2

AII v2

Hydrocarbon extractor

BHO extractor

Email Extractor

Lead Extractor