Matching References to Headers in PDF Papers

Matching References to Headers in PDF Papers Tan Yee Fan 2007 December 19 WING Group Meeting

Task • Corpus • ACL Anthology contains a collection of PDF papers • Task • For each paper P, what papers is cited by P? • Gold standard data obtained from Dragomir • e.g., P00-1002 ==> P98-2144

Header and References • Header of paper (HeaderParse) • Paper title, author names, etc. • Reference section (ParsCit) • Paper title, author names, publication venue, etc. • Each header and each reference is seen as a record • Title, authors, venue

System Overview Header records Indexing All fields concatenated into a singlestring, perform token matching Querying OR matching (default in Lucene) Luceneindex Reference record Returned headers Matching algorithm

Record Matching Header record Reference record TITLE AUTHOR VENUE TITLE AUTHOR VENUE Header-reference pair (instance) TITLE_MIN_LEN TITLE_MAX_LEN AUTHOR_MIN_LEN AUTHOR_MAX_LEN VENUE_MIN_LEN VENUE_MAX_LEN TITLE_SIM AUTHOR_SIM VENUE_SIM MATCH/MISMATCH?

Experiment Setup • Data • Reference records: papers divided into training set and test set (50% each) • Header records: same set of papers used for training and testing • Learning algorithm • SMO in Weka (a SVM implementation)

Bootstrapping the Training Data • Problem • Gold standard data specifies mappings at the paper to paper level, but not which reference • Solution • Hand labeled a small set of reference-header pairs from 6 papers • Train a SVM on this small bootstrap set • On training set, if gold standard specifies P1 -> P2, then use SVM to classify reference-header pairs of P1 and P2 • Retrain SVM using reference-header pairs combined from training and bootstrap sets

Experimental Result • Used the ACL subset (2176 PDF papers) • Skipped: 142 reference sections, 202 paper headers • If classifier considers a reference in P1 matches header of P2, then P1 -> P2 • Results (on paper to paper mappings) • P = 0.901, R = 0.696, F = 0.785 • P = 0.898, R = 0.767, F = 0.827 (with manually cleaned header records)

f1 f2 f3 f4 f5 c1 c2 c3 c4 c5 u1 u2 u3 u4 u5 r1 r2 r3 r4 r5 r6 Cost-utility Framework cost of acquiring fi utility of acquiring fi feature fi known value value that can be acquired

Record Matching [2] Information that canbe acquired at a cost Training data Assume all feature-valuesand their acquisition costsknown Testing data Assume [1] known, butfeature-values and theiracquisition costs in [2]unknown Costs Set to MIN_LEN * MAX_LEN [1] Given information Header-reference pair (instance) TITLE_MIN_LEN TITLE_MAX_LEN AUTHOR_MIN_LEN AUTHOR_MAX_LEN VENUE_MIN_LEN VENUE_MAX_LEN TITLE_SIM AUTHOR_SIM VENUE_SIM MATCH/MISMATCH?

Costs and Utilities • Costs • Trained 3 models (using M5’), treat as regression • Utilities • Trained 2^3 = 8 classifiers (each to predict match/mismatch using only known feature-values) • For a test instance with a missing feature-value F • Get confidence of appropriate classifier without F • Get expected confidence of appropriate classifier with F • Utility is difference between the two confidence scores • Note • Similar to Saar-Tsechansky et al.

Results Without cleaning of header records With manual cleaning of header records Increasing proportion of feature-values acquired Increasing proportion of feature-values acquired

Thank You

Matching References to Headers in PDF Papers