1 / 13

Matching References to Headers in PDF Papers

Matching References to Headers in PDF Papers. Tan Yee Fan 2007 December 19 WING Group Meeting. Task. Corpus ACL Anthology contains a collection of PDF papers Task For each paper P, what papers is cited by P? Gold standard data obtained from Dragomir e.g., P00-1002 ==> P98-2144.

Download Presentation

Matching References to Headers in PDF Papers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Matching References to Headers in PDF Papers Tan Yee Fan 2007 December 19 WING Group Meeting

  2. Task • Corpus • ACL Anthology contains a collection of PDF papers • Task • For each paper P, what papers is cited by P? • Gold standard data obtained from Dragomir • e.g., P00-1002 ==> P98-2144

  3. Header and References • Header of paper (HeaderParse) • Paper title, author names, etc. • Reference section (ParsCit) • Paper title, author names, publication venue, etc. • Each header and each reference is seen as a record • Title, authors, venue

  4. System Overview Header records Indexing All fields concatenated into a singlestring, perform token matching Querying OR matching (default in Lucene) Luceneindex Reference record Returned headers Matching algorithm

  5. Record Matching Header record Reference record TITLE AUTHOR VENUE TITLE AUTHOR VENUE Header-reference pair (instance) TITLE_MIN_LEN TITLE_MAX_LEN AUTHOR_MIN_LEN AUTHOR_MAX_LEN VENUE_MIN_LEN VENUE_MAX_LEN TITLE_SIM AUTHOR_SIM VENUE_SIM MATCH/MISMATCH?

  6. Experiment Setup • Data • Reference records: papers divided into training set and test set (50% each) • Header records: same set of papers used for training and testing • Learning algorithm • SMO in Weka (a SVM implementation)

  7. Bootstrapping the Training Data • Problem • Gold standard data specifies mappings at the paper to paper level, but not which reference • Solution • Hand labeled a small set of reference-header pairs from 6 papers • Train a SVM on this small bootstrap set • On training set, if gold standard specifies P1 -> P2, then use SVM to classify reference-header pairs of P1 and P2 • Retrain SVM using reference-header pairs combined from training and bootstrap sets

  8. Experimental Result • Used the ACL subset (2176 PDF papers) • Skipped: 142 reference sections, 202 paper headers • If classifier considers a reference in P1 matches header of P2, then P1 -> P2 • Results (on paper to paper mappings) • P = 0.901, R = 0.696, F = 0.785 • P = 0.898, R = 0.767, F = 0.827 (with manually cleaned header records)

  9. f1 f2 f3 f4 f5 c1 c2 c3 c4 c5 u1 u2 u3 u4 u5 r1 r2 r3 r4 r5 r6 Cost-utility Framework cost of acquiring fi utility of acquiring fi feature fi known value value that can be acquired

  10. Record Matching [2] Information that canbe acquired at a cost Training data Assume all feature-valuesand their acquisition costsknown Testing data Assume [1] known, butfeature-values and theiracquisition costs in [2]unknown Costs Set to MIN_LEN * MAX_LEN [1] Given information Header-reference pair (instance) TITLE_MIN_LEN TITLE_MAX_LEN AUTHOR_MIN_LEN AUTHOR_MAX_LEN VENUE_MIN_LEN VENUE_MAX_LEN TITLE_SIM AUTHOR_SIM VENUE_SIM MATCH/MISMATCH?

  11. Costs and Utilities • Costs • Trained 3 models (using M5’), treat as regression • Utilities • Trained 2^3 = 8 classifiers (each to predict match/mismatch using only known feature-values) • For a test instance with a missing feature-value F • Get confidence of appropriate classifier without F • Get expected confidence of appropriate classifier with F • Utility is difference between the two confidence scores • Note • Similar to Saar-Tsechansky et al.

  12. Results Without cleaning of header records With manual cleaning of header records Increasing proportion of feature-values acquired Increasing proportion of feature-values acquired

  13. Thank You

More Related