260 likes | 366 Views
Download Estimation for KDD Cup 2003. Janez Brank and Jure Leskovec Jo žef Stefan Institute Ljubljana, Slovenia. Task Description. Inputs: Approx. 29000 papers from the “ high energy physics – theory ” area of arxiv.org For each paper: Full text (TeX file, often very messy)
E N D
Download Estimationfor KDD Cup 2003 Janez Brank and Jure Leskovec Jožef Stefan Institute Ljubljana, Slovenia
Task Description • Inputs: • Approx. 29000 papers from the “high energy physics – theory” area of arxiv.org • For each paper: • Full text (TeX file, often very messy) • Metadata in a nice, structured file (authors, title, abstract, journal, subject classes) • The citation graph (excludes citations pointing outside our dataset)
Task Description • Inputs (continued): • For papers from 6 months (the training set, 1566 papers) • The number of times this paper was downloaded during its first two months in the archive • Problem: • For papers from 3 months (the test set, 678 papers), predict the number of downloads in their first two months in the archive • Only the 50 most frequently downloaded papers from each month will be used for evaluation!
Our Approach • Textual documents have traditionally been treated as “bags of words” • The number of occurrences of each word matters, but the order of the words is ignored • Efficiently represented by sparse vectors • We extend this to include other items besides words (“bag of X”) • Most of our work was spent trying various features and adjusting their weight (more on that later) • Use support vector regression to train a linear model, which is then used to predict the download counts on test papers
A Few Initial Observations • Our predictions will be evaluated on 50 most downloaded papers from each month — about 20% of all papers from these months • It’s OK to be horribly wrong on other papers • Thus we should be optimistic, treating every paper as if it was in the top 20% • Maybe we should train the model using only 20% of the most downloaded training papers • Actually, 30% usually works a little better • To evaluate a classifier, we look at 20% of the most downloaded test papers
Cross-Validation Labeled papers (1566) Split into 10 folds 9 folds (approx. 1409) 1 fold (approx. 157) 30% most frequentlydownloaded (approx. 423 papers) 20% most frequentlydownloaded (approx. 31 papers) Train Model Evaluate Lather, rinse, repeat (10 times) Report average
A Few Initial Observations • We are interested in the downloads within 60 days since inclusion in the archive • Most of the downloadsoccur within the first fewdays, perhaps a week • Most are probably comingfrom the “What’s new” page, which contains only: • Author names • Institution name (rarely) • Title • Abstract • Citations probably don’t directly influence downloads in the first 60 days • But they show which papers are good, and the readers perhaps sense this in some other way from the authors / title / abstract
The Rock Bottom • The trivial model: always predictthe average download count(computed on the training data) • Average download count: 384.2 • Average error: 152.5 downloads
Abstract • Abstract: use the text of the abstract and title of the paper in the traditional bag-of-words style • 19912 features • No further feature selection etc. • This part of the vector was normalized to unit length (Euclidean norm = 1) • Average error: 149.4
Author • One attribute for each possible author • Preprocessing to tidy up the original metadata: Y.S. Myung and Gungwon Kang myung-y kang-g • xa = nonzero iff. a is one of the authors of the paper x • This part is normalized to unit length • 5716 features • Average error: 146.4
Address • Intuition: people are more likely to download a paper if the authors are from a reputable institution • Admittedly, the “What’s new” page usually doesn’t mention the institution • Nor is it provided in the metadata,we had to extract it from TeX files (messy!) • Words from the address are represented using the bag-of-words model • But they get their own namespace, separate from the abstract and title words • This part of the vector is also normalizedto unit length • Average error: 154.0 ( worse than useless)
Abstract, Author, Address • We used Author + Abstract (“AA” for short) as the baseline for adding new features
Using the Citation Graph • InDegree, OutDegree • These are quite large in comparison to the text-based features (average indegree = approx. 10) • We must use weighting, otherwise they will appear too important to the learner • InDegree is useful • OutDegree is largely useless (which is reasonable) AA + InDegree
Using the Citation Graph • InLinks = add one feature for each paper i; it will be nonzero in vector x iff. the paper xis referenced by the paper i • Normalize this part of the vector to unit length • OutLinks = the same, nonzero iff. x references i(results on next slide)
Using the Citation Graph • Use HITS to compute a hub value and an authority value for each paper ( two new features) • Compute PageRank and add this as a new feature • Bad: all links point backwards in time (unlike on the web) — PageRank accumulates in the earlier years • InDegree, Authority, and PageRank are strongly correlated,no improvement over previous results • Hub is strongly correlated with OutDegree, and is just as useless
Journal • The “Journal” field in the metadata indicates that the paper has been (or will be?) published in a journal • Present in about 77% of the papers • Already in standardized form, e.g. “Phys. Lett.” (never “Physics Letters”, “Phys. Letters”, etc.) • There are over 50 journals, but only 4 have more than 100 training papers • Papers from some journals are downloadedmore often than from others: • JHEP 248, J. Phys. 104, global average 194 • Introduce one binary feature for each journal(+ one for “missing”)
Miscellaneous Statistics • TitleCc, TitleWc: number of characters/words in the title • The most frequently downloaded papers have relatively short titles: The holographic principle (2927 downloads) Twenty Years of Debate with Stephen (1540) Brane New World (1351) A tentative theory of large distance physics (1351) (De)Constructing Dimensions (1343) Lectures on supergravity (1308) A Short Survey of Noncommutative Geometry (1246)
Miscellaneous Statistics • Average error: 119.561 for weight = 0.02 • The model says that the number of downloads decreases by 0.96 for each additional letter in the title :-) • TitleWc is useless
Miscellaneous Statistics • AbstractCc, AbstractWc: number of characters/words in the abstract • Both useless • Number of authors (useless) • Year (actually Year – 2000) • Almost useless (reduces error from 119.56 to 119.28)
Clustering • Each paper was represented by a sparse vector (bag-of-words, using the abstract + title) • Use 2-means to split into two clusters, then split each of them recursively • Stop splitting if one of the two clusters would have < 600 documents • We ended up with 18 clusters • Hard to say if they’re meaningful (ask a physicist?) • Introduce one binary feature for each cluster(useless) • Also a feature (ClusDlAvg) to contain the average no. of downloads over all the training documents from the same cluster • Reduces error from 119.59 to 119.30
Tweaking and Tuning • AA + 0.005 InDegree + 0.5 InLinks + 0.7 OutLinks + 0.3 Journal + 0.02 TitleCc/5 + 0.6 (Year – 2000) + 0.15 ClusDlAvg: 29.544 / 119.072 • The “C” parameter for SVM regression was fixed at 1 so far • C = 0.7, AA + 0.006 InDegree + 0.7 InLinks + 0.85 OutLinks + 0.35 Journal + 0.03 TitleCc/5 + 0.3 ClusDlAvg: 31.805 / 118.944 • This is the one we submitted
Conclusions • It’s a nasty dataset! • The best model is still disappointingly inaccurate • …and not so much better than the trivial model • Weighting the features is very important • We tried several other features (not mentioned in this presentation) that were of no use • Whatever you do, there’s still so much variance left • SVM learns well enough here, but it can’t generalize well • It isn’t the trivial sort of overfitting that could be removed simply by decreasing the C parameter in SVM’s optimization problem
Further Work • What is it that influences readers’ decisions to download a paper? • We are mostly using things they can see directly: author, title, abstract • But readers are also influenced by their background knowledge: • Is X currently a hot topic within this community? ( Will reading this paper help me with my own research?) • Is Y a well-known author?How likely is the paper to be any good? • It isn’t easy to catch these things,and there is a risk of ovefitting