Current work on CitEc

Current work on CitEc José Manuel Barrueco Cruz http://www.uv.es/~barrueco Thomas Krichel http://openlib.org/home/krichel

Data • Papers from RePEc dataset • 31139 Working Papers • 15145 Journal Articles • all of them available online, not all are free • More than 90% of them are in PDF or PostScript formats

Harvesting • Perl script that: • Reads the RePEc data • Downloads the documents full text • Converts them to ASCII (using pstotext) • Tries to find a Reference section

Test on 1000 documents • 13% are not found in the URL specified • 3% are not it PDF or PS • 15% give errors in the pstotext conversion • 9% are converted but a reference section can not be found • 60% were successfully converted

Parsing problems of CiteSeer • Publication date. When a reference contains more than one year it is discarded • Source of publication, i.e. working papers series or journals titles is not parsed be CiteSeer. We will need to add code with a list of all journals and working paper series.

To do • Study of citation patterns • Use of data in user services • Use of data in logging and registration services

Thank you for your attention. Contact José Manuel Barrueco Cruz for more information

Current work on CitEc