70 likes | 76 Views
Current work on CitEc. José Manuel Barrueco Cruz http://www.uv.es/~barrueco Thomas Krichel http://openlib.org/home/krichel. Data. Papers from RePEc dataset 31139 Working Papers 15145 Journal Articles all of them available online, not all are free
E N D
Current work on CitEc José Manuel Barrueco Cruz http://www.uv.es/~barrueco Thomas Krichel http://openlib.org/home/krichel
Data • Papers from RePEc dataset • 31139 Working Papers • 15145 Journal Articles • all of them available online, not all are free • More than 90% of them are in PDF or PostScript formats
Harvesting • Perl script that: • Reads the RePEc data • Downloads the documents full text • Converts them to ASCII (using pstotext) • Tries to find a Reference section
Test on 1000 documents • 13% are not found in the URL specified • 3% are not it PDF or PS • 15% give errors in the pstotext conversion • 9% are converted but a reference section can not be found • 60% were successfully converted
Parsing problems of CiteSeer • Publication date. When a reference contains more than one year it is discarded • Source of publication, i.e. working papers series or journals titles is not parsed be CiteSeer. We will need to add code with a list of all journals and working paper series.
To do • Study of citation patterns • Use of data in user services • Use of data in logging and registration services
Thank you for your attention. Contact José Manuel Barrueco Cruz for more information