70 likes | 79 Views
This project focuses on improving data harvesting and parsing methods for citation pattern analysis. It involves extracting full-text documents, converting them to ASCII format, and parsing reference sections to study citation patterns effectively. The goal is to enhance user services and logging/registration based on data insights.
E N D
Current work on CitEc José Manuel Barrueco Cruz http://www.uv.es/~barrueco Thomas Krichel http://openlib.org/home/krichel
Data • Papers from RePEc dataset • 31139 Working Papers • 15145 Journal Articles • all of them available online, not all are free • More than 90% of them are in PDF or PostScript formats
Harvesting • Perl script that: • Reads the RePEc data • Downloads the documents full text • Converts them to ASCII (using pstotext) • Tries to find a Reference section
Test on 1000 documents • 13% are not found in the URL specified • 3% are not it PDF or PS • 15% give errors in the pstotext conversion • 9% are converted but a reference section can not be found • 60% were successfully converted
Parsing problems of CiteSeer • Publication date. When a reference contains more than one year it is discarded • Source of publication, i.e. working papers series or journals titles is not parsed be CiteSeer. We will need to add code with a list of all journals and working paper series.
To do • Study of citation patterns • Use of data in user services • Use of data in logging and registration services
Thank you for your attention. Contact José Manuel Barrueco Cruz for more information