1 / 92

NEP

Thomas Krichel LIU & HГУ 2007 – 11 — 21. NEP. acknowledgments. Thanks to organizers Andrew M. Cox. I thank everybody involved in RePEc and NEP, as well as JISC. current awareness.

slone
Download Presentation

NEP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Thomas Krichel LIU & HГУ 2007–11—21 NEP

  2. acknowledgments • Thanks to organizers Andrew M. Cox. • I thank everybody involved in RePEc and NEP, as well as JISC.

  3. current awareness • A “Current Awareness” aka “Selective Dissemination of Information” service informs users about new documents in their area of interest. • It is niche activity that has been neglected by the general search engines.

  4. common features • Current awareness generates a double classification in • subject ... • time ... • matter.

  5. types of current awareness • There is no official typology that I know of. • I distinguish • common classification based • keyword based • usage history based • others • Most of them have a problem when it comes to academic information.

  6. common classification based • You have a set of classifications say “sports”, “national politics”, “gossip”. Computer filters into the classes. • Works very well as long as the vocabularies in documents of different classes is very distinct. • Example: Google News • Academic information tends to be difficult to classify by computer.

  7. keyword based • A user builds a profile of her interest and files it with a service provider. The provider will mail documents that contain keywords. • Example: Nexis contents alert. • Works well if one gets the keywords right. Unfortunately this is difficult. • For academic documents, getting the keywords right seems next to impossible.

  8. usage history based • A users logs into a system. The system tracks usage and build a profile of interest of the user. • Works very badly if user interests change. • Example: amazon.com suggestions. • For academic documents there is no unified system the usage of which a computer could watch.

  9. academic current awareness • All three types work badly with academic documents. • But current awareness is the essence of academic publishing. • Academic journals are, in essence, nothing but current awareness services. • But there is more and more work published outside of formal academic journals.

  10. a new approach • Rather than only publishing in journals in some academic fields authors publish through other channels. • This tendency is likely to grow. • If there is a library that fills with records about these documents, we can build a new type current awareness system. • We have to have a digital library first.

  11. the RePEc digital library • RePEc is a digital library about economics. • Founded by Thomas Krichel in 1997, based on work he started in 1993. • It is compiled from over 800 sources. • Currently describes over 500000 items. • It has a current awareness system.

  12. introducing NEP • “NEP: New Economics Papers” is a current awareness system for RePEc. • It is based at http://nep.repec.org • Selections into topics is done manually by volunteers.

  13. who volunteers for NEP? • Many volunteers are junior academics. • They have good incentives. They • need to be aware of latest literature • are absent from some informal circulation channels of senior academics • need to “get their name around” among peers • Some are senior people but at second-tier institutions.

  14. NEP service model • There is a basic model behind this service we could call the “NEP Service Model”. • It is characterized by • two stages... • flat report space...

  15. general two stage setup • First stage: A general editor compiles a list of all new papers. This forms an issues of the “allport”. • Second stage: A group of subject editors filter the new allport issue into subject reports. Each editor does this independently for the subject reports she looks after.

  16. a flat space • There is a series of reports. Each reports has a number of issues over time. • There is an “allport” a report that contains all papers that are new in the period covered by the issue.

  17. first stage in NEP • General editor compiles a list of recent additions to the RePEc working papers data. • Computer generated • Journal articles are excluded • Examined by the General Editor (GE, a person)‏ • This list forms an issue of nep-all.

  18. first stage in NEP • nep-all contains all new papers • Circulated to • nep-all subscribers • editors of subject-reports

  19. second stage • Each editor creates, independently, a subject report for her subject. She does this by removing from nep-all. • A subject report issues (SRI) is the result of this process. • There have been over 50,000 SRIs issued through the lifetime of NEP to date.

  20. history • There are basically two phases in NEP the pre-ernad 1998 to 2004 and the post-ernad phase. • I will deal with pre-ernad history here. • Some research on NEP has been conducted in the pre-ernad phase. • This has informed the work that went into ernad.

  21. early history • System was conceived by Thomas Krichel • Name “NEP” by Sune Karlsson • Implemented by José Manuel Barrueco Cruz. • Started to run in May 1998.

  22. starting setup • First the system was all email based. • The nep-all was composed as an email. • It was sent to editors as an email. • Editors used whatever tool they used to compose the email.

  23. web interface • John S. Irons issued the first web interface for report composition on 2000-02-01. • This would just compose the report. • Editors would still cut and paste the results of the form into email clients.

  24. historic mail support • First mail support was given by mailbase.ac.uk. • When this was closed in 2000-11, NEP moved to jiscmail.ac.uk. • Since the mailing list service was only supposed to be for UK academic community, it was deemed not sustainable. • Thomas Krichel started hosting lists on 2002-11-16. It is a nightmare.

  25. Aeroflot document • The Aeroflot document was a thinking piece that Thomas Krichel wrote as early as 2001-09 at http://openlib.org/home/krichel/work/aeroflot.html • This paper already sets out ideas for what would be ernad. • At that time the Siberian RePEc team promised help with building such a system.

  26. discover disaster • In 2002-2003 Jeremiah Cochise Trinidad Christensen and Thomas Krichel were the first people to try to get a systematic picture of how NEP works. • They discover that this is exceedingly difficult.

  27. mail log parsing • Logs were not moved from to Maibase to JISCMail. • Mailbase removed the logs in 2002-11. Thomas Krichel got them just before they were destroyed. • The mail logs were the only source for historic NEP information.

  28. parsing targets • handles: severely compromised by cut-n-paste operations, editor locales, etc. • date of issue: editors were free to set dates, nep-all dates may not be preserved • time of issue: an email is almost impossible to time.

  29. state of pre-ernad data • After a regular expressions orgy, we can get some approximate idea about the handles that were used. • Thus the thematic component is roughly intact. • We have a problem with a bug in the discovery program that made many papers appear several times in nep-all. This makes it difficult to associate subject and allport issues.

  30. state of pre-ernad data • Timing of emails is extremely difficult, even with full headers. • The logs of the Mailbase system only have times for when the email client said it sent the mail. This is the local editor's PC time, can be years out of whack! • We still have some data for research...

  31. research conducted on NEP • Most of the research conducted on ernad has been done in the pre-ernad phase. • The difficulties of some of this work has informed the construction of ernad.

  32. Chu and Krichel (2003)‏ • Heting Chu & Thomas Krichel (2003) “NEP: Current awareness service of the RePEc digital library”. http://www.dlib.org/dlib/december03/chu/12chu.html vaguely talks about NEP. Notes that there is a problem of timeliness in the subject report issue, despite the very shaky data.

  33. Barrueco Cruz et al. (2003)‏ • Jose Manuel Barrueco Cruz, Thomas Krichel and Jeremiah Cochise Trinidad-Chrisitensen “Organizing Current Awareness in a Large Digital Library” http://openlib.org/home/krichel/papers/espoo.pdf have two themes • overlap between reports... • coverage ratio... • as well as history and suggestions.

  34. overlap • Barrueco Cruz et al (2003) argue that overlap occurs not when two papers are appearing in the two reports, but when the two reports are read by the same readers. • They have data on pairwise overlap between reports, based on crude membership data.

  35. overlap puzzle • Here is a puzzle to think about • If a person will be interested in two subject areas because they are close, she will subscribe to both reports. • But since they are thematically close, she will sometimes receive the same papers twice. • With mail technology and asynchronous issue generation, this appears difficult to solve.

  36. coverage ratio • We call the coverage ratio the number of papers in nep-all that have been announced in at least one subject report. • We can define this ratio • for each nep-all issue • for a subset of nep-all issues • for NEP as a whole

  37. coverage ratio theory & evidence • Over time more and more NEP reports have been added. As this happens, we expect the coverage ratio to increase. • However, the evidence, from research by Barrueco Cruz, Krichel and Trinidad is • The coverage ratio of different nep-all issues varies a great deal. • Overall, it remains at around 70%. • We need some theory as to why.

  38. Krichel & Bakkalbaşı (2005)‏ • Thomas Krichel and Nisa Bakkalbaşı “Developing a predicitve model of editor selectivity in a current awareness service of a large digital library”. http://openlib.org/home/krichel/papers/boston.pdf

  39. coverage ratio theories • Krichel & Bakkalbaşı (2005) build two theories of the observations of Barrueco Cruz at al. (2003)‏ • They are • Target-size theory • Quality theory • descriptive quality • substantive quality

  40. theory 1: target size theory • When editors compose a report issue, they have a size of the issue in mind. • If the nep-all issue is large, editors will take a narrow interpretation of the report subject. • If the nep-all ratio is small, editors will take a wide interpretation of the report subject.

  41. target size theory & static coverage • There are two things going on • The opening new subject reports improves the coverage ratio. • The expansion of RePEc implies that the size of nep-all, though varying in the short-run, grows in the long run. Target size theory implies that the coverage ratio deteriorates. • The static coverage ratio is the result of both effects canceling out.

  42. theory 2: quality theory • George W. Bush version of quality theory • Some papers are rubbish. They will not get announced. • The amount of rubbish in RePEc remains constant. • This implies constant coverage. • Reality is slightly more subtle.

  43. 2 versions of quality theory • Descriptive quality theory: papers that are badly described • misleading titles • no abstract • languages other than English • Substantive quality theory: papers that are well described, but not good • from unknown authors • issued by institutions with unenviable research reputation

  44. practical importance • We do care whether one or the other theory is true. • Target size theory implies that NEP should open more reports to achieve perfect coverage. • Quality theory suggests that opening more report will have little to no impact on coverage. • Since operating more reports is costly, there should be an optimal number of reports.

  45. results • Krichel & Bakkalbaşı (2005) build a binary logistic regression analysis model. • They find positive evidence for both target size and quality theory. • The NEP editors don't like the results. They insist that they only filter by topic.

  46. Bátiz-Lazo & Krichel (2005)‏ • Bernardo Bátiz-Lazo and Thomas Krichel “On-line distribution of working papers through NEP: A Brief Business History” http://openlib.org/home/krichel/papers/kassel.pdf has an early history of NEP that covers organizational details I don't talk about here.

  47. ernad • stands for editing reports on new academic documents. • Software system designed by Thomas Krichel at http://openlib.org/home/kric hel/work/altai.html. • Software written in Perl by Roman D. Shapiro. Cost $2000. • Started to work after 2004-12.

  48. cut editor freedom I • Editors no longer send mail to lists. • Only one email address sends mail. • But the mail appears like coming from the editor: From: Marcus Desjardin <ernad@nep.repec.org> Reply-To: Marcus Desjardin <desjardin@econ.louvain.be>

  49. cut editor freedom II • Editors can no longer edit report issue emails, e.g. add announcements of conferences. • They are generated from XML files into standardized text and HTML files bound together by MIME multipart/alternative. • They can not change dates of issue.

  50. help editors • Provide a simple-to-use interface for the composition of reports • provide an easy to scroll input • allow for easy sorting of report • do a better job at pretty-printing • Get ready for the introduction of pre-sorting • Actually presorting was only introduced in 2005-08.

More Related