1 / 36

Scholarly Communication II Citation Analysis and Reference Linking

Scholarly Communication II Citation Analysis and Reference Linking. CS 502 – 20020430 Carl Lagoze – Cornell University. Recalling the themes. Basic assumptions are broken Expensive distribution Distinctions between publishers, authors, readers Basic assumptions remain Need for quality

raoul
Download Presentation

Scholarly Communication II Citation Analysis and Reference Linking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scholarly Communication IICitation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

  2. Recalling the themes • Basic assumptions are broken • Expensive distribution • Distinctions between publishers, authors, readers • Basic assumptions remain • Need for quality • Need for people to make money • Reward system: tenure and promotion • Changing context • Readers habits • Increase of scholarly output • Computing and network power

  3. Unbundling content from services Acks. P. Ginsparg

  4. Signs of Change - Readers … there’s a sense in which the journal articles prior to the inception of the electronic abstracting and indexing database may as well not exist, because they are so difficult to find. Now that we are starting to see … full-text showing up online, I think we are very shortly going to cross a sort of critical mass boundary where those publications that are not instantly available in full-text will become kind of second-rate in a sense, not because their quality is low, but just because people will prefer the accessibility of things they can get right away. Clifford Lynch 1997

  5. Signs of Change - Publishers • Electronic versions of existing journals • Licensing arrangements to libraries • http://campusgw.library.cornell.edu/cgi-bin/dj.cgi?section=ejournal&URL=SerialsSearch • Problems • License bundling • Inflate costs and maintain economic model • Force libraries to subscribe regardless of interest • Longevity dependent on license continuity

  6. Signs of Change - Publishers • Electronic Journals • D-Lib Magazine – http://www.dlib.org • Journal of Digital Information (JODI) – http://journals.ecs.soton.ac.uk/jodi/ • Journal of Electronic Publishing (JEP) – http://www.press.umich.edu/jep/ • The economic models are not established

  7. Signs of Change – Publishers and Libraries • JSTOR • http://www.jstor.org • Recognition of reality • Archival journal storage is expensive for libraries • Shelf space crisis forces libraries to choose between • Keeping archival issues to serials • Continuing subscriptions for new issues • Building expensive new buildings • Archival copies have limited economic value to publishers • Cooperative non-profit model among publishers/foundation (Mellon)/libraries • Sliding window to digitize old issues of serials and provide ready access services

  8. Signs of Change – Libraries & Professional Societies • HighWire Press – http://highwire.stanford.edu • Realities • Many professional societies and journals are “Mom & Pop” operations • Technical and economic cost of electronic publishing is often prohibitively high • Solution • Highwire acts as a brokering service to provide electronic publishing technology for small professional societies and journals • Pooling technology allows creation of higher level services (e.g., reference linking amongst journals)

  9. Signs of Change - Scholars • Eprint respositories • Author-self archiving gives scholars control over their intellectual output • Harnad’s “subversive proposal” • Direct descendant of traditional pre-print sharing in print form among scholars • Examples • arXiv – http://arxiv.org • ePrints – http://www.eprints.org • California Digital Library scholarly publishing archive - http://repositories.cdlib.org/ • Related Issues • Publisher agreements – some journals refuse to publish anything that has been posted as an eprint

  10. Signs of Change – Computer Scientists • Automatic creation of traditional journal services • Citation analysis • Reviewing

  11. Concepts – References and Citations Doc2 Doc1 Doc1 references: (Doc2, Doc3) Doc1 citations: (Doc2) Doc3

  12. Concepts – References and Citations • # of references of a document is finite, stable, and easy to determine/compute • # of citations of a document is dynamic, impossible to computer (infinite) • Generally, references are at the work, or manifestation level, NOT at the item level

  13. Citation Analysis • Understanding citation patterns among scholarly journals • Quality metric • Cost/benefit analysis – what “basic” journals should a library have in its holdings • Eugene Garfield – “Father of citation analysis” • Science Citation Index • Origins circa 1950’s • Hand analysis of printed journals showing patterns of citations into and out from journals

  14. Results of citation analysis acks. Garfield, Science, 1972

  15. Citation analysis in the digital age • Automatic citation linking among papers in arXiv • Citebase (Open Citations Project) • http://citebase.eprints.org/cgi-bin/search?submit=1&author=Hawking%2C%20S%20W%20 • Scientometrics - Automation of methods reveals lots of data • Longevity of interest in paper • Journal and ePrint citation patterns • Automatic citation analysis as a reviewing tool?

  16. Are papers downloaded then cited or cited then downloaded?(2) • If all these time differences are plotted the above graph is produced. Acks: S. Harnad

  17. Citation Latencies • The raw data show that the latency of the citation peak has been reducing over the period of the archive Acks: S. Harnad

  18. Author Impact Quartiles • High impact authors update more than medium or low • High and medium impact authors deposit more papers than low Acks: S. Harnad

  19. Citation Quality • Papers generally cite papers of like impact Acks: S. Harnad

  20. Citation Spread • A small number of papers receive a very large number of citations Acks: S. Harnad

  21. How Paper Impact Effects Usage • Higher impact papers have a longer download life expectancy. Acks: S. Harnad

  22. What is the correlation between citations and downloads? • There is a significant positive correlation between citations and downloads for high impact papers. Acks: S. Harnad

  23. full text reference linking

  24. Who is Who Books in Print Amazon.com extended services

  25. Static Linking • Fixed URLs in references • All the associated problems with URLs • Persistent link through document “footprint” • Robust URLs – Berkeley

  26. General Idea: Enhance URLs with “signatures” • Add to a URL a “signature”, a small piece of document content. • When “traditional” (i.e., address-based) dereferencing fails, do “signature-(i.e., content-)based dereferencing: • Pass the signature to some search service, and hope that the target will be prominent among a very small result set. • Two issues: • Computing small, yet effective and robust signatures • Adding them innocuously to hyperlinks Acks: Phelps/Wilensky

  27. Computing Small, Robust Signatures • “Lexical” signatures: The top n words of a document chosen for rarity, subject to heuristic filters to aid robustness. • “a TF-IDF-like” measure • Easy to compute and use. • Question: How big a signature is needed to locate a document more or less uniquely on the Web? • Inktomi says there are approximately 1 billion web pages now. Acks: Phelps/Wilensky

  28. Answer: 5 words! • I.e., a signature of 5 words will, in most cases, cause search engines to return the target document within the top few hits. • Actually, a smaller signature will probably do just to locate exact matches, but length helps provide robustness and for growth. • Martin and Holte (1998) and Bharat and Broder (1998) demonstrate summaryqueries and strong queries, resp., which use rarity of words (and, possibly, phrases) to local specific documents. • Our variation focuses on robustness + rarity. Acks: Phelps/Wilensky

  29. Some Examples Acks: Phelps/Wilensky • Signature for Randy Katz’s home page was • “Californa ISRG Culler rimmed gaunt” • Here is what happens when we feed this signature to HotBot: Gives same result after correcting typo.

  30. Another Example Acks: Phelps/Wilensky • Signature for Endeavour home page is • “amplifies Endeavour leverages Charting Expedition” • Here is what happens when we feed this signature to Google:

  31. Why Does This Work? • If terms were distributed independently, the probability of 5 even moderately common terms occurring in more than one document is very small. • In fact, picking 3 terms restricted to those occurring in 100,000 documents works pretty well. • Many documents contain very infrequently used words. • There is lots of room for independence to be off, and to play with term selection for robustness, etc.. Acks: Phelps/Wilensky

  32. Persistent Linking • CrossRef • Uses Digital Object Identifiers (DOIs) • A type of URN (Handle) • Cooperative agreement among publishers • Publishers control the resolution mechanism • Can go to full-text, other services, or charging mechanism • Example • http://www.crossref.org/demos/springer/springer.htm

  33. Dynamic & Context Sensitive Linking • Problem – Link behavior should not be the same for all • Solution - Link contains metadata rather than an identifier • OpenURL – standard for incorporating metadata into a URL • http://sfx.aaa.edu/menu?genre=article&issn=1234-5678&volume=12&issue=3&spage=1&epage=8&date=1998&aulast=Smith&aufirst=Paul • SFX • System for locally resolving an OpenURL to extended and localized services • http://www.sfxit.com/

  34. Researchindex – automatic interlinking on the web • http://researchindex.org • Selective web crawling to gather CS resources • Heuristics and AI techniques to establish services • Searching • Reference linking • Why do we need metadata for textual documents?

  35. Automatic Reviewing Techniques • Traditional Collaborative Filtering • Estimate what score a reviewer might give to an item that he/she has not scored yet • Frequently used by recommender systems • Collaborative quality filtering • http://www.cs.berkeley.edu/~tracyr/project/ • Attempts to automatically determine which reviewers are "good" in an open reviewing system, in order to provide the same (or better) benefits as peer review

  36. Collaborative Quality Filtering Algorithm • Assume true value of an item is the asymptotic average of review scores • Good reviewers are those who consistently predict this average • Normalize according to # of reviews of an item, # of reviews by reviewer, review latency • Adjust by “expertise” • Use similarity of term vectors of items reviewed

More Related