Online Autonomous Citation Management for CiteSeer

Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li

Introduction to CiteSeer • Software package developed at NEC-Labs • Domain Independent Software for Automatic Citation Indexing (ACI) • Focus is on scholarly publications in electronic format (PS / PDF and variants) • Performs: • Document Discovery / Retrieval / Parsing • Automatic Citation Extraction • Document & Citation Indexing / Search

Crawler Document URL Retrieval Document (PDF/PS) Conversion Document (Plain Text) Web Server Parsing & Meta-Data Extraction Document Database File System Meta-Data Database PDBM_File & Chunk Tables Indexes Document Body Text N Citation Texts N Citation Meta-data Sets CID GID Title Authors etc. Document Meta-data Set DID Title Authors etc. C D Indexing

Submitting Documents • Output of Crawl / User Submission is URL of page linking to document. • These URLs are dumped in Paper Table • Paper Table maintains status for each document: • Downloaded/undownloaded • Processed/unprocessed • Other processing errors (tooshort/noreference/etc.) • CiteSeer regularly scans this table to start download of new documents • Only Documents meeting typical pattern of scholarly publications are eventually added to the collection

Document Structure Identification From document header System info From citation graph • Title • Subject(keywords) • Description (abstract) • Author names • Author affiliations • Author address, email, phone, Homepage URL • Publication date, Publication number • Archive date • Contributor • Type • Format • Identifier • Source • Publisher • Journal/Conference • Pages • Relation • References • Is Referenced By

Citations grouping • Citations to same document have common Group ID • Each Group ID has a set of keys associated to it, based on citation information • {authorkey1-titlekey; … authorkey2-titlekey} • For every single word in the authors information there is an authorkey • For a given citation, titlekey is unique and is concatenation of all title words

Citations Grouping • For newly discovered citation • Extract • Authors : C. Lee Giles, S. Lawrence • Title : “Good Paper Title” • Generate keys {giles-goodpapertitle; lee-goodpapertitle; lawrence-goodpapertitle} • Try to match at least one of them with existing Group ID key • If there is a match, add this citation (Citation ID) to the group • Otherwise create a new Group ID for this citation

Linking Citations to Documents • Citation ID->Group ID • We just saw that … • Document ID->Group ID • Based on document’s metadata, generate authorkey-titlekey in the same way and try to match a Group ID key generated from the citations • Document metadata can be erroneous, so successful mapping often happens AFTER correction by users

Problems of the Current Approach • There is no guarantee that the most similar citation contains the best metadata • Building citation graph is a time-intensive, offline task • Due to batch clustering, the addition of a single citation requires rebuilding the entire citation graph to include the new instance • The so-called canonical metadata is fixed to the document record

Goals of the New Citation Management System • Provide better document metadata • Reduce the cost of maintenance • Use on-line citation matching such that the citation graph environment can be adjusted immediately based on a single new citation • Provide a fluid framework for building canonical metadata in which all evidence is always considered • Allow the development of flexible APIs into CiteSeer citation graph system • Maintain data security despite an open, wiki-like approach to user-contributed metadata changes • Provide better citation matching compared to the current system

Prototype Overview Query Query Handler May ultimately be located in separate service Edge DB (SQL) Document Metadata Index Citation Metadata (XML) Citation Resolver Citation Metadata Index Document Metadata (XML)

Edge DB • One simple table containing one edge per row: • Id: citation handle (equivalent to CID) • citingDoc: citing document handle • citedDoc: cited document handle • Row-level locking

Matching citations and docs • Exact string match across disparate metadata fields way too optimistic - need better matching criteria • Lucene provides two methods out of the box: • Match based on Levenshtein distance • Specify arbitrary distance cut-off per field • choose most similar match out of returned set • Cut out the middleman - similarity-based matching • Specify arbitrary similarity threshold • Choose most similar match out of return set over threshold • Criteria to be determined through empirical tests using prototype system.

Online Autonomous Citation Management for CiteSeer

Online Autonomous Citation Management for CiteSeer

Presentation Transcript

Searching CiteSeer Metadata Using Nutch

Citation Management Software: Zotero

Citation

Citation Management Software

Citation

Citation

Citation Management Software: RefWorks

CiteSeer X : Next-Gen CiteSeer

Citation

Digital Libraries and Autonomous Citation Indexing

Citation Management for Eller Ph.D. Students

Runtime Autonomous Component Management Systems

Citation Analysis for the Free, Online Literature

Citation Management: Using Zotero

Get Dealer Management System For Autonomous Vehicle

Citation Management: Using Zotero

CITATION

Citation

Topic Trends from CiteSeer Data