140 likes | 153 Views
Online Autonomous Citation Management for CiteSeer. CSE598B Course Project By Huajing Li. Introduction to CiteSeer. Software package developed at NEC-Labs Domain Independent Software for Automatic Citation Indexing (ACI)
E N D
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li
Introduction to CiteSeer • Software package developed at NEC-Labs • Domain Independent Software for Automatic Citation Indexing (ACI) • Focus is on scholarly publications in electronic format (PS / PDF and variants) • Performs: • Document Discovery / Retrieval / Parsing • Automatic Citation Extraction • Document & Citation Indexing / Search
Crawler Document URL Retrieval Document (PDF/PS) Conversion Document (Plain Text) Web Server Parsing & Meta-Data Extraction Document Database File System Meta-Data Database PDBM_File & Chunk Tables Indexes Document Body Text N Citation Texts N Citation Meta-data Sets CID GID Title Authors etc. Document Meta-data Set DID Title Authors etc. C D Indexing
Submitting Documents • Output of Crawl / User Submission is URL of page linking to document. • These URLs are dumped in Paper Table • Paper Table maintains status for each document: • Downloaded/undownloaded • Processed/unprocessed • Other processing errors (tooshort/noreference/etc.) • CiteSeer regularly scans this table to start download of new documents • Only Documents meeting typical pattern of scholarly publications are eventually added to the collection
Document Structure Identification From document header System info From citation graph • Title • Subject(keywords) • Description (abstract) • Author names • Author affiliations • Author address, email, phone, Homepage URL • Publication date, Publication number • Archive date • Contributor • Type • Format • Identifier • Source • Publisher • Journal/Conference • Pages • Relation • References • Is Referenced By
Citations grouping • Citations to same document have common Group ID • Each Group ID has a set of keys associated to it, based on citation information • {authorkey1-titlekey; … authorkey2-titlekey} • For every single word in the authors information there is an authorkey • For a given citation, titlekey is unique and is concatenation of all title words
Citations Grouping • For newly discovered citation • Extract • Authors : C. Lee Giles, S. Lawrence • Title : “Good Paper Title” • Generate keys {giles-goodpapertitle; lee-goodpapertitle; lawrence-goodpapertitle} • Try to match at least one of them with existing Group ID key • If there is a match, add this citation (Citation ID) to the group • Otherwise create a new Group ID for this citation
Linking Citations to Documents • Citation ID->Group ID • We just saw that … • Document ID->Group ID • Based on document’s metadata, generate authorkey-titlekey in the same way and try to match a Group ID key generated from the citations • Document metadata can be erroneous, so successful mapping often happens AFTER correction by users
Problems of the Current Approach • There is no guarantee that the most similar citation contains the best metadata • Building citation graph is a time-intensive, offline task • Due to batch clustering, the addition of a single citation requires rebuilding the entire citation graph to include the new instance • The so-called canonical metadata is fixed to the document record
Goals of the New Citation Management System • Provide better document metadata • Reduce the cost of maintenance • Use on-line citation matching such that the citation graph environment can be adjusted immediately based on a single new citation • Provide a fluid framework for building canonical metadata in which all evidence is always considered • Allow the development of flexible APIs into CiteSeer citation graph system • Maintain data security despite an open, wiki-like approach to user-contributed metadata changes • Provide better citation matching compared to the current system
Prototype Overview Query Query Handler May ultimately be located in separate service Edge DB (SQL) Document Metadata Index Citation Metadata (XML) Citation Resolver Citation Metadata Index Document Metadata (XML)
Edge DB • One simple table containing one edge per row: • Id: citation handle (equivalent to CID) • citingDoc: citing document handle • citedDoc: cited document handle • Row-level locking
Matching citations and docs • Exact string match across disparate metadata fields way too optimistic - need better matching criteria • Lucene provides two methods out of the box: • Match based on Levenshtein distance • Specify arbitrary distance cut-off per field • choose most similar match out of returned set • Cut out the middleman - similarity-based matching • Specify arbitrary similarity threshold • Choose most similar match out of return set over threshold • Criteria to be determined through empirical tests using prototype system.