1.26k likes | 2.31k Views
Master’s Thesis Defense. Bibliographic Tools In The Context Of WWW And LaTeX Munushree Thummala Committee members Dr. Prabhaker Mateti (Advisor) Dr. Thomas Hartrum Dr. T.K. Prasad. Agenda. Introduction BiBTeX Primer Bibliographic Tool Survey Requirements for the BiBTeXTools
E N D
Master’s Thesis Defense Bibliographic ToolsIn The Context Of WWW And LaTeX Munushree Thummala Committee members Dr. Prabhaker Mateti (Advisor) Dr. Thomas Hartrum Dr. T.K. Prasad
Agenda • Introduction • BiBTeX Primer • Bibliographic Tool Survey • Requirements for the BiBTeXTools • Design Discussion • Conclusion • Future Work • Questions & Answers Session • Demonstration
Introduction • Preparing academic papers • Collecting bibliographic entries • Tools used to prepare the papers • Common problems
BibTeX Primer • What is BibTeX? • Helps prepare the References section in their documents • Defines entry types and required/optional fields • Uses “style” files to define the format of references • Standards for publications are specified in style files • Used with LaTeX • Latex collects \cite{}s in the .tex file • BibTeX extracts corresponding references from .bib file • BibTeX formats and sorts according to the .bst style • Output of BibTeX program is LaTeX formatted text
Sample BibTeX entry @mastersthesis{Thummala-2007, author = {Munushree Thummala}, title = {Bibliographic tools in the context of WWW and \latex},month = {November}, year = {2007}, school = {Wright State University},OPTkey = {}, OPTtype = {}, OPTaddress = {}, OPTnote = {}, OPTannote = {}, advisor ={Prabhaker Mateti} }
Contribution Of Thesis • Evaluation of Bibliographic tools • BiBTeX to Database Suite of Tools • Database to store BibTeX entries • LoadBiBTeX • BibSearch • Discovery of Duplicate BiBTeX entries • Normalization of BiBTeX entries • Text to BiBTeX Translation • TextToBiBTeX command line tool & API • PDFrefsToBiBTeX command line tool • Integration of TextToBiBTeX into Aigaion
Bibliographic Tools • There are 100+ tools • In this thesis: 87 are reviewed • Tools were evaluated for the following: • Formats supported • Navigating, Searching and Sorting capabilities • Ease of maintaining bibliographic entries • Duplicate discovery • Import/Export to other formats
Bibliographic Tools • Web browser based tools • Aigaion, Bibsonomy, CiteULike, Zotero, BibORB, Basilic, PubsOnline, etc. • Desktop/Small scale tools • JabRef, KBibTeX, TkBibTeX, BibDB, BibEdit, Open Office Bibliographic Manager, Tellico, etc. • Commercial tools • Scholar’s Aid, Bookends, NotaBene, ProCite, etc. • Utilities • Bib2html, Bibclean, Bp, Bibdup, Sixpack, etc.
A Few Notable Tools • Aigaion • Zotero • Bibsonomy • JabRef
Aigaion • Web application, Open source • Easy to use • Supports basic editing features • Supports Multiple Users • Native format is BiBTeX • Organizes references by Topics & Sub Topics • Maintains a list of authors to eliminate duplication • Duplicate discovery present in import feature
Zotero • Firefox Browser Extension Easy to use • Organizes entries in collections • Captures bibliographic entries from websites automatically • Some drawbacks • Loses BiBTeX citation keys and custom fields while importing • Not well suited for managing BiBTeX bibliographies • Local storage
Bibsonomy • Web browser based, hosted service • Easy to use • References • Users upload refs and bookmarks to Bibsonomy • Made available to other users • Tagged with keywords for categorization and search • Can be exported as BiBTeX • Browser shortcuts to capture entries from web
JabRef • Desktop Application • Easy to use • Multiple bib files can be edited • Search online: • CiteSeer, Medline, IEEExplore, ArXiv.org • Native format is BibTeX • Auto generate BiBTeX keys • Imports/Exports multiple formats
CiteuLike • Web browser based, hosted service • Easy to use • References • Users upload refs to CiteULike • Made available to other users • Tagged with keywords for categorization and search • Can be exported as BiBTeX • Browser shortcuts to • capture entries from web • cite the current article
Requirements for New Tools • Text to BiBTeX translation • Translating free style text into BibTeX • Customizing the translation • Certainty of Recognition measure • Extract references section from PDF papers • Provide an API for other developers to integrate free style translation into their applications • Command line invocation • GUI also • Normalized BiBTeX output
Requirements (Contd. 2) • Database of Bibliographic entries • Database to store BiBTeX files • Tool to Detect duplicates • Command line invocation • Normalized BiBTeX output
Requirements (Contd. 3) • Search and Generate BiBTeX files • Flexible searches • Command line invocation • Outputs BiBTeX format • Normalized BiBTeX output • Platform Independent
Database on Local Machine • Tables to store • BiBTeX entries • lookup data for text to BiBTeX translation • search index data for fast and flexible searching
Database Of BiBTeX Entries A schema to store BiBTeX entries including string macros Ability to specify a tag for each entry Tag defaults to .bib filename
Database Of Lookup Data A database Schema to store lookup tables Lookup Tables: Author Sub Names Journal Names Publishers Cities States Months Organizations
Database Of Search Indexes A database Schema to store BiBTeX Search Index data Stores data as sequence of tokens Provides ability to search Any field(s) Any keyword(s) Citation key also stored as tokens
LoadBiBTeX Tool • Loads BiBTeX files into the database and updates the search index tables • Loads the lookup tables used by Text to BiBTeX tool • Detects duplicates
LoadBibTeX– Loads BiBTeX Files • Program Usage • LoadBiBTeX –loadentries –bibtag thesis2007 –bibfile thesis.bib • Any entries that have errors are not loaded and are shown in the output • Updates the index tables used by the BibSearch tool
LoadBibTeX– Populate Lookup Tables • Program Usage • LoadBiBTeX –loadauthors –loadpublishers –loadjournals –bibfile thesis.bib • Only new values are loaded • The above command does not load the BiBTeX entries
LoadBibTeX– Duplicate Discovery • Program Usage • LoadBiBTeX –dupdisc –bibtag thesis2007 –bibfile thesis.bib • The BiBTeX entries in thesis.bib are read and compared to the entries in the database corresponding to the bibtag thesis2007 • Any entries considered to be duplicates are displayed for the user
BibSearch – Searching The Database • Program Usage • BibSearch –bibtag thesis2007 –fields author –keywords Donald Knuth • The database is searched for entries with the tag “thesis2007” and the words “Donald” and “Knuth” in the “author” field • The resulting BiBTeX entries and any required @String constructs are normalized and written to the output
Normalization • Make BiBTeX entries consistent • Some of the rules • Citation Keys are consistent • Fields are enclosed in {} to preserve formatting • Month field abbreviations are expanded • Missing required fields are indicated to the user appropriately • Order of the fields in the output • Where is it implemented? • In whichever tool a particular rule makes sense • Spread across TextToBiBTeX, LoadBibTeX, BibSearch
Normalization (Example 2) • @mastersthesis{Thummala2007, title = “Bibliographic tools in the context of WWW and \latex”, year = 2007, school = “Wright State University”, month = “Nov”, author = “Munushree Thummala”, advisor = “Prabhaker Mateti”,} • @MASTERSTHESIS{Thummala-2007, AUTHOR = {{Munushree} {Thummala}}, TITLE = {{Bibliographic} tools in the context of {WWW} and \latex}, MONTH = {November}, YEAR = {2007}, SCHOOL = {{Wright} {State} {University}}, ADVISOR= {{Prabhaker} {Mateti}},}
Normalization (Example 3) • @InCollection{ lawrence01access, author = "Steve Lawrence", title= "Access to Scientific Literature", journal = "The {\it Nature} Yearbook of Science and Technology", editor = "Declan Butler", publisher = "Macmillan", address = "London, England", pages = "86-88", year = 2001 } • @INCOLLECTION{ Lawrence-2001, AUTHOR = {{Steve} {Lawrence}}, TITLE = {{Access} to {Scientific} {Literature}}, BOOKTITLE= {}, YEAR = {2001}, JOURNAL = {The {\it Nature} {Yearbook} of {Science} and {Technology}}, EDITOR = {{Declan} {Butler}}, PUBLISHER= {{Macmillan}}, ADDRESS = {{London}, {England}}, PAGES = {86-88}, }
Text to BiBTeX Translation • What are Free Style References and where would authors find these ? • References at the end of academic papers • References on Internet sites like CiteSeer • A jotted-down text description • How do authors benefit from this translation ? • No need to manually convert to BiBTeX • Significantly better accuracy • Speeds the process of translating multiple references
Text to BiBTeX Translation (Contd. 2) • Ways to translate free style text • Write a routine to analyze the strings and guess the fields • Develop • Language Grammar • Recursive Descent Parser • Which method did we pick? • Recursive Descent Parsing • Tried other methods with varying degrees of success
Text to BiBTeX Translation (Contd. 3) • How does the Parser work? • Extent = A sequence of tokens • Field type = An extent that matches the set of okTokens for that field and ends when a notOkToken (including a delimiting token) is hit. • Backtrack: If the current token in an extent does not match the field, it is backtracked to the beginning token, and given a chance to match other field types. • Unrecognized: If the current token does not match any field type, it is appended to the unrecognized field list and the above process is repeated starting at the next token.
Text to BiBTeX Translation (Contd. 4) • How is a series of tokens recognized as a field? • Author, Journal fields - lookup table and heuristics • Title field - quoted strings or heurisitics • Pages field – • [PAGES.|PP.|P.] <number [–][–number]> • Year field - a four digit number between 1900 and 2100 • Volume field – • [VOL. | VOLUME] <number> • Number field – • [NO. | NUMBER] <number> • Abbrev field – • <volume>(<number>):<startpage>–[-]<endpage> • Edition field- • EDITION<number> or <number> EDITION • Publisher field, Place, State - Lookup table