150 likes | 353 Views
JSTOR. Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell. Tools for Linguists. Aim: To create a set of workflows that can extract data from JSTOR, then process or visualize this data in ways that are useful for linguists. Participants: JSTOR Michael Krot
E N D
JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell
Tools for Linguists Aim: To create a set of workflows that can extract data from JSTOR, then process or visualize this data in ways that are useful for linguists. Participants: JSTOR Michael Krot Clare Llewellyn U. Michigan Matthew Brook O’Donnell
Data for Research Service • The JSTOR archive: • 4.8M journal articles • 2.4M research articles • 1.6M review articles • ~14 billion words • +31M pages of OCR’d text • Multidisciplinary • Content is organized into 50 disciplines • High-quality bibliographic and structural metadata • Including +40M parsed reference citations • The Data for Research service brings much of this content into easy reach of researchers • Powerful search tools • Convenient data retrieval options
Data for Research Service • A self-serve tool for obtaining research data from the JSTOR archive • Provided by a web-interface enabling researchers to identify content of interest in the JSTOR archive and to retrieve associated datasets for research purposes • A researcher-oriented exploration tool complementing the search and browse capabilities offered by the JSTOR main site • Exposes additional fields for enhanced searching and results filtering • Provides data visualizations for viewing aggregate and document-level data • Links to JSTOR main site are provided for documents in search results • Authentication and authorization are required to view article contents
Data for Research Service • Applications Programming Interface (API) • Provides support for programmatic searching and data retrieval • Utilizes RESTful protocols for ease of use • Plain URL requests, XML responses • Standards-based search protocol • SRU (Search and Retrieval via URL) • Lightweight successor to Z39.50 protocol • CQL (Contextual Query Language) • Formal language defining search syntax • Data retrieval using simple REST protocol • Provides access to back-end content repository • Resource Oriented Architecture (ROA) • Stateless – requests contain all relevant information • Uses HTTP methods (GET, POST) for operations • http://dfr.jstor.org/resource/<resource-id>?view=<view-id>
Data for Research Service • Data Views available in DfR Beta3 • Bibliographic Metadata • Dublin Core • Word frequencies • List of distinct words and their occurrence • N-grams (specifically, word n-grams) • An n-gram is a sub-sequence of n items from a given sequence • Bigrams, trigrams, and quadgrams are provided by DfR • Keywords • Auto-extracted keywords based on their TF*IDF weight • TF*IDF (Term Frequency * Inverse Document Frequency) is a statistical measure used to evaluate how important a word is to a document in a collection or corpus • References (citations out) • Raw text for identified references
Components for API Interaction • ** Need to clarify the stuff from Bernie • Primary Component – JSTOR API interface • Persistent SEASR webservice • HTTP Listener • HTTP Responder
Tools and resulting data most likely to be of interest to: • Computational Linguists • For use in range of NLP applications; large discipline-specific datasets open up incredible options in computational semantics, tagging, parsing, text-mining etc. • numerous applications for a JSTOR-derived academic n-gram set (1 million 1960s BROWN corpus still used as source of frequency information!) • Corpus and Applied Linguists • The study of distinctive vocabulary and phraseology (lexical patterns of 2+ grams) in and across academic disciplines currently limited by lack and size of available data • finding words and phrases distinctive to or strongly associated with specific disciplines (statistically identified ‘key words’) requires frequency information from large samples • Need for discipline-specific frequency lists in teaching and testing of English for Academic Purposes (EAP)
Workflow • Define the search terms to create the data set(s) • Submit a query to the JSTOR API and receive a response • Download the data set(s) for one or more of the data views • Conduct analysis using SEASR components • Create visualizations using SEASR components
Comparing the Data • Different data sets: • Different searches in JSTOR, different • Journal • Discipline • Dates • Compare your own data set with one from JSTOR • Use Components to analyze or compare the data • Calculate differences in sets • Extract specific entities – example proper nouns • Extract key differences • Different data views: • Word counts • Bigrams • Trigrams • Quadgrams • Key terms • References
Visualizing the Data • Use the visualization capabilities already in SEASR to display results: • Tables • Graphs • Clustering • Dendograms
Progress • Defining what we wanted to do • Looking at what is already available • Discussions with SEASR folks • Producing a shared area for work at UIUC • Work on making the JSTOR API accessible • Re-defining what we want to do!
Experience • SEASR staff very knowledgeable, helpful and responsive • Learning curve • Easy to do the simple stuff • Can see the benefits of building our own components but can not find the time to learn the skills • Difficult to assign time – really need to build it into another project
Any questions / feedback? • Contact details • michael.krot@jstor.org • clare.llewellyn@jstor.org • mbod@umich.edu