1 / 1

Architecture to Enable Large-scale Computational Analysis of Millions of Volumes

Architecture to Enable Large-scale Computational Analysis of Millions of Volumes. Yiming Sun, Stacy Kowalczyk , Beth Plale , J. Stephen Downie , Loretta Auvil , Boris Capitanu , Kirk Hess, Zong Peng , Guangchen Ruan , Aaron Todd, Jiaan Zeng. HTRC System Architecture.

gent
Download Presentation

Architecture to Enable Large-scale Computational Analysis of Millions of Volumes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Architecture to Enable Large-scale Computational Analysis of Millions of Volumes Yiming Sun, Stacy Kowalczyk, Beth Plale, J. Stephen Downie, Loretta Auvil, Boris Capitanu, Kirk Hess, ZongPeng, GuangchenRuan, Aaron Todd, Jiaan Zeng HTRC System Architecture HTRC Portal & Blacklight Algorithm: Entity Extraction Firewall VM Instances VM Instances VM Instances Cassandra cluster volume store Solr index Storage resources Blacklight Portal Security (OAuth2 WSO2 Identity Server) Application Submission Algorithms and Worksets Registry (WSO2 Governance Registry) • Date entities extracted and displayed using SIMILE • Location, person, time, date, money, organization, percentage can be extracted and displayed in tabular form • Entities extracted from Jane Austen collection • 2.7 million volumes in corpus • 3 stacks: production, development, sandbox • HTRC-provided non-consumptive algorithms HTRC Data API access interface Algorithm: Topic Modeling Audit High level apps Topic Modeling Token Count Log Likelihood Entity Extraction Prototype Sloan Cloud Compute resources Algorithm: Dunning Log Likelihood Authentication Server VM Manager Job Submission Module Job Manager Registry Service Corpus Repository • LDA-style topic analysis using Mallet • First and last lines of each page removed • End of line hyphenated words are joined • Tokens containing non-alphanumeric characters removed • Stop words filtered, with “not ” replaced by “not_” • 12 topics, 100 tokens per topic over Jane Austen collection • Support for large-scale non-consumptive algorithms • Allow researchers to submit algorithms and worksets to run as Map/Reduce jobs • Words more frequent in Charles Dickens collection (analysis) than in Jane Austen collection (reference)

More Related