10 likes | 167 Views
Load Balancer. Servlet Container. Servlet Container. Servlet Container. Session Replication. Authorize, Authenticate. Spring MVC Framework. CAS. Acegi Security. Controller. MyCiteSeer. JSP Views. Data Model. User DB. User DB. Replicated. CSEL Scripts. Load Balancer. CSEL
E N D
Load Balancer Servlet Container Servlet Container Servlet Container Session Replication Authorize, Authenticate Spring MVC Framework CAS Acegi Security Controller MyCiteSeer JSP Views Data Model User DB User DB Replicated CSEL Scripts Load Balancer CSEL Engine CSEL Engine Control Flow <sequence> <invoke <invoke <link… <flow> <flow> <invoke <sequ… <invoke <invoke Native Load Balancing Index Servers Citation Matching Data Acquisition Central Logger Execution Layer CSEL Scripts Task Repositories Load Balancer File Server Distributor Data Layer Fedora Or DBMS Fedora Or DBMS Replicated File System Repository XML File Server Exact Dup Detector Ingestion Workflow Crawl Manager Receiver Crawler Crawler Header Acknowl WWW BPEL Engine Figures Tables Citations CiteSeerX: Next-Gen CiteSeer Isaac G. Councill1, Huajing Li2, Levent Bolelli2, Yang Song2, Ziming Zhuang1, Jian Huang1, Yang Sun1, Ding Zhou2, Wang-Chien Lee2, Anand Sivasubramaniam2, C. Lee Giles1,2 1College of Information Sciences and Technology 2Department of Computer Science and Engineering The Pennsylvania State University, University Park, PA 16802, USA Supported in part by NSF CRI 0454052, NSF 0202007, Microsoft Research, and NASA The CiteSeer Research Library was created in 1997 to demonstrate autonomous citation indexing (ACI), and has since grown to a collection of over 770,000 documents. CiteSeer currently receives over 1 million requests daily and serves over 1 terabyte of data every month. Having outgrown its original architecture, CiteSeer is being re-architected from the ground up. Background Legacy CiteSeer Web Application • MyCiteSeer - New Personal Content Portal • Personal collections, RSS-like notifications, social bookmarking, social network facilities • Personalized search settings • Institutional data tracking possible • Transparent document submission system • J2EE Servlet Deployment • Using Spring MVC Framework for improved organization and extensibility - generic model calls lower-tier execution system for data • Secure SSO through Acegi Security and Central Authentication System (CAS) Execution System • CSEL Distributed Execution Engine • CiteSeer Execution Language: very simple BPEL-like XML scripting, sequential/parallel task flows • CSEL scripts control complex service workflow • Scalable and flexible: fine-tuned service replication; built-in load balancing, fail-over • Component Task Provider Framework • Base libraries provide high-performance container for auto-registering and executing component tasks, including all low-level CiteSeer functionality • Scheduled batch events, e.g. near-dup detection • High-Performance Object Communication • Custom framework transfers large serialized Java objects in microseconds - connection type plug-ins • Fedora Integration: Investigating low-level API • Currently DBMS, pursuing benefits of Fedora Data Ingestion Configurable Ingestion Pipeline • Extraction and classification algorithms as collection of WS, multiple languages possible • Orchestration via BPEL (Business Process Execution Language), standard backed by leading companies, supports complex workflows • Results submitted as XML with URI pointers to file resources - local server provides file access Continuous, Manageable Crawling • Crawl results piped to ingestion receiver. Pipe can be closed at will - crawl manager queues New and Improved Algorithms • Tables, figures, acknowledgments, header