150 likes | 290 Views
The Cancer Imaging Archive (TCIA): Creating a Large Public Image Collection Lawrence R. Tarbox, PhD; Paul Koppel, PhD; Steve Moore, MS; Michael Pringle; Fred Prior, PhD Washington University in St. Louis, School of Medicine Justin Kirby Cancer Imaging Program, National Cancer Institute.
E N D
The Cancer Imaging Archive (TCIA): Creating a Large Public Image Collection Lawrence R. Tarbox, PhD; Paul Koppel, PhD; Steve Moore, MS; Michael Pringle; Fred Prior, PhD Washington University in St. Louis, School of Medicine Justin Kirby Cancer Imaging Program, National Cancer Institute
What is TCIA? • Repository for collections of cancer related images and associated data • Both public and restricted access collections • Images, annotations, markup, clinical and research data • Curated, de-identified, indexed, linked • Managed and supported • Focal point for collaborative research See TCIA operations poster for more details
Current Statistics • Over 1000 registered users • Over 3 million objects in 15 public collections (~ 2 TB) • Over 12 million objects in 9 restricted access collections (~ 6 TB) • In the last year, for public collections: • Almost 6000 searches • 4.5 TB data downloaded (13.6 M images)
Original Techincal Goals • High availability • 99.5% uptime • Minimal maintenance windows • No more than 8 hours per month • Scalable, with the following initial loads • 20 users accessing 4 TB of data • 5 sites uploading data
Methods to Achieve Goals • Parallelism • “Divide and Conquer” • Servers dedicated to specific functions • “Many hands make light work” • Spread the load among multiple servers • Redundant Hardware • Geographically Dispersed Sites • Virtualization and Live Migration
NBIA Software Functions • Receive • Curate • Final Prep • Search • Download Default NBIA installation puts all functions in one server
Divide and Conquer Receive Curate Final Prep Search Download Intake Public Receive Curate Final Prep Search Download Receive Curate Final Prep Search Download Receive Curate Final Prep Search Download Receive Curate Final Prep Search Download Identical Instances of NBIA Dedicated to Specific Functions Grouped for now, but could be split later Logical separation of incoming and public data to protect repository integrity
Many Hands, Light Work Load Balancing Switch Intake Public Intake Public Database Replication Intake DatabaseReplication Shared Storage Shared Storage Each group of functions in a cluster, with synchronized DBs and shared image storage within the group
Additional Functions Load Balancing Switch Intake Wiki Web IssueTracker Public Intake Web Public Intake LDAP LDAP Shared Storage Shared Storage Mirrored web servers for static info pages, dashboard, and user registration. Mirrored LDAP servers provide common user ID directory. Development/Staging servers and metadata repositories not shown.
Availability and Scaling • Critical functions in clusters • Databases mirrored within the clusters • Load balancer directs traffic to least loaded node, skipping nodes that are down • Add nodes as loading demands • Additional clusters as needed • Redundant, mirrored, shared storage • Mirrors geographically distributed • Redundant LB switches
Virtualization • Scalable, with minimized startup costs • Grouped functions each in a virtual machine • Server locked for clustered function groups • No need for live migrations, since other nodes in the cluster fill the void when a node is down • Floating for non-clustered functions • Live migrate for server maintenance • Snapshots in case migration fails • Identical servers ease maintenance
Current Hardware • Three 12 core servers, 2.66 GHz, mirrored 600 GB 15K RPM SAS drives, 48 GB memory (two additional 24 GB servers joining the ‘mini-private-cloud’) • Three head clustered storage system with multiple RAID 6 FC-connected storage arrays with near-line SAS disks, maximum capacity 24PB • Redundant L7 (app level) load balancing switches with SSL offload
Other Enhancements • TCIA branding of the NBIA software • Self-service account registration • DICOM “TagSniffer” to assist in curation • Intake improvements, newer version of Clinical Trial Processor (CTP) • Statistics dashboard and analytics • Download Manager improvements (automatic retries, bug fixes) • Metadata Query Tool (in development)
Summary • Parallelism provides reliability and scalability • Virtualization makes parallelism economical and simplifies maintenance • Dividing servers between sites further improves reliability • All can be done with free/open source software, if desired