The Cancer Imaging Archive (TCIA): Creating a Large Public Image Collection

The Cancer Imaging Archive (TCIA): Creating a Large Public Image Collection Lawrence R. Tarbox, PhD; Paul Koppel, PhD; Steve Moore, MS; Michael Pringle; Fred Prior, PhD Washington University in St. Louis, School of Medicine Justin Kirby Cancer Imaging Program, National Cancer Institute

What is TCIA? • Repository for collections of cancer related images and associated data • Both public and restricted access collections • Images, annotations, markup, clinical and research data • Curated, de-identified, indexed, linked • Managed and supported • Focal point for collaborative research See TCIA operations poster for more details

Current Statistics • Over 1000 registered users • Over 3 million objects in 15 public collections (~ 2 TB) • Over 12 million objects in 9 restricted access collections (~ 6 TB) • In the last year, for public collections: • Almost 6000 searches • 4.5 TB data downloaded (13.6 M images)

Original Techincal Goals • High availability • 99.5% uptime • Minimal maintenance windows • No more than 8 hours per month • Scalable, with the following initial loads • 20 users accessing 4 TB of data • 5 sites uploading data

Methods to Achieve Goals • Parallelism • “Divide and Conquer” • Servers dedicated to specific functions • “Many hands make light work” • Spread the load among multiple servers • Redundant Hardware • Geographically Dispersed Sites • Virtualization and Live Migration

NBIA Software Functions • Receive • Curate • Final Prep • Search • Download Default NBIA installation puts all functions in one server

Divide and Conquer Receive Curate Final Prep Search Download Intake Public Receive Curate Final Prep Search Download Receive Curate Final Prep Search Download Receive Curate Final Prep Search Download Receive Curate Final Prep Search Download Identical Instances of NBIA Dedicated to Specific Functions Grouped for now, but could be split later Logical separation of incoming and public data to protect repository integrity

Many Hands, Light Work Load Balancing Switch Intake Public Intake Public Database Replication Intake DatabaseReplication Shared Storage Shared Storage Each group of functions in a cluster, with synchronized DBs and shared image storage within the group

Additional Functions Load Balancing Switch Intake Wiki Web IssueTracker Public Intake Web Public Intake LDAP LDAP Shared Storage Shared Storage Mirrored web servers for static info pages, dashboard, and user registration. Mirrored LDAP servers provide common user ID directory. Development/Staging servers and metadata repositories not shown.

Availability and Scaling • Critical functions in clusters • Databases mirrored within the clusters • Load balancer directs traffic to least loaded node, skipping nodes that are down • Add nodes as loading demands • Additional clusters as needed • Redundant, mirrored, shared storage • Mirrors geographically distributed • Redundant LB switches

Virtualization • Scalable, with minimized startup costs • Grouped functions each in a virtual machine • Server locked for clustered function groups • No need for live migrations, since other nodes in the cluster fill the void when a node is down • Floating for non-clustered functions • Live migrate for server maintenance • Snapshots in case migration fails • Identical servers ease maintenance

Current Hardware • Three 12 core servers, 2.66 GHz, mirrored 600 GB 15K RPM SAS drives, 48 GB memory (two additional 24 GB servers joining the ‘mini-private-cloud’) • Three head clustered storage system with multiple RAID 6 FC-connected storage arrays with near-line SAS disks, maximum capacity 24PB • Redundant L7 (app level) load balancing switches with SSL offload

Other Enhancements • TCIA branding of the NBIA software • Self-service account registration • DICOM “TagSniffer” to assist in curation • Intake improvements, newer version of Clinical Trial Processor (CTP) • Statistics dashboard and analytics • Download Manager improvements (automatic retries, bug fixes) • Metadata Query Tool (in development)

Summary • Parallelism provides reliability and scalability • Virtualization makes parallelism economical and simplifies maintenance • Dividing servers between sites further improves reliability • All can be done with free/open source software, if desired

Questions?

The Cancer Imaging Archive (TCIA): Creating a Large Public Image Collection

The Cancer Imaging Archive (TCIA): Creating a Large Public Image Collection

Presentation Transcript

Digital Imaging for the NPS Museum Collection Web Catalog

IMAGING

Molecular Imaging in Cancer

Modern Imaging in Breast Cancer

Rationale for Creating an International Forum and NCI Funding Opportunities

Digital Imaging 101

The NUI Maynooth Ken Saro -Wiwa Audio Archive: Creating an Open Access Archive in Soundcloud

DIGITAL IMAGING

Cancer

4D Imaging and Tracking

Selective and Intelligent Imaging

Childhood Cancer

What is NCIA?

Update on MRI

Advances in MR Imaging of PROSTATE CANCER

New Software for PET/MR Image Registration

Advanced In Vivo Imaging to Understand Cancer Systems

Biomedical Image Analysis and Machine Learning BMI 731 Winter 2005

Modern Imaging in Breast Cancer

QuickVision Imaging Software