Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Update on Content-Based Image Retrieval Technology:incremental algorithmic advances making deployment insurgical pathology an increasingly viable proposition Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology University of Michigan Health System ulysses@umich.edu

Disclosures* • Aperio: • Technical Advisory Board and Shareholder *Listed for completeness only; this presentation does not contain proprietary or commercial vendor content.

The availability of digital whole slide data sets represent an enormous opportunity to carry out new forms of numerical and data- driven query, in modes not based on textual, ontological or lexical matching. Search image repositories with whole images or image regions of interest Carry our search in real-time via use of scalable computational architectures or Resultant Surface Map or gallery of matching images Thesis Statement Extraction from Image repositories based upon spatial information …001011010111010111.. Analysis of data in the digital domain

Overview • Brief Overview and History of the CBIR Realm • Some Specific Discussion on Model-Free Pattern Recognition • An update from where the field stood last year • Computational realities & performance improvements • Specific examples • Interactive exploration of image searching with Model-Free tools • HistoQuery • HistoMine

…001011010111010111.. A Quick History ofContent-Based Image Retrieval • 1970’s: Corona Satellite Remote Sensing Initiative • Film-based • Resultant analog content, when digitized, represented Gigabytes of data (consider the computational burden for 1972… • Several numerical approaches devised to quickly crunch data • Many approaches based on conventional image analysis: one or more specific algorithms developed for each feature to be extracted / identified • Technically challenging • Time consuming • Computationally expensive • The term CBIR first coined in 1992 by T. Kato to describe automatic retrieval of images from a database. • Many CBIR modalities

Present Commercial Use of CBIR • Not to identify image matches but to exclude classes of imagery in web-based image searching • Google Image Search with “Safe mode” activated • Easier to exclude whole classes of images than to select specific precise matches • Reduced to practice for small-scale real time search • ~102 images queried per submission (post lexical selection)

CBIR Techniques (model-based) • Color Operators • Texture operators • Shape • Spectral information • Frequency and phase domain information There are at least several thousand major classes of conventional image analysis operations, with most exhibiting the common trait of requiring some degree of application tuning for the intended use-case. Hence, this class of approaches should not be generally viewed as turnkey solutions.

CBIR Techniques (model-free) • “Genetic” Image Exploration • Originally designed to analyze multispectral satellite data • Semi-autonomous systems that employ a decision-tree to search a known repertoire of conventional image analysis algorithms for the most sensitive and specific combination of algorithms that fits the query predicate • is representative • (Los Alamos National Labs) • Open Microscopy Environment (OME); Ilya Goldberg – NIA • Autonomous operation comes at a price: the need for significant computational throughput in training mode (e.g. slow…)

From: http://openmicroscopy.org/site/support/omero4

CBIR Sub Modalities • QBVE (Query by visual example) • searches for a near-exact example • QBVP (Query by visual prototype) • Searches for a region with similar sub-regions as the predicate • MPE (Minimum probability of error) • Search for the statistical minimum of cumulative difference errors for each constitutive component feature All of the above search modalities can be carried out with either model-based or model-free approaches.

CBIR Operational Modes • Query by Example • Find pictures that contain this snippet / ROI • Semantic Retrieval • Find pictures like adenocarcinoma • Like this adenocarcinoma • Multimodal Retrieval • Search for matches based on imagery data combined with other search metrics • High-throughput “omics” data, etc. • Patient clinical outcomes and therapeutic response data • Other imaging modalities

Definition • Content-Based Image Retrieval (CBIR): • Within the context of an image-based repository, searching for matching predicates with image-based operators in lieu of text matching • Reverse Metadata Lookup (RML): • Using the cohort of returned images from a CBIR query to generate a list of associated metadata concept terms • Anatomic frame of reference • Prior diagnoses • Differential Diagnosis

CBIR Techniques • Model-Based Algorithmic approaches • Specific to intended subject matter • Brittle • May require deep domain programming knowledge • Model-free approaches • Agnostic to underlying subject matter • Robust • Domain programming knowledge is not required • Ideal for ground truth operations

Conventional Image Analysis • At present, confined to specific use-cases: • Quantitative IHC • FDA validation linked to each use-case • Not reduced to practice as an integral tool of the “pathologist’s workstation” • Not capable of searching 1 million or more whole slide images in real-time

The Challenge That IsPathology CBIR • Start with some conservative initial assumptions, concerning a prototypic image repository, in terms of search potential: • Ability to search 10 years of data • 1000 slides day  200,000 slides/year • 500 Mb of compressed whole slide data/slide • Operational goal of being able to: • Search in real-time • Re-index the database every evening, such that searches carried out the next day are current

The Challenge That IsPathology CBIR • Net storage required for ten year’s worth of data: • 1 Billion Megabytes • 106 Gigabytes • 103 Terabytes • 100 Petabytes  1 Petabyte • Current conservative enterprise storage is $2000/ Terabyte • The full Petabyte would cost $2M • A single Genetic-type search across all images, assuming 5-50 seconds of computation / slide, would be: • 200,000 slides * 10 years * 5 seconds/slide 10 million+ seconds • This is 6 log too slow • 8.27 weeks or about 6 searches per year • (original Apple 2e: 78 years) • So we would need to save our queries for those “really important” image searches…. • Conventional Vector Quantization (VQ), which is ~100 times faster, is still not fast enough: 13.8 hours per feature search • Yet another 4-5 log of performance is required… • Two ways to address this: • 10,000-100,000 parallel processors or • better algorithms

On Current Technology… • Modern computational throughput continues to increase, with this capability representing an opportunity for perhaps 1-2 log performance increase in the next decade • With a one-log increase, we are still left with a five-log gap that needs to be made up by improved algorithmic performance.

A Brief Overview:Conventional Vector Quantization (VQ) Original Image Division of image into local domains Extraction of Local Domain Composite Vectors ? VK=Σ{[L•x0y0]Order ,… [L•xnym]Order} Vectorization of each local kernel Individual assessment of each vector dimension 38857448643

Conventional Vector Quantization 8865433 354554343 776956468 865438676 66963658 554323267 446854456 53887 446854 553246564 55565435 38857448643 VK=Σ{[L•x0y0]Order ,… [L•xnym]Order} Established Vocabulary Query Against library (Vocabulary) of Established Vectors Previously Identified Vector Novel Vector Assignment of a unique serial number and inclusion into global vocabulary Assembly of compressed dataset 38857448643

VQ-Based Image Compression as the Original Predicate for Carrying OutImage-Based Search 8865433 354554343 776956468 865438676 66963658 554323267 446854456 53887 446854 553246564 55565435 38857448643 Raw Data Restored Data Compressed data The spatially-preserved organization of the encoded data represents a many-fold decrease in overall search dataset size, thus providing a significant computational opportunity for accelerated search. Additionally, the vectors identified as contributing to a match may be visually interrogated for confirmation of their predictive morphologic content.

Recent Model-Free Approach Developments • A number of promising algorithms being developed • Genetic image analysis algorithm selection • Support Vector Machines (SVM) • Principle Component analysis • High-dimensional reduction approaches • Spatially-invariant VQ (SiVQ)

VQ Revisited and SiVQ Q: What is conventional VQ’s greatest weakness: A: Too many required vectors to represent a single atomic morphologic feature • (promiscuity of vector set growth with continued training)

Conventional VQ Vector Growth during training

Candidate Feature A Matter of Degrees of Freedom… How many ways can this be sampled?

How Many Ways Can A Candidate Feature Be Matched During Training? Y Translational Freedom X Translational Freedom Rotational Freedom

In VQ: it may be the same feature but there are excessively enumerable ways to sample • Typical Feature Vector: • 25 x 25 pixels (x by y) or larger •  625 translational degrees of freedom • Effective radius of 12.5 pixels • After Nyquist rotational sampling (2x spatial frequency) • 2 x (2 x 12.5 x π)  79 separate rotations • 3 color planes • 2 mirror symmetries • At least 20 possible semi-discreet length-scale Nyquist samples • All together, there are at least 625 x 79 x 3 x 2 x 20 5,925,000 possible ways to represent one possible vector (assuming twenty fixed magnifications in use) • This explains the non-asymptotic (unbounded) vector growth observed of some histology patterns. • Multispectral data (e.g. 28 vs. 3 bands) will further multiply the diagnostic power of SiVQ vectors (55,300,000 degrees of freedom / vector)

Update from 2008 • Faster performance possible • Ground truth cancer detection possible • True model-free operation demonstrated • (works on any subject matter) • Additional reduction in degrees of freedom • faster

Transformation of the coordinate system • Adjacency Problem • New system with two degrees of freedom • Rotational • Mirror image

Transformation of the coordinate system

i,jΞi→,j↓ Θ=360o/32 aij + ai+1,jx + ai,j+1y + ai+2,jx2 + ai+1,j+1xy + ai,j+2y2 + ai+2,j+1x2y + ai+1,j+2xy2+ ai+2,j+2x2y2 + ai+3,jx3 + ai,j+3y3 + ai+3,j+1x3y + ai+1,j+3xy3 + ai+3,j+2x3y2 + ai+2,j+3x2y3 + ai+3,j+3x3y3

Degree of Freedom I:Recognition Across Mirror Symmetries

Degree of Freedom II:Rotationally Invariant Recognition

Rotationally Invariant Recognition

Further Possible Reductions in Degrees of Freedom (2009) • Length Scale • Up to 20x impact on search space (40:2 magnification ratio) • Dynamic Range (contrast) • 3x impact on search space • Black Level Offset (brightness) • 5x impact on search space • Biased distortion ellipsoid compression of fundamental circular vectors • 30x (both angle of axis and degree of distortion) • Total further reductions: at least 9000, or approximately 4 orders or magnitude.

Total Realized Search Space Reductions (2009) • RGB Images • 5,925,000 * 104 = ~60 * 109 • (60 billion equivalent Cartesian vectors) • Multispectral/multiplanar images • 55,300,000 * 104 = ~553 * 109 • (553 billion equivalent Cartesian vectors) • Computational performance is improved linearly by the reduction of required comparisons for each matching class (at least 60 billion times faster search for the predicate or interest) • In many cases, a complete feature descriptor can be described with as few as even a single vector.

Simple Use Case Already Reduced to Practice:Ground Truth Cancer Mapping • Useful for precisely identifying all areas of a whole-slide image that are involved by malignancy • Tumor quantization • Automated gating for LCM • Fiduciary mapping for multi-modality fusion studies • As vectors are internally derived for each case, inter-slide variability from fixation and staining becomes inconsequential

Colon Cancer

Malignant Epithelium: One vector

Stroma: One Vector

Use-case: Automated bone marrow aspirate differential counting via model-free tools to attain speed and accuracy • Band detection with a single vector • Resistant to cell segmentation issues encountered with traditional image analysis

Some Additional Interactive Demonstrations…

Consequences of SiVQ • Use one spatially-invariant vector to do the work of millions or billions of spatially-constrained vectors • Millions or billions of times faster than conventional image matching • Enormously fewer vectors to store per feature archetype • 6-9 log increase in algorithmic performance (we only needed 4 log, so we have CPU to burn) • Implies an operational solution to the real-time requirement for large datasets • CBIR is essentially reduced to practice for a sizable contingent of textural-based whole slide image-retrieval use-cases • Emergent property: SiVQ works equally-well on all structurally-repetitive data sets (e.g. remote sensing, Google-like image searches of the Web)

Opportunities and Future Work • CBIR development will continue • Many groups already demonstrating feasibility of real-time query capability • Activity at Rutgers, U. of Pittsburgh and Cal Tech • For the UofM Group: • Rapid dissemination of the algorithm and libraries via peer-reviewed publications and/or e-pubs • Extension of the discovery tool suite to support multiple-vector classification, similar to the approaches taken for prior VQ systems, with rapid follow-on publications • “Ground-Truth Engine” for integrative multimodality studies • Markov analysis module for automated identification of sets of vectors that optimize both sensitivity and specificity over a single vector • Activation of an open-architectures website that will provide a downloadable tool suite and a Web-Based, real-time decision support environment for submitted images, operating in two general use-cases: • Surface classification with rare event detection (anything not classified as normal) • Differential diagnosis generation with return of matching images and associated metadata • Generation of a classification library of extensive “normal SiVQ vectors” for each organ system • Actively pursue collaboration to form a core team to adjudicate needed normal and abnormal vector classes

Closing Remarks • CBIR will continue to improve in performance and accuracy • Contemporary computation speed is, actually, quite adequate for many CBIR tasks • Much work remains to realize its full potential • SiVQ will likely be one of a plurality of compelling solutions in the Image Query / Decision-support armamentarium

Acknowledgements • Jerome Cheng, U. of Michigan Funding: NIH CTSA (University of Michigan)

Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology