460 likes | 642 Views
A System for Large-scale, Content-based Web Image Retrieval - and the Semantics within. Till Quack. Task. Create a content-based image retrieval system for the WWW Large-scale, one order of magnitude larger than existing systems. Means O(10 6 ) items Relevance Feedback
E N D
A System for Large-scale, Content-based Web Image Retrieval - and the Semantics within Till Quack
Task • Create a content-based image retrieval system for the WWW • Large-scale, one order of magnitude larger than existing systems. Means O(106) items • Relevance Feedback • Explore and exploit the semantics within • Take large-scale, content-based image retrieval one step closer to commercial applications
Outline • Content-based Image Retrieval on the WWW • PART I: A System for Image Retrieval on the WWW • Features • Retrieval • Relevance Feedback • Software Design • PART II: The Semantics within • Identifying a Method to find Semantics • Data Mining for Semantic Clues • Frequent Itemset Mining and Association Rules • The Visual Link • Discussion & Demonstration • Conclusions & Outlook
Content-based Image Retrieval on the WWW • Characteristics of the data repository • Size: 4.2 billion documents in Google’s index • Diversity: Documents in any context, language • Control: Anybody can publish anything • Dynamics: Ever changing • System Requirements • FAST • SCALABLE • Make use of all the information available • Motivation for a new system • Existing systems • Either pure text (Google) • Or pure content-based • Large-Scale
PART I: A System for Large-scale, Content-based Image Retrieval on the WWW Ullrich MoenichTill QuackLars Thiele
Visual Features describe the Images • Global Features from MPEG-7 Standard • Currently no Segmentation • Reasons: Scalability and the diversity of the data • Texture Features • Edge Histogram Descriptor (EHD) • Histogram of quantified edge directions. 80 dimensions • Homogeneous Texture Descriptor (HTD) • Output of Gabor filter-bank. 62 dimensions. • Color Features • Scalable Color Descriptor (SCD) • Color Histogram. 256, 128, 64 or 32 dimensions • Dominant Color Descriptor (DCD) • Up to 8 dominant colors (3d color-space) and their percentages • 32 “dimensions” • “Bins” defined for each image
Collateral Text as an additional Feature • ALT Tag and Collateral Text around images • VERY uncontrolled annotation • Stemming: Porter Stemmer • Example: training -> train • More matching terms for boolean queries • But also some new ambiguities • train: to train [verb] / the train [noun]
Retrieval in 2 Steps 1. Text Retrieval 2. Visual Nearest Neighbor Search
Retrieval: Text • Options • Boolean query on inverted index • Vector Space Model • LSI etc. • Choice • Ranked boolean queries on inverted index • Ranking: tf*idf • Reasons • Speed • Sparsity of data: • 600 000 Keywords in total • 1 document: 10-50 words
Retrieval – Visual Features (MPEG-7) • K-Nearest Neighbor search (K-NN) • Find K closest candidates ci to query image q in a vector space • Distance: Minkowsky Metrics for distance d(ci,q) namely L1 and L2 norms • Most MPEG-7 descriptors are high-dimensional vectors • The “dimensionality curse” applies • High dimensional spaces behave “weirdly” • In particular the distances are not too meaningful
Retrieval – Challenges for Visual Features • We have several (visual) feature types How can we combine them? • Our database is very large.How can we search it fast enough? • i.e. how can we avoid comparing the query vector with each database entry?
A Combined Distance for the MPEG-7 Features • We use a combined distance of all the visual feature types • The individual distances occupy different ranges in different distributions • The distributions were transformed to a normal distribution in the range [0,1] • The distances are then combined linearly
Clustering speeds up the search • Problem • Millions of items in DB • Linear search over the whole dataset too slow • Looking only for the K nearest neighbors anyway • (One) Solution • Partition the data into Clusters, identified by representative, the centroid • Only search the cluster whose centroid is closest to query q • K-Means clustering algorithm • Not the best, in particular in HD spaces • But fast! • Problem with Clustering: • Query at the border of a cell does not find all the nearest neighbors • Simple Solution: Overlapping Clusters • Problem: Scalability • Original data 7GB • Overlapping data: 50 GB
Relevance Feedback Improves the Results • Relevance feedback: User input to improve search results - iteration by iteration • i.e. the user selects „good matches“ • We obtain the following information: • A new query vector which is a combination of the relevant images = Query Vector Movement • The ratios for the combination of the feature types
Relevance Feedback: Query Vector Movement • Construct the query vector qn of images selected in iteration n Vector component kFeature type f (EHD,SCD,HTD)i=1...M relevant images The final, new query vector is q = 0.75 *qn + 0.25 *qn-1 i.e. move from the old query vector towards the new vector
Relevance Feedback: Weight Adapation • Which feature is most important for the given query? • The one for which all the relevant images are closest • Determine the ratios for the combination based on the average distance, e.g. for the EHD • and set
Implementation – Software and Hardware • Languages: C++ and Perl • Inline::CPP to connect Layers • WWW: Apache and CGI • Relational DB: mySQL • Operating System: OS X • Hardware • Dual 2 GHZ Apple G5, 2GB RAM • Teran Terrabyte Disk Array
Semantics: Combining Text and Visual Features • Our dataset is multi-modal • Keywords and several visual features • Not only valid for WWW data • Video: image+speech, • Bio-imagery: image+microscope setting, cell coloring fluid • Goal: Try to jointly use the different modes • Do semantic relations between the modes exist? • Learn something about these semantic relations • Improve the retrieval precision based on them • Challenges in our project: • Large-scale • Noisy and uncontrolled data • Only global visual features
Identifying a Method to find the Semantics • Related work • Latent Semantic Indexing (LSI) [Westerveld 2000] • – problem O(N2m3), N=Documents+Terms, m=concept space • Statistical models [Barnard, Forsyth 2001-2004] • Problem O: “several hours for several thousand images” • Problem: It is a (rather strict, hierarchical) model • Others • Neural networks (SOM etc.) • Hidden Markov Models • Often: Classification • We don’t know our classes, or: there are just too many • We can’t train them either (data too diverse and noisy) • Most of the methods above only tested on relatively small, supervised datasets • There is one more option …
Method: Data Mining for Semantic Clues • Mine the data for patterns • Find them only where they exist • Deduce Rules from them • Scalable methods available • Frequent Itemset Mining and Association Rules • Classic Application: Market baskets, Census data … • Some works on Multimedia data • [Zaïane 98]: Datacubes with appended keywords • [Tešić et al. 03]: Perceptual associations (texture) within images
Frequent Itemsets and Association Rules • Itemset I • Transaction T • Database D • Support of Itemset A • A is called frequent if • Rule • Support of a Rule • Statistical significance • Confidence of a Rule • Strength of implication • Maximum likelihood estimate that Bis true given that A is true
Example & Advantages • Example: Market Baskets • Rule {Diaper,Milk}{Beer} • Advantages • Human readable • Can be edited • Fast Algorithms available • Note: Associations are not correlations • The same concept, just simpler • Associations and correlations: [Brin, Motwani, Silverstein 98]
Using FIMI to find the itemsets • Frequent Itemset Mining (FIMI) • Find frequent itemsets with support > minsupp • Minimal support minsupp given by “an expert” • First Algorithm: APriori [Agrawal et al. 93] • Basic Idea: If an itemset is frequent, all its subsets must be frequent (Monotonicity) • k-passes over dataset for itemsets of length k • ~O(knp) n transactions, p items, itemsets of length k • Today’s algorithms • Rely on the same basic principle • But much faster (Main Reason: Data structures) • Usually only 2 database passes • ~linear runtime • State-of-the-art algorithm overview: FIMI’03 • We used: fpmax* [Grahne, Zhu: Nov 03]
Diapers and Beer !!? • Application to the domain of Multimedia data: • Formulate images as transactions • Low-level clusters serve as a dimensionality reduction for the visual features • We find associations of visual features (clusters) and keywords • From theses associations we deduce semantic rules • Advantages • Comparably low computational complexity • Other data sources can be integrated in the same manner (e.g. long-term relevance feedback) • Challenges • Noisy, uncontrolled data • Associations within keywords much stronger than associations between keywords and visual features • Uneven distribution of cluster sizes (K-Means problem)
Characteristics of the Itemsets and Rules • There are associations • Within text {shoe} {walk} • Within visual clusters {EHD 14} {SCD 12} • Between text and visual clusters {shoe} {EHD 14} • Measure for interestingness or choice of rules from FI • Confidence? • Statistical Criteria? • Background Knowledge? (Example: pregnant -> Woman: 100% confidence) • Our „Background Knowledge“: Rules that connect keywords and low-level features are more interesting • Since this is known, the mining can be adapted and made even faster
Selecting Interesting Low-Level Clusters based on Rules • Clusters were introduced to partition the visual feature vector data and search only on certain clusters • Problem: We miss certain nearest neighbors if images for a concept are spread over several clusters • Unsatisfactory Solution: Overlapping Clusters • But association rules might find and solve this situation • Clusters are re-united • If number of images for concept in both clusters is >minsupp • Example: {shirt} -> {ehd249,ehd310} reunites these clusters for the initial keyword-query “shirt”! • This is scalable - unlike overlapping clusters • Another benefit is that more images labeled with the original keyword are “injected” into the results of K-NN search • Currently: One Keyword as high level semantic concept • Future: Find high level semantic concepts by mining associations within text first
The Visual Link • Another contribution, NOT related to Frequent Itemset Mining and Association Rules… • Since search-concept suggests visual nearest neighbor search with relevance feedback after intitial keyword search: • It would be nice to have a diverse selection of images for a given keyword on the first page of results • Images sorted not only by keyword ranking, but also based on visual feature information • Basic idea: For a given keyword query, build groups of images that are visually close. • Larger groups are more important • Show only one representative per group
The Visual Link: A Graph-Based Approach • Let I(Q) be a set of images matching a keyword query Q • Define a graph G(V,E) • i.e. images are visually linked if the distance between them is lower than a given threshold • Do a connected component analysis to find connected components C • For each component C find the „best“ representative rC • Re-rank results based on representatives rC
The Visual Link: An Approximation • Problem: Distance calculations for graph take too long • Clusters cannot be used • Loading individual vectors takes a lot of time • Solution: • Approximate distance • Idea: If images in the same cluster and same distance range to the centroid Probability that they are „close“ is high • New definition for visually linked • If in same cluster and same range of relative distance to its centroid • Can be encoded in relational DB! And comes at nearly no extra cost in creation
Discussion: Precision • Measuring the quality of such a large-scale system is difficult • Precision/Recall measure not possible: ground truth not known • C: correct results • D: Desired results • A: Actual results • We measure the precision based on user questioning
Before we continue … some numbers • Number of Images: 3 006 660 • Size of Image data: 111 GB • Feature Extraction: 15 days (dual 2Ghz CPU, 2GB RAM) • Number of distinct keywords: 680 256 • Size of inverted keyword index table: 50 260 345 lines • MySQL database size: 23 GB
And now … the moment you’ve all been waiting for … • The Demo of Cortina
Conclusions • A system with over 3 Million items was implemented • Probably the largest CBIR System to date? • A retrieval concept was introduced • a keyword query followed by relevance feedback and visual nearest neighbour search • Superior to existing retrieval concepts (query by keyword or query by example) • Data mining to explore and exploit semantics in large-scale systems was introduced
Outlook • Many extensions and improvements possible • Segmentation • Or maybe rather some simple tiling • Indexing • K-Means should be replaced • Suggestion: VA-File based approach [Manjunath,Tesic 03] • Association Rule Mining • Multilevel Approach • First keywords for high level semantic concepts • Then visual features
Thanks • Ullrich Moenich and Lars Thiele
Which Rules are of Interest? • There are associations • Within text {shoe} {walk} • Within visual clusters {EHD 14} {SCD 12} • Between text and visual clusters {shoe} {EHD 14, SCD 12} • There are long and short rules • Short rules have higher support by the nature of the problem • Long rules contain more (precise) information about the semantics • Measure for interestingness or choice of rules from FI • Confidence? • Statistical Criteria? • Background Knowledge? (Example pregnant Woman )
Characteristics and Challenges • Chosen criteria • Mainly interested in rules {keywords} {visual feature clusters}. (Our “Background Knowledge”) • Support, confidence • Mine long and short rules • Restriction of the problem: Mine for frequent itemsets per keyword • i.e. all images=transactions for a given keyword • This means • We avoid being distracted by associations within keywords • The method is made even more scalable • The keyword as a placeholder for a semantic concept • A keyword does not always stand for a single semantic concept • Proposal for future versions: Multi-Level approach: • First {keywords} {keywords} rules to identify “real” semantic concepts • Then itemset mining per identified concept
Proposal: Semantic Clusters • Ultimate goal: Search some kind of „Semantic Clusters“ instead of visual feature clusters • Proposal based on approach from Ester et al. 2002, 2003 • Clustering based on frequent itemsets, originally for text • Clustering criterion: minimize overlap