Cross modality: Interaction between image, video and language - A Trinity and personal perspective

Cross modality: Interaction between image, video and language - A Trinity and personal perspective 1 Khurshid Ahmad School of Computer Science and Statistics, Trinity College, Dublin A Seminar presentation

Preamble • One key message in modern neuroscience is cross-modality & multi-sensory integration: • uni-modal areas in the brain, such as vision, process complex data received in a single mode –e.g. images • these areas to interact with each other for the animals to deal with the world of multi-modal data • unimodal areas interact with hetero-modal areas, areas that are activated by two or more input modalities, to converge the outputs of the uni-modal systems for producing ‘higher cognitive’ behaviour: quantify (enumeration and counting), retrieve images given linguistic cues and vice versa

Neural Correlates of Behaviour:Modality and Neuronal Correlation Neural underpinnings of Multisensory Integration: M. Alex Meredith (2002). On the neuronal basis for multisensory convergence: a brief overview. Cognitive Brain Research Vol. 14 (2002) 31–40

Preamble One key message in modern neuroscience is cross-modality: ‘Sensory information undergoes extensive associative elaboration and attentional modulation as it becomes incorporated in the texture of cognition’; Cognitive processes are supposed to arise ‘from analogous associative transformations of similar sets of sensory inputs’ – differences in the resultant cognitive operations are determined by the anatomical and physiological properties of the transmodal node that acts as the critical gatewayfor the dominant transformation’ Core synaptic hierarchy: primary sensory, upstream and downstream unimodal, and transmodal –heteromodal, paralimbic, and limibic- zones of the cerebral cortex; Thin arrows  monosynaptic connections; Thick arrows  ‘massive connections’ Broken arrows  motor output pathways Mesulam, M.-Marsel (1998) ‘From sensation to cognition’ Brain Vol. 121, pp 1013-1052

Neural Correlates of Behaviour:Modality and Neuronal Correlation Neural underpinnings of Multisensory Motion Integration: ‘In addition to […] modality-specific motion-processing areas, there are a number of brain areas that appear to be responsive to motion signals in more than one sensory modality [….] the IPS, [..] precentral gyrus can be activated by auditory, visual or tactile motion signals’ Soto-Faraco, S. et al (2004). ‘Moving Multisensory Research Along: Motion Perception Across Sensory Modalities’. Current Directions in Psy. Sci. Vol 13(1), pp 29-32

Sensation and Cognition The highest synaptic levels of sensory fugal processing are occupied by heteromodal, paralimbic and limbic cortices – collectively known as transmodal areas. Key anatomically distinct brain networks with communicating epicentres Mesulam, M.-Marsel (1998) ‘From sensation to cognition’ Brain Vol. 121, pp 1013-1052

Uni- and Cross Modality @ Trinity

Uni- and Cross Modality @ Trinity Other friends and colleagues: Trinity Centre for Neurosciences (Fiona Newell, Shane O’Mara, Hugh Garavan & Ian Robertson  Cross modality and fMRI imaging); Linguistics and Phonetics (Ailbhe Ní Chasaide) Centre for Health Informatics (Jane Grimson)

Uni/Cross Modality & Ontology @ Trinity The key problem for the evolving semantic web and the creation of large data repositories (memories for life in health care, infotainment) is the indexation and efficient retrieval of images – both still and moving- and the identification of key objects and events in the images. The visual features under-constrain an image and supplemental, collateral, contextual knowledge is required to index the images: Linguistic description and motion features are one of the candidates. Above all, there must be a conceptual basis of any indexing scheme for it to be robust against changes in the subject domain and changes in the user perspective.

Uni/Cross Modality & Ontology @ Trinity The key term in distributed and soft computing for a conceptual basis is ontology: A consensus amongst a group of people (system developers, domain experts and end-users) about what there is. We have had a seminar where we discussed the philosophical, formal, linguistic, computational and inter-operability issues related to ontology systems. There is a work programme that is evolving under the co-ordination of Declan O’Sullivan. The intention is to see the fit between the work of the ontology consortium with that of folks in video annotation

Uni/Cross Modality & Ontology @ Trinity The intention is to see the fit between the work of the ontology consortium with that of folks in video annotation. The intention is to have a system that works in a distributed environment and interacts with users with variety of devices and allows access and update of large repositories of life and mission critical data. We have tremendous opportunities: (a) Major government initiatives in health-care  an integrated system for text and images related to patients accessible to authorised users on a range of mobile devices; (b) Major opportunities in animation and surveillance; (‘c) Key applications in mini-robotics systems; (d) the opening up of TCIN in clinical care; (e) ageing initiatives

Uni/Cross Modality & Ontology @ Trinity • There are key groups in the College that can contribute to the knowledge in computing and contribute to the advancement of key disciplines – health care, neurosciences. This is a win-win opportunity for all. • Communications and Value Chain Centre • Intelligent Systems Cluster in CS (Ontology, Linguistics, Graphics, Vision) • Theory and Architecture Cluster (Formal Methods) • Distributed Systems Cluster (Ubiquitous Systems) • Vision and Speech Groups in EE • Statistics Cluster (Bayesian Reasoning)

Uni/Cross Modality & Ontology @ Trinity The key message here is this: Trinity is good at good science; Trinity has substantial expertise and potential in building novel computing systems; Trinity has demonstrable ability to deal with real world audio/video systems; All the key players involved have a peer-reviewed track record We have the critical mass or have the desire create one!!

Preamble Neural computing systems are trained on the principle that if a network can compute then it will learn to compute; Most neural computing systems are single net, cellular systems – the single net systems; Lesson from biology: no network is an island, the bell tolls across networks

Preamble Neural computing systems are trained on the principle that if a network can compute then it will learn to compute. Multi-net neural computing systems are trained on the principle that if two or more networks learn to compute simultaneously or sequentially , then the multi-net will learn to compute.

Preamble Multi-net neural computing systems can be traced backed to the hierarchical mixture of experts’ systems originally reported by Jordan, Jacobs and Barto. In turn, these systems relate to the broader family of systems – the mixtures of ‘X’ Jacobs, R.A., Jordan, M.I. & Barto, A.G. (1991). Task Decomposition through Competition in a Modular Connectionist Architecture: The What and Where Vision Tasks. Cognitive Science, vol. 15, pp. 219-250.

Preamble One key message in modern neuroscience is multi-modality: My work has been in the multi-net simulation of language development; aphasia; numerosity; cross-modal retrieval; attention and automatic video annotation.

Learning to Compute: Cross-Modal Interaction and Spatial Attention The key to spatial attention is that different stimuli, visual and auditory, help to identify the spatial location of the object generating the stimuli. One argument is that there may be a neuronal correlate of such crossmodal interaction between two stimuli. Information related to the location of the stimulus (where) and identifying the stimulus (what) appears to have correlates at the neuronal level in the so-called dorsal and ventral streams in the brain.

Learning to Compute: Numerosity, Number Sense and ‘Numerons’ A number of other animal species appear to have the ‘faculty’ of visual enumeration or subitisation. The areas identified have ‘homologs’ in the human brain. Measurements are a tad problematic in neurobiology

Learning to Compute: Numerosity, Number Sense and ‘Numerons’ The ‘Edge’ Effect ‘Monkeys watched two displays (first sample, then test) separated by a 1-s delay. [the displays varied in shape, size, texture and so on.] They were trained to release a lever if the displays contained the same number of items. Average performance of both monkeys was significantly better than chance for all tested quantities, with a decline when tested for higher quantities similar to that seen in humans performing comparable tasks. Andreas Nieder, David J. Freedman, Earl K. Miller (2002). ‘Representation of the Quantity of Visual Items in the Primate Prefrontal Cortex’. ScienceVol. 297, pp 1709-11.

Computing to Learn • Neural computing systems are trained on the principle that if a network can compute then it will learn to compute. • Multi-net neural computing systems are trained on the principle that if two or more networks learn to compute simultaneously, then the multi-net will learn to compute.

Computing to Learn:Unsupervised Self Organisation • Combining multiple modes of information using unsupervised neural classifiers • Two SOMs linked by Hebbian connections • One SOM learns to classify a primary modality of information • One SOM learns to classify a collateral modality of information • Hebbian connections associate patterns of activity in each SOM

. . . . . . Bi directional Primary Primary Collateral Collateral Vector SOM Hebbian Network SOM Vector Computing to Learn:Unsupervised Self Organisation • Sequential Multinet Neural Computing Systems: SOMs and Hebbian connections trained synchronously.

Computing to Learn:Unsupervised Self Organisation • Work under my supervision at Surrey includes the development of multi-net neural computing architectures for: • language development • language degradation • Collateral images and texts • Numerosity development • In the case of the latter two, then the connections between modules are learnt too – cross-modal interaction via Hebbian connections

Computing to Learn:Unsupervised Self Organisation • Hebbian connections associate neighbourhoods of activity • Not just a one-to-one linear association • Each SOM’s output is formed by a pattern of activity centred on the winning neuron for the primary and collateral input • Training is deemed complete when both SOM classifiers have learned to classify their respective inputs

Computing to Learn:The Development of Numerosity An unsupervised multinet alternative Hebbian connections from the winning node of the magnitude representation SOFM to all nodes of the verbal SOFM (a), and the vice versa (b). During training those connections are strengthened based on the activations of the node pairs. (a) (b) Vrusias B. L. (2004). Combining Unsupervised Classifiers: A Multimodal Case Study. Unpublished PhD Dissertation, University of Surrey.

VERBAL SOFM MAGNITUDE SOFM HEBBIAN CONNECTIONS Number words Computing to Learn:Image and Collateral Texts An unsupervised multinet alternative Vrusias B. L. (2004). Combining Unsupervised Classifiers: A Multimodal Case Study. Unpublished PhD Dissertation, University of Surrey.

Computing to Learn:The Development of Numerosity An unsupervised multinet alternative: Simulating Fechners’ Law Ahmad K., Casey, M. & Bale, T. (2002). Connectionist Simulation of Quantification Skills. Connection Science, vol. 14(3), pp. 165-201.

Computing to Learn:The Development of Numerosity A ‘Hebbian-like learning rule’ that ‘resembles [..] Kohonen learning rule’: A confirmation of the results of Neider & Miller Verguts, Tom., & Fias, Vim. (2004). ‘Representation of Numbers in Animals and Humans: A Neural Model. Journal of Cognitive Neuroscience. Vol. 16(No. 9) pp 1493-1504

Computing to Learn:Image and Collateral Texts • Images have been traditionally indexed with short texts describing the objects within the image. The accompanying text is sometimes described as collateral to the image. • The ability to use the collateral texts for building computer-based image retrieval systems will help in dealing with image collections that can now be stored digitally. • Theoretically, the manner in which we grasp the relationship between the ‘features’ of the image and the ‘features’ of the collateral text relates back to cross-modality.

Computing to Learn:Image and Collateral Texts • The approximate locations of [lateral] regions where information about object form, motion and object-use-associated motor patterns may be stored. • Information from an increasing number of sources may be integrated in the temporal lobes, with specificity increasing along the posterior to anterior axis. • Specific regions of the Left Inferior Parietal Cortex and the polar region of the temporal lobes may be involved differentially in retrieving, monitoring, selecting and maintaining semantic information. Alex Martin* and Linda L Chao (2001). Semantic memory and the brain: structure and processes. Current Opinion in Neurobiology. Vol. 11, pp 194–201

Computing to Learn:Image and Collateral Texts • Activation of the fusiform gyrus when subjects retrieve color word associates has recently been replicated in two additional studies • Activation in a similar region has been reported during the spontaneous generation of color imagery in auditory color-word synaesthetes Alex Martin* and Linda L Chao (2001). Semantic memory and the brain: structure and processes. Current Opinion in Neurobiology. Vol. 11, pp 194–201

Visual Similarity (Similar Colours) Conceptual Similarity (Balls / Fruits) Computing to Learn:Image and Collateral Texts • In principle, image collections can be indexed by the visual features of the content alone (colour, texture, shapes, edges). The content-based image retrieval have not been a resounding success: K. Ahmad, B. Vrusias, and M. Zhu. ‘Visualising an Image Collection?’ In (Eds.) Ebad Banisi et al. Proceedings of the 9th International Conference Information Visualisation (London 6-8 July 2005). Los Alamitos: IEEE Computer Society Press. pp 268-274.

Computing to Learn:Image and Collateral Texts • We have developed a multi-net system that learns to classify images within an image collection, where each image has a collateral text, based on the common visual features and the verbal features of the collateral text. • The multi-net can also learn to correlate images and their collateral texts using Hebbian links – this means that one image may be associated with more than one collateral text and vice versa Ahmad, K., Casey, M., Vrusias, B., & Saragiotis P. Combining Multiple Modes of Information using Unsupervised Neural Classifiers. In (Ed.) Terry Windeatt and Fabio Rolli. Proc.4th Int. Workshop, MCS 2003. LNCS 2709. Heidelberg: Springer-Verlag. pp 236-245.

Computing to Learn:Image and Collateral Texts Automatic Image Annotation and Illustration Hebbian connections from the winning node of the text SOFM to all nodes from the image SOFM (a), and the vice versa (b). During training those connections are strengthened based on the activations of the node pairs. (a) (b) Vrusias B. L. (2004). Combining Unsupervised Classifiers: A Multimodal Case Study. Unpublished PhD Dissertation, University of Surrey.

TEXT SOFM IMAGE SOFM HEBBIAN CONNECTIONS Keywords Computing to Learn:Image and Collateral Texts Automatic Image Annotation and Illustration Vrusias B. L. (2004). Combining Unsupervised Classifiers: A Multimodal Case Study. Unpublished PhD Dissertation, University of Surrey.

Computing to Learn:Image and Collateral Texts Automatic Image Annotation and Illustration Different SOFM Configurations Used in Simulations Optimum SOFM Configuration

Computing to Learn:Image and Collateral Texts Automatic Image Annotation and Illustration

Computing to Learn:Image and Collateral Texts The performance of the multinet system was compared with a ‘monolithic’ single net SOFM. The monolith was trained with a conjoined text-image vector on a single SOFM. The performance of the two networks was compared using a ratio of precision (p) & recall (r) statistics, called the effectiveness ratio F. We use =0.5.

Computing to Learn:Image and Collateral Texts • The Hemera “PhotoObjects was used as the primary dataset collection for our experiments. • The collection contains about 50,000 photo objects (single object images with no background), and has been used extensively for image analysis. • Each image (object) in the collection has associated keywords attached, and is characterised by a general category type.

Computing to Learn:Image and Collateral Texts Hemara Collection: Training Subset Used – 1151 images randomly selected from 50000 obejcts

Computing to Learn:Image and Collateral Texts Visualising the clusters formed by the text-based SOFM. Visualising the clusters formed by the image-based SOFM.

Computing to Learn:Image and Collateral Texts Automatic Image Annotation and Illustration • It is not possible for an SOFM to output categories inherent in the training data. Recently, a sequential clustering scheme has been suggested: Produce the initial categorisation using a SOFM and then cluster the output using conventional clustering algorithms like k-means, hierarchical clustering, fuzzy c-means and so on. • We have obtained the best results with a SOFM+k-means clustering.

Computing to Learn:Image and Collateral Texts • The visual features proved too generic to be useful for classification. • Precision and recall figures were persistently below 0.5 for both metrices. • The results, however, were good for visually well defined objects like coins. • This perhaps explains the poor performance of some of the computer vision systems.

Computing to Learn:Image and Collateral Texts • Textual descriptors are much better for categorisation with precision and recall both quite high. Text-based categorisation Image-based categorisation

Computing to Learn:Image and Collateral Texts The performance of the multinet system was compared with a ‘monolithic’ single net SOFM. The monolith was trained with a conjoined text-image vector on a single SOFM.

Computing to Learn:Image and Collateral Texts The performance of the multinet system was compared with a ‘monolithic’ single net SOFM. The monolith was trained with a conjoined text-image vector on a single SOFM. Hemara Data Set: Single Objects + No Background

Computing to Learn:Image and Collateral Texts The performance of the multinet system was compared with a ‘monolithic’ single net SOFM. The monolith was trained with a conjoined text-image vector on a single SOFM. Correl Data Set: Multiple Objects + Background

Text-query Matched Text Retrieved Image Computing to Learn:Image and Collateral Texts Automatic Image Illustration through Hebbian cross-modal linkage

Computing to Learn:Image and Collateral Texts Automatic Image Annotation through Hebbian cross-modal linkage Query Image Matched Image Retrieved Text

Cross modality: Interaction between image, video and language - A Trinity and personal perspective

Cross modality: Interaction between image, video and language - A Trinity and personal perspective

Presentation Transcript

Research and Publications: A Personal Perspective

Image and Video Retrieval

Image and Video Fundamentals

Personal interaction between customer and L'Occitane company

A cognitive perspective on cross language influence

Image and Video Compression

Interaction between Student and Project

Interaction between physics and a woman

Image and Video Compression

Interaction between Academia and Microsoft in Speech and Language Systems

Language Translation and Media Transformation in Cross-Language Image Retrieval

Interaction Between Rivers and Aquifer

Cross modality: Interaction between image, video and language - A Trinity and personal perspective

Image and Video Retrieval

Image and Video Apps

Research and Publications: A Personal Perspective

Image and Video Processing

Image and Video Apps

Research and Publications: A Personal Perspective