1 / 16

Image-Language Association: are we looking at the right features?

Image-Language Association: are we looking at the right features?. Katerina Pastra. Language Technology Applications, Institute for Language and Speech Processing, Athens, Greece. IPTV, iTV. access to MM content. File-swapping networks (P2P), (video files & video blogs).

griselda
Download Presentation

Image-Language Association: are we looking at the right features?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Image-Language Association: are we looking at the right features? Katerina Pastra Language Technology Applications, Institute for Language and Speech Processing, Athens, Greece

  2. IPTV, iTV access to MM content File-swapping networks (P2P), (video files & video blogs) Conversational robots, MM presentation systems... Auto-analysis of image-language relations independence generation of MM content Video search engines The pervasive digital video context equivalence complementarity

  3. Overview Focus on semantic equivalence relation = Multimedia Integration = image-language association • Brief review of state of the art association mechanisms – feature sets used • The OntoVis feature set suggestion • Using OntoVis in the VLEMA prototype • Prospects for going from 3D to 2D • Future plans and conclusions

  4. Association Mechanisms in prototypes Intelligent MM systems from SHRDLU to conversational robots of new millennium (Pastra and Wilks 2004): • Simulated or manually abstracted visual input is used  to avoid difficulties in image analysis • Integration resources used with a priori known associations (e.g. image X on screen is a “ball”), or allowing simple inferences e.g. matching an input image to an object-model in the resource, which is in its turn linked to a “concept/word” )  toavoid difficulties in associating V-L • Applications are restricted to blocksworlds/miniworlds •  scaling issues

  5. Association algorithms To beembedded in prototypes: • Probabilistic approaches for learning (e.g. Barnard et al. 2003)  use word/phrase + image/image region (f-v vectors)  require properly annotated corpora (IBM, Pascal etc.) • Logic-based approaches (e.g. Dasiopoulou et al. 2004)  use feature-augmented ontologies  match low-level image features + leaf nodes • Use of both approaches reported too (Srikanth et al. 2005) Scaling? Feature set used: shape, colour, texture, position, size

  6. The quest for the appropriate f-set Cognitive thesis: No feature set is fully representative of the characteristics of an object, but one may be more or less successful in fixing the reference of the corresponding concept (word) Constraints in defining a f-set: • Features must be distinctive of object classes (at the basic-level) • Feature values must be detectable by image analysis modules

  7. The OntoVis suggestion A domain model  Ontology + KBase for static indoor scenes (sitting rooms in 3D – XI KR language) • Feature-set suggested • physical structure: the number of parts into which an object is expected to be decomposed in different dimensions • visually verifiable functionality: visual characteristics an object may have which are related to its function, & • interrelations: relative location of objects, relative size

  8. y x z The OntoVis suggestion

  9. OntoVis – KB examples armchairs? stools?

  10. OntoVis – KB examples

  11. OntoVis F-set advantages • It generalizes over visual appearance differences (e.g. different styles of sofas) • It goes beyond viewpoint (view angle + distance) differences • It can be used to reason on object id by analogy (e.g. to describe “sofa-like” objects if not certain)

  12. Using OntoVis VLEMA: A Vision-Language intEgration MechAnism • Input: automatically re-constructed static scenes in 3D (VRML format) from RESOLV (robot-surveyor) • Integration task: Medium Translation from images (3D sitting rooms) to text (what and where in EN) • Domain: estates surveillance • Horizontal prototype • Implemented in shell programming and ProLog

  13. The Input

  14. System Architecture OntoVis + KB Object Segmentation Object Naming Data Transformations Description “…a heater … and a sofa with 3 seats…”

  15. The Output Wed Jul 7 13:22:22 GMTDT 2004 VLEMA V1.0 Katerina Pastra@University of Sheffield Description of the automatically constructed VRML file “development-scene.wrl” This is a general view of a room. We can see the front wall, the left-side wall, the floor, A heateronthe lower part of the front-wall and a sofawith 3 seats. The heater is shorter in length than the sofa. It is on the right of the sofa.

  16. Future Plans & Conclusions • Extension of OntoVis and testing in VRML worlds • Modular description of clusters/parts (not rely just on their number in each dimension) • Exploration of portability of f-set to 2D images Initial signs of feasibility: cf. research on detecting spatial relations in 2D, structure-identification in 2D, algorithms for 3D reconstruction from photographs) OntoVis Complementary or alternative to current approaches? Indications of OntoVis scalability & feasibility that worth further exploration To what extent scalable even in 3D?

More Related