1 / 26

Combining Multiple Representations on the TRECVID Search Task

Combining Multiple Representations on the TRECVID Search Task. Arjen P. de Vries Thijs Westerveld Tzvetanka I. Ianeva. Introduction. Video Retrieval should take advantage of information from all available sources and modalities …but so far ASR best for almost any query

courtney
Download Presentation

Combining Multiple Representations on the TRECVID Search Task

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combining Multiple Representations on the TRECVID Search Task Arjen P. de Vries Thijs Westerveld Tzvetanka I. Ianeva ICASSP, May 21 2004

  2. Introduction • Video Retrieval should take advantage of information from all available sources and modalities • …but so far ASR best for almost any query • LL11@TRECVID2003: Combining information sources • Different models/modalities • Multiple example images ICASSP, May 21 2004

  3. Docs Models ‘Language Modelling’ approach to IR ICASSP, May 21 2004

  4. Calculate conditional probabilities of observing query samples given each model in the collection Retrieval Models P(Q|M1) Query P(Q|M2) P(Q|M3) P(Q|M4) ICASSP, May 21 2004

  5. Indexing Estimate a Gaussian Mixture Model from each keyframe (using EM) Fixed number of components (C=8) Feature vectors contain colour, texture, and position information from pixel blocks: <x,y,DCT> Static Model ICASSP, May 21 2004

  6. Dynamic Model • Indexing: • GMM of multipleframes (N=29) around keyframe • Feature vectors extended with time-stamp in [0,1]: <x,y,t,DCT> 1 .5 0 ICASSP, May 21 2004

  7. Dynamic Model ICASSP, May 21 2004

  8. Dynamic Model Advantages • More training data for models • Reduced dependency upon selecting appropriate keyframe • Some spatio-temporal aspects of shot are captured • (Dis-)appearance of objects ICASSP, May 21 2004

  9. Experimental Set-up • Build models for each shot • Static, Dynamic, Language • Build Queries from topics • Construct simple keyword text query • Select visual example • Rescale and compress example images to match video size and quality ICASSP, May 21 2004

  10. Combining Modalities • Independence assumption textual/visual • P(Qt,Qv|Shot) = P(Qt|LM) * P(Qv|GMM) • Combination works if both runs useful [CWI:TREC:2002] • Dynamic run moreuseful than static run ICASSP, May 21 2004

  11. Dynamic: Higher Initial Precision Combining Modalities ICASSP, May 21 2004

  12. Dow Jones Topic (120) ICASSP, May 21 2004

  13. “Dow Jones Industrial Average rise day points” Dow Jones Topic (120) + = ICASSP, May 21 2004

  14. Dow Jones Topic (120) ICASSP, May 21 2004

  15. Arafat topic (103) ICASSP, May 21 2004

  16. Arafat Topic (103) ICASSP, May 21 2004

  17. Baseball topic (102) Basketball topic (101) ICASSP, May 21 2004

  18. Basketball Topic ICASSP, May 21 2004

  19. Merging Run Results ICASSP, May 21 2004

  20. Combining (conflicting) examples difficult [CWI:TREC:2002] Single example  Miss relevant shots Round-Robin Merging Merging Run Results Combined 1 1 2 2 3 3 4 4 . . 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 ICASSP, May 21 2004

  21. Combining (conflicting) examples difficult [CWI:TREC:2002] Single example  Miss relevant shots Round-Robin Merging Merging Run Results Combined 1 1 2 2 3 3 4 4 . . 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 ICASSP, May 21 2004

  22. Flames (112) ICASSP, May 21 2004

  23. Flames Topic (112) ICASSP, May 21 2004

  24. Conclusions • For most topics, neither the static nor the dynamic visual model captures the user information need sufficiently… • …averaged over 25 topics however, it is better to use both modalities than ASR only Working hypothesis: Matching against both modalities gives robustness ICASSP, May 21 2004

  25. Conclusions • Dynamic captures visual similarity better • Thanks to spatio-temporal aspects? • Experiments with full covariance matrix for <x,y,t>-dims • Static model of KF is too fragile • Dependency on single KF? • To be tested by ranking max(all I-frames in shot) • Not enough training data? ICASSP, May 21 2004

  26. Conclusions • Visual aspects of an information need are best captured by using multiple examples • Combining results for multiple (good) examples in round-robin fashion, each ranked on both modalities, gives near-best performance for almost all topics ICASSP, May 21 2004

More Related