250 likes | 396 Views
Correcting Cuboid Corruption For Action Recognition In Complex Environment. Syed Zain Masood, Adarsh Nagaraja, Nazar Khan, Jiejie Zhu and Marshall Tappen University of Central Florida. Action Sequences. Can be broadly divided into: Activity: Person of interest performing the action
E N D
Correcting Cuboid Corruption For Action Recognition In Complex Environment Syed Zain Masood, Adarsh Nagaraja, Nazar Khan, Jiejie Zhu and Marshall Tappen University of Central Florida
Action Sequences • Can be broadly divided into: • Activity: Person of interest performing the action • Background: Context and/or Clutter • Simple datasets: background uninteresting • Complex datasets: context can be useful Background Activity
Complex Action Sequences • Most action recognition approaches treat action recognition problem holistically. • Systems designed to make intelligent decision when selecting features. • Complexity added till goal achieved. Global Representation
Issues with Holistic Methods • Lack of understanding about the decision making process of these complex systems. • Most complex datasets have strong contextual cues. How well would a system perform on actions with unrelated complex backgrounds?
Our Approach • Goal • Examine action recognition in a way that separates action from context • Is the system able to make an intelligent decision when confronted with adverse context? • Purpose • Useful to measure how much context matters • Will help improve handling background clutter • Avoid unnecessary complexity for recognition performance and thus higher efficiency
Our Approach • Problem: • Current datasets: Strong contextual cues • Solution: • Need for creating a new dataset where activity without strong relevant context • Easier if based on older sets; makes it possible to benchmark against older work
UCF Weizmann Dynamic Dataset • Simple actions from Weizmann Action Dataset • Complex backgrounds from YouTube • Matte action on complex background [1] • Dataset available at: http://www.cs.ucf.edu/~smasood/datasets/UCFWeizmannDynamic.zip
UCF Weizmann Dynamic • No humans in background • Backgrounds selected randomly for matting • Ensures unhelpful background • In some cases, might even be detrimental e.g. different actions having the same complex background
Testing Methodology • Baseline Performance • Basic “bag-of-words” system • Tuned to perform as well as a number of recently-published systems
Baseline Performance • Significant drop in performance • Completely unable to deal with clutter.
Why performance degrades? • Is it the matting process? • No. Tests conducted on action sequences matted on gray background show 94% recognition. • Change from simple to complex background only difference between datasets • Background cues contributing significantly to the recognition process
How to remove the effect of background? • Experiment #1: • Isolate actor from videos using available masks • Prune background interest points • With no background clutter, results should be comparable to those on Weizmann dataset
Experiment #1: Background Pruning • Interestingly, results improve but not as significantly as they dropped • Simple background pruning does not help our cause
Background Pruning Limitations • Out-of-place actions have background clutter in cuboid at “good” interest point locations. • Interest point pruning eliminates spatial but not temporal background clutter.
How to overcome this limitation? • Removing background information within cuboids might be helpful • Experiment #2: • Cuboid Masking: Zero out background frames
Cuboid Masking Results • Comparable results achieved with background pruning of interest points and masking within “good” interest point cuboids
The Next Steps • All above experiments were conducted using ground-truth annotations. • Now that we have identified the problem: • Need to do away with ground-truth actor masks and implement automatic localization of the actor. • Need to test the system on well known complex dataset where context might be helpful.
Automatic Localization • We combine: • An off-the-shelf human detector [4,5] • Saliency detector method [6]
Automatic Localization • Automatic localization not as good on UCF Weizmann Dynamic dataset • Still, optimal performance is achieved using both interest point pruning and cuboid masking for the automatic localization
UCF Sports Dataset • Reasons for selecting this dataset: • Small size • Good resolution • Ground-truth actor masks available
UCF Sports Dataset • Experiment using ground-truth masks: • Using both techniques gives the optimal performance
UCF Sports Dataset • Experiment using automatic localization: • Automatic localization results not as bad as for UCF Weizmann Dynamic dataset • Again, using cuboid masking results in the best performance
What Have We Learned? • Holistic approaches suffer without good context. • Localization is important and thus localization methods need to improve. • Correct use of localization is essential. • Once we can localize well, we can bring context back as an additional cue.
References [1] A. Levin, D. Lischinski, and Y. Weiss. A closed-form solution to natural image matting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30:228–242, 2008. [2] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conférence on, pages 1–8, 2008. [3] J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos ”in the wild”. Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, 0:461–468, 2009. [4] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Discriminatively trained deformable part models, release 4. http://people.cs.uchicago.edu/ pff/latent-release4/. [5] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32:1627–1645, 2010. [6] S. Goferman, L. Zelnik-Manor, and A. Tal. Context- aware saliency detection. In CVPR, pages 2376–2383. IEEE, 2010.
Q & A • This work was supported by NSF grants IIS-0905387 and IIS-0916868