DAP3D-Net: Where, What and How Actions Occur in Videos?

¹northumbria university
²Northumbria University

Details

09:55 - 10:00 | Tue 30 May | Room 4311/4312 | TUA3.1

Session: Computer Vision 1

Abstract

Action parsing in videos with complex scenes is an interesting but challenging task in computer vision. In this paper, we propose a novel deep model based on 3D CNN (convolutional neural network) and LSTM (long short-term-memory) module with a multi-task learning manner for effective Deep Action Parsing (DAP3D-Net) in videos. Particularly in the training phase, each action clip, sliced to several short consecutive segments, is fed into 3D CNN followed by LSTM to model whole action dynamic information, that action localization, classification and attributes learning can be jointly optimized via our deep model. Once the DAP3DNet is trained, for an upcoming test video, we can describe each individual action in the video simultaneously as: Where the action occurs, What the action is and How the action is performed. To well demonstrate the effectiveness of the proposed DAP3D-Net, we also contribute a new Numerous Category Aligned Synthetic Action dataset, i.e., NASA, which consists of 200; 000 action clips of 300 categories and with 33 pre-defined action attributes in two hierarchical levels (i.e., low-level attributes of basic body part movements and high-level attributes related to action motion). We learn DAP3D-Net using the NASA dataset and then evaluate it on our collected Human Action Understanding (HAU) dataset and the public THUMOS dataset. Experimental results show that our approach can accurately localize, categorize and describe multiple actions in realistic videos.