Action Recognition Based on 3D Skeleton and RGB Frame Fusion

Guiyu Liu1, Jiuchao Qian2, Fei Wen2, Xiaoguang Zhu2, Rendong Ying2, Peilin Liu2

  • 1Shanghai JitaoTong University
  • 2Shanghai Jiao Tong University

Details

11:15 - 11:30 | Tue 5 Nov | L1-R7 | TuAT7.2

Session: Computer Vision and Applications I

Abstract

Action recognition has wide applications in assisted living, health monitoring, surveillance, and human-computer interaction. In traditional action recognition methods, RGB video-based ones are effective but computationally inefficient, while skeleton-based ones are computationally efficient but do not make use of low-level detail information. This work considers action recognition based on a multimodal fusion between the 3D skeleton and the RGB image. We design a neural network that uses a 3D skeleton sequence and a single middle frame from an RGB video as input. Specifically, our method picks up one frame in a video and extracts spatial features from it using two attention modules, a self-attention module, and a skeleton-attention module. Further, temporal features are extracted from the skeleton sequence via a BI-LSTM subnetwork. Finally, the spatial features and the temporal features are combined via a feature fusion network for action classification. A distinct feature of our method is that it uses only a single RGB frame rather than an RGB video. Accordingly, it has a light-weighted architecture and is more efficient than RGB video-based methods. Comparative evaluation on two public datasets, NTU-RGBD and SYSU, demonstrates that our method can achieve competitive performance compared with state-of-the-art methods.