SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition

Overview of SkateFormer's partition-specific attention strategy. SkateFormer partitions joints and frames based on different types of skeletal-temporal relation (4 Skate-Types) and performs skeletal-temporal self-attention (Skate-MSA) within each partition.

Abstract

We propose a Skeletal-Temporal Transformer (SkateFormer), a partition-specific attention strategy (Skate-MSA) for skeleton-based action recognition that captures skeletal-temporal relations and reduces computational complexity.

We introduce a range of augmentation techniques and an effective positional embedding method, named Skate-Embedding, which combines skeletal and temporal features. This method significantly enhances action recognition performance by forming an outer product between learnable skeletal features and fixed temporal index features.

Our SkateFormer sets a new state-of-the-art for action recognition performance across multiple modalities (4-ensemble condition) and single modalities (joint, bone, joint motion, bone motion), showing notable improvement over the most recent state-of-the-art methods. Additionally, it concurrently establishes a new state-of-the-art in interaction recognition, a sub-field of action recognition.

Network Architecture

Top. The overall framework of SkateFormer.
Bottom left. The Skate-MSA of SkateFormer.
Bottom right. Skate-Type partition and reverse.

Discussion on Various Partition-Based Approaches

Comparison with existing transformer-based methods. S-Type, T-Type, S-Attn, T-Attn, T-Conv, and ST-Attn indicate 'physically neighboring joints', 'local motion', ‘skeletal attention’, ‘temporal attention’, ‘temporal convolution’ and ‘skeletal-temporal attention’, respectively.

Ensemble Performance Comparison

Top-1 accuracy of skeleton-based action recognition methods. (i) E1- joint modality only; (ii) E2 - joint + bone modalities; and (iii) E4 - joint + bone + joint motion + bone motion modalities.

Top-1 accuracy of skeleton-based interaction recognition methods. Human interaction recognition is a sub-part of skeleton-based action recognition, specifically focusing on scenarios where two or more individuals coexist within a single action.

Computational Complexity Comparision

Comparative analysis of SkateFormer with other methods by parameters, FLOPs, inference time, and average top-1 accuracy for joint modality.

Single Modality Comparision

(i) J - joint modality; (ii) B - bone modality; (iii) JM - joint motion modality; and (iv) BM - bone motion modality.

Analysis of Partition-Specific Attention

The activation level of Skate-Types according to action labels. For 'sitting down' (action label: 7) and 'standing up' (action label: 8), Skate-Type-3 or Skate-Type-4 exhibited pronounced activations, while 'reading' (action label: 10) and 'writing' (action label: 11) prominently activated Skate-Type-1.

BibTeX

@misc{do2024skateformer,
      title={SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition},
      author={Jeonghyeok Do and Munchurl Kim},
      year={2024},
      eprint={2403.09508},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}