A Multi-Modal Fusion Approach for Audio-Visual Scene Classification Enhanced by CLIP Variants

Soichiro Okazaki, Quan Kong, Tomoaki Yoshinaga

2021

Abstract

In this paper, we propose a system for audio-visual scene classification with a multi-modal ensemble way consisting of three features: (1) Log-mel spectrogram audio features extracted by CNN variants from audio modality. (2) Frame-wise image features extracted by CNN variants from video modality. (3) Another frame-wise image features extracted by OpenAI CLIP models which are trained with a large-scale web crawling text and paired image dataset under contrastive learning framework. We trained the above three models respectively and made an ensemble weighted by class-wise confidences of each model’s semantic outputs. As a result, our ensemble system reached 0.149 log-loss (official baseline: 0.658 log-loss) and 96.1% accuracy (official baseline: 77.0% accuracy) on TAU Audio-Visual Urban Scenes 2021 dataset which are used in DCASE2021 Challenge Task1B.


Results

Identifier Dataset Task Cross-validation set Performance Rank
E02 tau_avsc_2021_dev AVSC
  • accuracy: 96.1
  • logloss: 0.149
E01 tau_avsc_2021_dev AVSC
  • accuracy: 95.8
  • logloss: 0.238