A Multi-Modal Fusion Approach for Audio-Visual Scene Classification Enhanced by CLIP Variants
Soichiro Okazaki, Quan Kong, Tomoaki Yoshinaga
2021
Abstract
In this paper, we propose a system for audio-visual scene classification with a multi-modal ensemble way consisting of three features: (1) Log-mel spectrogram audio features extracted by CNN variants from audio modality. (2) Frame-wise image features extracted by CNN variants from video modality. (3) Another frame-wise image features extracted by OpenAI CLIP models which are trained with a large-scale web crawling text and paired image dataset under contrastive learning framework. We trained the above three models respectively and made an ensemble weighted by class-wise confidences of each model’s semantic outputs. As a result, our ensemble system reached 0.149 log-loss (official baseline: 0.658 log-loss) and 96.1% accuracy (official baseline: 77.0% accuracy) on TAU Audio-Visual Urban Scenes 2021 dataset which are used in DCASE2021 Challenge Task1B.
Results
Identifier | Dataset | Task | Cross-validation set | Performance | Rank |
---|---|---|---|---|---|
E02 | tau_avsc_2021_dev | AVSC |
|
||
E01 | tau_avsc_2021_dev | AVSC |
|