Network architecture

Attributes of sound inherent to objects can provide valuable cues to learn rich representations for object detection and tracking. Furthermore, the co-occurrence of audiovisual events in videos can be exploited to localize objects over the image field by solely monitoring the sound in the environment. Thus far, this has only been feasible in scenarios where the camera is static and for single object detection. Moreover, the robustness of these methods has been limited as they primarily rely on RGB images which are highly susceptible to illumination and weather changes. In this work, we present the novel self-supervised MM-DistillNet framework consisting of multiple teachers that leverage diverse modalities including RGB, depth and thermal images, to simultaneously exploit complementary cues and distill knowledge into a single audio student network.

We propose the new MTA loss function that facilitates the distillation of information from multimodal teachers in a self-supervised manner. Additionally, we propose a novel self-supervised pretext task for the audio student that enables us to not rely on labor-intensive manual annotations. We introduce a large-scale multimodal dataset with over 113,000 time-synchronized frames of RGB, depth, thermal, and audio modalities. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods while being able to detect multiple objects using only sound during inference and even while moving.

How Does It Work?

Network architecture
Figure: Overview of our proposed MM-DistillNet framework. Our proposed cross-modal MM-DistillNet distills knowledge exploiting complementary cues from multimodal visual teachers into an audio student. During inference, the model detects and tracks multiple objects in the visual frame using only audio as input.

Our framework consists of multiple teacher networks, each of which takes a specific modality as input, for which we use RGB, depth, and thermal to maximize the complementary cues that we can exploit (appearance, geometry, reflectance). Each of these modalities has its own benefits and drawbacks. RGB images perform better in well-illuminated conditions but poorly in the night-time, whereas thermal images perform better in low illumination conditions. Depth images help discern multiple objects better than RGB or thermal images. Our framework can be used with other modalities such as LiDAR or RADAR that may improve the performance in conditions where RGB, depth, and thermal fail.

The teachers are first individually trained on diverse pre-existing datasets to predict bounding boxes in their respective modalities. We then train the audio student network to learn the mapping of sounds from a microphone array to bounding box coordinates of the combined teachers' prediction, only on unlabeled videos. To do this, we present the novel Multi-Teacher Alignment (MTA) loss to simultaneously exploit complementary cues and distill object detection knowledge from multimodal teachers into the audio student network in a self-supervised manner. During inference, the audio student network detects and tracks objects in the visual frame using only sound as an input. Additionally, we present a self-supervised pretext task for initializing the audio student network in order to not rely on labor-intensive manual annotations and to accelerate training.

To facilitate this work, we collected a large-scale driving dataset with over 113,000 time-synchronized frames of RGB, depth, thermal, and multi-channel audio modalities. We present extensive experimental results comparing the performance of our proposed MM-DistillNet with existing methods as well as baseline approaches, which shows that it substantially outperforms the state-of-the-art. More importantly, for the first time, we demonstrate the capability to detect and track objects in the visual frame, from only using sound as an input, without any meta-data and even while moving in the environment. We also present detailed ablation studies that highlight the novelty of the contributions that we make.

Multimodal Audio-Visual Detection Dataset (MAVD)

We introduce the Multimodal Audio-Visual Detection Dataset for autonomous driving that provides 113,283 synchronized audio, RGB, depth, and thermal images. The dataset was gathered from 24 car drives of nearly 300 km during 3 months and at 20 different locations. Each drive has an average of half hour duration. We recorded data on diverse scenarios ranging from highways to densely populated urban areas and small towns. The recordings consist of high traffic density, freeway driving, and multiple traffic lights (involving transition from static to driving conditions). To capture diverse noise conditions, we recorded sounds not only during conventional city driving but also near trams and while going through tunnels.

License Agreement

The data is provided for non-commercial use only. By downloading the data, you accept the license agreement which can be downloaded here. If you report results based on the MAVD dataset, please consider citing the paper mentioned in the Publications section.


Coming soon!

Code and Models

A software implementation of this project based on PyTorch can be found in our GitHub repository for academic usage and is released under the GPLv3 license. The pretrained MM-DistillNet model released under the aforementioned terms can be downloaded here. For any commercial purpose, please contact the authors.


Francisco Rivera, Juana Valeria Hurtado, Abhinav Valada,
There is More than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking with Sound by Distilling Multimodal Knowledge
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

(Pdf) (Bibtex)