Benchmarking Audio Visual Segmentation for Long-Untrimmed Videos

[Accepted by CVPR 2024]


Why Long-Untrimmed Audio-visual Segmentation Dataset?

Existing audio-visual segmentation (AVS) datasets typically focus on short-trimmed videos with only one pixel-map annotation for a per-second video clip. In contrast, for untrimmed videos, the sound duration, start- and end-sounding time positions, and visual deformation of audible objects vary significantly. Therefore, we observed that current AVS models trained on trimmed videos might struggle to segment sounding objects in long videos. To investigate the feasibility of grounding audible objects in videos along both temporal and spatial dimensions, we introduce the Long-Untrimmed Audio-Visual Segmentation dataset (LU-AVS), which includes precise frame-level annotations of sounding emission times and provides exhaustive mask annotations for all frames.

Samples of our from Long-Untrimmed Audio-Visual Segmentation dataset. Different from previous AVS datasets, LU-AVS is specifically crafted to explore the challenges inherent in the AVS task for long-untrimmed videos. It features detailed start- and end-sounding positions in the temporal dimension, along with comprehensive mask and bounding box annotations in the spatial dimension. The examples show that our dataset contains numerous audible segments in each video, characterized by diverse durations and varying start and end-sounding positions. Additionally, within a single video, the same objects may have notable shifts spatially and undergo deformation.


What is the audio-visual segmentation task?

Task Definition. Audio-visual segmentation aims to localize audible objects by a pixel-level map for a given audio-visual pair.

Challenges in Long-Untrimmed Videos. (1) The start and sounding frames of audible objects are uncertain.
(2) The sounding duration of the target objects varies significantly among different videos.

What is LU-AVS dataset?

To investigate the AVS task on long-untrimmed videos, we propose a large-scale Long-Untrimmed Audio-Visual Segmentation (LU-AVS) dataset. LU-AVS dataset comprises 6.6K untrimmed videos covering 78 categories, and 10M pixel-level annotation masks are provided to indicate the audible objects. Moreover, we extend our dataset with bounding box annotations, extending them to both previously mask-marked categories and those challenging to annotate with masks. The extended LU-AVS dataset contains about 7.2K untrimmed videos and more than 88 categories. More details about the LU-AVS dataset are shown below.

Basic informations

  • Videos are collected from YouTube.
  • 6.6K videos spanning 88 categories.
  • Spanning a wide range of domains, including Animal, Human, Music, Sport, Tool, Vehicle, and others

Characteristics

  • Long average duration 41.97s.
  • Long average sounding duration 16.03s.
  • Over half of the videos have multiple audible segments.
  • Target objects have notable shifts and undergo deformation.

Personal data/Human subjects

Videos in LU-AVS are public on YouTube and annotated via professional annotators. The details of the annotation way are explained in our paper.

Video examples


Some annotated examples in the LU-AVS dataset. Each color represents an audible segment type.

The information of the audible segments:
(1) Female singing:
     start frame 68; end frame 482
(2) Female singing:
     start frame 651; end frame 783
(3) Female singing:
     start frame 938; end frame 998
(4) Playing acoustic guitar:
     start frame 0; end frame 631
(5) Playing acoustic guitar:
     start frame 769; end frame 1079
The information of the audible segments:
(1) Car passing:
     start frame 177; end frame 621
(2) Skidding:
     start frame 0; end frame 110
The information of the audible segments:
(1) Airplane flyby:
     start frame 0; end frame 935

LU-AVS Dataset

Dataset Analysis and Statistics

Overview of LU-AVS Dataset:

Comparison with existing AVS datasets. Statistics of publicly-available AVS datasets. Compared to the existing AVS datasets, the newly curated LU-AVS possesses a greater number of mask annotations and a broader range of categories. Additionally, it incorporates two forms of spatial annotations. For video samples where the pixel-level annotations are hard to obtain (denoted as 'Hard'), we provide the bounding box annotations. In contrast, 'Normal' samples are labeled with bounding boxes and mask annotations. More importantly, the average durations of videos and their audible segments in LU-AVS are both higher than those in other AVS datasets. This demonstrates that LU-AVS facilitates the exploration of challenges posed by untrimmed videos in the AVS task.



Sounding durations of each audible segment class in the LU-AVS dataset sorted by descending order, with colors indicating audible segment types. The category labels framed with dashed lines only include bounding box annotations, while the other categories contain mask and bounding box annotations. Overall, the category distribution of the dataset spans a wide range of domains, including Animal, Human, Music, Sport, Tool, Vehicle.


Temporal Characteristics:

Statistics on the temporal structure of LU-AVS. (a) and (b) indicate the duration distribution of videos and audible segments, respectively. As the two figures suggested, the length of video durations and audible segment durations are various. (c) presents the distribution of the start- and end-sounding positions of audible segments in the temporal dimension, revealing the start- and end-time distribution of audible segments is broad. (d) shows the statistics of the number of audible segments/tubes in videos, indicating that over half of the videos have multiple audible segments.


Spatial Characteristics:

Statistics on the spatial distribution of sounding objects in our LU-AVS dataset. (a) and (b) illustrate the size and aspect ratio distributions of annotated bounding boxes, respectively. (c) represents the spatial distribution of the centroids of object masks. This diverse object spatial distribution mitigates the data bias problem and also increases the challenges of tracking objects in videos.

Strong Baselines for Benchmarking

Different Tasks Adopted to LU-AVS, A Strong Baseline for LU-AVS, Experimental Results and Challenge Analysis

To show the necessity and fully explore the challenges of LU-AVS, we investigate the performance of existing audio-visual segmentation (AVS), audio-visual localization (AVL), audio-visual event localization (AVE), and spatio-temporal video grounding (STVG) methods on our LU-AVS dataset. Based on the analysis of the above methods, we introduce a simple yet effective framework to provide a base reference for long video audible object grounding in the future.


More details are in the [Paper] and [Supplementary].

Task Definition:

Audio-Visual Segmentation Methods (AVS):

The goal of the AVS task is to segment audible objects in an image based on a given audio-visual pair. Existing methods are designed based oncdatasets with the fixed input format of 10 frames corresponding to 10s audio. To adapt these methods for untrimmed videos, we modify the input to one second of audio and five uniformly sampled frames from the segment, allowing audio-visual interactions within the per-second segment.

Audio-Visual Localization Methods (AVL):

The AVL task also focuses on locating audible objects in the spatial dimension. However, AVL presents sounding regions by heatmaps. For comparison, in the test stage, we convert the heatmaps into bounding boxes as other works. Thanks to our bounding-box annotations, we can evaluate AVL on our LU-AVS dataset in a unified manner. Similar to the modification for AVS methods, we slice the videos into 1-second segments to fit the AVL methods.

Audio-Visual Event Localization Methods (AVE):

AVE task aims at determining the audio-visual temporal segments when the target object is both audible and visible at the same time. This task does not focus on segmenting audible objects on the spatial dimension, they cannot be applied to segmenting objects in images. These methods are also developed based on the videos with fixed durations.

Spatio-temporal Video Grounding Methods (STVG):

Given a query sentence, STVG methods are required to track the target object in the video both at the spatial (bounding boxes) and temporal dimensions (start and end time positions). Unlike the AVS task that may require segmenting multi-sounding objects in a video, the STVG methods only need to find and track one target object for each text-video pair. To explore the performance of STVG methods on LU-AVS, we modify these text-guided methods to adapt to our task.


Challenges Imposed by LU-AVS Dataset

Based on the above experimental results, we summarize the dataset challenges and adaptability of existing methods as follows:

(1) For long videos in LU-AVS, the sounding duration and the start- and end-sounding time positions are uncertain.
Therefore, both the spatial and temporal localization of audible objects are necessary for LU-AVS. Existing AVS methods developed based on the trimmed videos struggle to achieve temporal localization, showing limited adaptability in long videos.

(2) Unlike trimmed videos always feature audible objects, untrimmed videos contain a high proportion of silent segments. Hence, the existing AVL methods trained on LU-AVS tend to overlook the audible objects. This suggests the requirement for greater emphasis on audio in model development on LU-AVS.

(3) Similar to the STVG task, the exhaustive annotations in LU-AVS pose a high demand for achieving consistent spatial and temporal localization of audible objects, requiring methods to effectively joint model spatial, temporal, and audio-visual interactions.




Download

Dataset publicly available for research purposes

Data download


Resources: All videos can be downloaded from Hugging Face  and Baidu Pan (Extraction Code: rx8p)

Annotations are available at Hugging Face  and Baidu Pan (Extraction Code: rx8p)

The mask_bbox_jsons folder contains JSON files with both mask and bbox annotations. The bbox_jsons folder contains JSON files with only bbox annotations.

The format of a JSON file is shown in below:

Instructions for Use: The annotation directory for each video segment follows the format: main_dir/{video Name}__{label}__st__{start frame index}__et__{end frame index}
Note: Spaces in the label are replaced with _.
For masks.npy or bbox.npy, the first annotation corresponds to the segment's starting frame. For example, if an audible segment starts at frame 23, the first mask in masks.npy is the annotation for frame 23.


Publication(s)

If you find our work useful in your research, please cite our paper.

        
        @inproceedings{liu2024benchmarking,
		title={Benchmarking audio visual segmentation for long-untrimmed videos},
		author={Liu Chen, Peike Li, Qingtao Yu, Hongwei Sheng,  Dadong Wang, Lincheng Li, and Xin Yu},
		booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
		pages={22712--22722},
		year={2024}
	}
        

Disclaimer

The released LU-AVS dataset is curated, which perhaps owns potential correlation between instrument and geographical area. This issue warrants further research and consideration.


Important Notice

Considering the copyright issues of some videos, we have not included controversial data (10%). If there are still any controversial aspects regarding our videos, please contact us as soon as possible, and we will withdraw the controversial part.


Copyright Creative Commons License

All datasets and benchmarks on this page are copyright by us and published under the Creative Commons Attribution-NonCommercial 4.0 International License. This means that you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. You may not use the material for commercial purposes.