Benchmarking Audio Visual Segmentation for Long-Untrimmed Videos

[Accepted by CVPR 2024]


Why Long-Untrimmed Audio-visual Segmentation Dataset?

Existing audio-visual segmentation (AVS) datasets typically focus on short-trimmed videos with only one pixel-map annotation for a per-second video clip. In contrast, for untrimmed videos, the sound duration, start- and end-sounding time positions, and visual deformation of audible objects vary significantly. Therefore, we observed that current AVS models trained on trimmed videos might struggle to segment sounding objects in long videos. To investigate the feasibility of grounding audible objects in videos along both temporal and spatial dimensions, we introduce the Long-Untrimmed Audio-Visual Segmentation dataset (LU-AVS), which includes precise frame-level annotations of sounding emission times and provides exhaustive mask annotations for all frames.

Samples of our from Long-Untrimmed Audio-Visual Segmentation dataset. Different from previous AVS datasets, LU-AVS is specifically crafted to explore the challenges inherent in the AVS task for long-untrimmed videos. It features detailed start- and end-sounding positions in the temporal dimension, along with comprehensive mask and bounding box annotations in the spatial dimension. The examples show that our dataset contains numerous audible segments in each video, characterized by diverse durations and varying start and end-sounding positions. Additionally, within a single video, the same objects may have notable shifts spatially and undergo deformation.


What is the audio-visual segmentation task?

Task Definition. Audio-visual segmentation aims to localize audible objects by a pixel-level map for a given audio-visual pair.

Challenges in Long-Untrimmed Videos. (1) The start and sounding frames of audible objects are uncertain.
(2) The sounding duration of the target objects varies significantly among different videos.

What is LU-AVS dataset?

To investigate the AVS task on long-untrimmed videos, we propose a large-scale Long-Untrimmed Audio-Visual Segmentation (LU-AVS) dataset. LU-AVS dataset comprises 6.6K untrimmed videos covering 78 categories, and 10M pixel-level annotation masks are provided to indicate the audible objects. Moreover, we extend our dataset with bounding box annotations, extending them to both previously mask-marked categories and those challenging to annotate with masks. The extended LU-AVS dataset contains about 7.2K untrimmed videos and more than 88 categories. More details about the LU-AVS dataset are shown below.

Basic informations

  • Videos are collected from YouTube.
  • 7.2K videos spanning 88 categories.
  • 6.2K annotated masks and 7.3K bounding boxes
  • Spanning a wide range of domains, including Animal, Human, Music, Sport, Tool, Vehicle, and others

Characteristics

  • Total duration of videos is 84.6 hours , sounding duration is 73.3 hours.
  • Long average duration 41.97s.
  • Long average sounding duration 16.03s.
  • Over half of the videos have multiple audible segments.
  • Target objects have notable shifts and undergo deformation.

Personal data/Human subjects

Videos in LU-AVS are public on YouTube and annotated via professional annotators. The details of the annotation way are explained in our paper. Our dataset does not contain personally identifiable information or offensive content.

Video examples


Some annotated examples in the LU-AVS dataset. Each color represents an audible segment type.

The information of the audible segments:
(1) Female singing:
     start frame 68; end frame 482
(2) Female singing:
     start frame 651; end frame 783
(3) Female singing:
     start frame 938; end frame 998
(4) Playing acoustic guitar:
     start frame 0; end frame 631
(5) Playing acoustic guitar:
     start frame 769; end frame 1079
The information of the audible segments:
(1) Car passing:
     start frame 177; end frame 621
(2) Skidding:
     start frame 0; end frame 110
The information of the audible segments:
(1) Airplane flyby:
     start frame 0; end frame 935

LU-AVS Dataset

Dataset Analysis and Statistics

Overview of LU-AVS Dataset:

Comparison with existing AVS datasets. Statistics of publicly-available AVS datasets. Compared to the existing AVS datasets, the newly curated LU-AVS possesses a greater number of mask annotations and a broader range of categories. Additionally, it incorporates two forms of spatial annotations. For video samples where the pixel-level annotations are hard to obtain (denoted as 'Hard'), we provide the bounding box annotations. In contrast, 'Normal' samples are labeled with bounding boxes and mask annotations. More importantly, the average durations of videos and their audible segments in LU-AVS are both higher than those in other AVS datasets. This demonstrates that LU-AVS facilitates the exploration of challenges posed by untrimmed videos in the AVS task.



Sounding durations of each audible segment class in the LU-AVS dataset sorted by descending order, with colors indicating audible segment types. The category labels framed with dashed lines only include bounding box annotations, while the other categories contain mask and bounding box annotations. Overall, the category distribution of the dataset spans a wide range of domains, including Animal, Human, Music, Sport, Tool, Vehicle.


Temporal Characteristics:

Statistics on the temporal structure of LU-AVS. (a) and (b) indicate the duration distribution of videos and audible segments, respectively. As the two figures suggested, the length of video durations and audible segment durations are various. (c) presents the distribution of the start- and end-sounding positions of audible segments in the temporal dimension, revealing the start- and end-time distribution of audible segments is broad. (d) shows the statistics of the number of audible segments/tubes in videos, indicating that over half of the videos have multiple audible segments.


Spatial Characteristics:

Statistics on the spatial distribution of sounding objects in our LU-AVS dataset. (a) and (b) illustrate the size and aspect ratio distributions of annotated bounding boxes, respectively. (c) represents the spatial distribution of the centroids of object masks. This diverse object spatial distribution mitigates the data bias problem and also increases the challenges of tracking objects in videos.

Strong Baselines for Benchmarking

Different Tasks Adopted to LU-AVS, A Strong Baseline for LU-AVS, Experimental Results and Challenge Analysis

To show the necessity and fully explore the challenges of LU-AVS, we investigate the performance of existing audio-visual segmentation (AVS), audio-visual localization (AVL), audio-visual event localization (AVE), and spatio-temporal video grounding (STVG) methods on our LU-AVS dataset. Based on the analysis of the above methods, we introduce a simple yet effective framework to provide a base reference for long video audible object grounding in the future.


More details are in the [Paper] and [Supplementary].

Task Definition:

Audio-Visual Segmentation Methods (AVS):

The goal of the AVS task is to segment audible objects in an image based on a given audio-visual pair. Existing methods are designed based oncdatasets with the fixed input format of 10 frames corresponding to 10s audio. To adapt these methods for untrimmed videos, we modify the input to one second of audio and five uniformly sampled frames from the segment, allowing audio-visual interactions within the per-second segment.

Audio-Visual Localization Methods (AVL):

The AVL task also focuses on locating audible objects in the spatial dimension. However, AVL presents sounding regions by heatmaps. For comparison, in the test stage, we convert the heatmaps into bounding boxes as other works. Thanks to our bounding-box annotations, we can evaluate AVL on our LU-AVS dataset in a unified manner. Similar to the modification for AVS methods, we slice the videos into 1-second segments to fit the AVL methods.

Audio-Visual Event Localization Methods (AVE):

AVE task aims at determining the audio-visual temporal segments when the target object is both audible and visible at the same time. This task does not focus on segmenting audible objects on the spatial dimension, they cannot be applied to segmenting objects in images. These methods are also developed based on the videos with fixed durations.

Spatio-temporal Video Grounding Methods (STVG):

Given a query sentence, STVG methods are required to track the target object in the video both at the spatial (bounding boxes) and temporal dimensions (start and end time positions). Unlike the AVS task that may require segmenting multi-sounding objects in a video, the STVG methods only need to find and track one target object for each text-video pair. To explore the performance of STVG methods on LU-AVS, we modify these text-guided methods to adapt to our task.

A Strong LU-AVS Baseline:

The overview architecture of a strong baseline. It first learns visual and audio features separately and then establishes visual and audio associations. It enables us to dissect the impacts of visual and audio branches explicitly.


Different from previous AVS datasets, the LU-AVS dataset introduces unique challenges of detecting and segmenting audible objects in long videos. In untrimmed videos, the target objects may emit or stop sound at a random position in a video, and the sounding durations of objects are various in the dataset. To achieve the spatial and temporal localization of audible objects, we introduce a simple framework to provide a base reference for long video audible object grounding in the future.


Benchmarking Results and Analysis

Benchmarking results on the LU-AVS dataset. For all the evaluation metrics, higher values indicate better performance. Notably, in AVL methods, the spatial localization results are presented by heatmaps. For comparison, we convert heatmaps to bounding boxes as exsiting works. Additionally, AVE methods focus on temporal localization. Here, m_tIoU represents the segmentation accuracy within the ground-truth temporal range, and m_vIoU indicates the segmentation accuracy over the temporal union between the predicted and ground-truth durations. Besides, we replace the text branch in the STVG methods with an audio branch for spatial-temporal audible object grounding.


Challenges Imposed by LU-AVS Dataset

Based on the above experimental results, we summarize the dataset challenges and adaptability of existing methods as follows:

(1) For long videos in LU-AVS, the sounding duration and the start- and end-sounding time positions are uncertain.
Therefore, both the spatial and temporal localization of audible objects are necessary for LU-AVS. Existing AVS methods developed based on the trimmed videos struggle to achieve temporal localization, showing limited adaptability in long videos.

(2) Unlike trimmed videos always feature audible objects, untrimmed videos contain a high proportion of silent segments. Hence, the existing AVL methods trained on LU-AVS tend to overlook the audible objects. This suggests the requirement for greater emphasis on audio in model development on LU-AVS.

(3) Similar to the STVG task, the exhaustive annotations in LU-AVS pose a high demand for achieving consistent spatial and temporal localization of audible objects, requiring methods to effectively joint model spatial, temporal, and audio-visual interactions.




Download

Dataset publicly available for research purposes

Data download


Resources (Anonymous Now):

Annotations (train, val and test set): Available for download at GitHub (Anonymous Now):

The format of a JSON file is shown in below:

The annotated json file of audible segments in a video (0YMv3RUxGQM.json).

Publication(s)

If you find our work useful in your research, please cite our paper.

        
        @ARTICLE{
          title={Benchmarking Audio Visual Segmentation for Long-Untrimmed Videos},
          author={anonymous},
          year={2024},
        }
        

Disclaimer

The released LU-AVS dataset is curated, which perhaps owns potential correlation between instrument and geographical area. This issue warrants further research and consideration.


Copyright Creative Commons License

All datasets and benchmarks on this page are copyright by us and published under the Creative Commons Attribution-NonCommercial 4.0 International License. This means that you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. You may not use the material for commercial purposes.