Using Diffusion Priors for Video Amodal Segmentation

Carnegie Mellon University
CVPR 2025

We tackle the problem of video amodal segmentation and content completion: given a modal (visible) object sequence in a video, we develop a two-stage method that generates its amodal (visible + invisible) masks and RGB content. Here, we show one such example of an unseen deformable object category 'laptop' that undergoes a complete occlusion.

In-the-wild Gallery

Abstract

Object permanence in humans is a fundamental cue that helps in understanding persistence of objects, even when they are fully occluded in the scene. Present day methods in object segmentation do not account for this amodal nature of the world, and only work for segmentation of visible or modal objects. Few amodal methods exist; single-image segmentation methods cannot handle high-levels of occlusions which are better inferred using temporal information, and multi-frame methods have focused solely on segmenting rigid objects.

To this end, we propose to tackle video amodal segmentation by formulating it as a conditional generation task, capitalizing on the foundational knowledge in video generative models. Our method is simple; we repurpose these models to condition on a sequence of modal mask frames of an object along with contextual pseudo-depth maps, to learn which object boundary may be occluded and therefore, extended to hallucinate the complete extent of an object. This is followed by a content completion stage which is able to inpaint the occluded regions of an object.

We benchmark our approach alongside a wide array of state-of-the-art methods on four datasets and show a dramatic improvement of upto 13% for amodal segmentation in an object's occluded region.

Video

Comparison on SAIL-VOS

We evaluate our segmentation method on SAIL-VOS dataset. Our model achieves high-fidelity shape completion across diverse categories, handling both rigid and deformable objects effectively.

RGB Image Modal Convex ConvexR PCNet-M
pix2gestalt VideoMAE 3D UNet Ours Amodal-GT

Comparison on TAO-Amodal

For zero-shot evaluation, we test on the real-world TAO-Amodal dataset. Despite being trained exclusively on synthetic data, our model generalizes well to real-world scenarios, even for unseen object categories.

RGB Image Modal PCNet-M pix2gestalt
VideoMAE 3D UNet Ours Amodal-GT

Comparison on MOVi-B/D

We also benchmark our method on the MOVi-B and MOVi-D datasets. These datasets present challenges such as strong camera motion and frequent full occlusion. Our method consistently outperforms state-of-the-art baselines, maintaining robustness without relying on additional inputs like camera poses or optical flow.

RGB Image Modal VideoMAE EoRaS Ours Amodal-GT

How does it work?

The first stage of our pipeline generates amodal masks {At} for an object, given its modal masks {Mt} and pseudo-depth of the scene {Dt} (which is obtained by running a monocular depth estimator on RGB video sequence {It} ). The predicted amodal masks from the first stage are then sent as input to the second stage, along with the modal RGB content of the occluded object in consideration. The second stage then inpaints the occluded region and outputs the amodal RGB content {Ct} for the occluded object. Both stages employ a conditional latent diffusion framework with a 3D UNet backbone.

BibTeX

@article{chen2024diffvas,
      title={Using Diffusion Priors for Video Amodal Segmentation},
      author={Kaihua Chen and Deva Ramanan and Tarasha Khurana},
      year={2024},
      archivePrefix={arXiv},
      eprint={2412.04623},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.04623}
    }