Using Diffusion Priors for Video Amodal Segmentation

Kaihua Chen, Deva Ramanan, Tarasha Khurana

Video Versions of Paper Figures

Figure 1

In this work, we tackle the problem of video amodal segmentation and content completion: given a modal (visible) object sequence in a video, we develop a two-stage method that generates its amodal (visible + invisible) masks and RGB content. We capitalize on the shape and temporal consistency priors baked into video foundation models because of their large-scale pretraining. Finetuning these models enables us to infer complete shapes and RGB details of objects that undergo occlusion. Our method is effectively able to handle severe occlusions and generalizes across diverse object categories, achieving state-of-the-art results on synthetic and real-world datasets. We show one such example of an unseen deformable object category `laptop' that undergoes a complete occlusion.

Figure 2

Model pipeline for amodal segmentation and content completion. The first stage of our pipeline generates amodal masks {At} for an object, given its modal masks {Mt} and pseudo-depth of the scene {Dt} (which is obtained by running a monocular depth estimator on RGB video sequence {It} ). The predicted amodal masks from the first stage are then sent as input to the second stage, along with the modal RGB content of the occluded object in consideration. The second stage then inpaints the occluded region and outputs the amodal RGB content {Ct} for the occluded object. Both stages employ a conditional latent diffusion framework with a 3D UNet backbone. Conditionings are encoded via a VAE encoder into latent space, concatenated, and processed by a 3D UNet with interleaved spatial and temporal blocks. CLIP embeddings of {Mt} and the modal RGB content {VO} provide cross-attention cues for the first and second stage respectively. Finally, the VAE decoder translates outputs back to pixel space.

Figure 3

Modal-amodal RGB training pair. for content completion. The left frame displays the partially occluded modal RGB content, generated by overlaying amodal masks (black regions) onto the amodal object to disrupt its visual integrity. The right frame shows the original, unoccluded amodal RGB object.

Figure 5

Modal Content pix2gestalt Ours
Temporal consistency comparison with an image amodal segmentation method. We highlight the lack of temporal coherence in a single-frame diffusion based method, pix2gestalt, for both the predicted amodal segmentation mask and the RGB content for the occluded person in the example shown. By leveraging temporal priors, our approach achieves significantly higher temporal consistency across occlusions.

Figure 6.1

RGB Image Modal VideoMAE PCNet-M Ours Amodal-GT
Qualitative comparison of amodal segmentation methods on SAIL-VOS and TAO-Amodal. Our method leverages strong shape priors, such as for humans and chairs, to generate clean and realistic object shapes. It also excels in handling heavy occlusions; even when objects are nearly fully occluded (e.g., "chair" in the second row), our method achieves high-fidelity shape completion by utilizing temporal priors. Note that TAO-Amodal contains out-of-frame occlusions which none of the methods are trained for, but our method is able to handle such cases.

Figure 6.2

RGB Image Modal VideoMAE EoRaS Ours Amodal-GT
Qualitative comparison of amodal segmentation methods on MOVi-B/D. Our method leverages robust shape priors for boots and teapots, ensuring consistent shapes even under significant camera movement and near-complete occlusion of the objects.

Figure 7

Qualitative results for content completion. Although our content completion module, initialized from pretrained SVD weights, is finetuned solely on synthetic SAIL-VOS, it achieves photorealistic, high-fidelity object inpainting even in real-world scenarios. Furthermore, our method can complete unseen categories, such as giraffes and plastic bottle, likely due to its ability to transfer styles and patterns from the visible parts of objects to occluded areas in the current or neighboring frames. We show examples from TAO-Amodal (top) and in-the-wild YouTube videos (bottom).

Figure 8

We show an example of multi-modal generation from our diffusion model. Since there are multiple plausible explanations for the shape of the person in his occluded region, our model predicts two such plausible amodal masks (with the person's occluded legs in two different orientations).

Figure 11

w/o ours
w/ ours
w/o ours
w/ ours
4D reconstruction results . Without amodal completion by our method, the 4D reconstruction exhibits blank regions and unrealistic artifacts in occluded areas, such as the person’s back and leg. The varying occluded portions over time confuse SV4D, disrupting its understanding of the object's 3D structure. In contrast, using completed objects from our method significantly improves the reconstruction quality, producing more consistent and clear novel-views.

Figure 12

Source Manipulated
Scene manipulation examples . Using de-occluded objects from our method, we can reposition and reorder them to create new scenes. In the top rows, the relationship between the person and the soccer ball is altered, changing the scene from “the person is juggling” to “the person places the soccer ball aside and practices a juggling posture.” In the bottom rows, the middle giraffe is moved to the front and its position is adjusted.

Figure 13

Qualitative results for pseudo-ground truth of TAO-Amodal masks . Leveraging the amodal bounding box as a strong prior, our method demonstrates versatility across diverse categories, such as person, tractor, and bottles, and generalizes well to unseen categories like snowboards and horses. This high-quality pseudo-ground truth can semi-automate the manual annotation of amodal masks in real-world videos.

Figure 15 & 16

RGB Image Modal Convex ConvexR PCNet-M
pix2gestalt VideoMAE 3D UNet Ours Amodal-GT
Qualitative results on SAIL-VOS .

Figure 17 & 18

RGB Image Modal PCNet-M pix2gestalt
VideoMAE 3D UNet Ours Amodal-GT
Qualitative results on TAO-Amodal .

Figure 19

RGB Image Modal VideoMAE EoRaS Ours Amodal-GT
Qualitative results on MOVi-B/D .

Figure 20

Qualitative results for amodal content completion for in-the-wild scenarios .