GenMOJO: Robust Multi-Object 4D Generation for In-the-wild Videos

1Carnegie Mellon University    2Toyota Research Institute
*Equal Contribution

GenMOJO can reconstruct and synthesize scene-level Gaussians
from multi-object monocular videos with complex motions.

Abstract

We address the challenge of generating dynamic 4D scenes from monocular multi-object videos with heavy occlusions and introduce GenMOJO, a novel approach that integrates rendering-based deformable 3D Gaussian optimization with generative priors for view synthesis. While existing view-synthesis models excel at novel view generation for isolated objects, they struggle with full scenes due to their complexity and data demands. To overcome this, GenMOJO decomposes scenes into individual objects, optimizing a differentiable set of deformable Gaussians per object while capturing 2D occlusions from a 3D perspective through joint Gaussian splatting. Joint splatting ensures occlusion-aware rendering losses in observed frames, while explicit object decomposition allows the usage of object-centric diffusion models for object completion in unobserved viewpoints. To reconcile the differences between object-centric priors and the global frame-centric coordinate system of the video, GenMOJO employs differentiable transformations to unify the rendering and generative constraints within a single framework. The result is a model capable of generating 4D objects across space and time while producing 2D and 3D point tracks from monocular videos. To rigorously evaluate the quality of scene generation and the accuracy of the motion under multi-object occlusions, we introduce MOSE-PTS, a subset of the challenging MOSE benchmark, which we annotated with high-quality 2D point tracks. Quantitative evaluations and perceptual human studies confirm that GenMOJO generates more realistic novel views of scenes and produces more accurate point tracks compared to existing approaches.


Method Overview

Arbitrary View Synthesis

GenMOJO can synthesize views from arbitrary camera poses at any given timestep for long videos. We present some examples where we render the scene using the reference camera (outlined in blue) and novel views (outlined in orange). The original input video is presented in the leftmost column for reference.

Input Video Original View Novel View 1 Novel View 2

Motion Trajectory Comparisons

We visualize the Gaussian motion from GenMOJO corresponding to the pixels rendered from camera views by projecting them to 2D. The motion of our Gaussians are highly accurate even in complex videos with fast motion and heavy occlusions.

Ground Truth Ours CoTracker V3 Shape-of-Motion

4D Generation Comparisons

We show some comparisons between GenMOJO and other video-to-4d baselines. Our method preserves the correct geometrical relationships between objects in complex multi-object videos without inter-penetration artifacts. The original input video is presented in the leftmost column for reference.

Input Video DreamScene4D GenMOJO (Ours)