Teaching Video Diffusion Model with Latent Physical Phenomenon Knowledge

Arxiv

Abstract

Video diffusion models have exhibited tremendous progress in various video generation tasks. However, existing models struggle to capture latent physical knowledge, failing to infer physical phenomena that are challenging to articulate with natural language. Generating videos following the fundamental physical laws is still an opening challenge. To address this challenge, we propose a novel method to teach video diffusion models with latent physical phenomenon knowledge, enabling the accurate generation of physically informed phenomena. Specifically, we first pre-train Masked Autoencoders (MAE) to reconstruct the physical phenomena, resulting in output embeddings that encapsulate latent physical phenomenon knowledge. Leveraging these embeddings, we could generate the pseudo-language prompt features based on the aligned spatial relationships between CLIP vision and language encoders. Particularly, given that diffusion models typically use CLIP's language encoder for text prompt embeddings, our approach integrates the CLIP visual features informed by latent physical knowledge into a quaternion hidden space. This enables the modeling of spatial relationships to produce physical knowledge-informed pseudo-language prompts. By incorporating these prompt features and fine-tuning the video diffusion model in a parameter-efficient manner, the physical knowledge-informed videos are successfully generated. We validate our method extensively through both numerical simulations and real-world observations of physical phenomena, demonstrating its remarkable performance across diverse scenarios.

Method

Our goal is to equip the video diffusion model with latent physical phenomenon knowledge, enabling it to generate physically plausible phenomena from an initial frame

Overview of our proposed method. Aiming to teach stable video diffusion model with latent physical phenomenon knowledge. We firstly leverage MAE to reconstruct masked physical phenomena. Then, the CLIP vision features of initial frame are propagated into quaternion hidden space for pseudo-language embeddings enriched with latent physical phenomenon knowledge. By incorporating these embeddings and using parameter-efficient fine-tuning to make the model generate physically consistent phenomena.

Video Generation Results

Ground Truth

LoRA+SVD

Stable Video Diffusion(SVD)

Tune-a-video

Ours

"Bernard fluid simulation"

Ground Truth

LoRA+SVD

Stable Video Diffusion(SVD)

Tune-a-video

Ours

"CYlinder Fluid Simulation"

Ground Truth

LoRA+SVD

Stable Video Diffusion(SVD)

Tune-a-video

Ours

"DamBreak Fluid Simulation"

Ground Truth

LoRA+SVD

Stable Video Diffusion(SVD)

Tune-a-video

Ours

"DepthCharge Fluid Simulation"

Ground Truth

LoRA+SVD

Stable Video Diffusion(SVD)

Tune-a-video

Ours

"Typhoon_202001"

Ground Truth

LoRA+SVD

Stable Video Diffusion(SVD)

Tune-a-video

Ours

"Typhoon_202009"

Ground Truth

LoRA+SVD

Stable Video Diffusion(SVD)

Tune-a-video

Ours

"Typhoon_202102"

Ground Truth

LoRA+SVD

Stable Video Diffusion(SVD)

Tune-a-video

Ours

"Typhoon_202204"

Quantitative Performance

Quantitative evaluation results on fluid simulation dataset (Left Side) and true typhoon dataset (Right Side). SVD: Stable Video Diffusion. LoRA*: SVD+LoRA. TAV: Tune-A-video. SimDA: Simple Diffusion Adapter. To evaluate the generated physical phenomena, we consider eight essential metrics: RMSE, SSIM, Stream Function Error (SFE), Smoothness Error (SE), Gradient Smoothness (GS), Continuity Score (CS), Q-Criterion Error (QCE), and Vorticity Error (VE).

@article{cao2024teaching, title={Teaching Video Diffusion Model with Latent Physical Phenomenon Knowledge}, author={Cao, Qinglong and Wang, Ding and Li, Xirui and Chen, Yuntian and Ma, Chao and Yang, Xiaokang}, journal={arXiv preprint arXiv:2411.11343}, year={2024} }

Teaching Video Diffusion Model with Latent Physical Phenomenon Knowledge

Abstract

Method

Video Generation Results

Quantitative Performance

BibTeX