I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

TL;DR

1. Aligning VLM to an LLM decoder, instead of a diffusion decoder.
2. It's based on the finding that the LLM decoder shares the same input space with the diffusion decoder.
3. ThinkDiff-LVLM aligns deep features of LVLM's generated tokens, instead of deep features of LVLM's input tokens, to the decoders.
4. This transfers the reasoning capabilities to diffusion decoders. (Generated tokens are answers while input tokens are only questions.)

Abstract

This paper presents ThinkDiff, a novel alignment paradigm that enables multimodal in-context understanding and reasoning capabilities in text-to-image diffusion models by integrating the capabilities of vision-language models (VLMs). Directly aligning VLMs with diffusion decoders via diffusion loss requires complex and costly reasoning-based data pairs with multimodal inputs and image outputs. Instead, ThinkDiff leverages vision-language training as a proxy task, aligning VLMs to a large language model (LLM) decoder. This proxy task is feasible because the LLM decoder shares the same input feature space as diffusion decoders that use the corresponding LLM encoder for text embedding. As a result, alignment with diffusion decoders can be achieved by alignment with the LLM decoder. ThinkDiff effectively transfers multimodal in-context understanding and reasoning capabilities from VLMs to diffusion models, eliminating the need for complex reasoning-based multimodal datasets by using only readily available image-text pairs for training. Experiment results demonstrate that ThinkDiff significantly improves performance on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, raising the best accuracy from 19.2% to 46.3%, with only 5 hours of training on 4 A100 GPUs.

Framework

In this paper, we propose ThinkDiff, a novel alignment paradigm that leverages vision-language training for diffusion alignment. Instead of directly aligning VLMs with diffusion decoders, this proxy aligns VLMs with large language model (LLM) decoders, only requiring widely available image-text pairs for training and eliminating the need for complex multimodal reasoning dataset.

General idea

Recent advanced diffusion models such as FLUX, Stable Diffusion 3 and PixArt-α adopt the encoders of encoder-decoder LLMs, T5, as diffusion models' prompt encoders. This shared text encoder establishes a shared input feature space for both diffusion decoders and LLM decoders.

Therefore, aligning with diffusion decoders can be accomplished through the proxy task of aligning with the LLM decoders through vision-language training.

The below figure illustrates the difference between reconstruction-based diffusion finetuning and ThinkDiff.
(a) Reconstruction-based diffusion finetuning integrates image features using a diffusion loss, focusing on pixel-level image reconstruction without reasoning.
(b) ThinkDiff aligns a VLM to an LLM decoder by vision-language training on image-caption datasets. In inference (dotted lines), it transfers multimodal in-context reasoning capabilities from the VLM to a diffusion decoder.

Multimodal in-conetxt reseasoning generation

Given two images and three words, ThinkDiff-LVLM accurately captures the implicit logic in the inputs and generates an image corresponding to the third word while retaining common attributes from the input images. In contrast, compared methods often fail to interpret the inputs correctly, leading to inaccurate and lower-quality results. Ground-truth text is provided for reference.

Multimodal in-conetxt composition

ThinkDiff-CLIP, a novel alignment paradigm that leverages vision-language training for diffusion alignment. Instead of directly aligning VLMs with diffusion decoders, this proxy task aligns VLMs with large language model (LLM) decoders, only requiring widely available image-text pairs for training and eliminating the need for complex multimodal reasoning dataset.

🌟Single image + text for video

ThinkDiff-CLIP is agnostic to diffusion decoders, and is versatile for integrating models like CogVideoX-5B, a text-to-video diffusion model. A background image is fed to the vision encoder and aligner network, along with a text prompt, and then to CogVideoX decoder. The model generates a coherent video by seamlessly integrating images and text. This shows ThinkDiff-CLIP's flexibility and broad applicability for multimodal generation tasks.

🌟Single image + text

Figures below show results with a single image as input. FLUX Ultra, possibly finetuned by reconstruction-based diffusion training, performs well in "copy-pasting" the input image (FLUX Ultra + I), but struggles to maintain coherence when an additional text prompt is included (FLUX Ultra + I + T). In contrast, ThinkDiff-CLIP excels at understanding the semantic details of the input image and effectively integrates both image and text to generate logically coherent outputs (Ours + I and Ours + I + T).

🌟Two images

ThinkDiff-CLIP is flexible and can handle multiple images and text prompts. It can combine semantic details from two images in a reasonable and coherent manner.

🌟Two images + text

With an additional text prompt (Ours + 2I + T), ThinkDiff-CLIP effectively incorporates the prompt into the generation.

BibTeX



@article{mi2025thinkdiff,
  title={I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models},
  author={Mi, Zhenxing and Wang, Kuan-Chieh and Qian, Guocheng and Ye, Hanrong and Liu, Runtao and Tulyakov, Sergey and Aberman, Kfir and Xu, Dan},
  journal={ICML},
  year={2025}
}