DivCon: Divide and Conquer for Complex Numerical and Spatial Reasoning in Text-to-Image Generation

Yuhao Jia, Wenhan Tan

公開日: 2024/3/11

Abstract

Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements in recent years. To further improve T2I models' capability in numerical and spatial reasoning, layout is employed as an intermedium to bridge large language models and layout-based diffusion models. However, these methods often rely on closed-source, large-scale LLMs for layout prediction, limiting accessibility and scalability. They also struggle with generating images from prompts with multiple objects and complicated spatial relationships. To tackle these challenges, we introduce a divide-and-conquer approach which decouples the generation task into multiple subtasks. First, the layout prediction stage is divided into numerical & spatial reasoning and bounding box visual planning, enabling even lightweight LLMs to achieve layout accuracy comparable to large-scale models. Second, the layout-to-image generation stage is divided into two steps to synthesize objects from easy ones to difficult ones. Experiments are conducted on the HRS and NSR-1K benchmarks and our method outperforms previous approaches with notable margins. In addition, visual results and user study demonstrate that our approach significantly improves the perceptual quality, especially when generating multiple objects from complex textural prompts.