Leveraging Foundation Models for Causal Generative Modeling

2026-05-22Machine Learning

Machine LearningArtificial IntelligenceComputer Vision and Pattern Recognition
AI summary

The authors present FM-CGM, a new system that uses advanced pretrained AI models to understand and change images based on cause-and-effect relationships without extra training. Their method breaks down the process into three parts: identifying key concepts in an image, changing these concepts, and generating new images to reflect those changes. They also create a technique called Causal Semantic Guidance to make sure changes affect related parts properly while keeping other parts the same. Their tests show the system can find believable cause-effect patterns and create reliable edited images.

causal generative modelingpretrained foundation modelszero-shot reasoningconcept extractioncounterfactual generationtext-to-image diffusioncausal inferencecross-attentionsemantic intervention
Authors
Aneesh Komanduri, Xintao Wu
Abstract
Causal generative modeling is essential for developing reliable and transparent AI systems capable of counterfactual reasoning. While existing approaches focus on integrating causal constraints during the training of generative models, they often lack a unified framework to leverage the zero-shot reasoning capabilities of pretrained foundation models. We introduce FM-CGM, a modular framework for end-to-end visual causal reasoning using pretrained foundation models. FM-CGM formalizes the causal pipeline through three core components: a concept extractor, a concept manipulator, and a counterfactual generator. By leveraging a large reasoning model for causal inference and a text-to-image diffusion model for generation, our approach enables zero-shot causal discovery, intervention, and counterfactual generation. We then develop Causal Semantic Guidance (CSG), a cross-attention-based mechanism that ensures semantic interventions propagate to descendant concepts while preserving invariant regions. We empirically show that our approach can identify plausible causal structures and is suitable for faithful counterfactual image generation.