CF-World

Are Text-to-Image Models Inductivist Turkeys?
A Counterfactual Benchmark for Causal Reasoning

The "Inductivist Turkey" Dilemma: We argue that current text-to-image models rely on memorized statistical correlations (priors) rather than a genuine understanding of objective world laws.
CF-World Benchmark: We introduce a novel benchmark with a three-level progressive framework (Factual, Explicit, and Implicit Counterfactuals) to rigorously test if models can break free from conventional priors.
CF-Eval Pipeline: We propose an automated evaluation system featuring new metrics like Prior Resistance Rate (PRR) and Reasoning Retention Rate (RRR) to accurately quantify true reasoning capabilities.
Findings & Root Cause: Extensive evaluations show SOTA models fail on counterfactuals due to a lack of logical decoupling. High-dimensional statistical priors force them to rely on language priors (concept lock-in) rather than separating true causal variables from visual attributes.
counterfactual

Overview: Testing the Reasoning Limits of Text-to-Image Models

Current text-to-image (T2I) models consistently generate high-quality images that comply with human commonsense. However, a critical question remains: does this seemingly perfect understanding stem from a genuine grasp of objective physical laws and causal logic, or is it merely sophisticated pattern matching and mechanical memorization of high-frequency co-occurrences in massive training datasets?

To answer this, we introduce CF-World, a novel evaluation framework designed to test the true reasoning capabilities of T2I models. As our benchmark reveals, these models suffer from severe "concept lock-in." Because they primarily learn pixel-level correlations, they struggle to decouple independent causal variables from basic visual attributes. When tasked with generating images that systematically contradict real-world priors, T2I models default to familiar statistical habits rather than demonstrating true logical deduction.

Counterfactual Benchmark Results
Figure 1: Evaluating T2I models across factual and counterfactual scenarios. While models like FLUX.2-dev, Qwen-image, and Nano Banana perform well on standard factual prompts (L1), they suffer a precipitous drop in coherence when faced with explicit (L2) and implicit (L3) counterfactuals, demonstrating a failure to genuinely reason beyond learned correlations.

The CF-World Benchmark: Probing Genuine Causal Reasoning

The philosopher Bertrand Russell once proposed the famous thought experiment of the Inductivist Turkey: Observing daily feedings, the turkey deduces an unbreakable law—that food will always arrive—until Thanksgiving exposes its "world knowledge" as a mere statistical correlation rather than a true understanding of the farmer's intent.

To rigorously determine whether T2I models are simply "inductivist turkeys" regurgitating memorized priors, CF-World utilizes a progressive, three-tiered evaluation structure. By systematically altering rules and removing explicit instructions, this framework isolates different cognitive capabilities.

Counterfactual Benchmark Results
Figure 2: Text-to-image models as inductivist turkeys. The progressive L1-to-L3 prompting design reveals how models fail to deduce causal changes (like water turning to ice at room temperature under altered physics) when explicit visual instructions are removed.

The Three-Level Progressive Prompting Design

Scale and Disciplinary Diversity

At its core, CF-World is built upon 1,091 fundamental scientific principles, which are systematically expanded into 3,273 meticulously crafted prompts. To ensure a comprehensive evaluation of objective world knowledge, the dataset spans eight distinct disciplines.

Data Distribution & Prompt Examples

Interactive Chart: Click on any category slice to view its 3-level prompt examples.

CF-Eval: Automated VLM Evaluation Pipeline

To quantify generative capabilities at scale, we introduce CF-Eval, an automated Vision-Language Model (VLM) pipeline that evaluates images through a structured, three-step process:

VLM-based scoring pipeline
Figure 3: VLM-based scoring pipeline. Our proposed multi-dimensional scoring pipeline, featuring a sequential thresholding mechanism (SL1 ≥ 0.5) and metrics-Prior Resistance Rate (PRR) and Reasoning Retention Rate (RRR).

Key Results

Extensive evaluations across 13 state-of-the-art models reveal a stark reality: while models excel at factual generation, their performance collapses on counterfactuals. Our key findings include:

Table 1: Main evaluation results on the CF-World dataset. All metrics are scaled to 0-1. PRR and RRR are calculated to quantify reasoning robustness. The best performing open-weight models in each column are highlighted in blue, while the best proprietary models are highlighted in pink.
Model Qwen3-VL-235B Gemini-3-Pro
L1 L2 L3 PRR↑ RRR↑ L1 L2 L3 PRR↑ RRR↑
Open-Source Models
SANA 1.50.830.360.230.400.380.750.290.170.330.32
Janus-Pro-7B0.800.290.210.320.390.690.210.110.250.24
Show-o20.770.320.200.360.350.660.250.140.310.28
Z-image0.820.380.210.420.340.750.330.160.380.28
Lumina-DiMOO0.760.330.200.380.350.700.290.170.350.32
BAGEL0.800.290.170.320.320.730.290.150.340.28
BAGEL-CoT0.880.430.290.460.440.820.410.260.450.41
OmniGen20.760.320.190.370.340.700.290.180.350.33
FLUX.2-dev0.810.420.260.470.400.830.480.280.530.40
Qwen-Image0.840.350.240.380.410.800.370.230.410.38
Closed-Source Models
Nano Banana0.930.640.550.660.690.880.640.520.680.65
Nano Banana Pro0.950.670.580.690.710.930.760.670.790.77
GPT-Image-1.50.920.660.490.690.600.910.730.550.770.64
Seedream 5.00.910.630.500.660.630.890.720.610.760.72

Interactive Qualitative Comparison

Please select the test scenario and model below to view the generation results of this model at the three counterfactual levels.

Select the generation model
L1: Factual

L2: Explicit Counterfactual

L3: Implicit Counterfactual

Why Models Fail: A Decoupling Perspective

Our mechanistic investigation reveals that the precipitous degradation in counterfactual scenarios stems from a fundamental inability to decouple. Because current T2I models primarily learn pixel co-occurrences, they struggle to separate independent causal variables (logical reasoning) from basic attribute modules (visual recombination). To validate this deep-rooted entanglement, we designed three targeted experiments:

Model Rule Decoupling Attribute Decoupling De-nominalization
Fact.CF Fact.CF L2De-norm
SANA 1.50.310.300.940.830.360.37
Janus-Pro-7B0.190.070.970.830.290.30
Show-o20.390.370.920.800.320.37
Z-image0.610.530.980.890.380.43
Lumina-DiMOO0.380.340.970.820.330.35
BAGEL0.290.220.980.830.290.31
BAGEL-CoT0.380.320.970.900.430.44
OmniGen20.330.250.960.810.320.35
FLUX.2-dev0.530.520.990.900.420.51
Qwen-Image0.400.400.970.860.350.37