CF-World

Are Text-to-Image Models Inductivist Turkeys?
A Counterfactual Benchmark for Causal Reasoning

The "Inductivist Turkey" Dilemma: We argue that current text-to-image models rely on memorized statistical correlations (priors) rather than a genuine understanding of objective world laws.

CF-World Benchmark: We introduce a novel benchmark with a three-level progressive framework (Factual, Explicit, and Implicit Counterfactuals) to rigorously test if models can break free from conventional priors.

CF-Eval Pipeline: We propose an automated evaluation system featuring new metrics like Prior Resistance Rate (PRR) and Reasoning Retention Rate (RRR) to accurately quantify true reasoning capabilities.

Findings & Root Cause: Extensive evaluations show SOTA models fail on counterfactuals due to a lack of logical decoupling. High-dimensional statistical priors force them to rely on language priors (concept lock-in) rather than separating true causal variables from visual attributes.

arXiv Code

Data

Model	Qwen3-VL-235B					Gemini-3-Pro
Model	L1	L2	L3	PRR↑	RRR↑	L1	L2	L3	PRR↑	RRR↑
Open-Source Models
SANA 1.5	0.83	0.36	0.23	0.40	0.38	0.75	0.29	0.17	0.33	0.32
Janus-Pro-7B	0.80	0.29	0.21	0.32	0.39	0.69	0.21	0.11	0.25	0.24
Show-o2	0.77	0.32	0.20	0.36	0.35	0.66	0.25	0.14	0.31	0.28
Z-image	0.82	0.38	0.21	0.42	0.34	0.75	0.33	0.16	0.38	0.28
Lumina-DiMOO	0.76	0.33	0.20	0.38	0.35	0.70	0.29	0.17	0.35	0.32
BAGEL	0.80	0.29	0.17	0.32	0.32	0.73	0.29	0.15	0.34	0.28
BAGEL-CoT	0.88	0.43	0.29	0.46	0.44	0.82	0.41	0.26	0.45	0.41
OmniGen2	0.76	0.32	0.19	0.37	0.34	0.70	0.29	0.18	0.35	0.33
FLUX.2-dev	0.81	0.42	0.26	0.47	0.40	0.83	0.48	0.28	0.53	0.40
Qwen-Image	0.84	0.35	0.24	0.38	0.41	0.80	0.37	0.23	0.41	0.38
Closed-Source Models
Nano Banana	0.93	0.64	0.55	0.66	0.69	0.88	0.64	0.52	0.68	0.65
Nano Banana Pro	0.95	0.67	0.58	0.69	0.71	0.93	0.76	0.67	0.79	0.77
GPT-Image-1.5	0.92	0.66	0.49	0.69	0.60	0.91	0.73	0.55	0.77	0.64
Seedream 5.0	0.91	0.63	0.50	0.66	0.63	0.89	0.72	0.61	0.76	0.72

Model	Rule Decoupling		Attribute Decoupling		De-nominalization
Model	Fact.	CF	Fact.	CF	L2	De-norm
SANA 1.5	0.31	0.30	0.94	0.83	0.36	0.37
Janus-Pro-7B	0.19	0.07	0.97	0.83	0.29	0.30
Show-o2	0.39	0.37	0.92	0.80	0.32	0.37
Z-image	0.61	0.53	0.98	0.89	0.38	0.43
Lumina-DiMOO	0.38	0.34	0.97	0.82	0.33	0.35
BAGEL	0.29	0.22	0.98	0.83	0.29	0.31
BAGEL-CoT	0.38	0.32	0.97	0.90	0.43	0.44
OmniGen2	0.33	0.25	0.96	0.81	0.32	0.35
FLUX.2-dev	0.53	0.52	0.99	0.90	0.42	0.51
Qwen-Image	0.40	0.40	0.97	0.86	0.35	0.37

CF-World

Are Text-to-Image Models Inductivist Turkeys?
A Counterfactual Benchmark for Causal Reasoning

Overview: Testing the Reasoning Limits of Text-to-Image Models

The CF-World Benchmark: Probing Genuine Causal Reasoning

The Three-Level Progressive Prompting Design

Scale and Disciplinary Diversity

Data Distribution & Prompt Examples

CF-Eval: Automated VLM Evaluation Pipeline

Key Results

Interactive Qualitative Comparison

Why Models Fail: A Decoupling Perspective

CF-World

Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning

Overview: Testing the Reasoning Limits of Text-to-Image Models

The CF-World Benchmark: Probing Genuine Causal Reasoning

The Three-Level Progressive Prompting Design

Scale and Disciplinary Diversity

Data Distribution & Prompt Examples

CF-Eval: Automated VLM Evaluation Pipeline

Key Results

Interactive Qualitative Comparison

Why Models Fail: A Decoupling Perspective

Are Text-to-Image Models Inductivist Turkeys?
A Counterfactual Benchmark for Causal Reasoning