Thinking in Structures: Evaluating Spatial Intelligence in Constraint-Governed Spaces

Chen Yang, Guanxin Lin, Youquan He, Peiyao Chen, Guanghe Liu, Yufan Mo, Zhouyuan Xu, Linhao Wang, Guohui Zhang, Zihang Zhang, Shenxiang Zeng, Chen Wang, Jiansheng Fan

Tsinghua University

International Conference on Machine Learning (ICML), 2026

Paper Dataset GitHub arXiv

Taxonomy of SCSR tasks in SSI-Bench — Taxonomy of *Constrained-Manifold Spatial Reasoning* (SCSR) tasks in SSI-Bench (1,000 samples; 1,160 unique images; average template prompt length 2,062 characters). #UI: number of unique images; #TPL: template prompt length (characters, including spaces and punctuation).

Dataset composition pie chart — Distribution of categories in SSI-Bench.

Category

Task

Original

Candidates

Your answer (ranking)

Click on column headers to sort the results

Proprietary

Open-Source

Baseline

Loading leaderboard…

Results are reported in terms of taskwise accuracy and pairwise accuracy on SSI-Bench. Compared with prior spatial benchmarks in largely unconstrained settings, accuracies are substantially lower, suggesting that constrained-manifold reasoning is harder and less amenable to 2D shortcut cues.

Constrained-manifold spatial reasoning remains hard

Even strong VLMs are far from human performance on SSI-Bench: the best model reaches 33.60% average taskwise accuracy, while humans achieve 91.60%. The random-ranking baseline is 12.85%.

Advanced open-source models still trail proprietary counterparts

A consistent gap separates open-source and proprietary models across geometric and topological tasks. Proprietary systems lead the leaderboard (~30%), while top open-weight models stay around ~20%.

Progress is incremental; scaling is inconsistent

Performance improves over time across major lineages, but gains are uneven and often modest (e.g., Gemini-2.5 → Gemini-3). Within-family scaling yields limited gains (e.g., Qwen3-VL: 19.20% → 21.90%).

Examples from the leaderboard: Gemini-3-Flash (33.60%) and Gemini-3-Pro (29.50%) lead the proprietary group, while GLM-4.6V (22.20%) and Qwen3-VL-235B-A22B (21.90%) are among the strongest open models. Overall performance indicates that SCSR remains broadly unsolved.

We compare matched settings under the same evaluation protocol: Gemini-3-Pro with two thinking levels (high vs. low) and Qwen3-VL-30B-A3B with two variants (Thinking vs. Instruct) on the full benchmark.

Relationship between thinking-token usage and accuracy, and sub-category level effects of thinking on SCSR. — *Left:* thinking-token usage vs. accuracy. *Right:* sub-category effects of thinking on **SCSR**.

Modest gains: Gemini-3-Pro improves from 27.1% (low) to 29.5% (high); Qwen3-VL-30B-A3B improves from 20.6% (Instruct) to 22.5% (Thinking).
Token usage is a weak proxy: accuracy is non-monotonic and often peaks at moderate usage, where extra tokens can reflect uncertainty rather than better inference. Very high usage can correspond to longer deliberation over an incorrect structural hypothesis.
Task-dependent effects: thinking helps more when evidence is stable, but can be mixed or negative on 3D-consistency bottlenecks (e.g., Multi-View and Volume).

Takeaway: thinking provides incremental benefits, but doesn't resolve dominant failure modes that require stable 3D grounding and cross-view correspondence.

To diagnose bottlenecks, we manually inspect sampled questions with Gemini-3-Pro as a representative model, using its reasoning traces to categorize failures.

Illustration of four common error types on SSI-Bench. — Representative error types in VLM spatial reasoning on **SSI-Bench**.

Member-extent: over-/under-extending members under occlusion and clutter, breaking endpoint-based comparisons.
Object recognition: misidentifying components/nodes and coarse orientations, hurting criteria like Ground Angle.
Computation & logic: optimizing the wrong quantity (e.g., 2D area vs. 3D volume) or applying invalid simplifications.
3D spatial logic: weak depth and cross-view correspondence; unstable relational composition in Multi-View settings.

Overall, the performance gap reflects both imperfect visual grounding and limitations in globally consistent 3D reconstruction under manifold constraints—core capabilities for SCSR.

Thinking in Structures: Evaluating Spatial Intelligence in Constraint-Governed Spaces

Dataset

Overview

Dataset Viewer

Construction Pipeline

Leaderboard

Analysis

Key Results

Impact of Thinking on SCSR

Error Analysis