Thinking in Structures: Evaluating Spatial Intelligence through Reasoning on Constrained Manifolds

We introduce SSI-Bench, constructed from complex real-world 3D structures with feasible configurations tightly governed by geometric, topological, and physical constraints.

Chen Yang, Guanxin Lin, Youquan He, Peiyao Chen, Guanghe Liu, Yufan Mo, Zhouyuan Xu, Linhao Wang, Guohui Zhang, Zihang Zhang, Shenxiang Zeng, Chen Wang, Jiansheng Fan

Tsinghua University

SSI-Bench teaser

Dataset

Explore the CMSR task taxonomy and dataset composition, browse samples in the interactive viewer, and see how SSI-Bench is constructed.

Overview

Taxonomy of CMSR tasks in SSI-Bench
Taxonomy of Constrained-Manifold Spatial Reasoning (CMSR) tasks in SSI-Bench (1,000 samples; 1,160 unique images; average template prompt length 2,062 characters). #UI: number of unique images; #TPL: template prompt length (characters, including spaces and punctuation).
Dataset composition pie chart
Distribution of categories in SSI-Bench.

Dataset Viewer

Question

              
Original
original view 1
Candidates

Construction Pipeline

Construction pipeline

Leaderboard

Performance comparison of different models on SSI-Bench. Dark highlighting indicates the best result within each category, light highlighting denotes the second-best.

Click on column headers to sort the results
Proprietary
Open-Source
Baseline
Loading leaderboard…

Analysis

Key results, thinking impact, and representative failure modes on CMSR—highlighting where current VLMs succeed, where they fail, and what is still missing for constraint-consistent 3D reasoning.

Key Results

Results are reported in terms of taskwise accuracy and pairwise accuracy on SSI-Bench. Compared with prior spatial benchmarks in largely unconstrained settings, accuracies are substantially lower, suggesting that constrained-manifold reasoning is harder and less amenable to 2D shortcut cues.
Constrained-manifold spatial reasoning remains hard
Even strong VLMs are far from human performance on SSI-Bench: the best model reaches 33.60% average taskwise accuracy, while humans achieve 91.60%. The random-ranking baseline is 12.85%.
Advanced open-source models still trail proprietary counterparts
A consistent gap separates open-source and proprietary models across geometric and topological tasks. Proprietary systems lead the leaderboard (~30%), while top open-weight models stay around ~20%.
Progress is incremental; scaling is inconsistent
Performance improves over time across major lineages, but gains are uneven and often modest (e.g., Gemini-2.5Gemini-3). Within-family scaling yields limited gains (e.g., Qwen3-VL: 19.20%21.90%).
Examples from the leaderboard: Gemini-3-Flash (33.60%) and Gemini-3-Pro (29.50%) lead the proprietary group, while GLM-4.6V (22.20%) and Qwen3-VL-235B-A22B (21.90%) are among the strongest open models. Overall performance indicates that CMSR remains broadly unsolved.

Impact of Thinking on CMSR

We compare matched settings under the same evaluation protocol: Gemini-3-Pro with two thinking levels (high vs. low) and Qwen3-VL-30B-A3B with two variants (Thinking vs. Instruct) on the full benchmark.
Relationship between thinking-token usage and accuracy, and sub-category level effects of thinking on CMSR.
Left: thinking-token usage vs. accuracy. Right: sub-category effects of thinking on CMSR.
  • Modest gains: Gemini-3-Pro improves from 27.1% (low) to 29.5% (high); Qwen3-VL-30B-A3B improves from 20.6% (Instruct) to 22.5% (Thinking).
  • Token usage is a weak proxy: accuracy is non-monotonic and often peaks at moderate usage, where extra tokens can reflect uncertainty rather than better inference. Very high usage can correspond to longer deliberation over an incorrect structural hypothesis.
  • Task-dependent effects: thinking helps more when evidence is stable, but can be mixed or negative on 3D-consistency bottlenecks (e.g., Multi-View and Volume).
Takeaway: thinking provides incremental benefits, but doesn't resolve dominant failure modes that require stable 3D grounding and cross-view correspondence.

Error Analysis

To diagnose bottlenecks, we manually inspect sampled questions with Gemini-3-Pro as a representative model, using its reasoning traces to categorize failures.
Illustration of four common error types on SSI-Bench.
Representative error types in VLM spatial reasoning on SSI-Bench.
  • Member-extent: over-/under-extending members under occlusion and clutter, breaking endpoint-based comparisons.
  • Object recognition: misidentifying components/nodes and coarse orientations, hurting criteria like Ground Angle.
  • Computation & logic: optimizing the wrong quantity (e.g., 2D area vs. 3D volume) or applying invalid simplifications.
  • 3D spatial logic: weak depth and cross-view correspondence; unstable relational composition in Multi-View settings.
Overall, the performance gap reflects both imperfect visual grounding and limitations in globally consistent 3D reconstruction under manifold constraints—core capabilities for CMSR.