✨ CVPR 2026

StructXLIP

Enhancing Vision-language Models with Multimodal Structural Cues

Zanxi Ruan1Songqun Gao2Qiuyu Kong1,3Yiming Wang4Marco Cristani1,5

1University of Verona   3Sapienza University of Rome   2University of Trento   4Fondazione Bruno Kessler   5Reykjavik University

📄 Paper 💻 Code 🤗 Models 📚 BibTeX

Structure-Centric
Vision-Language Alignment

Edge-based cues are fundamental to visual understanding. We extend this principle to vision-language alignment by explicitly modeling structural information across modalities, improving fine-tuning on long, detail-rich captions for cross-modal retrieval. We propose StructXLIP (pronounced as /strʌk slɪp/), a structure-centric fine-tuning paradigm that extracts edge maps as structural proxies and filters captions to emphasize structural content. Three auxiliary losses enforce global edge–text alignment, local region–phrase matching, and consistency between edge and RGB representations. StructXLIP consistently outperforms recent approaches across diverse retrieval benchmarks and serves as a plug-and-play enhancement for vision-language models.

StructXLIP teaser figure

Two-Stage Fine-Tuning
with Structural Cues

StructXLIP operates in two stages. First, it extracts complementary structural views edge maps for images via an edge detector, and structure-centric captions via lexicon filtering that removes appearance-related terms (colors, materials, textures). Second, it aligns these views at multiple granularities via three auxiliary losses added on top of the standard CLIP objective.

StructXLIP method overview
Figure 1. Overview of StructXLIP fine-tuning. (a) Structure-centric multimodal extraction: edge maps are generated from original images via an edge detector; captions undergo lexicon filtering to remove appearance terms, producing structure-centric text T′. (b) Structure-centric multimodal alignment: three auxiliary losses enforce global structural alignment (ℒI′,T′), local compositional alignment (ℒlocalI′,T′), and consistency regularization (ℒI,I′). At inference, only the original image and caption are needed — no edge extraction or filtering required.

Three Structure-Centric Losses

① Structure-centric image-text Alignment
I′,T′

Aligns edge maps and structure-centric captions via symmetric InfoNCE.

② Local structure-centric image-text alignment
localI′,T′

Matches textual phrases with semantically related edge regions for fine-grained alignment.

③ Consistency
regularization
I,I′

Anchors edge representations to the original RGB embeddings, preventing structural drift during fine-tuning.

Full Objective

total = ℒI,T + λ₁ℒI′,T′ + λ₂ℒI,I′ + λ₃ℒlocalI′,T′   

Why It Works

By the Data Processing Inequality, IMI(I′,T′) < IMI(I,T). The auxiliary loss optimizes a harder objective, providing persistent gradients when the main loss flattens, acting as an implicit regularizer that expands the effective search space and improves convergence stability.


State-of-the-Art
Cross-Modal Retrieval

StructXLIP is evaluated on four benchmarks: SKETCHY (46k fashion images), INSECT (6k fine-grained biology), DOCCI (15k general scenes, avg. 136 words/caption), and DCI (7.8k dense captions, avg. 1000+ words).

Method SKETCHY INSECT DOCCI DCI
T→II→T T→II→T T→II→T T→II→T
R@1R@5R@1R@5 R@1R@5R@1R@5 R@1R@5R@1R@5 R@1R@5R@1R@5
Long-CLIP (ECCV'24) 54.3280.1452.7680.31 8.2023.839.4124.78 64.4987.6763.0887.45 59.2380.8960.1381.44
FineLIP (CVPR'25) 40.5971.1640.3372.11 8.4623.326.8623.75 67.8090.2266.3989.12 66.1385.3464.5884.59
SmartCLIP (CVPR'25) 50.7381.0951.3080.83 4.8416.844.6615.46 74.9294.0874.9194.04 69.8886.6470.9487.04
GOAL (CVPR'25) 63.2187.1362.4487.82 8.8124.358.5525.91 79.4796.6579.4396.14 72.6489.8972.8490.50
StructXLIP (Ours) 69.8690.8568.2290.67 9.9326.609.5026.60 83.0497.0681.5996.94 75.9090.0074.3989.90
Δ ↑6.65↑3.72↑5.78↑2.85 ↑1.12↑2.25↑0.09↑0.69 ↑3.57↑0.41↑2.16↑0.80 ↑3.26↑0.11↑1.55↓0.60
Qualitative retrieval results
Figure 2. (a) Qualitative retrieval results on DOCCI. StructXLIP retrieves structurally consistent images with correct layouts. (b) GRAD-CAM attention maps: StructXLIP shows sharper, more localized attention on described objects (dress silhouette, tree branch on floor).

Key Contributions

  • 01 We show that injecting structure-centric, multimodal cues into contrastive learning substantially improves long-text vision-language alignment.
  • 02 StructXLIP outperforms recent fine-tuning methods without architectural complexity or inference overhead.
  • 03 A rigorous mutual information & optimization analysis (three lemmas + theorem) explains why structure-centric alignment stabilizes fine-tuning.
  • 04 Our structure-centric losses L* serve as a universal plug-and-play booster across LoRA, DoRA, GOAL, SmartCLIP, SigLIP2, and more.
  • 05 Extensive experiments confirm robustness to different edge extractors (Canny, LoG, HED, LAD, P2S) and a strong inductive bias in low-data regimes.

BibTeX

If you find our work useful, please consider citing:

@inproceedings{ruan2026StructXLIP,
  title     = {StructXLIP: Enhancing Vision-Language Models
               with Multimodal Structural Cues},
  author    = {Ruan, Zanxi and Gao, Songqun and Kong, Qiuyu
               and Wang, Yiming and Cristani, Marco},
  booktitle = {Proceedings of the IEEE/CVF Conference on
               Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}