StructXLIP – CVPR 2026

01 — Abstract

Structure-Centric
Vision-Language Alignment

Edge-based cues are fundamental to visual understanding. We extend this principle to vision-language alignment by explicitly modeling structural information across modalities, improving fine-tuning on long, detail-rich captions for cross-modal retrieval. We propose StructXLIP (pronounced as /strʌk slɪp/), a structure-centric fine-tuning paradigm that extracts edge maps as structural proxies and filters captions to emphasize structural content. Three auxiliary losses enforce global edge–text alignment, local region–phrase matching, and consistency between edge and RGB representations. StructXLIP consistently outperforms recent approaches across diverse retrieval benchmarks and serves as a plug-and-play enhancement for vision-language models.

02 — Method

Two-Stage Fine-Tuning
with Structural Cues

StructXLIP operates in two stages. First, it extracts complementary structural views edge maps for images via an edge detector, and structure-centric captions via lexicon filtering that removes appearance-related terms (colors, materials, textures). Second, it aligns these views at multiple granularities via three auxiliary losses added on top of the standard CLIP objective.

Figure 1. Overview of StructXLIP fine-tuning. (a) Structure-centric multimodal extraction: edge maps are generated from original images via an edge detector; captions undergo lexicon filtering to remove appearance terms, producing structure-centric text T′. (b) Structure-centric multimodal alignment: three auxiliary losses enforce global structural alignment (ℒ_I′,T′), local compositional alignment (ℒ^local_I′,T′), and consistency regularization (ℒ_I,I′). At inference, only the original image and caption are needed — no edge extraction or filtering required.

Three Structure-Centric Losses

① Structure-centric image-text Alignment

ℒ_I′,T′

Aligns edge maps and structure-centric captions via symmetric InfoNCE.

② Local structure-centric image-text alignment

ℒ^local_I′,T′

Matches textual phrases with semantically related edge regions for fine-grained alignment.

③ Consistency
regularization

ℒ_I,I′

Anchors edge representations to the original RGB embeddings, preventing structural drift during fine-tuning.

Full Objective

ℒ_total = ℒ_I,T + λ₁ℒ_I′,T′ + λ₂ℒ_I,I′ + λ₃ℒ^local_I′,T′

Why It Works

By the Data Processing Inequality, IMI(I′,T′) < IMI(I,T). The auxiliary loss optimizes a harder objective, providing persistent gradients when the main loss flattens, acting as an implicit regularizer that expands the effective search space and improves convergence stability.

03 — Results

State-of-the-Art
Cross-Modal Retrieval

StructXLIP is evaluated on four benchmarks: SKETCHY (46k fashion images), INSECT (6k fine-grained biology), DOCCI (15k general scenes, avg. 136 words/caption), and DCI (7.8k dense captions, avg. 1000+ words).

Method	SKETCHY				INSECT				DOCCI				DCI
	T→I		I→T		T→I		I→T		T→I		I→T		T→I		I→T
	R@1	R@5	R@1	R@5	R@1	R@5	R@1	R@5	R@1	R@5	R@1	R@5	R@1	R@5	R@1	R@5
Long-CLIP (ECCV'24)	54.32	80.14	52.76	80.31	8.20	23.83	9.41	24.78	64.49	87.67	63.08	87.45	59.23	80.89	60.13	81.44
FineLIP (CVPR'25)	40.59	71.16	40.33	72.11	8.46	23.32	6.86	23.75	67.80	90.22	66.39	89.12	66.13	85.34	64.58	84.59
SmartCLIP (CVPR'25)	50.73	81.09	51.30	80.83	4.84	16.84	4.66	15.46	74.92	94.08	74.91	94.04	69.88	86.64	70.94	87.04
GOAL (CVPR'25)	63.21	87.13	62.44	87.82	8.81	24.35	8.55	25.91	79.47	96.65	79.43	96.14	72.64	89.89	72.84	90.50
StructXLIP (Ours)	69.86	90.85	68.22	90.67	9.93	26.60	9.50	26.60	83.04	97.06	81.59	96.94	75.90	90.00	74.39	89.90
Δ	↑6.65	↑3.72	↑5.78	↑2.85	↑1.12	↑2.25	↑0.09	↑0.69	↑3.57	↑0.41	↑2.16	↑0.80	↑3.26	↑0.11	↑1.55	↓0.60

Figure 2. (a) Qualitative retrieval results on DOCCI. StructXLIP retrieves structurally consistent images with correct layouts. (b) GRAD-CAM attention maps: StructXLIP shows sharper, more localized attention on described objects (dress silhouette, tree branch on floor).

04 — Contributions

Key Contributions

01 We show that injecting structure-centric, multimodal cues into contrastive learning substantially improves long-text vision-language alignment.
02 StructXLIP outperforms recent fine-tuning methods without architectural complexity or inference overhead.
03 A rigorous mutual information & optimization analysis (three lemmas + theorem) explains why structure-centric alignment stabilizes fine-tuning.
04 Our structure-centric losses L* serve as a universal plug-and-play booster across LoRA, DoRA, GOAL, SmartCLIP, SigLIP2, and more.
05 Extensive experiments confirm robustness to different edge extractors (Canny, LoG, HED, LAD, P2S) and a strong inductive bias in low-data regimes.

05 — Citation

BibTeX

If you find our work useful, please consider citing:

@inproceedings{ruan2026StructXLIP,
  title     = {StructXLIP: Enhancing Vision-Language Models
               with Multimodal Structural Cues},
  author    = {Ruan, Zanxi and Gao, Songqun and Kong, Qiuyu
               and Wang, Yiming and Cristani, Marco},
  booktitle = {Proceedings of the IEEE/CVF Conference on
               Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

Structure-CentricVision-Language Alignment

Two-Stage Fine-Tuningwith Structural Cues