Final Research Report
Self-Supervised Learning ViT Models
Expert-grounded attention study testing whether vision models look at the architectural evidence human experts mark as diagnostic.
Abstract
Self-supervised vision models achieve strong downstream performance, but high classification accuracy alone does not reveal whether those models attend to the same visual evidence that human experts consider diagnostically important. This project studies that question in the WikiChurches setting, where expert bounding boxes identify architectural features such as arches, windows, towers, and facade elements that matter for style recognition.
The report evaluates seven vision models across self-distillation, masked autoencoding, multimodal contrastive pretraining, and a supervised CNN baseline, then measures attention alignment against 631 expert boxes on 139 annotated church images using IoU, Coverage, MSE, KL divergence, and EMD.
Results
DINOv3 is the strongest frozen spatial-prior model.
It leads the default-method benchmark on IoU@90, Coverage, KL, and EMD, and is the only model to clear all calibrated continuous baselines across MSE, KL, and EMD.
Fine-tuning helps, but unevenly.
CLIP gains the most after full fine-tuning, while MAE's largest shift appears around Renaissance pediment geometry. DINO mostly preserves an already useful frozen alignment.
The easy cases are shared across model families.
Frozen DINOv3 alignment predicts where CLIP improves after fine-tuning, suggesting the hard examples are often hard because the expert-marked feature is small, thin, peripheral, or visually entangled.
Head specialization is sparse and descriptive.
DINO-family dominant heads stay stable, MAE is partly reshaped, and CLIP reorganizes toward later adapted heads. The clearest alignments sit on larger structures such as portals, arches, and rose windows.