6.8300 Final Project: Taming CLIP’s Captioning Bias: A COCO-Driven Analysis and Permutation Ensemble
Published:
Abstract: Vision-language models like CLIP struggle with multi-object scenes, often favoring prominent objects or those mentioned first in captions. Using real-world COCO images, we show that CLIP’s caption-matching accuracy drops from 91.23% to 87.45% when object order is reversed. To address this, we explore a post-hoc mitigation: a permutation ensemble that averages scores across all object orders, boosting robustness and recovering accuracy to 90.04%. Our findings reveal persistent order biases and offer a simple, effective strategy to improve CLIP’s reliability in complex scenes.