Multi-Modal LLM Benchmarking Framework

Tech: InstructBLIP, CLIP, Transformers, MS-COCO

  • Fine-tuned 4 vision-language models (InstructBLIP variants, CLIP, custom ResNet-LSTM) on MS-COCO dataset, achieving state-of-the-art 145.8 CIDEr score with improved inference speed.
  • Implemented comparative evaluation pipeline using SPICE, CIDEr, BLEU, and METEOR metrics across 10K+ image-caption pairs, identifying optimal model for medical image description tasks.