TrafficInternVL: Spatially-Guided Fine-Tuning with Caption Refinement for Fine-Grained Traffic Safety Captioning and Visual Question Answering

Phimsiri S.Sunpawatr S.Cherdchusakulchai R.Kiawjak P.Tosawadi T.Tungjitnob S.Trairattanapa V.Vatathanavaro S.Kudisthalert W.Utintu C.Saetan W.Kongsawat N.Borisuitsawat P.Mahakijdechachai K.Su-Inn N.Thamwiwatthana E.Suttichaya V.Mahidol University2026-04-162026-04-162025-01-01Proceedings 2025 IEEE Cvf International Conference on Computer Vision Workshops Iccv W 2025 (2025) , 5358-5365https://repository.li.mahidol.ac.th/handle/123456789/116233Fine-grained traffic understanding requires both detailed visual descriptions and precise answers to safety-critical questions. We present TrafficInternVl, a framework for fine-grained traffic safety description and question answering, developed for AI City Challenge 2025 Track 2. Our approach is based on the InternVL3-38B vision-language model and integrates four key components: (1) spatially guided visual prompting via bounding-box-based cropping and rendering; (2) Adaptive view selection protocols; (3) low-rank adaptation (LoRA) fine-tuning, updating only 1% of model parameters; and (4) caption refinement for intra-scene consistency. Our model achieves a Caption Score of 32.75 (BLEU-4, METEOR, ROUGE-L, CIDEr averaged) and a VQA accuracy of 83.08 %. Code, prompts, and LoRA weights are released at https://github.com/ARV-MLCORE/TrafficInternVLComputer ScienceTrafficInternVL: Spatially-Guided Fine-Tuning with Caption Refinement for Fine-Grained Traffic Safety Captioning and Visual Question AnsweringConference PaperSCOPUS10.1109/ICCVW69036.2025.005592-s2.0-105035187032