TrafficInternVL: Spatially-Guided Fine-Tuning with Caption Refinement for Fine-Grained Traffic Safety Captioning and Visual Question Answering

Phimsiri S.; Sunpawatr S.; Cherdchusakulchai R.; Kiawjak P.; Tosawadi T.; Tungjitnob S.; Trairattanapa V.; Vatathanavaro S.; Kudisthalert W.; Utintu C.; Saetan W.; Kongsawat N.; Borisuitsawat P.; Mahakijdechachai K.; Su-Inn N.; Thamwiwatthana E.; Suttichaya V.

TrafficInternVL: Spatially-Guided Fine-Tuning with Caption Refinement for Fine-Grained Traffic Safety Captioning and Visual Question Answering

dc.contributor.author	Phimsiri S.
dc.contributor.author	Sunpawatr S.
dc.contributor.author	Cherdchusakulchai R.
dc.contributor.author	Kiawjak P.
dc.contributor.author	Tosawadi T.
dc.contributor.author	Tungjitnob S.
dc.contributor.author	Trairattanapa V.
dc.contributor.author	Vatathanavaro S.
dc.contributor.author	Kudisthalert W.
dc.contributor.author	Utintu C.
dc.contributor.author	Saetan W.
dc.contributor.author	Kongsawat N.
dc.contributor.author	Borisuitsawat P.
dc.contributor.author	Mahakijdechachai K.
dc.contributor.author	Su-Inn N.
dc.contributor.author	Thamwiwatthana E.
dc.contributor.author	Suttichaya V.
dc.contributor.correspondence	Phimsiri S.
dc.contributor.other	Mahidol University
dc.date.accessioned	2026-04-16T18:49:06Z
dc.date.available	2026-04-16T18:49:06Z
dc.date.issued	2025-01-01
dc.description.abstract	Fine-grained traffic understanding requires both detailed visual descriptions and precise answers to safety-critical questions. We present TrafficInternVl, a framework for fine-grained traffic safety description and question answering, developed for AI City Challenge 2025 Track 2. Our approach is based on the InternVL3-38B vision-language model and integrates four key components: (1) spatially guided visual prompting via bounding-box-based cropping and rendering; (2) Adaptive view selection protocols; (3) low-rank adaptation (LoRA) fine-tuning, updating only 1% of model parameters; and (4) caption refinement for intra-scene consistency. Our model achieves a Caption Score of 32.75 (BLEU-4, METEOR, ROUGE-L, CIDEr averaged) and a VQA accuracy of 83.08 %. Code, prompts, and LoRA weights are released at https://github.com/ARV-MLCORE/TrafficInternVL
dc.identifier.citation	Proceedings 2025 IEEE Cvf International Conference on Computer Vision Workshops Iccv W 2025 (2025) , 5358-5365
dc.identifier.doi	10.1109/ICCVW69036.2025.00559
dc.identifier.scopus	2-s2.0-105035187032
dc.identifier.uri	https://repository.li.mahidol.ac.th/handle/123456789/116233
dc.rights.holder	SCOPUS
dc.subject	Computer Science
dc.title	TrafficInternVL: Spatially-Guided Fine-Tuning with Caption Refinement for Fine-Grained Traffic Safety Captioning and Visual Question Answering
dc.type	Conference Paper
mu.datasource.scopus	https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105035187032&origin=inward
oaire.citation.endPage	5365
oaire.citation.startPage	5358
oaire.citation.title	Proceedings 2025 IEEE Cvf International Conference on Computer Vision Workshops Iccv W 2025
oairecerif.author.affiliation	Carnegie Mellon University
oairecerif.author.affiliation	Mahidol University
oairecerif.author.affiliation	King Mongkut's Institute of Technology Ladkrabang
oairecerif.author.affiliation	Ai and Robotics Ventures

Collections

Scopus 2025

	Office Hour: Monday-Friday 08.30-12.00 and 13.00-16.30 hrs.
	Phutthamonthon Sai 4 Rd. Salaya, Nakhon Pathom 73170, Thailand
	The office: +66 (2) 800 2680 ext.4306
	thipsuda.van@mahidol.ac.th
	https://repository.li.mahidol.ac.th

TrafficInternVL: Spatially-Guided Fine-Tuning with Caption Refinement for Fine-Grained Traffic Safety Captioning and Visual Question Answering

Files

Collections