TrafficInternVL: Spatially-Guided Fine-Tuning with Caption Refinement for Fine-Grained Traffic Safety Captioning and Visual Question Answering
| dc.contributor.author | Phimsiri S. | |
| dc.contributor.author | Sunpawatr S. | |
| dc.contributor.author | Cherdchusakulchai R. | |
| dc.contributor.author | Kiawjak P. | |
| dc.contributor.author | Tosawadi T. | |
| dc.contributor.author | Tungjitnob S. | |
| dc.contributor.author | Trairattanapa V. | |
| dc.contributor.author | Vatathanavaro S. | |
| dc.contributor.author | Kudisthalert W. | |
| dc.contributor.author | Utintu C. | |
| dc.contributor.author | Saetan W. | |
| dc.contributor.author | Kongsawat N. | |
| dc.contributor.author | Borisuitsawat P. | |
| dc.contributor.author | Mahakijdechachai K. | |
| dc.contributor.author | Su-Inn N. | |
| dc.contributor.author | Thamwiwatthana E. | |
| dc.contributor.author | Suttichaya V. | |
| dc.contributor.correspondence | Phimsiri S. | |
| dc.contributor.other | Mahidol University | |
| dc.date.accessioned | 2026-04-16T18:49:06Z | |
| dc.date.available | 2026-04-16T18:49:06Z | |
| dc.date.issued | 2025-01-01 | |
| dc.description.abstract | Fine-grained traffic understanding requires both detailed visual descriptions and precise answers to safety-critical questions. We present TrafficInternVl, a framework for fine-grained traffic safety description and question answering, developed for AI City Challenge 2025 Track 2. Our approach is based on the InternVL3-38B vision-language model and integrates four key components: (1) spatially guided visual prompting via bounding-box-based cropping and rendering; (2) Adaptive view selection protocols; (3) low-rank adaptation (LoRA) fine-tuning, updating only 1% of model parameters; and (4) caption refinement for intra-scene consistency. Our model achieves a Caption Score of 32.75 (BLEU-4, METEOR, ROUGE-L, CIDEr averaged) and a VQA accuracy of 83.08 %. Code, prompts, and LoRA weights are released at https://github.com/ARV-MLCORE/TrafficInternVL | |
| dc.identifier.citation | Proceedings 2025 IEEE Cvf International Conference on Computer Vision Workshops Iccv W 2025 (2025) , 5358-5365 | |
| dc.identifier.doi | 10.1109/ICCVW69036.2025.00559 | |
| dc.identifier.scopus | 2-s2.0-105035187032 | |
| dc.identifier.uri | https://repository.li.mahidol.ac.th/handle/123456789/116233 | |
| dc.rights.holder | SCOPUS | |
| dc.subject | Computer Science | |
| dc.title | TrafficInternVL: Spatially-Guided Fine-Tuning with Caption Refinement for Fine-Grained Traffic Safety Captioning and Visual Question Answering | |
| dc.type | Conference Paper | |
| mu.datasource.scopus | https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105035187032&origin=inward | |
| oaire.citation.endPage | 5365 | |
| oaire.citation.startPage | 5358 | |
| oaire.citation.title | Proceedings 2025 IEEE Cvf International Conference on Computer Vision Workshops Iccv W 2025 | |
| oairecerif.author.affiliation | Carnegie Mellon University | |
| oairecerif.author.affiliation | Mahidol University | |
| oairecerif.author.affiliation | King Mongkut's Institute of Technology Ladkrabang | |
| oairecerif.author.affiliation | Ai and Robotics Ventures |
