TrafficInternVL: Spatially-Guided Fine-Tuning with Caption Refinement for Fine-Grained Traffic Safety Captioning and Visual Question Answering

dc.contributor.authorPhimsiri S.
dc.contributor.authorSunpawatr S.
dc.contributor.authorCherdchusakulchai R.
dc.contributor.authorKiawjak P.
dc.contributor.authorTosawadi T.
dc.contributor.authorTungjitnob S.
dc.contributor.authorTrairattanapa V.
dc.contributor.authorVatathanavaro S.
dc.contributor.authorKudisthalert W.
dc.contributor.authorUtintu C.
dc.contributor.authorSaetan W.
dc.contributor.authorKongsawat N.
dc.contributor.authorBorisuitsawat P.
dc.contributor.authorMahakijdechachai K.
dc.contributor.authorSu-Inn N.
dc.contributor.authorThamwiwatthana E.
dc.contributor.authorSuttichaya V.
dc.contributor.correspondencePhimsiri S.
dc.contributor.otherMahidol University
dc.date.accessioned2026-04-16T18:49:06Z
dc.date.available2026-04-16T18:49:06Z
dc.date.issued2025-01-01
dc.description.abstractFine-grained traffic understanding requires both detailed visual descriptions and precise answers to safety-critical questions. We present TrafficInternVl, a framework for fine-grained traffic safety description and question answering, developed for AI City Challenge 2025 Track 2. Our approach is based on the InternVL3-38B vision-language model and integrates four key components: (1) spatially guided visual prompting via bounding-box-based cropping and rendering; (2) Adaptive view selection protocols; (3) low-rank adaptation (LoRA) fine-tuning, updating only 1% of model parameters; and (4) caption refinement for intra-scene consistency. Our model achieves a Caption Score of 32.75 (BLEU-4, METEOR, ROUGE-L, CIDEr averaged) and a VQA accuracy of 83.08 %. Code, prompts, and LoRA weights are released at https://github.com/ARV-MLCORE/TrafficInternVL
dc.identifier.citationProceedings 2025 IEEE Cvf International Conference on Computer Vision Workshops Iccv W 2025 (2025) , 5358-5365
dc.identifier.doi10.1109/ICCVW69036.2025.00559
dc.identifier.scopus2-s2.0-105035187032
dc.identifier.urihttps://repository.li.mahidol.ac.th/handle/123456789/116233
dc.rights.holderSCOPUS
dc.subjectComputer Science
dc.titleTrafficInternVL: Spatially-Guided Fine-Tuning with Caption Refinement for Fine-Grained Traffic Safety Captioning and Visual Question Answering
dc.typeConference Paper
mu.datasource.scopushttps://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105035187032&origin=inward
oaire.citation.endPage5365
oaire.citation.startPage5358
oaire.citation.titleProceedings 2025 IEEE Cvf International Conference on Computer Vision Workshops Iccv W 2025
oairecerif.author.affiliationCarnegie Mellon University
oairecerif.author.affiliationMahidol University
oairecerif.author.affiliationKing Mongkut's Institute of Technology Ladkrabang
oairecerif.author.affiliationAi and Robotics Ventures

Files

Collections