Decomposing Food Images for Better Nutrition Analysis: a Nutritionist-Inspired Two-Step Multimodal Llm Approach

dc.contributor.authorKhlaisamniang P.
dc.contributor.authorKerdthaisong K.
dc.contributor.authorVorathammathorn S.
dc.contributor.authorYongsatianchot N.
dc.contributor.authorPhimsiri H.
dc.contributor.authorChinkamol A.
dc.contributor.authorThitseesaeng T.
dc.contributor.authorVeerakanjana K.
dc.contributor.authorKachai K.
dc.contributor.authorIttichaiwong P.
dc.contributor.authorSaengja T.
dc.contributor.correspondenceKhlaisamniang P.
dc.contributor.otherMahidol University
dc.date.accessioned2025-10-12T18:07:57Z
dc.date.available2025-10-12T18:07:57Z
dc.date.issued2025-01-01
dc.description.abstractAccurate estimation of nutritional information from food images remains a challenging problem. Most existing approaches rely on deep image models fine-tuned with extensive food annotations or require detailed user inputs (e.g., portion size, cooking method), both of which are prone to error. Motivated by the workflow of nutrition experts, we propose a two-step prompting framework leveraging off-the-shelf Multimodal Large Language Models (MLLMs). The first step deconstructs the dish into its components listing major ingredients, portion sizes, and cooking details while the second step computes total calories and macronutrients. This approach alleviates the need for heavy fine-tuning or large ingredient databases, by instead harnessing the compositional reasoning capabilities of general MLLMs. We evaluate the method on both a subset of the Nutrition5k dataset (Nutrition320) and real-world samples from the Gindee application (Gindee121), achieving more accurate estimates than one-step direct queries. Additional experiments with visual prompts (bounding boxes, segmentation masks) further demonstrate the robustness and adaptability of our approach. Notably, our findings reveal that guiding MLLMs through a structured two-step reasoning process-separating 'what is on the plate' from 'how it translates nutritionally'-substantially improves the reliability of image-based macronutrient estimation.
dc.identifier.citationIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (2025) , 482-491
dc.identifier.doi10.1109/CVPRW67362.2025.00053
dc.identifier.eissn21607516
dc.identifier.issn21607508
dc.identifier.scopus2-s2.0-105017845301
dc.identifier.urihttps://repository.li.mahidol.ac.th/handle/123456789/112492
dc.rights.holderSCOPUS
dc.subjectComputer Science
dc.subjectEngineering
dc.titleDecomposing Food Images for Better Nutrition Analysis: a Nutritionist-Inspired Two-Step Multimodal Llm Approach
dc.typeConference Paper
mu.datasource.scopushttps://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105017845301&origin=inward
oaire.citation.endPage491
oaire.citation.startPage482
oaire.citation.titleIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
oairecerif.author.affiliationKing's College London
oairecerif.author.affiliationChulalongkorn University
oairecerif.author.affiliationSiriraj Hospital
oairecerif.author.affiliationVidyasirimedhi Institute of Science and Technology
oairecerif.author.affiliationThammasat School of Engineering
oairecerif.author.affiliationNational Health Security Office
oairecerif.author.affiliationArtificial Intelligence Association of Thailand
oairecerif.author.affiliationPreceptorAI Team

Files

Collections