Decomposing Food Images for Better Nutrition Analysis: a Nutritionist-Inspired Two-Step Multimodal Llm Approach

Khlaisamniang P.; Kerdthaisong K.; Vorathammathorn S.; Yongsatianchot N.; Phimsiri H.; Chinkamol A.; Thitseesaeng T.; Veerakanjana K.; Kachai K.; Ittichaiwong P.; Saengja T.

Decomposing Food Images for Better Nutrition Analysis: a Nutritionist-Inspired Two-Step Multimodal Llm Approach

dc.contributor.author	Khlaisamniang P.
dc.contributor.author	Kerdthaisong K.
dc.contributor.author	Vorathammathorn S.
dc.contributor.author	Yongsatianchot N.
dc.contributor.author	Phimsiri H.
dc.contributor.author	Chinkamol A.
dc.contributor.author	Thitseesaeng T.
dc.contributor.author	Veerakanjana K.
dc.contributor.author	Kachai K.
dc.contributor.author	Ittichaiwong P.
dc.contributor.author	Saengja T.
dc.contributor.correspondence	Khlaisamniang P.
dc.contributor.other	Mahidol University
dc.date.accessioned	2025-10-12T18:07:57Z
dc.date.available	2025-10-12T18:07:57Z
dc.date.issued	2025-01-01
dc.description.abstract	Accurate estimation of nutritional information from food images remains a challenging problem. Most existing approaches rely on deep image models fine-tuned with extensive food annotations or require detailed user inputs (e.g., portion size, cooking method), both of which are prone to error. Motivated by the workflow of nutrition experts, we propose a two-step prompting framework leveraging off-the-shelf Multimodal Large Language Models (MLLMs). The first step deconstructs the dish into its components listing major ingredients, portion sizes, and cooking details while the second step computes total calories and macronutrients. This approach alleviates the need for heavy fine-tuning or large ingredient databases, by instead harnessing the compositional reasoning capabilities of general MLLMs. We evaluate the method on both a subset of the Nutrition5k dataset (Nutrition320) and real-world samples from the Gindee application (Gindee121), achieving more accurate estimates than one-step direct queries. Additional experiments with visual prompts (bounding boxes, segmentation masks) further demonstrate the robustness and adaptability of our approach. Notably, our findings reveal that guiding MLLMs through a structured two-step reasoning process-separating 'what is on the plate' from 'how it translates nutritionally'-substantially improves the reliability of image-based macronutrient estimation.
dc.identifier.citation	IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (2025) , 482-491
dc.identifier.doi	10.1109/CVPRW67362.2025.00053
dc.identifier.eissn	21607516
dc.identifier.issn	21607508
dc.identifier.scopus	2-s2.0-105017845301
dc.identifier.uri	https://repository.li.mahidol.ac.th/handle/123456789/112492
dc.rights.holder	SCOPUS
dc.subject	Computer Science
dc.subject	Engineering
dc.title	Decomposing Food Images for Better Nutrition Analysis: a Nutritionist-Inspired Two-Step Multimodal Llm Approach
dc.type	Conference Paper
mu.datasource.scopus	https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105017845301&origin=inward
oaire.citation.endPage	491
oaire.citation.startPage	482
oaire.citation.title	IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
oairecerif.author.affiliation	King's College London
oairecerif.author.affiliation	Chulalongkorn University
oairecerif.author.affiliation	Siriraj Hospital
oairecerif.author.affiliation	Vidyasirimedhi Institute of Science and Technology
oairecerif.author.affiliation	Thammasat School of Engineering
oairecerif.author.affiliation	National Health Security Office
oairecerif.author.affiliation	Artificial Intelligence Association of Thailand
oairecerif.author.affiliation	PreceptorAI Team

Collections

Scopus 2025

	Office Hour: Monday-Friday 08.30-12.00 and 13.00-16.30 hrs.
	Phutthamonthon Sai 4 Rd. Salaya, Nakhon Pathom 73170, Thailand
	The office: +66 (2) 800 2680 ext.4306
	thipsuda.van@mahidol.ac.th
	https://repository.li.mahidol.ac.th

Decomposing Food Images for Better Nutrition Analysis: a Nutritionist-Inspired Two-Step Multimodal Llm Approach

Files

Collections