Khlaisamniang P.Kerdthaisong K.Vorathammathorn S.Yongsatianchot N.Phimsiri H.Chinkamol A.Thitseesaeng T.Veerakanjana K.Kachai K.Ittichaiwong P.Saengja T.Mahidol University2025-10-122025-10-122025-01-01IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (2025) , 482-49121607508https://repository.li.mahidol.ac.th/handle/123456789/112492Accurate estimation of nutritional information from food images remains a challenging problem. Most existing approaches rely on deep image models fine-tuned with extensive food annotations or require detailed user inputs (e.g., portion size, cooking method), both of which are prone to error. Motivated by the workflow of nutrition experts, we propose a two-step prompting framework leveraging off-the-shelf Multimodal Large Language Models (MLLMs). The first step deconstructs the dish into its components listing major ingredients, portion sizes, and cooking details while the second step computes total calories and macronutrients. This approach alleviates the need for heavy fine-tuning or large ingredient databases, by instead harnessing the compositional reasoning capabilities of general MLLMs. We evaluate the method on both a subset of the Nutrition5k dataset (Nutrition320) and real-world samples from the Gindee application (Gindee121), achieving more accurate estimates than one-step direct queries. Additional experiments with visual prompts (bounding boxes, segmentation masks) further demonstrate the robustness and adaptability of our approach. Notably, our findings reveal that guiding MLLMs through a structured two-step reasoning process-separating 'what is on the plate' from 'how it translates nutritionally'-substantially improves the reliability of image-based macronutrient estimation.Computer ScienceEngineeringDecomposing Food Images for Better Nutrition Analysis: a Nutritionist-Inspired Two-Step Multimodal Llm ApproachConference PaperSCOPUS10.1109/CVPRW67362.2025.000532-s2.0-10501784530121607516