Single Image-Based Food Volume Estimation Using Monocular Depth-Prediction Networks

Graikos A., Charisis V., Iakovakis D., Hadjidimitriou S., Hadjileontiadis L. (2020) Single Image-Based Food Volume Estimation Using Monocular Depth-Prediction Networks. In: Antona M., Stephanidis C.

 

In this work, we present a system that can estimate food volume from a single input image, by utilizing the latest advancements in monocular depth estimation. We employ a state-of-the-art, monocular depth prediction network architecture, trained exclusively on videos, which we obtain from the publicly available EPIC-KITCHENS and our own collected food videos datasets. Alongside it, an instance segmentation network is trained on the UNIMIB2016 food-image dataset, to detect and produce segmentation masks for each of the different foods depicted in the given image. Combining the predicted depth map, segmentation masks and known camera intrinsic parameters, we generate three-dimensional (3D) point cloud representations of the target food objects and approximate their volumes with our point cloud-to-volume algorithm. We evaluate our system on a test set, consisting of images portraying various foods and their respective measured volumes, as well as combinations of foods placed in a single image.

 

DOWNLOAD HERE

Share on facebook
Share on twitter
Share on linkedin

Single Image-Based Food Volume Estimation Using Monocular Depth-Prediction Networks

Graikos A., Charisis V., Iakovakis D., Hadjidimitriou S., Hadjileontiadis L. (2020) Single Image-Based Food Volume Estimation Using Monocular Depth-Prediction Networks. In: Antona M., Stephanidis C.

 

In this work, we present a system that can estimate food volume from a single input image, by utilizing the latest advancements in monocular depth estimation. We employ a state-of-the-art, monocular depth prediction network architecture, trained exclusively on videos, which we obtain from the publicly available EPIC-KITCHENS and our own collected food videos datasets. Alongside it, an instance segmentation network is trained on the UNIMIB2016 food-image dataset, to detect and produce segmentation masks for each of the different foods depicted in the given image. Combining the predicted depth map, segmentation masks and known camera intrinsic parameters, we generate three-dimensional (3D) point cloud representations of the target food objects and approximate their volumes with our point cloud-to-volume algorithm. We evaluate our system on a test set, consisting of images portraying various foods and their respective measured volumes, as well as combinations of foods placed in a single image.

 

DOWNLOAD HERE

Scroll to Top