Product-ProtoNet

A simple architecture for classifying supermarket products, using just a few example images

More Info
expand_more

Abstract

Airlab, a collaboration between TU Delft and Ahold Delhaize, is developing Albert, a robot tailored to work in a complex supermarket environment. Key to Albert is a product detection and classification module that tells it what products to grasp and where they are located in a shelf. Albert’s existing YOLO‑based product detector a significant issue: Adding new products without re‑training the whole model is impossible. Especially in a dynamic supermarket environment with an ever‑changing stock, the latter is a major issue.

This problem will be the main focus of this paper and is addressed through few‑shot learning, which predicts similarity between query and target products. This simplifies adding new products to just supplying new target images. Few‑shot learning also requires significantly less data to train on. In supermarkets with 300.000 different products, requiring only a few images per product is a major advantage. For this reason, this paper aims to deploy a few‑shot model to classify products as either the target class or non‑target class for Albert’s picking task and defines the following research question: “What few‑shot classifier can identify products in a supermarket environment, is able to detect non‑target classes, and meets the requirements of deployment on a robotic platform like Albert best?“

This paper first analyses the potential of using TRIDENT and P>M>F, two state‑of‑the‑art few‑shot models, for deployment on Albert, and evaluates them on the requirements of this paper. P>M>F performs better on all requirements, which makes it the preferred model for Albert. However to work well, it still requires adjustments. Its inference time is still too high to work on Albert and it cannot classify query images as not the target product.

For this reason, this paper uses P>M>F’s two key ideas to construct Product‑ProtoNet, a new Albert‑suitable few‑shot model: 1) Using a good pre‑trained feature extractor; and 2) Comparing query images to a set of classes and matching only to the likeliest. P>M>F uses a ProtoNet model for classification that essentially does this; Like ProtoNet, Product‑ProtoNet constructs class prototypes from one or multiple examples of class images. Product‑ProtoNet then uses a sigmoid classifier to predict if query images have the same class as those prototypes. It compares query images to a set of similar class prototypes(helper prototypes) and classifies it as the likeliest. Product‑ProtoNet uses a ViT pre‑trained with DINO to extract image features. To bring down inference time, Product‑ProtoNet computes product prototypes before deployment.

With an accuracy of 99.1% on product classes seen during training and 99.8% on novel classes in a realistic supermarket setting, a low inference time of 2.89 ms and a memory usage lower than 4GB, Product‑ProtoNet is the only model that passes all requirements of this paper. When deployed on Albert Product‑ProtoNet successfully guides Albert to the right product in 97% of attempts. This makes Product‑ProtoNet the only few‑shot classifier that can identify products in a supermarket environment, is able to detect non‑target classes, and meets the requirements of deployment on a robotic platform like Albert.

Files