#### Finding a needle in a haystack: building a question answering system for online store

**Sergey Parakhin**

Oct 15, 2020 •

**26 min read**

Jan 14, 2021
• **9 min read**

The coin recognition system is a great showcase of the power of modern deep learning image processing models. While coins themselves are relatively simple objects, many coins look very similar and it is surprisingly challenging to build a system that can reliably identify a particular coin. That is why this task frequently shows up in Kaggle competitions.

At Grid Labs, we decided to give this problem another shot with the latest visual search techniques. Our goal is to enable real-time recognition of any coin from our collection on the mobile device.

In this blog post, we will explore how to implement a coin recognition system end-to-end: from the dataset collection to model design and training and service deployment.

On the surface, coin recognition can be viewed as a classification problem. We can define two classes per coin for reverse and obverse and train a classifier to predict a coin class. This task can be accomplished even with an AutoML approach given a sufficient number of coin images in the dataset.

However, this approach has a significant flaw — adding new coins to the catalog will require us to re-train the model to recognize new classes, which is a major operational hurdle.

A much better way is to approach this task as a similarity search problem. We will train a visual model that will represent each coin as a point in high-dimensional vector space in such a way that the images of the same coin are clustered together and separated from the images of the other coins. A picture snapped by a user's mobile phone will be encoded by the model to a point in this multi-dimensional vector space and the nearest vectors to this point will be returned as a search result. This process is called k-nearest-neighbors (KNN) search and is supported by the special indexes which help to achieve high performance of such search.

If our model is trained well, it will be able to properly separate and cluster images of the coins which it never saw. This way, we will not need to retrain the model when the new coins are added to our catalog, we just add yet another embedding to our vector index.

A journey of a thousand miles begins with a single dataset. We have chosen to be adventurous and decided to create a coin dataset from scratch. As a base, we have used a small personal numismatic collection: about 400 coins from 38 countries. Turns out, with modern models even this modest data set was quite enough to achieve high-quality coin recognition.

Each coin was photographed from both sides using light and dark backgrounds:

Dataset was manually labeled and contained information about country, currency, denomination, side of the coin — obverse or reverse. Based on this information we defined the coin class. Basic rules for defining the class were:

- Each side of the coin is a different class.
- We treat coins that differ only in an issue year as one class.
- Coins of the same country, currency, and denomination which look different because of some collectional release will be treated as different classes.

We prepared a testing dataset where we followed the way people take the pictures of their coins in real life: on the hand with complex multi-color backgrounds.

Coins can be photographed in a variety of conditions: different photos may have different contrast and lighting, noise from the camera, photos may be taken with various backgrounds. Backgrounds can seriously interfere with coin recognition and our model should be prepared to handle those variations. Additionally, on many images, the coin represents only a small part of the image. Deep learning models require a fixed, relatively small size of the input image, so naive resizing of the input image can lose essential coin features. This means that before feeding the image to the coin recognition model we need to locate the coin and remove the coin’s background.

This is the job of the segmentation model.

Segmentation models assign each pixel in the image to a particular class with some probability, thus creating a semantic mask of the image, identifying objects of interest. In our case, we can train a simple binary segmentation model for “coin” and “background” classes.

To prepare a training dataset for the segmentation model we used a combination of manual and automated labeling. We manually labeled about 100 wild images and also leveraged Hough Circle Transform to perform sufficiently good segmentation on about 300 additional images. Next, we applied a set of data augmentation techniques to improve data variability with expected backgrounds: tables, table cloths, people hands, etc...

Here is a high-level recipe for this augmentation:

- For the coin image, add random padding ranging from 1:1 to 1:36 of the original.
- Crop a part of the background image with random size and position.
- Apply random transformations (rotate, blur, and brightness) to a coin and background images separately.
- Merge coin and background images using a segmentation mask.
- Use blending with Gaussian and Laplacian pyramids to avoid sharp pixel transitions between background and coin.
- Apply augmentations like random noise and shadows to the generated image.

Here are some examples of the generated images:

For the segmentation model, we used classical U-Net architecture with a pre-trained EfficientNet-b4 encoder. We trained the model using the Dice loss.

We use the Dice loss for training U-Net. This loss is considered as (2 * area of overlap between the predicted mask and ground truth) / (area of predicted mask + area of ground truth).

To evaluate the quality of the segmentation, we used traditional Intersection over Union metric.

Even with our small dataset, we achieved pretty decent metrics:

Threshold | 0.3 | 0.5 |

IoU without data augmentation | 0.952 | 0.949 |

IoU with data augmentation | 0.976 |
0.960 |

To improve the segmentation quality we used the refinement approach where we use the segmentation model twice first for object detection and second for the segmentation.

First, we predict the mask on the original image, filter the mask, and derive the rectangular bounding box. Then, we crop the image and predict with the same model on the cropped image. This approach helped to deal with pictures with small coins and coins with holes.

In the case when the segmentation model predicts a coin in several places, we choose the largest area. This helps when there are multiple coins in the photo and we need to choose one. Also, it helps to eliminate some noise from recognizing some random small objects like coins.

Comparison of Hough Transform and U-Net.

For the coin similarity model we used the similar EfficientNet as at the time of the writing it is conceited state-of-the-art CNN backbone. As we have only a modest dataset of about 1000 images, we chose the smallest EfficientNet-b0 member of the family.

After some experimentation, we settled on an embedding size of 64, which means that every coin in our dataset will be represented as 64 numbers. We added a couple of fully connected layers to our backbone to perform this embedding.

We used ArcFace loss with the parameters margin=0.2, scale=25.0 to train the model. ArcFace loss proves useful for metric training problems, as it is designed to strongly separate a large number of classes where each class has a small number of samples. This is exactly the situation we have with our dataset of many different coins with a small number of images per coin.

In training, we are passing all the available images through our coin segmentation model to remove image background and to focus training on the coins themselves. We also apply some augmentations: blur, random noise (ISONoise, IAAPerspective, IAAAddictiveGaussianNoise), random contrast and brightness, shadows, flips, rotate. All images were resized to 256x256 pixels.

Since CNN layers in our model were already pre-trained and fully connected layers are not, we used different separate scheduler and optimizer for CNN and FCC+ArcFace parts of the network. Standard Categorical Cross-Entropy Loss is used for final classification loss.

Following best practices, we split our dataset into training and validation parts during the preprocessing stage. After each training epoch, we calculate the loss for the validation set to watch whenever the model starts overfitting.

With the trained similarity model, we can use the embeddings produced by our model as a vector representation. Also those vector representations create a coins vector space where similar coins are clustered together.

This makes it easy to recognize the coin: all we have to do is to convert the snapped picture into a vector in this vector space using our trained model and find the nearest neighbors which will be the result of our search. The notion of “nearest neighbors” requires a strict mathematical definition of distance between vectors. Classical Euclidean distance is not optimal here because Arcface separates points in vector space using angles between vector representations rather than linear distances. Thus, we use cosine distance:

$$

\begin{aligned}

similarity(A,B) = \frac {A \cdot B}{\parallel A\parallel\times\parallel B\parallel} = \frac { \sum^n_{i=1}A_i \times B_i }{\sqrt{\sum^n_{i=1}A^2_i} \times \sqrt{\sum^n_{i=1}B^2_i}}

\end{aligned}

$$

In general, searching in vector space is very computationally intensive. It requires you to find distances between the representation of your snapped picture, a.k.a as “anchor” and all other points in the vector space. Distance calculation complexity is proportional to the dimension of vector space and even with a modest number of points to search and relatively low dimension, the exact search can take seconds.

Because of that, an exact vector search is impractical for most real-life scenarios. In practice, we search for the nearest vectors approximately using pre-computed indexes. There are a lot of great libraries that implement a wide spectrum of trade-offs between the speed and accuracy of vector space search.

We chose Faiss as a vector index implementation For us, the main advantage is an opportunity to use the GPU acceleration which significantly reduces the time for calculating metrics. We use Milvus as a service wrapping several popular approximate-nearest-neighbors libraries such as Faiss, NMSLIB, and Annoy, with intuitive APIs, allowing you to choose index types based on your scenario. Milvus is fully containerized which adds to the convenience.

When the model is trained, it optimizes its loss as defined by the training algorithm. However, it remains to be seen how this optimization achieves our goal of recognizing coins.

We reserved a validation dataset with pictures that were not involved in the training, and also it contains coin classes that the model is unaware of. This way, we check the model can capture essential coin features and generalize coin similarity beyond the types of coins seen in training. We split this dataset into “catalog” and “query” parts to be used in the evaluation.

After each epoch, we vectorize the “catalog” images using the latest model generation, index them and perform a nearest neighbor search using each image from the “query” dataset.

As the main metric, we use 1-Recall@k, where k = 1, 2, 5, 10, 20. This metric counts the correct prediction that if a correct coin was found among top K search results.

K | 1 | 2 | 5 | 10 | 20 |

1-Recall@k | 0.90 | 0.91 | 0.96 | 0.99 | 1 |

As you can see, we are getting the correct coin at the first search results in 90% of cases, and within the top 10 results in 99% of cases.

Let's look at some real-life examples of the coin recognition model in action:

Let's also look at some of the results for the coins which are not in the index, so the model tries to find something similar:

In this blog post, we described how to build a coin recognition system from scratch. With the power of modern deep learning-based visual models, it is possible to build high-quality visual search systems even with modest amounts of training data. Modern vector search services and APIs, like Milvus, greatly simplify the engineering and deployment aspects of such projects.

At Grid Labs, we are always eager to try out new approaches and emerging technologies and find their application in real-life business problems. Visual search systems are quickly becoming mainstream and find their application in many enterprises across the industries.

If you have questions about visual search and want to learn more about those kinds of systems, don’t hesitate to reach out or leave a comment!

Happy searching!