Street-to-Shop: Is the Future of Fashion Visual Search?
May 28, 2021 • 14 min read
May 28, 2021 • 14 min read
Ever tried to stretch your imagination and picture how your bedroom would look like with that new lovely wallpaper? What about new tiles for a bathroom? Sounds like a tedious task, especially as you go through dozens of design options, combine and compare them. Well, no more! Modern augmented reality powered by deep learning allows you to take a picture of your room and visualize it with a new wall paint, wallpaper, tile or floor.
At Grid Labs, we like to take a stab at interesting problems like that and find approaches to solve them. In this blog post, we will report on how to build a system which is able to take a photo of your room and a texture of desired wallpaper or paint and magically produce a visualization of your room in a new style.
Truth to be told, there is no real magic in this solution, but there is quite a bit of pretty cool image processing and deep learning, so read on!
So, how do we achieve the magical experience in the picture above? The main idea is that we will have to analyse the image and distinguish walls from the rest of the room. We also have to understand the perspective of the room in order to correctly project our texture and, to make things look real, we will have to transfer shades from original image to simulated one.
All this processing comprises a full room processing pipeline:
Now, let's look at those steps in more detail
To correctly apply the texture to the walls we have to consider two critical pieces of data: room layout and room segmentation. We will use separate deep learning models for those concerns:
For layout detection we employed a model described in a paper called Physics Inspired Optimization on Semantic Transfer Features: An Alternative Method for Room Layout Estimation which was presented at CVPR 2017. Model was trained on the LSUN dataset. The authors claim that they achieved mIOU of 0.75 on validation set for layout estimation problem. Also metrics such as pixel error (ratio of mislabelled pixels to all pixels) and corner error (euclidean distance between estimated coordinates and ground truth) are reported to beat state of the art models of that time. IOU stands for intersection over union, which is a metric defining how well we cover pixels of ground truth layout of the room. Below you can see a few examples of ground truth bounding boxes (blue), predicted bounding boxes (red) and corresponding IOU in each case.
For the room segmentation task we evaluated a number of available models
|ResNet50-dilated + |
Models were pre-trained on two datasets - ADE20K and COCO. ADE20K consists of scene-centric images with 150 semantic categories which include stuff like sky, road, grass, and discrete objects like person, bed, etc. COCO has 180 classes.
To compare the models’ performance we used two key metrics - MoU and Pixel Wise Accuracy. As those metrics are averaged across all classes in the corresponding datasets, we can compare segmentation efficiency of these models.
State of the art model DeepLab with ResNet200 backbone trained on ADE20K shows the best results. The second model with Pyramid Pooling Module has lower metrics and DeepLab model arrives a distant third. Consequently, we chose two models pretrained on ADE20K.
We can apply our textures as is or in a tiled mode. To obtain the natural looking size of the texture representing tiles we have to scale our input image appropriately. There are models which could estimate a depth of the room, but here we decided to take a shortcut and made our textures 20 times smaller than the image of the room to get roughly 30x30 cm (12x12 inch) tile size. We can also double the size of the tiles as needed for visualization.
The general process of applying texture on walls is displayed in the image below. The output of the Layout model gave us the heatmap with detected wall edges (room layout). This room layout is used to calculate the coordinates of wall planes. First, we took contours of wall edges and formed wall planes. We used heuristics to approximate the corners of these wall planes. After it, to fill the gap between adjacent walls, the average distance between walls’ contours was calculated. This average distance then was used to find the room corner. Once wall places were found, we transferred texture on them using homography projection. In the last step, the output mask of the Segmentation Model was used to remove furniture and decor from textured wall planes.
Images without shadows look flat and unnatural. In order to achieve a more natural look of the image, we need to add light and shadows to our textured walls. The key technique to do this is to consider HSV representation of the original image. The V channel (sometimes called B) is responsible for the pixel blackness which correlates with the shades intensity.
The basic approach to transferring shadows is the following:
This approach produces pretty smooth results (picture above), but it has some important flaws worth mentioning. When dealing with walls with strong patterns (say stripes), the blackness from darker parts of the pattern is transferred as if it was a shadow. To fix this problem, we tried several techniques:
Because the data pipeline we use is pretty long, our original solution was suffering from the performance issues. We looked at 40 seconds to 2 minutes of image processing time which was clearly not acceptable for the real customers. Layout and Segmentation networks were the top offenders. We employed some of the techniques to optimize the performance of the pipeline
Those measures allowed to cut inference time to 1-2 seconds per image which is already good enough for the real time use.
We compared the results of our model with a leading commercially available product and can see that our model produces comparable results quality:
In this example we can see how different models work with shades and lighting.
In our blog post, we described a practical solution to a core of room visualisation system which can power your agmented reality application. As we continue working on our models, we see a lot of opportunities to expand the system features and improve the quality and performance, such as training our own segmentation model and depth perception model, as well as applying NN optimizations to improve inference times.
Meanwhile, if you are interested in adding computer vision and image analytics to your application, don’t hesitate to reach out!