The world of AI is constantly evolving, and recent advances in text-to-image models like DALL-E and Stable Diffusion have been truly groundbreaking. These models are already transforming how artists and designers work, enabling them to experiment with their ideas and create stunning illustrations at lightning-fast speeds. But the impact of generative AI goes far beyond just the creative realm. These cutting-edge models are also poised to revolutionize the way we shop online.
At Grid Labs, we are constantly experimenting with emerging AI technologies that might help our clients, and in our recent blog post about Generative AI in Digital Commerce we already discussed some of the applications of generative AI. In this blog post, we will show the results of our experiments with text-to-image generative models, and in particular, we'll explore how brands can leverage this technology for content creation and product visualization.
In the current era of digital commerce, the role of product images and videos cannot be underestimated, especially when it comes to shaping the overall customer experience. Brands and retailers understand the importance of investing significant resources in elaborate photoshoots that feature attractive human models, captivating lifestyle images set in immersive environments, and innovative video production techniques. They also strive to incorporate cutting-edge 3D rendering technologies to showcase their products in the most visually appealing way possible. However, the expenses and time associated with live photoshoots, and the high costs and limitations of 3D rendering, pose challenges that restrict the extent to which these creative efforts can be scaled and personalized. Thankfully, generative AI has emerged as a potential game-changer in this landscape. By streamlining and automating the content creation process, as well as unlocking new avenues for product visualization and hyper-personalization, generative AI has the power to revolutionize the way brands and retailers operate in the digital commerce space.
By replacing costly 3D rendering pipelines and enabling designers and marketers with tools powered by generative AI, we can significantly speed up and scale the creation of images across websites and marketing channels. This also means that we can generate a lot more content in the same amount of time and with the same resources, while opening the door for higher levels of content personalization.
Besides reducing costs and scaling content creation, generative AI also improves the quality of product visualizations, and unlocks new capabilities. For example, generating personalized outfit images, or even providing realistic try-on experiences with much higher quality than traditional 3D and AR technology.
Generative AI has become an essential tool for producing stunning illustrations and designing new products. While its ability to create without boundaries is impressive, there are also use cases where AI is required to generate images within strict constraints. For example, customizing products using available colors and materials while maintaining the overall design, or, in our case, visualizing products in different contexts while retaining all details.
Generative models such as DALL-E or Stable Diffusion do not provide mechanisms for accurately reproducing existing objects in a new context. However, recent advancements in text-to-image generative AI have focused on controllable generation and are already making significant progress toward achieving these capabilities.
In the following sections, we will look at product visualization examples generated using various approaches, starting from relatively simple techniques based on inpainting capabilities of generative AI models, and then with more sophisticated re-contextualization methods which allow us to reconstruct products in a completely new context.
Models like DALLE and Stable Diffusion have the ability to perform inpainting, for example, when you take a photo of your product and generate a new background with a text prompt. This approach enables us to create stunning, customized product visualizations. And even though you can not change the angle, lighting, etc., of the object itself, these models are able to adjust to the lighting of your product. Let’s take a look at several examples below generated using the Stable Diffusion model.
However, one of the common issues with inpainting is that it can change your object during image generation. As shown above, it is possible to achieve great results, but this is not always the case. Let’s take a look at one of the failed examples below, generated for a 3D model of custom Nike shoes (Nike By You).
This is where adapter networks like ControlNet and T2I-Adapter can help achieve more consistent results. We can train adapters to control various aspects of image generation such as composition, semantics, human poses, and more. These adapters can then be connected to a pre-trained model such as Stable Diffusion, with surprisingly good results.
One such adapter can control the image generation process using object edges. For instance, a sketch of a shoe image can be automatically generated and used as a control image during the inpainting process:
Inpainting is a very powerful tool, and as we see above, it works well for certain products and use cases. However, it has limitations, such as the inability to modify the product itself (e.g. rotate it) if you don’t have a 3D model of your product, or change the product lighting/shadows. It is also not effective for more complex scenarios such as outfit generation.
However, researchers are pursuing an alternative approach, and these methods involve fine-tuning text-to-images models to visualize objects in different contexts. One such popular approach is called DreamBooth, which allows fine tuning a text-to-image model using only a few images of a subject paired with a text prompt containing a unique identifier and the name of the class the subject belongs to (e.g., "a photo of a [V] teapot”). The model learns to associate a unique identifier with that particular subject, and then it can be used to synthesize completely new photorealistic images of the subject.
DreamBooth-like methods can produce stunning results, however, they are not always sufficient for product visualization, as they are unable to accurately preserve some of the finer, often important, details of an object. In the teapot example above, you can easily notice the discrepancies in color and shape. This is a limitation that is not easily overcome when fine-tuning the model with only a few images (e.g. only 4 training images of the teapot were provided), however, we believe the technology will develop rapidly. Further, as the number of training images increases, the results improve. For example, according to our experiments, starting with 20-50 images is likely to produce great results. Below, you can see generated images by Stable Diffusion, fine-tuned in Grid Labs on 50 images of Nike shoes.
The results look impressive, and in most cases, it’s hard to find inaccuracies when comparing the original and generated images side-by-side. These results demonstrate the incredible potential of generative AI models for product visualization. It is worth saying that this approach may not work for some products, especially with complex patterns and text. And not everyone has dozens of diverse photos of products for model fine-tuning. Another potential issue is that even small discrepancies with real products can be critical for some businesses or use cases, however, we provide a few techniques to resolve some of these issues in the next section. Below are several more examples for furniture and clothing items we generated using the same DreamBooth approach:
Generated images using fine-tuned Stable Diffusion:
DreamBooth can also work for clothing items, which is impossible to do using inpainting due to nature of the products. Here are examples for a sweater:
Generated images using fine-tuned Stable Diffusion:
Similar to what we did with inpainting earlier, we can combine DreamBooth with ControlNet, which allows us to preserve the details of the product. However, compared to inpainting, DreamBooth allows us to add details to the product, such as shadow and lighting treatments.
There are different types of ControlNet models, not just edges. One of them can control poses on generated images. Here are examples of pose-guided generation (Dreambooth + ControlNet) for a sweater we generated earlier.
Besides some of the inpainting and DreamBooth limitations we already mentioned (and as we have seen, some of them can be resolved using ControlNet or bigger training datasets), there are a few that require manual or semi-automated post-processing by digital artists.
One of the most challenging tasks for any generative AI model is to accurately replicate hands (see example below). In practice, you would need to generate several images and choose the one with the most natural-looking hands, or redraw part of the image using inpainting and ControlNet.
Another problem for text-to-image models is generating text. Text generation issues can also be fixed by a digital artist with the help of ControlNet, however, there may soon be a simpler solution. For example, one of the recent models, DeepFloyd IF, shows great promise in the area of text generation.
In the example of the Nike shoes above, we generated only close-up images. This way, we can achieve higher accuracy of the generated object. However, using the out-painting feature of text-to-image models, we can expand our generated images to make larger lifestyle images.
To unlock all the capabilities of generative AI, you need a combination of several models, and sometimes fine-tuning these models with your data. Depending on the domain and use cases, different components of the generative AI ecosystem can be used to provide generative AI capabilities to content creators.
As shown earlier, open-source models such as Stable Diffusion can be used for inpainting and re-contextualization approaches. ControlNet models can be used for more precise and controlled image generation. At the same time, various pre-processing (e.g. accurate automatic background removal before inpainting) and post-processing methods can be used by the content creators to achieve desired results.
Compared to the world of large language models (LLMs), where open-source models are far behind SaaS products such as OpenAI, generative AI for images has many open-source models and tools. By leveraging these open-source components, we can build and customize generative AI studios for your needs, and deploy them in any cloud platform.
The virtual try-on is something that all apparel brands want to achieve, however, none have delivered truly convincing results. Previous solutions based on generative adversarial networks (GAN) were not precise enough, and 3D/AR solutions still look less than impressive. By combining different approaches, such as DreamBooth, ControlNet, and inpainting, it seems we will finally achieve a realistic virtual try-on experience very soon. Here are several examples of shoes from early research we are doing at Grid Labs:
Another exciting area is video synthesis (DreamPose, Text2Video), which is obviously a more complex process than image visualizations. We are still at a very early stage, but the technology is evolving very fast.
Generative AI has shown tremendous promise in the realm of content creation and product visualization. Certain approaches have proven to work exceptionally well, such as inpainting and adapters to control various aspects of the image generation process. On the other hand, in the case of approaches such as DreamBooth, there are still limitations, and further research is required for certain use cases. However, there is optimism that this year will bring significant advancements, elevating the technology to even greater heights.
Designers are anticipated to embrace AI tools to accelerate content creation, leveraging the power of generative AI to streamline their workflows and produce captivating visuals more efficiently. Furthermore, as the potential of visual generative AI becomes increasingly recognized, more domains and industries are expected to embrace and harness its capabilities, extending beyond the confines of traditional use cases.
With continuous technical advancements and a wider adoption across various sectors, generative AI is poised to revolutionize the way brands and businesses present their products, creating immersive and hyper-realistic visual experiences for customers. The future holds immense potential for generative AI, and its transformative impact on the world of content creation and product visualization is only just beginning.