For this use case, we assume a real-time strategy game set in the nearest future, and explore the generation of various isometric building sprites for it. We start with generating the initial designs in Midjourney using just the text prompt, and then research how these initial designs can be modified.
Midjourney V4 does an excellent job generating isometric building designs based on basic prompts that contain only the building type and several style instructions:
The designs generated based on the basic prompts have relatively high variability in style even if the chaos parameter is to the minimum. To create several buildings of approximately the same style, we usually need to generate dozens of designs and manually pick the matching variants.
This basic approach works well for more complex prompts that combine several instructions about the building structure or use relatively uncommon or ambiguous concepts and terms:
The examples below demonstrate how the “default” designs can be customized with various types of damage and degradation. The variability of the generated designs decreases as we add more specific instructions about the style and scene layout, so it becomes easier to create a collection of buildings with a certain consistent style.
Similar to wear and tear, we can control the weather and lighting conditions. The instructions that specify the season, weather, and light also provide a powerful way to decrease the style variability and generate consistent collections of buildings.
Controlling the minor details, and specifically, colors, can be a challenging problem. Consider the following baseline designs:
Extending this baseline prompt by requesting minor colored elements results in major changes of the overall color theme:
These correlations are generally difficult to suspend using negative prompts (as illustrated in the examples below), and other prompt engineering techniques. However, these issues can be addressed by combining manual image editing and reference-based generation, as we discuss later in this blog post.
Controlling the quantitative parameters of the building and scene is also a challenging task, especially for complex prompts with multiple instructions. In many cases, the desirable results can be achieved but with a low success rate (only a small percentage of the generated designs meet the specification):
We conclude the overview of the initial design generation capabilities with a few examples of mixing multiple styles. We can start with engineering a prompt for an alternative style such as Warcraft-like fantasy:
This new style, and the previously used cyberpunk style, can be combined in a single prompt:
The contribution of different styles can also be controlled using prompt weights. For example, we can make the fantasy style two times more important than the cyberpunk style, as depicted below:
Midjourney provides extremely impressive capabilities for generating the initial designs based on structural and style instructions. However, it can be challenging to create collections of buildings with the same style, as well as adjust minor details, using only the textual prompts. In this section, we explore several techniques that help to address these limitations.
Midjourney provides the built-in Variations feature that can be used to generate alternative designs based on a specific initial design. However, this feature does not allow you to set the chaos parameter, and can be used only to generate small variations:
The alternative approach is to use the Image Reference feature that provides much higher variability by default, and also allows setting the chaos parameter explicitly:
The color correlation problem can be alleviated by replacing the problematic instructions in the prompt with a manually modified reference image. This produces reasonably good results for many applications:
All methods described above rely on manipulations of the conditioning signal in the stable diffusion model that backs the Midjourney services. The level of control over the composition, details, and variability that can be achieved using this approach is somewhat limited. The alternative option is to fine-tune the diffusion models based on manually selected, edited, or drawn reference images. This approach was developed in the DreamBooth paper, and productized in the Scenario service.
The fine-tuning approach can be illustrated with the following example. We start by generating multiple initial designs, and manually select a small set (usually 10-20) images of the same style. This allows us to accurately control both the style and composition/structure variability:
The model fine-tuned on such a training set can be used to sample style-consistent designs based on short prompts such as the following:
The general-purpose, pre-trained text-to-image and image-to-image generative models provide very impressive capabilities for game asset generation. Services like Midjourney and Scenario make these models very accessible and enable extremely productive asset development workflows. The techniques described in this blog post help to improve the control over the generation process and address some of the typical needs such as the generation of style-consistent collections of assets. We anticipate that the capabilities of the generative AI services, as well as applied no-code techniques for using them, will rapidly evolve in the next few years, revolutionizing the design and game development industries.