Herbie Bradley1,2,3, Andrew Dai4, Jenny Zhang5,6, Jeff Clune5,6, Kenneth Stanley7, Joel Lehman1,2

1CarperAI, 2Stability AI, 3University of Cambridge, 4Aleph Alpha, 5University of British Columbia, 6Vector Institute, 7Maven

Figure 1: Maps showing the diversity across genre and tone (x and y axes) and quality (color of each grid cell) of generated poems from GPT-4 using our method, QDAIF, compared with a simple independent sampling baseline. The diversity and quality metrics are also obtained from GPT-4. White cells are unfilled.

Introduction

Human innovation is not only a generative capacity for creativity, but also contains the ability to evaluate the subjective quality of new ideas and artifacts. Great ideas are rarely generated all at once out of whole cloth, but rather gradually emerge through divergent chains of elaboration and revision. To successfully navigate such a tree of ideas, the creator must somehow evaluate which stepping stones in a chain are worth pursuing further, a question that can be highly subjective, especially in domains with artistic or literary dimensions.

Until now, even if AI could provide candidates, the hope for such subjectively tinged evaluation lay firmly with humans. However, the emerging language model (LM) technology of recent years now means that the computer can also play the role of evaluator, even when such evaluation is in part subjective. In this way, for the first time an entire ideation process that returns a diverse set of interesting options can in principle be automated. However, this process cannot be run by LMs entirely on their own, but requires chaining together a search algorithm and model calls together in a nuanced way. This blog post highlights one way to achieve this full potential: to combine LMs with the field of quality diversity (QD), which centers on how to design search processes that produce high-quality solutions that span a design space.

Haiku, “dark” tone, 9/10 qualityLimerick, “mysterious” tone, 9/10 quality
Silent shadows loom,

Ghastly whispers pierce the night,

Dreadful fate awaits.
There once was a place of intrigue

Where shadows danced without fatigue

The wind it did howl

As secrets did prowl

But what lay beneath, none could besiege.
Table 1: Examples of diverse, high quality generated poems at the end of poetry evolution with QDAIF using GPT-4.

The main insight in QD algorithms is to explicitly maintain and seek high-quality diverse responses (most often through hand-designed measures of diversity and quality). However, we show that we can harness LMs to create powerful new algorithms, which we name Quality Diversity with AI Feedback (QDAIF), that can explore and return diverse responses to an LM prompt, without any hand-designed diversity measures, or the need to fine-tune models (although it could also be used for LMs to self-improve through generating fine-tuning data). In this way, we can create LMs that step closer to the ability to independently search and innovate, one of the linchpin abilities of humans that allow them to create culture and science.

The rest of this blog post describes the basic technologies in more detail, then highlights the potential of QDAIF through experiments in two creative writing domains.

ELM and OpenELM

Evolution Through Large Models (ELM) is a recent approach for creating evolutionary algorithms that use LMs to intelligently refine model-generated text, including code or creative writing. In the original paper, evolution leveraged special diff-trained language models to evolve Python code. Later, a collaborative project with CarperAI researchers built on this work to develop LMX (language model crossover), a way to evolve arbitrary text representations (e.g. mathematical expressions, sentences, Python programs, and prompts for text-to-image models), without needing a specially-trained language model, powered by the effectiveness of in-context learning. What is exciting about these kinds of LM-based evolutionary algorithms is that LMs enable intelligent search and exploration among a population of possible solutions, because LMs contain a learned prior about what changes to a piece of writing or code would be interesting and plausible. In this way, ELM helps evolutionary algorithms move from the undirected genetic search of biological evolution, and towards the directed mimetic search that characterizes human innovation.

To help catalyze this new and exciting area of LM research, CarperAI earlier released OpenELM, an open-source Python library that enables easy exploration of combinations of LMs and evolutionary algorithms. OpenELM implements several LM-based variation operators, including diff models, prompt-based mutation, and LMX, and is designed to accommodate users with limited compute through inference optimizations and integrating API-based LMs together with models run locally on a GPU. OpenELM is what we use to implement the experiments described in this blog post.

Quality Diversity Algorithms

Most evolutionary algorithms aim to optimize a population towards a single high-quality solution. In contrast, quality diversity algorithms aim to return a diverse set of high-quality solutions. So, in addition to supplying a measure of quality (e.g. a fitness function), for QD algorithms, some measure of the desired kinds of diversity is also required. For example, in the ELM paper, a QD algorithm evolves Python programs that construct locomoting robots in a 2D physics environment, where the height, width, and mass of the robot were chosen to be dimensions of desired diversity. The result of such a QD search were many robots of different sizes and shapes capable of competent motion. While in that experiment, diversity was measured in continuous space (e.g. mass), diversity dimensions can also be categorical, such as the writing style of a piece of text (as in the first experiment in this blog post).

As in the ELM paper, the experiments described here apply MAP-Elites, a simple QD algorithm that maintains a population in the form of  a grid (also called the map) spanning a space of desired diversity (e.g. height, width, and mass of a robot). During the evolution process, as each potential solution is generated, its quality is evaluated and it is mapped into a particular cell of the grid (based on the diversity measures). This new solution is compared against the current occupant of the cell corresponding to its diversity characteristics (e.g. compared to a robot of similar size and mass), and replaces the current occupant if it has a higher quality score (it becomes the new “elite” in that cell of the map). In this way, over generations of evolution, the map is filled up with both high-performing and diverse solutions.

One limitation of current QD algorithms is that they require supplying both a quantitative measure of quality (to judge how good each solution is) and quantitative measures of desired diversity (to drive search to explore many different kinds of solutions). Usually such measures are hand-coded, which limits the application of QD algorithms to complex domains like generating creative writing. This blog post shows that LMs can supply quality and diversity feedback and overcome this limitation, helping to improve upon the default responses of LMs.

The Potential of AI Feedback

Recent months have seen an explosion of research that leverages LMs to provide feedback on the training, evaluation, or problem-solving capabilities of other LMs. For LM training, the use of AI feedback in supervised or RL finetuning is becoming increasingly popular, following Anthropic’s work on Constitutional AI, in which LM-generated critiques, refinements, and preferences over text generations are used to finetune models to perform better on both helpfulness and harmlessness metrics.

One particularly promising direction for AI feedback is the use of LMs to rate and evaluate their own outputs, with the goal of improving the original output in an iterative refinement process. Self-refine shows that on a dialogue response task, asking LMs to rate their output with a numeric score across 10 dimensions of quality, including helpfulness and consistency, can result in feedback to improve a refined output by 10% according to human evaluators. These kinds of self-improvement results relate to the notion of generation-discrimination gaps, showing that in many domains, it can be easier for a model to evaluate the quality of a generation than to generate the text in the first place.

QDAIF builds upon this capability of LMs to critique and evaluate outputs. A LM judges both the quality and diversity of its (or another LM’s) responses, and those judgments serve as the core of a divergent, exploratory search process.

Quality Diversity through AI Feedback

Bringing these threads together (QD, ELM, and AI Feedback), this section introduces a simple QDAIF algorithm. In this algorithm, we first decide on what aspect of quality and diversity we care about for a given domain, and how to prompt a LM to obtain them. This can involve prompt engineering to get an LM to evaluate quality and diversity, or the use of a custom fine-tuned model for a particular domain. Then, we select an LM-based variation operator (e.g. a way for an LM to riff on existing solutions, like LMX or a prompt-based instruction, e.g. “Create a small variation of this sentence”), and seed the algorithm with one or more initial (and potentially low-quality) solutions. In a domain such as creative writing, these might be examples of stories you want the algorithm to build from.

Next, we define a map from these diversity measures and begin a MAP-Elites loop: 

  1. Creating offspring with the language model variation operator
  2. Evaluating those offspring for quality and diversity with the AI feedback measures
  3. Placing each solution into its corresponding cell in the map if it is better. 

Over time, the map fills up with diverse and high quality solutions, relying only on AI feedback (i.e. without hand-coding quantitative diversity or quality measures).

Evolution in Creative Writing Domains

To highlight the potential of QDAIF, we conduct experiments in two domains, using different models. We first review how to apply feedback from GPT-4 to evolve poems that vary over genre and emotional tone. Next, we review supplementary experiments that provide some evidence of the approach’s generality using Aleph Alpha’s models to write movie reviews that vary in how positively they judge the movie.

Evolving Poetry With GPT-4

To demonstrate how QDAIF can be applied to complex creative domains where diverse and high performing generations are useful, we developed a MAP-Elites domain for poetry. Here, we use categorical variables to define two diversity dimensions: genre and tone of poem. Each cell in the map has two integer coordinates, which map to a particular discrete combination of genre and tone. Our genres are: "haiku", "sonnet", "ballad", "limerick", and "hymn", while our tones are: "happy", "dark", "mysterious", "romantic", and "reflective".

During evolution, a poem is randomly chosen from the map, and to generate a new solution, we prompt the LM to translate the chosen poem into a target genre and tone. In our experiments, these targets are chosen randomly, but future work could investigate guided evolution by prompting the model to explicitly target unexplored areas of the map.

We use GPT-4 to both identify the genre & tone of a poem, and to measure the poem’s quality. To obtain a quality rating, we ask the model to evaluate the quality of the poem on a scale from 1 to 10. For diversity ratings, we ask the model what genre and tone the target poem is closest to from the list of genres and tones we defined our map with, listed above. For both diversity and quality, we ask the model to provide its output in JSON form with pre-specified keys, to make it easy to parse.

We found that these quality and diversity ratings were highly consistent: repeated prompting of GPT-4 with the same poem would give quality ratings varying by at most a single point. The correlation between these AI-generated ratings and our own qualitative impressions of quality was high, and improved further once we changed the prompt to ask the model to elaborate on its reasoning.

Figure 2: Quality-Diversity score and number of niches filled in the map (i.e. how many combinations of genres and tones have been discovered, out of 25) over time, during a 2000-step run of our poetry domain with GPT-4. The targeted baseline randomly selects target genres and tones of poems to generate from scratch, but does no evolution from previous poems, while the baseline simply generates 2000 random poems.

We ran MAP-Elites for 2000 steps with this setup, and compared our results against two baselines: a targeted random sampling and a purely random baseline. The former consists of randomly selecting 2000 combinations of genres and tones, asking GPT-4 to generate poems of high quality with the target genre and tone, then filling up a map with these generations to measure a QD score, defined as the sum of all fitnesses in the map. The purely random baseline simply generates 2000 random poems and places them in a map. Note that asking GPT-4 to generate a poem with a certain genre and tone does not mean it will generate a poem that another LM call evaluates as a success; these kinds of failures are why none of the methods successfully fills all 25 niches.

The results in Figure 2 demonstrate that QDAIF can generate more diverse and high-quality creative writing compared with even a strong baseline of targeted random sampling. Table 1 shows examples of two of our highest quality generations from the QDAIF map, targeting haikus with a “dark” tone, and limericks with a “mysterious” tone respectively.

As shown in Figure 1, QDAIF successfully fills out the map with high quality poems, with only a few cells left empty. In contrast, by default randomly sampling poems from GPT-4 produces a map with only a handful of filled cells, demonstrating that just asking for generic pieces of text from a language model often results in a surprising lack of diversity in comparison with the full capabilities of the model. This phenomenon is related to mode collapse in LMs. In many domains, including poetry, human users often care about more than just high-quality outputs when we ask a model to generate text; we also want some level of diversity, and QDAIF provides a path towards fulfilling this goal.

Sentiment of Movie Reviews

As a supporting demonstration of QDAIF’s generality, we also earlier applied MAP-Elites to a creative writing environment where the task is to evolve movie reviews for the late 1980’s action movie “Die Hard”. We measure quality by computing the cosine similarity between the embeddings of a reference string (“Movie review for the film ‘Die Hard’”) and the generated movie review using the Luminous-Explore 13B embedding model, as a type of AI feedback on whether or not the generated text is a realistic movie review. We measure diversity with the generated review’s sentiment (e.g. how positive or negative a review was), by prompting an instruction-tuned model to evaluate the general sentiment of the movie review text and calculating a score from log-probabilities of “positive” and “negative”.

We use Aleph Alpha’s Luminous models as our language model variation operator, with LMX crossover, where each prompt consists of several movie review examples of the following form:

Here is a random example of a review for the movie “Die Hard”:{review}
###

We compare our results from QDAIF against a simple baseline, which simply samples movie reviews using a fixed, hand-written 3-shot prompt. We then measure quality and diversity for each generation from the baseline, in the same manner described above, and fill out a map, calculate QD scores, and compare against QDAIF.

Here are some example results from our QDAIF evolution:

QDAIF, bin index 1/20 (negative sentiment), iteration 200:
So it should come as no surprise that the new Die Hard should be exactly the opposite.\nIt is nowhere near as good as the first Die Hard, which doesn’t have to be very good considering its amazing record, and the plot is.

QDAIF, bin index 19/20 (positive sentiment), iteration 100:
In the new Die Hard the hero, John McClane, finds himself in Europe to rescue his estranged wife, Holly, and his son, Jack, who has been kidnapped by Russian terrorists and taken to Moscow. All three are in danger.

QDAIF, bin index 19/20 (positive sentiment), iteration 500:
This is review for the movie Die Hard.

Interestingly, we can see that in some cases, the generated reviews converge to become more similar to the reference string used to define the fitness function, “Movie review for the film ‘Die Hard”, rather than consisting of an actual movie review. This indicates a type of reward hacking, where the effect of optimizing the fitness function does not align with the intended goal of making texts similar to realistic movie reviews. 

This highlights a general issue when using AI feedback, which is that increased optimization pressure against fixed AI-driven evaluations can lead to adversarial examples (i.e. discovering ways to score high on the evaluation by exploiting a weakness in the evaluation), thus motivating research into making AI feedback more robust, or understanding how much optimization can be applied before the AI evaluation can no longer be trusted. 

Figure 3: QD score history of comparable runs in movie review writing domain. Mean and standard error are measured in each run setting over 5 seeds.

We see notable differences in the resulting corpus of generated movie reviews between QDAIF and the baseline. Firstly, Figure 3 shows that QDAIF improves the map’s QD score more than the baseline, especially with more evolution steps. In addition, we show above some interesting qualitative results from the movie reviews discovered by QDAIF, highlighting the drive for optimization. For subjective descriptions of raw example movie reviews, please check out the supplementary material.

To study how our movie review setup generalizes to additional diversity axes, we also ran experiments with two diversity axes, adding a diversity axis to evaluate whether or not the generated movie review focuses on film characters. The overall number of niches in the resulting map is much higher than in the 1D case (400 vs 50).

Figure 4: An animation of the map fitness improving over time on our 2D movie review task. The “cross” pattern in the center is an artifact of our use of AI feedback for sentiment and topic alignment.

To obtain diversity feedback for this new dimension, we prompt the instruct model to evaluate whether or not the movie review focuses on film characters (by answering either “yes” or “no”). Similar to sentiment diversity, we compute the normalized probability of predicting an answer option to this prompt (i.e. “yes”) to compute the score.

Exploring the whole map in this case is more difficult due to the greater number of bins (400) in this 2-dimensional map. However, the animation in Figure 4 shows that MAP-Elites can still successfully fill out the map with high quality examples.

Conclusion

Quality-Diversity with AI Feedback (QDAIF) highlights the potential for building powerful search algorithms through LM feedback—algorithms that can explore and refine diverse possible solutions to nuanced qualitative problems, including creative domains like poetry. QDAIF achieves this by continuing to build iteratively upon what it has discovered so far. QDAIF can improve an LM’s baseline capabilities, and we believe it could also generate fine-tuning data to help a model improve.

We acknowledge that both the safety of generally capable AI and its potential existential risks pose real and significant challenges, and that open-ended AI algorithms may have risks particular to them, yet we believe that appropriately managed development of open-ended AI has the potential for tremendous societal benefit.

While experiments so far have been conducted only with language models, and in conjunction with relatively simple QD algorithms (like MAP-Elites), we are excited about applying QDAIF to multimodal models (and foundation models in general), and to invent new kinds of open-ended search algorithms designed specifically with foundation models in mind. We believe that such algorithms are a potential remedy to a critical weak point of current LMs, which is their lack of ability to generate useful and diverse new artifacts and solutions that extend far beyond the distribution of their training data.

We plan to develop this work further into a publication—coming soon! Supplementary material to this post can be found here.

Acknowledgements

To cite this blog post, please use:

H. Bradley, A. Dai, J. Zhang, J. Clune, K. Stanley, J. Lehman. (May 2023). Quality Diversity through AI Feedback. CarperAI Blog. https://carper.ai/quality-diversity-through-ai-feedback/.

@article{bradley2023qdaif,
  title="Quality Diversity through AI Feedback",
  author="Bradley, Herbie and Dai, Andrew and Zhang, Jenny and Clune, Jeff and Stanley, Kenneth and Lehman, Joel",
  journal="CarperAI Blog",
  year="2023",
  month="May",
  url="https://carper.ai/quality-diversity-through-ai-feedback/"
}

This work was a collaboration between CarperAI, Stability AI, and Aleph Alpha.

We acknowledge the efforts of Aleph Alpha’s contributors: 

  • Koen Oostermeijer (visualizations and analysis)
  • Marco Bellagente & Hannah Teufel (contributions and technical support for OpenELM)
  • Souradeep Nanda & Jan Zierstek (general feedback and guidance)