CarperAI is happy to announce the paper and 0.9 release of OpenELM! OpenELM is an open-source library that enables evolutionary search with language models in both code and natural language. 

This release is intended to be mostly feature-complete, and we intend to push a 1.0 release by the end of the summer. The OpenELM paper was published at GPTP 2023.

If you’d like to contribute to further development, experiments, or just have questions, please go to the #openelm channel on the CarperAI Discord server!


ELM stands for Evolution Through Large Models, a technique from a 2022 OpenAI paper demonstrating that large language models can act as intelligent operators of variation in an evolutionary algorithm, enabling diverse and high quality generation of text in domains not seen in the language model’s training set.

In recent years, LLMs have demonstrated the ability to refine their output iteratively and critique and improve their own outputs. This capability can be leveraged to improve an LLM’s problem-solving ability, and highlights the potential for LLMs to act as an intelligent search operator in the space of language and code.

In this way, evolutionary algorithms can benefit from LLMs that provide an intelligent engine of variation in domains such as plain-text code generation (e.g. evolving pure Python code) and creative writing.

We build on this work to develop an extensive library, OpenELM, with the following goals:

  1. Release an open-source version of ELM with its associated diff models.
  2. Integrate with both open-source language models (run locally or on Colab) and with closed models via paid APIs, such as the OpenAI API.
    We want to support users with many different compute profiles!
  1. Provide a simple interface to a range of example environments for evolutionary search, to let users adapt these easily for their domain.
  2. Demonstrate the potential of evolution with LLMs.

OpenELM Features

LLM integration with evolutionary algorithms 

We focus primarily on quality-diversity (QD) algorithms such as MAP-Elites. These algorithms work by defining a behavior space of niches defining the space of possible solutions, and incentivize diversity among the evolutionary population by ensuring that the best individual in each niche is not replaced by fitter individuals in other niches.

OpenELM supports MAP-Elites, CVT-MAP-Elites, and Deep Grid MAP-Elites, as well as a simple genetic algorithm baseline.

Figure 1: This figure shows an abstract schematic of the OpenELM process. Green boxes indicate steps which, in OpenELM, may be wholly or partly carried out by a large language model, or an arbitrary combination of large language models. The "selection" box represents the genetic or quality-diversity algorithm itself.
The evaluation and generation shown here is an example of the use of AI feedback to evaluate the generation of poems.

A flexible set of LLM-based evolutionary operators

The most basic LLM-based evolutionary operator we might imagine is prompt-based mutation, in which an LLM is simply prompted with an example of an individual (i.e. a program or piece of text) and instructed to modify it in some way.

For code, the original ELM work demonstrated an alternative mutation operator: diff models. These models take in a piece of code and a commit message describing a desired change, and create a new piece of code with the change applied. We described how we trained our own diff models in a prior blog post, and OpenELM supports these models for code environments.

Finally, what about combining multiple individuals via a crossover operator? In a separate paper, we developed a technique called LMX, inspired by few-shot prompting, demonstrating how LLMs can be effective crossover operators in an evolutionary search. OpenELM supports LMX along with several variations, such as crossover by sampling parents from only nearby cells in MAP-Elites.

Language model support, efficiency, and safety

OpenELM’s language models are instantiated as Langchain classes by default, which means that OpenELM can support practically any existing LLM API, as well as models run on your local GPU via HuggingFace Transformers.

We also provide optional Nvidia Triton Inference Server support, intended for use cases where low latency on 8 or more GPUs is important. Finally, for code generation domains, we provide a sandbox environment, consisting of a container server backed with gVisor (a container runtime that introduces an additional barrier between the host and the container) as well as a heuristic-based safety guard.

Baseline Environments

We include a variety of environments with OpenELM to allow users to flexibly build on the library for their own use cases:

  1. Sodarace. Sodarace is a 2D physics-based simulation of robots moving across a variety of terrains, demonstrated in the ELM paper. These robots are created by Python programs generated from an LLM. This environment shows OpenELM’s ability to start with a single seed and bootstrap a language model to new capabilities in code domains.
  2. Image Generation. OpenELM can evolve over generated images by generating code that returns NumPy arrays containing the images. This serves as a simple test environment for code generation
  3. Programming Puzzles. OpenELM can be used to generate diverse solutions to programming puzzles. This environment can be extended to co-evolve both the problem and the solution at the same time.
  4. Prompts. OpenELM contains a generic environment suitable for evolving prompts for language models, customizable with Langchain templates to the desired domain.
  5. We also include a poetry environment, demonstrating the use of LLMs to evaluate both the quality and diversity of generated creative writing text, as described in a recent CarperAI blog post on Quality-Diversity with AI Feedback (QDAIF).
Figure 2: Heatmap of elites from the prompt evolution environment. Shown here is a
heatmap of the fitness for the map after 200 generations for the “largest animal" instruction-induction task.
The shortest prompts (at the top of the map) are unable to effectively communicate the
task and receive low fitness scores. The best prompts found are medium-length prompts
in the middle of the map. The highest performing elites have neutral sentiment, but
effective prompts are found spanning the entire range of sentiment.

Library Roadmap

1.0 Release

  • Integration of fine-tuning code with the evolutionary loop, both in the standard way and using LoRA adapters. This will allow users to continually specialize their LLMs to the target domain during evolution, greatly increasing sample efficiency.
  • Improved inference efficiency via integration with DeepSpeed Inference, providing another acceleration option alongside Triton Inference Server.
  • Improved API for co-evolution of an environment with its population.
  • ReadTheDocs documentation.

We hope you will enjoy using OpenELM, and we look forward to seeing what the community does with it!