Written stories paired with critiques are a good source of data for preference learning. Critiques can serve as a very information-rich measure through which to gauge preferences on story content. With Contrastive Anecdote Review Pretraining (CARP, for short) we presented the Story-Critique dataset of passage/critique pairs (i.e. "Geese are better then ducks", "Should be than not then" as an example of such a pair), and the CARP model. CARP was trained on the Story-Critique dataset to produce embeddings of passages and reviews where high similarity between passages and review embeddings generally corresponds to reviews that would fit the given passage. By having a measure of how well a review fits a story, we can get a measure of how well a story does under a certain preference. The model and checkpoint are publicly available and the dataset can be shared by request. Paper is available here.
The direction in which we wanted to move after CARP was to use its similarity scores to guide text generation with preferences. Embedding a passage and preference (where a "preference" is treated as a review), we can use a similarity score as a reward for how well a model met a specified preference. As an example, suppose we had the preference that "the protagonist should be happy". Naturally, this kind of setup would punish passages like "the goose was sad" and reward passages such as "the goose was happy". CARP alone was not capable of this, so we designed CARP-CoOp to take a step towards this goal. The CoOp in the name comes from context optimization. Rather than simply feeding in predetermined reviews to compare stories against, we use context optimization to tune the review before feeding it to CARP's review encoder. Paper is available here.
Collecting human preferences on machine generated content at scale is tough. Several platforms exist for it that we've found insufficient for our research. To make something that functions synergistically with CARP and the reinforcement learning setup we want for Gyarados in the future, we decided to make CHEESE. CHEESE's function is in its full name: it is a Coadaptive Harness for Effective Evaluation, Steering and Enhancement of content generation models. The main thing setting CHEESE apart from other data labeling solutions will be its interactivity and modularity. The interactivity would allow users of CHEESE to use it for all sorts of human-in-the-loop setups where labellers may work together with a model to label data (coadaptive), score/label generations outputted by a model directly (evaluation), or provide immediate feedback for the model on how it can improve its generations (steering and enhancement).
Typically, when you want to adapt a pre-trained generative model (i.e. a language model producing textual content), you need a large and concise dataset to fine-tune it on. An alternate approach to this in recent years has been to instead reward or punish it for its generations. There are many ways to do this kind of reinforcement learning task, but we could not find a framework to use for our experiments.. This is why we are working on TRLX: Transformer Reinforcement Learning X to function on top of HuggingFace's already existing transformers library and provide a scalable framework for enabling reinforcement learning on large models.
In CodeCARP we aim to model the programming preferences one might have, such as preferences for one kind of solution over another or a certain design pattern. This project involves the development and release of relevant code critique datasets, along with training and release of large language models for code. CodeCARP models will eventually be combined with other approaches at CarperAI (like CARP-CoOp) to develop novel programming assistants. In this regard, we hope that CodeCARP will allow for a more fluid experience akin to pair programming, compared to current programming assistants.