We at CarperAI are happy to announce a new release today: CHEESE, a Co-adaptive Harness for Effective Evaluation, Steering, and Enhancement. We hope it will solve human feedback data collection with a simple API that can turn any Gradio experiment into a feedback collection platform. This v0.1 release includes:
- An API for collecting feedback from a Gradio demo.
- Numerous examples, such as text completion reranking, image selection, and design feedback.
- Documentation detailing how to use CHEESE for your own tasks.
Read on to learn about the value of collecting human feedback and what it can be used for.
Improving Models With Feedback
Recent advancements in reinforcement learning from human feedback (RLHF for short) have driven breathtaking improvements in natural language generation. Just a few weeks ago, ChatGPT took the world by storm, wowing not only the research community but also the general public with its ability to follow instructions and execute complex textual tasks with relatively high accuracy (granted it is far from perfect). This is in contrast to older, purely self-supervised models (lacking human feedback), which struggled at carrying out instructions and tended to respond to questions with further questions rather than useful answers.
While this blog post will focus on the human feedback (HF) component of RLHF, first we will review how HF can improve models through RL. Improving a pre-trained model on an already collected dataset of human feedback involves the following two steps: training a reward model to provide a reward signal approximating human preferences and then fine-tuning the pre-trained model with RL on that signal. For more specifics on this process, you can check out this blog post from our friends at Hugging Face. For the RL training, several frameworks exist, including ours—TRLX, which was inspired heavily by TRL—and AI2’s framework RL4LMs. As for training the reward model, stay tuned!
While typically RLHF is used to improve language generation, we don’t think we should stop there. Any model that generates content stands to improve from human feedback. Most weaknesses of self-supervised generative models boil down to them not understanding intent, unless we speak their language and engineer our prompts so as to not be misunderstood. Some may find it fun to design complex and creative prompts, though for the average person this creates an unnecessary barrier to using generative models. Reducing this friction can make these models useful and accessible for more people.
The Hard Part
Training a reward model and then fine-tuning another model with the reward signal is well-trodden ground, but the task of obtaining the necessary human feedback poses two main challenges. First, there is the actual financial cost of gathering and paying human participants.
Getting feedback on art or question answering is one thing, but what if you need experts for a domain like medical imaging? The expenses can scale quite quickly, which creates a significant obstacle for independent researchers. One alternative is to collect data approximating human feedback from usage statistics (i.e. having users rate your models' generations in production) or scraping publicly accessible forums that have scoring for submissions. Saving on contractor costs is appealing, however in doing so you lose control over the type of feedback you receive.
The second challenge occurs when you are designing your annotation platform for collecting feedback. You want to minimize the risk of evaluators giving underspecified or overspecified feedback. This corresponds to asking for too little or too much information. The latter may sometimes seem appealing as more specific feedback could provide a stronger signal; however, it is also riskier as it makes the tasks longer and more challenging for the participants, resulting in a smaller overall dataset.
To solve this, a thoughtful UI is required for the annotation platform. A good interface should be as simple, intuitive and engaging for participants as possible; however, it is not immediately apparent how to achieve this. This is likely why the value of a UX expert in machine learning has grown considerably in recent years. For corporations, this might mean hiring such an expert, but for independent researchers that is often not an option. There are many frameworks that can be used for labeling generally but not one that is fully open source and tailored for RLHF. This is where CHEESE comes in!
Easy UIs and CHEESE
Another place where fast, simple, and sleek UIs come in handy is for demoing trained models. The current most popular API for creating demos is Gradio. Given its popularity, ease of use when working with generative models, and open source nature, we decided to build a framework for collecting human feedback around it: CHEESE! CHEESE can take any demo made using Gradio and turn it into an annotation platform for any kind of data. Provided you write a pipeline and a Gradio demo, CHEESE is very simple to use for a wide variety of tasks.
In using CHEESE to evaluate models, one can also use the acquired feedback to enhance or steer models. It is even possible to do this iteratively, further fine-tuning an already enhanced model after deploying it and collecting updated feedback. At Carper we use our TRLX framework for the enhancement component, since it has already proven scalable and modular enough to work well for large models and will soon be used for our Open-Instruct project. As an example use-case for CHEESE, we can take a look at a model that generates architectural designs from text prompts.
RLHF for Design with Architext
The creation of a generative architectural design model democratizes design expertise, allowing everyone to generate valid, useful, and interesting design outputs. This is what Architext sets out to do. To this end, it works off of something anyone can use: language. By fine-tuning pretrained language models, Architext enables semantic generation of designs. It takes a language prompt representing a high level description of some design, then tries to generate the appropriate geometry for that prompt. This ability to go from language to a structured representation (i.e. geometry) allows us to apply the model to a variety of downstream design applications and software.
In its current stage Architext is limited to the design domain of residential floor plans. While it has been trained on a limited synthetic dataset (~250,000 designs) with a limited design space, and only a handful of language annotations as prompts, the power of LLMs allows Architext’s generations to be diverse and creative, making it a great candidate for generative design.
The next step for Architext is to improve the model's robustness against complex and diverse prompts. Ideally, any prompt should be able to create a relevant design. To this end, we need more data paired with a large dataset of semantic annotations. Unfortunately, neither of these is found easily in the architecture domain. Datasets are scarce, rarely open sourced, and not very diverse. Annotations don’t really exist, as the field never anticipated their potential use in generative workflows. The procedure by which one would produce these annotations is rather simple: present a design, then elicit feedback of some kind. Is the design aesthetically pleasing? Does it satisfy a certain constraint? If not, what changes would be needed to satisfy the constraint? Human feedback opens up new and exciting fields of research and design is a great domain to explore it further.
That being said, feedback can be difficult to collect. One challenge is finding a proper interface that can handle collection and curation of the desired feedback. CHEESE works great here. It allows the user to easily interact with Architext models through an intuitive interface and captures a variety of human feedback. Our example in the repository shows how we use CHEESE for collecting scalar rewards, open ended design critiques, and satisfaction of specific design constraints. All this requires is defining a data class, pipeline, model, and frontend. CHEESE not only collects our data but curates it in a structured format, allowing the development of expert feedback datasets that can help us train better task-specific models across a plethora of design applications.
With the release of CHEESE, we hope to empower independent researchers in collecting human feedback quickly and efficiently, so that any generative model can be improved with feedback.