CarperAI is releasing a series of diff models—models trained to predict a code diff, trained on millions of commits scraped from GitHub. We are releasing 3 models of different sizes, all fine-tuned from Salesforce’s CodeGen code synthesis models:

The dataset of diffs we scraped to train these models will be released separately in the near future. We hope these models will be useful for suggesting intelligent changes to existing code, controllable through a specific commit message describing the change. We will continue to iterate on our diff models, so stay tuned for further releases.

Read on for more details on how the models were trained, with benchmark results!

Introduction

A diff model is an autoregressive language model trained on edits to a piece of text, formatted in Unified Diff Format. These diff models can suggest, given a section of text and a description of the desired change, an intelligent change to the text that fits the description, marking the lines added, changed, and deleted in diff format. The primary use case for these models is for suggesting changes to code—as such, the models we are releasing are fine-tuned versions of models already trained on code datasets.

In comparison to few-shot prompting of normal code generation models, diff models are specialized for suggesting intelligent changes to existing code, particularly longer pieces of code and where a change is required to follow some natural language text description (provided in the form of a commit message).

Prior work by Microsoft Research (Li et al., 2022) and OpenAI (Ray and McCandlish 20201; Lehman et al. 2022) identified the potential for diffs as a source of rich data on how to make changes to code, and trained models on diffs, but did not release any diff models or publish an analysis of how to obtain good performance.

1 Alex Ray and Sam McCandlish, OpenAI. Independent contribution: Training diff models, 2020.

A Diff Dataset

Our dataset for this fine-tune consists of commits from GitHub, obtained using the Google BigQuery Public Dataset, a public up to date snapshot of a huge number of open-source GitHub repositories. We took this dataset and filtered using BigQuery on the number of stars in the repository to exclude repos with less than 100 stars, and further restricted the query to only repositories with open-source non-copyleft licenses (e.g. MIT, Apache, etc) and commits with more than 10 characters in the commit message. We also restricted ourselves to a list of 22 popular programming, scripting, and markup languages, including Python, HTML, Bash scripts, SQL, C++, etc. This resulted in a dataset of 19 million commits after filtering.

At this point we had the commit hashes, repository names, and other metadata for the commits we wanted in our dataset. We then ran git clone on every repository in our dataset and used a Python script to obtain the raw code files before the diff is applied, together with the diff itself in Unified Diff Format. These were processed into Apache Parquet format using Dask with Apache Arrow to efficiently get it into a dataframe format, with one row per file changed (e.g. if a diff affected multiple files it was split up), and included only rows where each file + diff was short enough to fit into the context of the language model.

From there, we processed the dataset into EleutherAI’s lm_dataformat, a utility to create compressed data files for efficient language model training. The final format of the data seen by the language model consisted of the filename changed by the diff, the file before changes, the commit message, and the diff itself, all concatenated together with delineating tags in between:

<NME> {filename}
<BEF> {file_before_changes}
<MSG> {commit_message}
<DFF> {diff}

The model is then typically prompted with everything up to <DFF>, but you can also optionally include the section heading of the unified diff format immediately after <DFF>, which specifies which lines exactly the model should change. For example, appending @@ -1,3 +1,9 @@ after the diff tag would instruct the model to change the file at line 1, adding 9 - 3 = 6 lines. We do not add these four tags as special tokens, since we prioritized leaving the tokenizer unchanged.

The final dataset consisted of 1.4 million files from 19 million commits, which resulted in 1.086 billion tokens after tokenizing with a modified GPT-2 tokenizer to include whitespace tokens—an average of 888 tokens per sample.

Fine-tuning CodeGen

The model suite we worked with as a base was Salesforce’s CodeGen series of models, which are decoder-only transformer language models trained to predict the next token in a sequence. These models were first pre-trained on The Pile, an 800GB dataset of diverse text released by EleutherAI, and then further trained on a large dataset of permissively licensed code from GitHub BigQuery in 6 programming languages, before finally being trained on Python only code from the same source. Note that the code in these pre-training datasets will inevitably overlap to some degree with our diff dataset, although they do not contain diffs.

Salesforce have released variants of their models at 4 scales (350M, 2B, 6B, and 16B parameters) with 3 variants at each scale corresponding to the 3 different stages of pre-training described above. We chose to fine-tune the “mono” variants at each model scale, meaning the version trained on Python only code in addition to multi-language code.

In order to fine-tune these models on our diff dataset, we used HuggingFace’s standard fine-tuning script with slight modifications to customize to CodeGen’s architecture, using the default hyperparameters and without freezing any layers. To pre-process the data we concatenated each sample (file with changes) together in the format described above and cut it into chunks of 2048 tokens (the context length of the CodeGen models). We then fine-tuned all of the model sizes with this dataset as an initial trial run and baseline for further experiments. For all fine-tuning experiments in this post, we used 64 Nvidia A100 GPUs—we thank Stability AI for access to their compute resources!

To test a range of hyperparameters, we did a 12 run sweep with the 350m model across a range of learning rates and batch sizes, and settled on a learning rate of 3e-5 and a batch size of 1024 samples.

We then experimented with masking tokens in the loss computation, as described in the ELM paper. Specifically, we include only the tokens in the diff (including the tag <DFF>) in the loss, which is intended to encourage the model to predict the diff and not memorize the file and commit message. For example, we expect that filenames in <NME> and file contexts in <BEF> are given by the prompt, while <DFF> is the only goal in the diff generation. Therefore, it is natural to ignore unrelated prediction targets and exclude tokens before <DFF> in the computation of the loss function. We fine-tuned the full suite of models with this modification to compare the results across model scale.

File Truncation

We also experimented with different ways of truncating the file before changes to fit more of it into the context length. Without any truncation, roughly half of the files in the original dataset fit into the 2048 context length, for a total of 1.086 billion tokens. If we crop the file before changes to only contain the lines in the diff file, we can then fit 95% of the original dataset in the context, for a total of 2.181 billion tokens (see Figure 1). We hoped that including the extra data at the cost of some context in the file being changed would improve the model’s performance. However, we found that this experiment resulted in a model significantly worse than without truncation, likely because being able to see an entire class/function that a change relies on is important for modelling.

Results

To evaluate our models, we test their bug fixing capabilities on two tasks: 4-Parity, a simple toy benchmark where the model is required to fix basic bugs in a Python function to calculate the parity of a 4-bit sequence, and a more complex dataset of many synthetic and real Python bugs scraped from GitHub repositories by He et al. (2022). These benchmarks provide a simple testbed for whether diff LLMs can make multiple coordinated and effective changes to code.

For 4-Parity, we generate completions using a prompt consisting of the original function followed by the commit message <MSG> # Fixed bugs. We generate 3200 completions for each model, apply the resulting diff patches to the original function, execute the generated code and report the % of the generations where the generated 4-Parity function is correct across all test cases, at the best model temperature from {0.7, 0.8. 0.9}. We report results across 1-5 bugs synthetically introduced to the original function.

For the latter task of real Python bugs, we filter the dataset down to 1000 bugs across several bug fixing problems (e.g. a wrong binary operator and incorrect variable name problem), where we generate a diff for each bug and measure the exact string match accuracy between the generated function after applying the diff, and the correct (bug-free) function. The commit message for this task is Fix {bug_class}, where the bug class might be, for example, “incorrect binary operator”. Note that in this case we do not execute the generated code to test it, since these bugs are scraped from many different GitHub repositories and execution would be impractical.

The results from 4-Parity, shown in Figure 2, demonstrate that our diff models can perform basic bug fixing at comparable skill to the CodeGen models, which are prompted with the bugged function followed by #Fixed bugs. There is a clear performance increase with scale, and the 350M diff model performs better at the bug fixing task. We can also see that the loss masking approach described above results in significantly better diff models on this task.

Table 1 shows the results from our diff models on the synthetic + real bugs benchmark, using exact match accuracy as a metric, where a generated solution is correct if it fixes the bug and otherwise does not change the program. We can see that the masked diff models perform slightly better

Qualitatively, we also evaluated the accuracy of the line numbers in the generated diff hunk, and noticed that the larger scale models do very well at accurately generating line numbers which correspond to the lines which the diff below actually changes. This opens the door to prompting the model with specific line numbers to change, add, or remove, allowing for more control over the code generation in comparison with a non-diff model.

We also noticed that diff models (especially the 2B and 6B) tend to do better when prompted with longer code generation tasks (such as fixing bugs in a large function, and that varying the prompt induces greater diversity in generated code in comparison with the normal CodeGen models.

In further work, we hope to examine in greater detail the enhanced diversity and localised mutation abilities that diff models offer over standard code generation models, across many model scales.

Accelerated Inference with Triton and FasterTransformer

We also investigated the use of Nvidia’s FasterTransformer (FT) framework with the Triton Inference Server using an FT backend to achieve significantly accelerated inference. FasterTransformer is a collection of fused CUDA kernels optimized for inference, written in C++. The Triton Inference Server is an optimized system for serving large language models at scale, in both multi-GPU and multi-node setups using Docker containers.

Converting the CodeGen models to FT involved significant technical work, since CodeGen is not supported natively in FT. We first converted the CodeGen weights to GPT-J format via a linear algebra trick, since GPT-J has a very similar architecture, building on Brendan Dolan-Gavitt’s work with the Fauxpilot framework. From there, we used the FT script to convert the GPT-J HuggingFace checkpoint into FT’s format, which can be run with the Triton server. We struggled to get this to run on our cluster (which does not use Docker), but eventually succeeded and achieved a significant speedup on inference of our models—in some cases up to an order of magnitude faster.

Our scripts to convert and run these models with FasterTransformer and Triton are available in the OpenELM library.

We hope that this work inspires others to take our models and experiment with the potential of diff-based code generation!

To cite this blog post, please use the following entry:

H. Bradley, H. Fan, H. Saini, R. Adithyan, S. Purohit, and J. Lehman. (Jan 2023). Diff Models - A New Way to Edit Code. CarperAI Blog. https://carper.ai/diff-model/.

Or

@article{bradley2023diffmodels,
title   = "Diff Models - A New Way to Edit Code",
author  = "Bradley, Herbie and Fan, Honglu and Saini, Harry and Adithyan, Reshinth and Purohit, Shivanshu and Lehman, Joel",
journal = "CarperAI Blog",
year    = "2023",
month   = "Jan",
url     = "https://carper.ai/diff-model/"
}

Change Log: Changed y-axis on Figure 2 to be clearer.

Acknowledgements

The CarperAI diff models team consisted of Herbie Bradley, Honglu Fan, Harry Saini, Reshinth Adithyan, Shivanshu Purohit, and Joel Lehman.

We thank Stability AI for providing compute resources.