Will AI Extinguish Code Developers

"I don't have time to write all of those individual reports", my aunt told me a couple of weeks ago on a family dinner, "I let ChatGPT write the reports and then I just make some small adjustments to them. Way easier." She even shoulder shrugged like it was nothing while saying it. This indicates how well both OpenAI and Microsoft have succeeded at making their GPT models a commodity. After all LLMs are amazing, right? Or is there also a darker side? A fear that those who fail to board the AI rocket ship will be outperformed by their peers, or even worse replaced by a "machine"? In this post, we zoom in on one profession in particular that might be at risk: the code developers themselves.

AlphaCode 2, you say?

In order to do that, we have to look at recent developments that have been made in the code generation spectrum of the LLM landscape. And while OpenAI's Codex model is ubiquitously present in solutions like GitHub Copilot, Pygma and Replit, there's a new kid on the block: AlphaCode 2. With their second generation of the AlphaCode model series, Google DeepMind does what they do best: impress.

So what is this AlphaCode 2, you might ask? Well, for starters AlphaCode 2 is not a model, but rather an entire system. Google themselves call it a "family of policy models", where each model is a Gemini (Pro) variation that is tuned with different hyperparameters. When given a coding problem, each of these individually tweaked models will generate a potential solution to that problem. And by "each", we mean hundreds, thousands or even millions of potential solutions. According to their own technical report, Google then filters out code that doesn't compile or solve the problem (about 95% of the samples) and code that is too similar (by means of a clustering algorithm). Finally, a scoring or verification model is used to select the best model out of the coding samples. This scoring model is actually another fine-tuned Gemini Pro model.

So in short, AlphaCode 2 generates a ton of different solutions, filters out code that doesn't work or is too similar and then uses a scoring model to select the best solution out of the different potential solutions. I leave the judgement of whether this approach is completely genius or completely overkill to you, but what I will say is that Google is not the only one looking into this way of solving these problems. Ironically OpenAI - Google's biggest competitor in the GenAI space - actually coined this way of working in their in their well own "Let's Verify Step By Step" and "Training Verifiers to Solve Math Word Problems" papers. In these papers, they talk about how this setup of working with a verifier leads to a boost in performance similar to a 30x model size increase and that verifiers are significantly easier to scale with increased data.

How does it perform?

First of all, there's the performance on the Codeforces platform. For those who are unfamiliar with this platform, Codeforces hosts various competitive programming tests which usually contain very challenging problems to solve. As a reference, GPT-4 could solve 0/10 of the easiest problems when they were not within its dataset, or in other words problems that were unseen by GPT-4 before. On the contrary, it could solve 10/10 pre 2021 problems. These were problems the model had already seen before. This obviously strongly points to data contamination, reported in a tweet from Horace He on X.

Source: Horace He - X

So now that we have an anchor point, let's see how AlphaCode2 performs. And honestly, it blew GPT-4 out of the water: AlphaCode2 solved 43% of the problems within 10 attempts and thereby also performed better than 87% of the competition participants. Interesting to note is the fact that the solve rate still went up after generating 1 million potential solutions. This leads to the AlphaCode2 model to rank between Expert and Master on the platform. Even more thrilling: in its 2 best performing contests, AlphaCode2 even outperformed 99.5% of all competition participants. Note that, while these competition participants are humans, they are not your average coding companions, but typically highly professional and intelligent coders.

AlphaCode2 performs better than 87% of the competition participants

Source: Google DeepMind - AlphaCode2 Technical Report

However while this might seem like some very stark numbers, we really need to think about whether we are looking at a fair "apples-to-apples" comparison here. Moreover, remember that we are comparing AlphaCode2, a system of thousands of Gemini Pro models each generating a solution, to a single GPT-4 model. In human terms, that would be similar to comparing the solution that was collaboratively generated by thousands of the greatest minds on earth to the solution of just 1 great mind that was left out of the group. Looking at it this way, it would be more underwhelming to underperform than it is overwhelming to overperform.

And apart from analytical performance, there's also a word to say about compute and operating cost. Imagine that you are working on a very tough coding problem and you want some artificial help to solve it. Firing up thousands of Gemini Pro models to solve your problem sound like an efficient solution to you, but it is most likely not a financially viable option for the company you are working for. Google also noticed the cost issue of AlphaCode2, which is why it is currently not part of the foundation Gemini models themselves. However, don't be while you might be inclined to think that "those crazy American companies are just throwing a boatload of resources at a problem and hope that it sticks", there is actually some nuance to be made here. Moreover, research has shown that the use of verifiers results in approximately the same performance boost as a 30x model size increase (Training Verifiers to Solve Math Word Problems, K. Cobbe et al.). In other words, the trade-off between model size and number of generated potential solutions will be paramount in scaling these types of LLM systems.

Expectations for the future

This being said, 2024 will be very interesting in terms of coding. If you can generate enough samples and test them - like AlphaCode2 does, it becomes more a matter of brute force compute rather than elegant reasoning. The question is though: for which kinds of problems is this immense requirement of compute justified and responsible? Is it justified to use when you're creating your next churn prediction model? Or how about using it to find a cure for a complex and deadly disease? How about using it to create new LLMs or pursue an AGI?

Barreling on those questions: are we going to be coding with tools like AlphaCode2 in the future or will they just replace us entirely? After all, while they aren't economically feasible now, technological evolutions might make them more feasible in the future. Google themselves hope that models like these will be used as a tool by humans . And, the fact that when scores on the Codeforces platform are evaluated in an AlphaCode2 + human setting are above the 90th percentile hints that there is still some synergy to be found between both humans and machines. However, how that will translate in the future is yet to be seen.

Our personal take as a Data Scientists/Data Engineers - read: I have no crystal ball to predict the future - is that these models might just end up as very powerful tools being used by developers. Especially in the early stage when the scaled down, less performant models are the only ones that are economically feasible to run. For everyday coding tasks like creating a forecasting model (data scientist) or creating end-to-end pipelines (data engineer), I believe that there will always be someone who needs to instruct the LLM, be able to intervene when something goes wrong and explain to the rest of the business what these models are actually doing. Already now, it is a big issue in companies that AI models can seem like a "black box" and "hard to understand" to the business. Imagine how this will evolve if we let a machine do all the work, without having experts who actually understand what has been done and can convey that message to their peers. So while I think that coders are probably less on the verge of being extinct, I think their skillset might require quite some change. Perhaps, communication skills will become even more important for those hardcore developers!

In a nutshell

Using state-of-the-art LLMs to generate a bunch of potential solutions to a complex coding problem, and a verifier to select the best solution, the ability to solve these problems can be increased tremendously
Going forward, the tradeoff between larger models and this generation - verification mechanism will be important for the scalability of these LLM systems
While it looks like AlphaCode2 is already better at solving complex coding problems than most humans, there is still a synergy to be found between human and machine
The role of developers might change from actual hands-on coding to a role that more revolves around instructing the LLMs, safeguarding best practices and explaining the results to the business

Sources

AlphaCode 2 Technical Report:
https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf
Let's Verify Step By Step paper:
https://cdn.openai.com/improving-mathematical-reasoning-with-process-supervision/Lets_Verify_Step_by_Step.pdf
Training Verifiers to Solve Math Word Problems:
https://matthiasplappert.com/publications/2021_Cobbe_Grade-School-Math.pdf
Codeforces thread:
https://codeforces.com/blog/entry/123035
X post by Horace He about GPT-4 performance on Codeforces platform:
https://twitter.com/cHHillee/status/1635790330854526981?lang=en
Towards Unbiased Evaluation of Large Language Models:
Towards Unbiased Evaluation of Large Language Models | by Donato Riccio | Dec, 2023 | Towards Data Science
Training Verifiers to Solve Math Word Problems:
https://matthiasplappert.com/publications/2021_Cobbe_Grade-School-Math.pdf