The quality of auto-generated code – O’Reilly

Kevlin Henney and I talked about some ideas about GitHub Copilot, the tool to automatically generate codebase on GPT-3’s language model, trained on the code contained in GitHub. This article asks some questions and (maybe) some answers without trying to present any conclusions.

At first we wondered about code quality. There are plenty of ways to solve a given programming problem; but most of us have some ideas about what makes code “good” or “bad”. Is it readable, is it well organized? Things like. In a professional environment where software needs to be maintained and modified over long periods of time, readability and organization count a lot.

Learn faster. Dig deeper. See further.

We know how to test whether the code is correct or not (at least up to a certain limit). Given enough device tests and acceptance tests, we can imagine a system for automatically generating code that is correct. Property-based testing may possibly give us some additional ideas for building test suites that are robust enough to verify that the code works properly. But we do not have methods to test for code that is “good”. Imagine asking Copilot to write a function that sorts a list. There are many ways to sort. Some are pretty good – for example quicksort. Some of them are awful. But a unit test has no way of telling if a function is implemented using quicksort, permutation sorting, (which is completed in factorial time), sleep sorting, or one of the other strange sorting algorithms that Kevlin has written about.

Do we not care? Well, we worry about O (N log N) behavior versus O (N!). But if we assume that we have a way to solve that problem, if we can specify the behavior of a program accurately enough, then we are very confident that Copilot will write code that is correct and acceptably performing, then we worry about its aesthetics? Do we care if it is legible? 40 years ago we could have worried about the assembler language code generated by a compiler. But today we do not, except for a few increasingly rare corner cases that usually involve device drivers or embedded systems. If I type something in C and compile it with gcc, I will realistically never look at the compiler’s output. I do not have to understand it.

To get to this point, we may need a meta-language to describe what we want the program to do, which is almost as detailed as a modern, high-level language. That could be what the future holds: An understanding of “prompt engineering” that lets us tell an AI system exactly what we want a program to do, rather than how it should do it. Testing would become much more important and the same would understand exactly the business problem to be solved. “Sling code” regardless of language would become less common.

But what if we do not reach the point where we rely so much on automatically generated code as we now rely on the output of a compiler? Readability will be a premium as long as people need to read code. If we have to read the output from one of Copilot’s descendants to assess whether it will work or not, or if we have to troubleshoot that output because it mostly works but fails in some cases, then we need to use it to generate code that is readable. Not that people are currently doing a good job of writing readable code; but we all know how painful it is to debug code that is not readable, and we all have some idea of ​​what “readability” means.

Second: Copilot was trained in the code in GitHub. At this point, all (or almost all) of it is written by humans. Some of it is good, high quality, readable code; much of it is not. What if Copilot became so successful that Copilot-generated code came to make up a significant percentage of the code on GitHub? The model will definitely need to be retrained from time to time. So now we have a feedback loop: Copilot trained on code that has been (at least in part) generated by Copilot. Is code quality improving? Or does it break down? And again, these would mean that you have to spend for these processes.

This question can be argued in both ways. People who work with automated tagging for AI seem to take the view that iterative tagging leads to better results: ie. after a tagging pass, use a human-in-the-loop to check some of the tags, correct them where they are wrong, and then use that extra input in another workout. Repeat as needed. It is not so different from current (non-automated) programming: type, compile, run, troubleshoot, as often as necessary to get something working. The feedback loop allows you to write good code.

A human-in-the-loop approach to training an AI code generator is a possible way to get “good code” (no matter what “good” means) – even if it’s only a partial solution. Problems such as indentation style, significant variable names and the like are just the beginning. It is a more difficult problem to assess whether a code is structured in coherent modules, has well-designed APIs and can be easily understood by maintainers. Humans can evaluate code with these qualities in mind, but it takes time. A human-in-the-loop might help train AI systems to design good APIs, but at some point, the “human” part of the loop will begin to dominate the rest.

If you look at this problem from the point of view of evolution, you see something else. If you breed plants or animals (a very selected form of evolution) for one desired quality, you will almost certainly see all the other qualities deteriorate: you get large dogs with hips that do not work, or dogs with flat faces that can not breathe properly.

What direction will automatically generated code take? We do not know. Our guess is that without ways to measure “code quality” strictly, code quality is likely to deteriorate. Ever since Peter Drucker, management consultants have liked to say, “If you can not measure it, you can not improve it.” And we assume that this also applies to code generation: aspects of the code that can be measured will improve, aspects that cannot, will not. Or, as accounting historian H. Thomas Johnson put it: “Maybe what you measure is what you get. More likely, what you measure is the only thing you get. What you do not (or can not) measure is lost. ”

We can write tools to measure some superficial aspects of code quality, such as obeying stylistic conventions. We already have tools that can “correct” rather superficial quality problems such as immersion. But again, the superficial approach does not touch on the more difficult parts of the problem. If we had an algorithm that could score readability and limit Copilot’s training set to code that scores in the 90th percentile, we would definitely see output that looks better than most human codes. Even with such an algorithm, however, it is still unclear whether this algorithm could determine whether variables and functions had appropriate names, let alone whether a large project was well-structured.

And a third time: do we not care? If we have a strict way of expressing what we want a program to do, we may never need to look at the underlying C or C ++. At some point, one of Copilot’s descendants may not need to generate code in a “high level language” at all: it may generate machine code for your target machine directly. And maybe the target machine will be Web Assembly, JVM or something else that is very portable.

Do we care if tools like Copilot write good code? We do that until we do not. Readability will be important as long as people have a role to play in the troubleshooting loop. The important question is probably not “do we care”; it is “when do we stop worrying?” When we can rely on the output of a code model, we will see a rapid phase change. We will worry less about the code and more about describing the task (and appropriate tests for that task) correctly.

Leave a Comment

%d bloggers like this: