AI won’t be in software engineering anytime soon, but it’s getting there
Last Sunday, we looked at OpenAI’s latest work in which the company trained broadcast models to generate deepfakes and subsequently reached a new state of the art in multiple image generation tasks. Today we are shifting gears and focusing on another great and recent development in the field of artificial intelligence: transformer models.
Transformer models came to the fore with Google’s open-source implementation of BERT. By improving the shortcomings of RNNs and LSTMs, this deep learning architecture has revolutionized the field of natural language processing and generation. We first saw the power of these language models in the form of OpenAI’s GPT-2 with 1.5 billion parameters when the language model produced news, stories, lyrics and other tunes. of text that could easily be mistaken for human work. and not a language model. Soon after, the GPT-3 – the successor to the GPT-2 – essentially borrowed all of the best elements of its predecessor and, with 175 billion settings to back it up, produced work that seemed incredibly consistent, sophisticated, and factually correct. . Since the training dataset for this language model was essentially the entire Internet, we could ask it to produce just about anything that is publicly available in textual form on the Internet. Aside from stories, lyrics, news, and conversations, the GPT-3 even wrote valid CSS and HTML code. The last of these, the ability of a language model to write code, is what we’ll be focusing on today.
A few days ago, a team of researchers made up of individuals from UC Berkeley, UChicago, UIUC and Cornell published an article in which they assessed the ability of today’s best language models to write code. In the article titled Measuring Coding Challenge Competence with APPS, the researchers essentially put these language patterns in the shoes of an individual going through a programming interview where their ability to understand a given problem and code its solution is tested. To do this, the team presents a new data set called the Automated Programming Progress Standard (APPS).
APPS assesses models not only on their ability to understand coding syntax, but also on their ability to understand job descriptions and design algorithms to solve those jobs. If a model were to perform well on APPS, it would indicate an ability to flexibly use data structures and programming techniques, as well as an ability to correctly interpret various task specifications, follow instructions, and understand the human intention.
The dataset consists of 10,000 coding problems divided into three categories (introduction, interview, competition) and written in plain English, which is generally expected in interview programming today. These problems come from open access sites like Codewars AtCoder, Kattis, and Codeforces, where programmers share coding issues with each other. To validate the solutions provided, the dataset contains 131,836 test cases and 232,444 ground truth solutions written by humans in Python.
1. Introductory level. Problems that most programmers with 1-2 years of experience can address without the need for complicated algorithms. Examples of such problems include counting the number of occurrences of a substring, or finding whether a string is a palindrome. There are 3,639 problems rated at the introductory level and 1,000 in the test set.
2. Interview level. These are more algorithmic problems and of a difficult nature and would be at the level of the questions asked during difficult technical interviews. Examples of such problems can include those involving data structures such as trees or charts, or problems that require modification of common algorithms. There are 5,000 issues classified as maintenance level and 3,000 in the test set.
3. Level of competition. These are issues that are even more difficult and are found at the level of more advanced high school and college programming competitions, including USACO, IOI, and ACM. There are 1,361 problems ranked at the competition level and 1,000 in the test set.
The following image shows an excerpt from the dataset:
With the APPS dataset prepared, the researchers trained three of the best language models openly available today: GPT-2, GPT-3, and GPT-Neo (a free alternative to closed-source GPT-3). After the training was completed, the models were evaluated and compared to each other.
The team’s researchers found that while there are definite positives, comprehension and coding issues remain a notoriously difficult task, even for the best language models we have today.
We are refining large language models both on GitHub and our learning set, and we find that the prevalence of syntax errors is declining exponentially. Newer models like GPT-Neo can pass around 15% of introductory problem test cases, so we’re seeing machine learning models starting to learn to code.
On the positive side, the models have demonstrated the ability to understand the problem, write import statements, define classes, and create program flow. Here’s a sample of GPT-2, the smallest of the three models, on a test sample for which it passed all 18/18 test cases:
And here is an example of what the GPT-3 produced for a separate problem.
Obviously, the models sometimes suffered from syntax errors. But the larger models were more resistant against them, and more tweaking and training exponentially decreased those syntax errors. There were also times when the solution given by these models passed as correct at first glance despite the failure of all test cases once validated.
The team believe that a possible “memorization” of the code blocks of the training set could be the cause here. To tackle such problems, generally the idea is that we need more trainable settings. Overall, it is clear from the above results that while language models have come a long way in terms of conversational and creative and formal writing skills, their ability to code is still lackluster. But it’s definitely getting there.
We evaluated leading-edge generative models on our benchmark and found the overall performance to be poor. However, the prevalence of syntax errors has declined exponentially with larger-scale fine-tuning models, and recent models such as GPT-Neo have solved a number of introductory issues.
Going forward, the team envisioned that as language models continue to grow and become more robust, concerns about malicious code and automation could arise in the future. For those times, the APPS dataset offered here might be useful. As of yet, it doesn’t look like language models have a chance of landing a decent job in software engineering. More details can be found in this GitHub repository or in the arXiv preprint repository.