How good is ChatGPT at coding?

This article is part of our exclusive IEEE Journal Watch Series in cooperation with IEEE Xplore.

Programmers have spent decades writing code for AI models, and now, at a full circle moment, AI is being used to write code. But how does an AI code generator compare to a human programmer?

The study was published in the June issue IEEE Transactions on Software Engineering evaluated code produced by OpenAI’s ChatGPT for functionality, complexity, and security. The results show that ChatGPT has an extremely wide range of success when it comes to producing functional code—with success rates ranging from as poor as 0.66 percent to as good as 89 percent—depending on the difficulty of the task, the programming language, and a number of other factors.

While in some cases an AI generator can generate better code than a human, the analysis also reveals some concerns about the security of AI-generated code.

Yutian Tang is a lecturer at the University of Glasgow who was involved in the study. He notes that AI-based code generation can offer some benefits in terms of increasing productivity and automating software development tasks — but it’s important to understand the strengths and weaknesses of these models.

“By conducting a comprehensive analysis, we can discover potential issues and limitations that arise in ChatGPT-based code generation… (and) improve the generation techniques,” explains Tang.

To further investigate these limitations, his team decided to test GPT-3.5’s ability to solve 728 coding problems from the LeetCode testbed in five programming languages: C, C++, Java, JavaScript, and Python.

“A reasonable hypothesis for why ChatGPT might perform better on algorithmic problems prior to 2021 is that these problems appear frequently in the training dataset.” —Yutian Tang, University of Glasgow

Overall, ChatGPT was pretty good at solving problems in different coding languages — but especially at trying to solve coding problems that existed in LeetCode before 2021. For example, it was able to generate functional code for easy, medium, and hard problems with success rates of around 89, 71, and 40 percent, respectively.

“However, when it comes to algorithmic problems after 2021, ChatGPT’s ability to generate functionally correct code is limited. It sometimes fails to understand the meaning of questions, even for easy-level problems,” Tang notes.

For example, ChatGPT’s ability to generate functional code for “easy” coding problems dropped from 89 percent to 52 percent after 2021. And its ability to generate functional code for “hard” problems dropped from 40 percent to 0.66 percent after that time.

“A reasonable hypothesis for why ChatGPT might perform better on algorithmic problems before 2021 is that these problems appear frequently in the training dataset,” Tang says.

Basically, as coding has evolved, ChatGPT has not yet been exposed to new problems and solutions. It lacks the critical thinking skills of humans and can only solve problems it has encountered before. This may explain why it is so much better at solving older coding problems than newer ones.

“ChatGPT may generate incorrect code because it does not understand the meaning of algorithmic problems.” —Yutian Tang, University of Glasgow

Interestingly, ChatGPT is able to generate code that requires less execution time and less memory overhead than at least 50 percent of human-developed LeetCode solutions to the same problems.

The researchers also examined ChatGPT’s ability to correct its own encoding errors after receiving feedback from LeetCode. They randomly selected 50 encoding scenarios in which ChatGPT initially generated incorrect encodings because it did not understand the content or the problem.

While ChatGPT was good at fixing compilation errors, it wasn’t very good at correcting its own bugs.

“ChatGPT may generate incorrect code because it does not understand the meaning of algorithmic problems. As such, this simple error feedback is not sufficient,” Tang explains.

The researchers also found that the code generated by ChatGPT had a fair few vulnerabilities, such as a missing null test, but many of them were easily fixed. Their results also show that the generated C code was the most complex, followed by C++ and Python, which has similar complexity to human-written code.

Tangs says that based on these results, it is important for developers using ChatGPT to provide additional information that will help ChatGPT better understand issues and avoid security holes.

“For example, when they encounter more complex programming issues, developers can provide as much relevant knowledge as possible and point ChatGPT in a prompt to identify potential vulnerabilities to look out for,” Tang says.

From your website articles

Related articles on the web