The Rise of Coding Assistants: Increasing Developer Productivity

The rapid growth of coding assistants in large language models (LLM) has sparked debates about the future of coding, with some even predicting the end for programmers. However, at Heroku we’re seeing a different trend – LLM companies are democratizing coding to the point that we’re starting to see a wave of aspiring coders entering the industry.

This influx of developers creates some unique challenges for AI assistants. For example, experienced developers are better equipped to notice when an assistant is leading them down the wrong coding path, while younger developers may not notice that something is wrong.

Furthermore, the quality of LLM code can vary significantly depending on the choice of language and area of application. Using an assistant to write a backend in Python or Java is usually not a problem; however, asking him to write Zig or a newer language like Mojo may result in disappointment. For example, when asked how to install Python packages, the assistant will instruct the user to use pip or conda because these are the most popular tools in the training data. Meanwhile, developers are absent, preferring Poetry as their Python dependency and management package; AI tools ideally do not use or take this into account unless the developer specifically asks for it.

Outdated information can also be a problem, especially when languages and frameworks try to phase out old features or when libraries that were once used contain security vulnerabilities that have not yet been fixed.

Using data to enhance your software developer LLM

So how can we help assistants (and therefore developers) avoid such problems? First, it is important to have current data informing the LLM. While larger organizations can leverage their existing codebases and operational data to partially solve this problem, full access to comprehensive, up-to-date data sets is unavailable. In fact, 62% of employees worldwide say that training AI on outdated data undermines their trust in the tool. Similarly, companies lack sufficient relevant data when they introduce new languages or try to change the status quo.

The necessary data curation can be addressed in a collaborative and community-focused way, following patterns established by open source projects. This does not mean that we need to agree on one set of best practices. On the contrary, it makes sense for existing languages, tools and frameworks communities to create a dataset or framework knowledge base (FKB) that is specifically curated for their area of interest.

The optimal method for complementing existing models with these FKBs is not clear. However, this should not stop anyone from producing them. Whether it’s placed entirely in a context window, accessed through augmented search generation (RAG), or used for tuning, having current, accurate, and relevant data that remains consistent is the best first step.

Knowledge base framework for LLM

How does the user interact with these FKBs? Imagine a menu that users can choose from once they activate the coding assistant. The user would select one or more FKBs tailored for a specific purpose. Each FKB will contain a starting template for the assistant to work with. In addition to the template, the FKB database will include code samples, best practice guides, and recommended libraries.

To maximize success, datasets will be created and curated by experts in languages and frameworks. However, every journey begins with one step, or in this case, with the person who has the ability and willingness to move forward and make the first FKB. There are several questions this pioneer would have to answer:

1. What license should FKB use? Ideally, these datasets should be accessible to everyone. To ensure the widest possible use, you should choose something extremely liberal, such as license-free or CC0.

2. Where should FKB be stored? Ideally, this is a place where people can collaborate on datasets and have an easy method to modify the dataset if they disagree about certain choices. For example, a GitHub repository is a good option because the template can be used directly and allows for stars, pull requests, and forks. The Huggingface dataset could also work. Another idea is to include the ai-hints.json file directly into the existing repository.

3. What test data should be included for LLM? One challenge in the future will be to evaluate LLM performance using the dataset. These assessments would require testing data as well as training data. To address this issue, the FKB must include domain-specific examples that can be used as a test set to evaluate LLM performance.

Coding assistants are becoming amazing tools

Fostering a collaborative and community approach to gathering and sharing knowledge can make coding assistants incredible tools that empower developers of all skill levels. This journey begins with a collective effort by the technology community to create and improve datasets that keep pace with the ever-changing landscape of programming languages and tools. Together we can build a future where coding assistants simplify coding and inspire innovation across the global developer ecosystem.