Table of Links
Abstract and 1 Introduction
2. Prior conceptualisations of intelligent assistance for programmers
3. A brief overview of large language models for code generation
4. Commercial programming tools that use large language models
5. Reliability, safety, and security implications of code-generating AI models
6. Usability and design studies of AI-assisted programming
7. Experience reports and 7.1. Writing effective prompts is hard
7.2. The activity of programming shifts towards checking and unfamiliar debugging
7.3. These tools are useful for boilerplate and code reuse
8. The inadequacy of existing metaphors for AI-assisted programming
8.1. AI assistance as search
8.2. AI assistance as compilation
8.3. AI assistance as pair programming
8.4. A distinct way of programming
9. Issues with application to end-user programming
9.1. Issue 1: Intent specification, problem decomposition and computational thinking
9.2. Issue 2: Code correctness, quality and (over)confidence
9.3. Issue 3: Code comprehension and maintenance
9.4. Issue 4: Consequences of automation in end-user programming
9.5. Issue 5: No code, and the dilemma of the direct answer
10. Conclusion
A. Experience report sources
References
7. Experience reports
At present, there is not a lot of research on the user experience of programming with large language models beyond the studies we have summarised in Section 6. However, as the availability of such tools increases, professional programmers will gain long-term experience in their use. Many such programmers write about their experiences on personal blogs, which are then discussed in online communities such as Hacker News. Inspired by the potential for these sources to provide rich qualitative data, as pointed out by Barik (Barik et al., 2015; Sarkar et al., 2022), we draw upon a few such experience reports. A full list of sources is provided Appendix A; below we summarise their key points.
7.1. Writing effective prompts is hard
As with several other applications of generative models, a key issue is the writing of prompts that increase the likelihood of successful code generation. The mapping that these models learn between natural language and code is very poorly understood. Through experimentation, some have developed heuristics for prompts that improve the quality of the code generated by the model. One developer, after building several applications and games with OpenAI’s code-davinci model (the second generation Codex model), advises to “number your instructions” and creating “logic first” before UI elements. Another, in using Copilot to build a classifier for natural language statements, suggests to provide “more detail” in response to a failure to generate correct code. For example, when asking Copilot to “binarize” an array fails, they re-write the prompt to “turn it into an array where [the first value] is 1 and [the second value] is 0” – effectively pseudocode – which generates a correct result.
Commenters on Hacker News are divided on the merits of efforts invested in developing techniques for prompting. While some see it as a new level of abstraction for programming, others see it as indirectly approaching more fundamental issues that ought to be solved with better tooling, documentation, and language design:
“You’re not coding directly in the language, but now you’re coding in an implicit language provided by Copilot. […] all it really points out is that code documentation and discovery is terrible. But I’m not for sure writing implicit code in comments is really a better approach than seeking ways to make discovery of language and library features more discoverable.”
“[…] the comments used to generate the code via GitHub Copilot are just another very inefficient programming language.”
“[Responding to above] There is nonetheless something extremely valuable about being able to write at different levels of abstraction when developing code. Copilot lets you do that in a way that is way beyond what a normal programming language would let you do, which of course has its own, very rigid, abstractions. For some parts of the code you’ll want to dive in and write every single line in painstaking detail. For others […] [Copilot] is maybe enough for your purposes. And being able to have that ability, even if you think of it as just another programming language in itself, is huge.”
Being indiscriminately trained on a corpus containing code of varying ages and (subjective) quality has drawbacks; developers encounter generated code which is technically correct, but contains practices considered poor such as unrolled loops and hardcoded constants. One Copilot user found that:
“Copilot […] has made my code more verbose. Lines of code can be liabilities. Longer files to parse, and more instances to refactor. Before, where I might have tried to consolidate an API surface, I find myself maintaining [multiple instances].”
Another Copilot user reflected on their experience of trying to generate code that uses the fastai API, which frequently changes:
“[…] since the latest version of fastai was only released in August 2020, GitHub Copilot was not able to provide any relevant suggestions and instead provided code for using older versions of fastai. […] To me, this is a major concern […] If we are using cutting edge tools […] Copilot has no knowledge of this and cannot provide useful suggestions.”
On the other hand, developers can also be exposed to better practices and APIs through these models. The developer that found Copilot to make their code more verbose also observed that:
“Copilot gives structure to Go errors . […] A common idiom is to wrap your errors with a context string [which can be written in an inconsistent, ad-hoc style] […] Since using Copilot, I haven’t written a single one of these error handling lines manually. On top of that, the suggestions follow a reasonable structure where I didn’t know structure had existed before. Copilot showed me how to add structure in my code in unlikely places. For writing SQL, it helped me write those annoying foreign key names in a consistent format […]
[Additionally,] One of the more surprising features has been [that] […] I find myself discovering new API methods, either higher-level ones or ones that are better for my use case.”
In order to discover new APIs, of course, the APIs themselves need to be well-designed. Indeed, in some cases the spectacular utility of large language models can be largely attributed to the fact that API designers have already done the hard work of creating an abstraction that is a good fit for real use cases (Myers & Stylos, 2016; Piccioni et al., 2013; Macvean et al., 2016). As a developer who used Copilot to develop a sentiment classifier for Twitter posts matching certain keywords remarks, “These kinds of things are possible not just because of co pilot [sic] but also because we have awesome libraries which have abstracted a lot of tough stuff.” This suggests that API design, not just for human developers but also as a target for large language models, will be important in the near and mid-term future.
Moreover, breaking down a prompt at the ‘correct’ level of detail is also emerging as an important developer skill. This requires at least some familiarity, or a good intuition, for the APIs available. Breaking down prompts into steps so detailed that the programmer is effectively writing pseudocode, can be viewed as an anti-pattern, and can give rise to the objections cited earlier that programming via large language models is simply a “very inefficient programming language”. We term this the problem of fuzzy abstraction matching. The problem of figuring out what the system can and can’t do, and matching one’s intent and instructions with the capabilities of the system, is not new – it has been well-documented in natural language interaction (Mu & Sarkar, 2019; Luger & Sellen, 2016). It is also observed in programming notation design as the ‘match-mismatch’ hypothesis (T. R. Green & Petre, 1992; Chalhoub & Sarkar, 2022). In the broadest sense, these can be seen as special cases of Norman’s “gulf of execution” (Hutchins et al., 1985), perhaps the central disciplinary problem of first and secondwave (Bødker, 2015) human-computer interaction research: ‘how do I get the computer to do what I want it to do?’.
What distinguishes fuzzy abstraction matching from previous incarnations of this problem is the resilience to, and accommodation of, various levels of abstraction afforded by large language models. In previous natural language interfaces, or programming languages, the user needed to form an extremely specific mental model before they could express their ideas in machine terms. In contrast, large language models can generate plausible and correct results for statements at an extremely wide range of abstraction. In the context of programming assistance, this can range from asking the model to write programs based on vague and underspecified statements, requiring domain knowledge to solve, through to extremely specific and detailed instructions that are effectively pseudocode. This flexibility is ultimately a double-edged sword: it has a lower floor for users to start getting usable results, but a higher ceiling for getting users to maximum productivity.
In the context of programming activities, exploratory programming, where the goal is unknown or illdefined (Kery & Myers, 2017; Sarkar, 2016), does not fit the framing of fuzzy abstraction matching (or indeed any of the variations of the gulf of execution problem). When the very notion of a crystallised user intent is questioned, or when the design objective is for the system to influence the intent of the user (as with much designerly and third-wave HCI work), the fundamental interaction questions change. One obvious role the system can play in these scenarios is to help users refine their own concepts (Kulesza et al., 2014) and decide what avenues to explore. Beyond noting that such activities exist, and fall outside the framework we have proposed here, we will not explore them in greater detail in this paper.