Table of Links
Abstract and 1 Introduction
2. Prior conceptualisations of intelligent assistance for programmers
3. A brief overview of large language models for code generation
4. Commercial programming tools that use large language models
5. Reliability, safety, and security implications of code-generating AI models
6. Usability and design studies of AI-assisted programming
7. Experience reports and 7.1. Writing effective prompts is hard
7.2. The activity of programming shifts towards checking and unfamiliar debugging
7.3. These tools are useful for boilerplate and code reuse
8. The inadequacy of existing metaphors for AI-assisted programming
8.1. AI assistance as search
8.2. AI assistance as compilation
8.3. AI assistance as pair programming
8.4. A distinct way of programming
9. Issues with application to end-user programming
9.1. Issue 1: Intent specification, problem decomposition and computational thinking
9.2. Issue 2: Code correctness, quality and (over)confidence
9.3. Issue 3: Code comprehension and maintenance
9.4. Issue 4: Consequences of automation in end-user programming
9.5. Issue 5: No code, and the dilemma of the direct answer
10. Conclusion
A. Experience report sources
References
2. Prior conceptualisations of intelligent assistance for programmers
What counts as ‘intelligent assistance’ can be the subject of some debate. Do we select only features that are driven by technologies that the artificial intelligence research community (itself undefined) would recognise as artificial intelligence? Do we include those that use expert-coded heuristics? Systems that make inferences a human might disagree with, or those with the potential for error? Mixed-initiative systems (Horvitz, 1999)? Or those that make the user feel intelligent, assisted, or empowered? While this debate is beyond the scope of this paper, we feel that to properly contextualise the qualitative difference made by large language models, a broad and inclusive approach to the term ‘intelligence’ is required.
End-user programming has long been home to inferential, or intelligent assistance. The strategy of direct manipulation (Shneiderman & Norwood, 1993) is highly successful for certain types of limited, albeit useful, computational tasks, where the interface being used (“what you see”, e.g., a text editor or an image editor) to develop an information artefact can represent closely the artefact being developed (“what you get”, e.g., a text document or an image). However, this strategy cannot be straightforwardly applied to programs. Programs notate multiple possible paths of execution simultaneously, and they define “behaviour to occur at some future time” (Blackwell, 2002b). Rendering multiple futures in the present is a core problem of live programming research (Tanimoto, 2013), which aims to externalise programs as they are edited (Basman et al., 2016).
The need to bridge the abstraction gap between direct manipulation and multiple paths of execution led to the invention of programming by demonstration (PBD) (Kurlander et al., 1993; Lieberman, 2001; Myers, 1992). A form of inferential assistance, PBD allows end-user programmers to make concrete demonstrations of desired behaviour that are generalised into executable programs. Despite their promise, PBD systems have not achieved widespread success as end-user programming tools, although their idea survives in vestigial form as various “macro recording” tools, and the approach is seeing a resurgence with the growing commercialisation of “robotic process automation”.
Programming language design has long been concerned with shifting the burden of intelligence between programmer, program, compiler, and user. Programming language compilers, in translating between high-level languages and machine code, are a kind of intelligent assistance for programmers. The declarative language Prolog aspired to bring a kind of intelligence, where the programmer would only be responsible for specifying (“declaring”) what to compute, but not how to compute it; that responsibility was left to the interpreter. At the same time, the language was designed with intelligent applications in mind. Indeed, it found widespread use within artificial intelligence and computational linguistics research (Colmerauer & Roussel, 1996; Rouchy, 2006).
Formal verification tools use a specification language, such as Hoare triples (Hoare, 1969), and writing such specifications can be considered programming at a ‘higher’ level of abstraction. Program synthesis, in particular synthesis through refinement, aims at intelligently transforming these rules into executable and correct code. However, the term “program synthesis” is also used more broadly, and programs can be synthesised from other sources than higher-level specifications. Concretely, program synthesis by example, or simply programming by example (PBE), facilitates the generation of executable code from input-output examples. An example of successfully commercialised PBE is Excel’s Flash Fill (Gulwani, 2011), which synthesises string transformations in spreadsheets from a small number of examples.
The Cognitive Dimensions framework (T. R. Green, 1989; T. Green & Blackwell, 1998) identifies three categories of programming activity: authoring, transcription, and modification. Modern programmer assistance encompasses each of these. For example, program synthesis tools transform the direct authoring of code into the (arguably easier) authoring of examples. Intelligent code completions (Marasoiu et al., 2015) support the direct authoring of code. Intelligent support for reuse, such as smart code copy/paste (Allamanis & Brockschmidt, 2017) support transcription, and refactoring tools (Hermans et al., 2015) support modification. Researchers have investigated inferential support for navigating source code (Henley & Fleming, 2014), debugging (J. Williams et al., 2020), and selectively undoing code changes (Yoon & Myers, 2015). Additionally, intelligent tools can also support learning (Cao et al., 2015).
Allamanis et al. (2018) review work at the intersection of machine learning, programming languages, and software engineering. They seek to adapt methods first developed for natural language, such as language models, to source code. The emergence of large bodies of open source code, sometimes called “big code”, enabled this research area. Language models are sensitive to lexical features like names, code formatting, and order of methods, while traditional tools like compilers or code verifiers are not. Through the “naturalness hypothesis”, which claims that “software is a form of human communication; software corpora have similar statistical properties to natural language corpora; the authors claim that these properties can be exploited to build better software engineering tools.” Some support for this hypothesis comes from research that used n-gram models to build a code completion engine for Java that outperformed Eclipse’s completion feature (Hindle et al., 2012, 2016). This approach can underpin recommender systems (such as code autocompletion), debuggers, code analysers (such as type checkers (Raychev et al., 2015)), and code synthesizers. We can expect the recent expansion in capability of language models, discussed next, to magnify the effectiveness of these applications.