Let's Talk Usability: Unpacking The User Experience Of AI-Assisted Programming

Table of Links

Abstract and 1 Introduction

2. Prior conceptualisations of intelligent assistance for programmers

3. A brief overview of large language models for code generation

4. Commercial programming tools that use large language models

5. Reliability, safety, and security implications of code-generating AI models

6. Usability and design studies of AI-assisted programming

7. Experience reports and 7.1. Writing effective prompts is hard

7.2. The activity of programming shifts towards checking and unfamiliar debugging

7.3. These tools are useful for boilerplate and code reuse

8. The inadequacy of existing metaphors for AI-assisted programming

8.1. AI assistance as search

8.2. AI assistance as compilation

8.3. AI assistance as pair programming

8.4. A distinct way of programming

9. Issues with application to end-user programming

9.1. Issue 1: Intent specification, problem decomposition and computational thinking

9.2. Issue 2: Code correctness, quality and (over)confidence

9.3. Issue 3: Code comprehension and maintenance

9.4. Issue 4: Consequences of automation in end-user programming

9.5. Issue 5: No code, and the dilemma of the direct answer

10. Conclusion

A. Experience report sources

References

6. Usability and design studies of AI-assisted programming

Vaithilingam et al. (2022) conducted a within-subjects comparative study (n=24) of Github Copilot, comparing its user experience to that of traditional autocomplete (specifically, the Intellisense plugin, not the same as the Intellicode feature mentioned previously). Participants failed to complete the tasks more often with Copilot than with Intellisense, and there was no significant effect on task completion time. Perhaps unsurprisingly, the authors find that assessing the correctness of generated code is difficult and an efficiency bottleneck, particularly when the code generated has a fundamental flaw or inefficiency that leads the programmer on an ultimately unsuccessful ‘wild goose chase’ of repair or debugging. However, the overwhelming majority (19 of 24) of participants reported a strong preference for Copilot in a post-task survey. While participants were less confident about the code generated by Copilot, they almost universally (23 of 24) perceived it as more helpful, because it had the potential for generating useful starting points and saving the programmer the effort of searching online for documented solutions that could be the basis for reuse.

Ziegler et al. (2022) conducted a survey (n=2,047) of the perceived productivity of Copilot users in the USA. They matched these to telemetric usage measurements of the Copilot add-in, which included metrics such as how often an auto-completion was shown, how often it was accepted, how often it persisted unchanged in the document for a certain time period, how often it persisted with minor variations (e.g., measured by Levenshtein distance) and so on. They find that the acceptance rate (the ratio of accepted suggestions to shown suggestions) is the strongest predictor of users’ perceived productivity due to Copilot. Fascinatingly, they find that the pattern of acceptance rates for all users in aggregate follows a daily and weekly “circadian” rhythm, such that users are more likely to accept Copilot completions out of working-hours and on weekends. However, for any given user, the acceptance rate depends on that user’s normal working hours; suggestions outside of normal working hours are less likely to be accepted. Future work is needed to see whether this finding replicates, and if so to establish how and why acceptance rates are so significantly affected by working hours.

Xu, Vasilescu, & Neubig (2022) conducted a within-subjects study (n=31) comparing the programming experience with and without a code generation plugin. Their experimental plugin takes the form of a text field in which the user enters a natural language prompt, the system responds with a list of code snippets, and when clicked the desired snippet is inserted at the cursor. This workflow differs from Copilot’s, where the ‘prompt’ is text within the source file, and can contain a mix of natural language comments and code. The plugin supported both code generation (using a tree-based neural network) and code snippet retrieval (searching the programming forum Stack Overflow). Results from both generation and retrieval are shown in the same list, but visually demarcated. The authors found no significant effect of the plugin on task completion time or program correctness. They found that simple queries were more likely to be answered correctly through generation, and more complex queries requiring multiple steps were more likely to be answered correctly though retrieval, and that it was possible to predict which approach would succeed based on the word content of the queries. Further, they found that most (60%) natural language queries that participants wrote in their experiment were not sufficiently well-specified for a human expert to write code implementing those intents. Retrieved snippets were edited more often than generated snippets, mostly to rename identifiers and choose different parameters. In a postexperiment survey, participants reported mostly feeling neutral or somewhat positive (30 of 31). These participants felt that the plugin was helpful for finding snippets they were aware of but cannot recall, and less disruptive than using a browser, but the interaction worked better when the developer had a pre-existing knowledge of the target APIs and frameworks, and it took experimentation to understand the “correct way” to formulate queries. There was no clear indication of preference between retrieval and generation.

Jiang et al. (2022) developed an LLM-based tool for converting natural language statements to code. As in Xu, Vasilescu, & Neubig (2022), prompts are entered in a pop-up dialog invoked at the cursor from within a code editor, rather than as comments. In a study (n = 14), participants were given a week to complete two website-building tasks with the tool, while recording the screen, and were interviewed afterwards. As in other studies, participants saw utility in the tool for facilitating quick API lookups and for writing boilerplate code. They found that novice programmers’ queries were mainly natural language, whereas experts were more likely to mix code into their requests. While some queries were abstract, and expressed high-level goals, most had low granularity, being “roughly equivalent to a line of code”. To cope with model failures, participants used a variety of strategies to reword their query, such as reducing the scope of the request or replacing words with alternatives, but no particular strategy was observed to be more effective than any other. Participants struggled with forming a mental model of what the model can understand and the “syntax” of the language it required – this is precisely the fuzzy abstraction matching problem we described earlier, which the authors call an “uncanny valley”. The authors suggest possible solutions such as automated rewording of prompts, suggesting simpler tasks, suggesting task breakdowns, and better onboarding and tutorials.

Figure 5 – Searching for code snippets using Bing Developer Assistant. A result for Stack Overflow is shown. Note how the query “generate md5 hash from string @line” contains a hint about the identifier line, which is used to rewrite the retrieved snippet. Source: https://www.microsoft.com/en-us/research/publication/building-bing-developer-assistant/ Figure 5 – Searching for code snippets using Bing Developer Assistant. A result for Stack Overflow is shown. Note how the query “generate md5 hash from string @line” contains a hint about the identifier line, which is used to rewrite the retrieved snippet. Source: https://www.microsoft.com/en-us/research/publication/building-bing-developer-assistant/

Barke et al. (2022) studied how programmers (n = 20) use GitHub Copilot to complete short programming tasks in Python, Rust, Haskell, and Java. Through analysis of screen recordings, the authors identifed two primary modes of interaction with Copilot: acceleration, where the programmer has a well-formed intent and Copilot speeds up code authoring in “small logical units”, and exploration, where Copilot suggestions are used to assist the planning process, “help them get started, suggest potentially useful structure and API calls, or explore alternative solutions”. In acceleration, long code suggestions, which take time to read and evaluate, can break the programmer’s flow. Participants developed heuristics for quickly scanning suggestions, such as looking for the presence of certain keywords. In exploration, participants were more likely to prompt using purely natural language comments, rather than a mix of comments and code. Moreover, these prompt comments were often ‘cleaned’ subsequent to accepting a suggestion, which implies a form of ‘instruction language’ that is separate from ‘explanation language’.

Madi (2022) compared the readability of code generated by Copilot with that of code written by human programmers in a user study (n = 21). They found that model generated code is comparable in complexity and readability to human-authored code.

The Bing Developer Assistant (Y. Wei et al., 2015; Zhang et al., 2016) (also referred to as Bing Code Search) was an experimental extension for Visual Studio initially released in 2015. It enabled an in-IDE, identifier-aware search for code snippets from forums such as Stack Overflow. It had the ability to rewrite retrieved code to use identifiers from the programmer’s current file. A user study (n=14) comparing task time in performing 45 short programming tasks with the extension versus regular web search found on average 28% of time was saved with the extension. Morever telemetry data gathered over three weeks (representing around 20,000 users and around 3,000 queries per day) showed that several programmers used the feature frequently. Some used it repeatedly for related problems in quick succession, showing its use in multi-step problems. Others issued the same query multiple times on separate days, suggesting that the speed of auto-completion was useful even if the programmer knew the solution.

Let’s Talk Usability: Unpacking The User Experience Of AI-Assisted Programming | HackerNoon

Table of Links

6. Usability and design studies of AI-assisted programming

Leave a Reply Cancel reply

Stay Connected

Latest News

Samsung is making smart glasses powered by AI

Carbon Robotics raises $20M as LaserWeeder maker plans secretive new ‘AI robot’ for farms

The Outer Worlds 2 Review – CNET

Linux’s Proposed Cache Aware Scheduling Benchmarks Show Big Potential On AMD EPYC Turin Review

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

6. Usability and design studies of AI-assisted programming

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News