Transcript
Böckeler: My name is Birgitta Böckeler. I’m a distinguished engineer at Thoughtworks. For the past over one and a half years at Thoughtworks, I’ve been in a global role that is a subject matter expert, developer advocate for everything to do with using AI in the software delivery life cycle. Coding, of course, is the biggest part of that at the moment, but probably also for the foreseeable future. I’ve been using a lot of different coding assistants trying to stay on top of all the developments that are happening. If you feel bad for not keeping up with this, I have a full-time job for this and I cannot keep up. You don’t have to feel bad. I talk to my colleagues as well of what they’re doing on the ground with our clients. Always with my talks about AI coding assistants, I want to get the skeptics more excited, and the enthusiasts, get you a little bit more down to earth.
The History of Features
I’ll start with a quick history of the features. It all started with this auto-suggest. You could write a comment or a method signature and it would start suggesting code for you. Here in this case, it’s even suggesting the full method body for me. In some cases, it would just be one line. It felt like autocomplete on steroids. Then we got chat in GitHub Copilot, for example, or the other coding assistants. In the beginning, it was just the ability to ask questions directly in your IDE, so you wouldn’t have to go to ChatGPT in the browser or Google Search. Especially for more basic questions, it was already quite good. This was an actual question. I once asked it, is there a concept of static functions in Python? Because my native language is Java and I was writing Python for the first time. I was trying to figure out questions like this. When you ask this on the internet, you might have to scroll through a lot of stuff until you get to what you actually want to know.
Then, next, we got some more IDE integration of these features. Here you see in GitHub Copilot, we got stuff like additions to the Quick Fix menu, like, fix this using Copilot, or explain this using Copilot. Also, there was more ability now from the chat to actually point at a file, like here in this file, in this case, I’m actually pointing at one of the JavaScript files I had in my workspace, and so I had a bit more control over where I pointed it and gave it context. Then, next, some even more IDE integration. Now we don’t have to write comments anymore that we have to delete afterwards or method signatures. You can open up little inline chat windows and actually give your instructions. Then you get a little diff here, so it will change one or two lines for me. I can see what it was before, what it was afterwards.
Then we got even more powerful chat where you could chat with the code base. Here I was asking questions. Can you see any tests for this particular dropdown box? This was actually a code base that was totally unfamiliar to me. It will then turn that in the background into a search query in this case and actually help me find places in the code base. I have definitely found that this is a step up from string search from Command+F or Control+F, and can be quite useful.
In most of the cases actually points me in the right direction. It’s not always perfectly the right place, but it’s become quite powerful. Interestingly, a lot of the coding assistants do this differently under the hood. There are a lot of different techniques to do this. It’s quite interesting to see the different results of that. Here the example is from Cursor and they’re using an index of the code base under the hood where they turn the code base into vector embeddings. Also, what started coming up more and more is context providers, so that when I’m in the chat, I can actually pull in other context. If as a developer, you know the context providers available, it can be quite useful for you to pull those in in the right situation. You see some examples here.
One that I find quite interesting often is the git diff or the local changes. I can point it at that, and for example, say, give me some review of this, or a web, it can also be quite powerful, or you can maybe paste the URL of a site that has documentation of the library you’re using. It will pull that in. What’s in the terminal, of course, maybe reference documentation. Then more and more also integration with tools like Continue here. They also early on started introducing integration with Jira or Confluence, which also now more and more coding assistants actually have.
Then, at the same time, of course, the models evolved as well. I was only talking about the tool so far, but models evolved as well. This is by no means any detailed history of all the model things that went on over time. We started with GitHub Copilot with Codex and GPT-3.5. Then, Copilot also went to GPT-4. There’s a bunch of coding assistants that have their own models, like Tabnine actually is even older than GitHub Copilot. Copilot wasn’t the original one of these AI coding assistants. I put the Claude Sonnet series here as a big oval at the right, because for multiple months now already, stability, who would have thought, the Claude Sonnet models like 3.5, 3.7, have been by far the most popular for coding. It’s been hard for other models to actually get to that. They’re the most popular right now.
At the same time, also, you see that arrow there at the bottom. The so-called reasoning capabilities of models have also evolved, which a lot of people say that when they’re planning or debugging, the reasoning capability can actually give it an extra boost. With all of these features, so in October, November last year, these were the most common features. The unit of assistance that I would get at the time was usually about the size of a smaller method, maybe a bit less. You would maybe go multiple lines at a time. Of course, people also go into the chat and maybe get a full class or something. Typically, I would use it in smaller steps, because that also left me in control, and I could actually review along the way what was happening.
Potential Impact of Coding Assistants on Cycle Time
In terms of the impact, that’s often the frequently asked question. What is actually the impact of this? When we think about coding assistant impact on cycle time, which is our common proxy variable for speed, I always thought about it like this. You think about, how much of our cycle time do we spend on coding? We have the very optimistic scenario here, because 40% is definitely quite high. It was actually in the talk by Trisha and Holly, they did a little poll, and most people say they spend less than 30% of their time on coding.
Then, let’s say when you code, 60% of the time, the coding assistant is actually useful to you, because it isn’t always useful. There are very complex tasks where it doesn’t work, or maybe you’re using a tech stack that is unfamiliar to them. It just doesn’t always work. It’s hit and miss. Let’s say 60% of the time it’s useful. Then, whenever it is usable, we are 55% faster, which was that big number in GitHub’s marketing materials that freaked everybody out, and people thought they could now cut their teams in half. It was about task completion, 55% faster task completion.
If you have these numbers, then the impact on your cycle time would be 13%. That’s also what I saw last year was the ballpark of what people were seeing, or not even seeing. Because if you imagine, let’s say, you do have 8% to 10% impact on your cycle time, I imagine most of your cycle time graphs look like this. It’s very variable. You might not even see 8% show up. Everybody was anchored on the 50%. Everybody also thought, 8%, that’s nothing, that’s not worth it. If it actually makes the team that much faster, I think that’s not so bad. Like I said, this was the ballpark that I was also seeing with our teams at Cline’s where they were using coding assistants.
GenAI Tooling – A Moving Target
GenAI tooling is a moving target. If you are in an organization where you actually had a nice setup to measure this, and you also came to a similar conclusion, you have to start all over again, because now agents have entered the field. I’m specifically talking here about supervised agents. I will not be talking about Devin, for example, which is an agent that you give a task, and it just goes off by itself autonomously, and then comes back with a full commit. Because I actually haven’t seen them actually work a single time yet, I think. Some of you might have seen the YouTube videos of people picking apart marketing videos of these agents and actually showing, no, it didn’t actually solve the problem. What is quite interesting and already practically usable is these new agentic modes that we now have in coding assistants where you as a developer are still in charge and continuously supervising and continuously looking what is being done, adjusting, intervening, and so on. This is like a little bit of a timeline here as well.
Interestingly, like you see Cline here on the left, that is an open-source project that was already super popular before even products like Windsurf or Cursor had these modes. An open-source project was already useful in this way without all of the VC money that gets pumped into the other ones. You see here the colored ones are the ones that I have mostly used. That’s what my experience is based on that I will share next. I mentioned a few others here that also have this agent mode, can’t quite keep up right now which coding assistant already has it and which doesn’t have it yet. I put GitHub Copilot in italics here because they still haven’t fully released their agentic mode. It’s only available in pre-release at the moment. In that space, they’re falling a little bit behind now, behind products like Windsurf and Cursor.
My experience is based on those, and almost all of the time when I use them, I use Claude Sonnet as the model. Just for you to keep in mind when you hear my experiences. Also, 98% of the time I use them to edit existing code bases and not to build like tic-tac-toe from scratch, or something like that. I try to use them for what we actually do most of the time when we change code, we change existing stuff. Here now the unit of assistance is growing in size. It’s actually changing multiple files for me. I put a question mark here because problem of course is like a word that can be a very small problem, just change this one function for me, or it can be a larger thing. I’ll get a little bit back to the problem size that is suitable for this a little bit later, but basically agents introduce this thing where it’s not just a few lines of code at a time, but we actually work on a larger context.
What is an Agent?
What is an agent? It’s a very widely used term right now, but not really specifically defined. I think we need to put qualifiers in front of agent now, the same we put in front of service. In the context of coding assistant, agent means that you have the coding assistant who ultimately puts together a nicely orchestrated prompt for you with context from your code base and so on. Which context providers you used, and ultimately it will send a prompt to a large language model, of course. The coding assistant has access to tools. It can read files, change files, execute commands, run tests. That’s either in the IDE or a terminal session.
Some of these assistants also just run in your terminal. Now what happens and what always happens in agents, but this is specifically now for coding assistants, is that when it sends the prompt, the ultimate prompt, it doesn’t only say this is what the user wants. It also sends the large language model a description of all the tools that it has available, almost like an API description. You can read files, change files, just let me know what else you need and I’ll do it for you. Then the large language model will come back and say, actually, I think I need to know what the results of the tests are right now. Please run yarn test and give me the result. It will go back and forth like that. That’s how agents work.
Here’s a very simple little video. You don’t need to see the details of this. I just want you to get a feeling for it. I basically ask it to modify something in one of my functions. It’s changing my test file first, then it’s changing the actual implementation, then it’s running the tests. You can see here it puts together the command to run the tests. It sees, the test is green and it says we’re done. You can imagine this is, of course, even more useful when the test fails because it will immediately continue and say, there’s a failure. Let me actually immediately check what is that failure. It increases the automation that I have because previously when I was working with an assistant, I would probably have gone and copy and pasted or sent the failure from the terminal to the chat. Now it’s immediately picking up on that itself.
Here’s another example. This is in Windsurf this time, before was Cursor. It’s actually going and doing web research for me. There’s an error that I have about an incompatibility in libraries, and it’s going and browsing the web, and then reacting to what it finds and making a suggestion for me how to solve my problem. Here’s another one that I really like is that most of these now also pick up on what they all call lint errors. What it actually means is whatever the IDE is that you’re using, let’s say squiggly lines, so it’s picking up on warnings, on errors, compile errors, transpile errors, and so on. We now also have more automation here, where it immediately double checks what it did. In this particular case, you can maybe guess that from these types of errors that it just totally messed up the syntax. It immediately gets the feedback from the IDE as I would as a developer and then tries to fix it for me immediately. I might have a higher success rate overall with the usefulness of these.
Agentic Modes
This is the next step change, these agentic modes. Coding now looks a little bit like this. I’m stretching while I watch the tool do stuff. I use it all the time, not just because it’s my job. It’s almost a little bit addictive. I saw somebody write on Reddit, I think that it’s a little bit like a slot machine. Either you win or you don’t win, and you always have to put in more money to try again. Why I like it is that you can’t just use it for anything, and you can’t just give it half a sentence, and then it will magically do what you want. It does frequently really reduce my cognitive load.
For just some of the little details, I still feel in control in giving it a plan, this is what I want to do. I don’t have to look up the exact detail of how I have to do it. Sometimes it also helps me think through a design when it starts doing something where I’m like, no, this doesn’t work. Let’s roll back, I have a new plan. As Gene Kim says, it helps you create options. The more options you have, the more you might actually sometimes reduce risk or improve your design. It also helps me solve issues much faster, a lot of the time.
Before I get to what the catch is, or what the many catches are, I just want to finish the feature tour. What other features are now relevant or even more relevant than before with these agentic modes? One is MCP, which is exploding at the moment. MCP stands for Model Context Protocol. It’s a standard that was introduced by Anthropic to introduce some standardization for those tools that I mentioned before that the agents get. Now I have my coding assistant, which is in the MCP world, considered the MCP client. I can run on my machine, at least that’s what they’re mostly used like at the moment, you can also have remote MCP servers, but I can run MCP servers on my machine, like little programs. It can be a Python process or a Node process. Those programs can use the browser, or search Confluence, or query my particular test database, if you build a custom one for yourself, or add a comment to a Jira ticket, and so on.
Then, just like before, the coding assistant would then also send descriptions of these MCP server tools to the large language model. It would, for example, in a debugging session, say, I need to query your test database to see which data do we have now while I try to find out what the bug might be. Here’s an example of using an MCP server called Playwright MCP that was released by Microsoft open source. Here I asked it to browse an application. I told the agent, browse the application, see what you find about this feature, what’s happening, and create a markdown file that describes how the application works.
This is super powerful, but beware that MCP servers are really like Wild Wild West right now. Like I said, they run on your machine, and a lot of people are building them right now, putting them on GitHub, and you can go and clone other people’s MCP servers, run them on your machine. What can go wrong? This is like your IDE, and what you’re doing with the coding assistant is in the middle of your software supply chain. That’s a very valuable target. The tool descriptions might have malicious additional instructions, or very easily the implementation of the tool could just have a statement that just calls some attacker server or something like that, if you don’t look at what it’s doing.
As with all dependencies that you use, you have to check, do I trust the source? For example, do you trust Microsoft and their Playwright MCP server? In my case, I decided to do that. I also suspect that lots of these MCP servers at the moment are built in vibe coding mode. I’ll get to vibe coding. Who knows what they do?
Next feature that already existed before, but is becoming much more important now is custom instructions, or some assistants call it custom rules. Basically, you can have things configured in your assistant that get added to the prompt every time you start a new session. It can be stuff like for this one project that I have, it contains things like the backend code is in Python, the Python code is in the app folder, the build system we use is Poetry.
Then, over time, I noticed it kept forgetting to activate the virtual environment before executing commands. I then iterated on the instructions and said, remember to activate the virtual environment before any Python or Poetry command. You can iterate on these as you run into common problems. You can also share them across the team. Those are usually dotfiles in your workspace, so you can commit them to your repository. You can also have global instructions that you always want to use, or that are maybe more specific to yourself as an individual developer.
Things like, be short and concise, or in certain tones. I know one of my colleagues told it to always speak like Yoda to him. I recently experimented with this, what you see at the bottom, when I say “/wrapup”:, draft a commit message for me. Remind me to make sure that the server still starts, because there was some gap in the automated testing that we had, so I always wanted to do that before I committed. You can try and do these little like pseudo commands for yourself.
Remember, though, that there are no guarantees that these will work. These can actually mitigate a lot of the typical things where the AI tool might go wrong for you and really increase the usability. Who of you has never said this to a large language model? “Important, please always do the following, always, always, always, please, please, please”. There’s no guarantee that they will do what you want all the time, especially as your sessions get longer. The fuller the context window gets, even if it’s like a model that does support a very large context window, it will have less attention on all of the details that are actually there. There’s no guarantee.
Then, secondly, also, a lot of people put stuff like this into their custom instructions, like follow best practices. How do we know that that actually works? How do we know what the large language model considers best practices? It feels like, yes, I get more high-quality code when I say that, so I’m more in control, but it might not actually do anything. You would actually have to do an experiment to see, is it better or worse when I add it or not? In any case, it’s a super-useful tool for agents. Also, here, alarm with software supply chain. This vulnerability was published, and they call it the Rules File Backdoor.
The thing is that with these custom rules and custom instructions, people share them with each other on the internet, the same as MCP servers. There are actually whole websites that popped up where Cursor rules or where people share this with each other, and then you go and copy and paste them. For this vulnerability, they found that there are ways to put hidden characters into these rules that then lead the large language models to generate additional things as part of your instructions. As an example, they give that they might actually put a script tag in your generated HTML code that calls an attacker’s server. Yes, Wild Wild West at the moment. Maybe type them all out by hand.
Finally, in this section, a few ways of working that are emerging right now with agents. The first one is to ask the agent to plan with you first, so not go directly and like, here’s one sentence of what I want to do, but go in smaller steps and first say, let’s plan together, because then you also have a chance to review that before it goes off the rails. I want to do the following, let’s make a plan first. A lot of people are using a model in this step that has reasoning capabilities, because they find that it sometimes gives them better results. It’s more of a maybe chain of thought type of activity, so reasoning model might make it better.
The second one is small sessions, concrete instructions. I talked before about the context window and how the more stuff you have in there, the more it might actually lose attention. You actually have to start new sessions every now and then to actually get better results again. Also, concrete instructions. You can say, make it so I can toggle the visibility of the edit button, or you can say, add a new Boolean to the database, expose it through the following endpoint, and toggle visibility based on that. Just to give you an example of what I mean by concrete instructions. I’ve also found from my colleagues who are using these on the ground for real code bases, that they sometimes go into quite a bit of detail. Of course, you have to balance at which point are you going into so much detail that you could also be writing it yourself. As I said before, I also feel like this, in a nice way, actually reduces my cognitive load sometimes with the details. The thing you see on the right could be a result of the planning step as well.
Then, finally, to enable you to have smaller sessions and not having to re-describe every single time what you’re currently working on and what you’re doing, is to use some form of memory. You can first, for example, say now, “Put the plan into a Markdown file, and always update it on the way as we’re working”, say, “This is done, this is done”, so that whenever you start a new session, you can say, “I’m working on”, and point to the file, “And now continue with the following task”. Some coding assistants now are starting to support something like this natively, where you actually have a memory feature, but it’s just as easy to do it like this, and then you actually also have something that you can come back to the next morning. I used to always have sticky notes on my desk, all of the things I needed to do for the story, and now there’s a new incentive to actually have this in a digital file because the agent can also work with it.
Vibe Coding, and AI as a Teammate
Stepping back a bit from the feature details, the internet is now full of videos like this one, building Snake, for example, or like I said, tic-tac-toe. It generates an HTML file, JavaScript file, CSS file. It runs a command that runs the server. Here in Windsurf, I can even have a browser preview here in the editor and I have a Snake game. The original video of this, I sped it up, of course, was two minutes. Then, when you see these videos, then we also get statements like this. The Anthropic CEO says that in three to six months, AI will be writing 90% of the code software developers were in charge of. Do I agree with that? Do I think I will only have to be in charge 10% of the time? It depends a lot as well on what that means. I can totally see how in six months, regardless of what happens, you can reinterpret that statement to say that he was right. Does it mean AI types 90% of the code for me? Does it mean it does so fully autonomously? Does it mean it does so without my intervention?
One term in particular captures this moment in time and statements like these and this debate, and that’s, of course, vibe coding. People are already using it as a replacement term just for AI-assisted coding. Andrej Karpathy tweeted this on February 2nd, and said, vibe coding is when you don’t even look at the code, you just tell the AI what you want, maybe even using voice instead of typing, and you iterate via the chat until it does what you meant. Then, about six weeks later, we had pages like this, Job Board for Vibe Coders. It’s actually beyond pages like this on actual job boards, there are job ads with the title vibe coder.
Also, about six weeks after the tweet, this happened on the internet. I don’t know if some of you saw this. This went pretty viral. This guy said, “My SaaS was built with Cursor, zero handwritten code. AI is no longer just an assistant, it’s also the builder. Now you can continue to whine about it or start building, P.S., yes, people pay for it”. Some people might have seen what happened two days later. “Guys, I’m under attack. Ever since I shared about this, random things are happening, maxed out usage on API keys, people bypassing the subscription, creating random shit on the database.
As you know, I’m not technical, so this is taking me longer than usual to figure out. There are just some weird people out there”. There was somebody who responded to this. When you read a tweet and you have a tone in your head of how the person said what they responded, so this guy responded and said, “I guess it’s because you just don’t know what you’re doing”. I didn’t hear it aggressive or anything. I just heard like, “Honey”. Of course, for us, when you go to QCon, you feel good hearing this. It feels good. It’s like, “Our expertise still matters”.
If we look back at Karpathy’s definition, I also feel a little bit sorry for Andrej Karpathy for what he kicked off with this. Later in the tweet, he said, it’s not too bad for throwaway weekend projects, but still quite amusing. People were also introduced to what I showed you before, all of those features that they’re now in existence through this term and this meme. Like I said, it’s become the replacement term for agent-assisted coding. Vibe coding is just one of the ways that you can use these tools. I sometimes go into that mode for certain things. Maybe I’m just whipping up a quick utility tool, or I’m just asking the chat for 10 minutes to style my page a certain way.
Then I drop out of vibe coding mode and I delete 80% of the CSS because it’s unnecessary. It’s just one of the ways to do this. There shouldn’t be job ads for this, with that title. If not vibe coding, then, are there other downsides of this? For us, it’s maybe obvious that vibe coding shouldn’t be what you do 100% of the time. In November, 2023, I actually wrote a blog post about how you should maybe think of AI on your team as a character and give it some characteristics to make it easier for you to understand when you should believe it, when you should trust it, the same as if you have a teammate. You might trust an intern’s advice less than maybe your colleague who’s been doing backend for 20 years. You might also not trust your colleague who’s been doing backend for 20 years to give you help with CSS.
That’s an intuitive thing that we do as humans. I said, ok, the AI teammate is eager to help, very well-read, but inexperienced. Or I always say for D&D players, like very high intelligence, very low wisdom, stubborn often, and won’t admit when they don’t know something. Or you could also say, when you point out that they didn’t know something, they’re very polite as well. It’s like this character. I would say that now, it has definitely become more sophisticated, maybe a little less stubborn sometimes, but still, most of these principles apply. It’s just now that it has access to a lot more stuff than it had before.
AI Missteps
One thing is, amateurs building SaaS and not knowing what they need to do to protect an API endpoint. Let’s say, you are on a professional team and you actually do have all those things about security and resiliency in your backlog that you need to build them. Can we now just give those instructions to AI, like protect the API endpoint and just let it go off on its own? I just wanted to give you just a few examples, to give you more data about the types of AI missteps or blunders where I, as a developer, then as the supervisor, actually step in and have to hold it back or steer it a little bit. Maybe they give you some more data points about how your experience actually does still matter.
Then you can judge for yourself if you think this will be fixed in three to six months. Here’s the first one. The tool was telling me, we’re hitting an out of memory error. Let’s increase the memory limit. We were working on a Dockerfile, building a JavaScript application. I said, but why do we have the memory error? It said, yes, that’s a good question. It said, we’re running npm install three times in our build process and it includes dev dependencies, we shouldn’t do that. I was like, yes. This is something that can actually lead to pain further down the road. It’s a clear signal of an inefficient build process.
This type of brute force fix, “I know what to do”. Very eager. It immediately fixes it. I have to say, yes, but why? That’s because I know that this is not the right way to do this because I’ve been doing this for a few years. Here’s the next one. I said, “Looks good”. I’m always very polite to the AI. “I’m just curious why the tests are all passing, even though we didn’t change anything in the tests”. What I had asked it to do was do a refactoring and merge two methods into one, because I realized I don’t need both of them.
Merge the parameter sets, somehow I had described that. It had changed my code, but the tests weren’t changed and they were still green. It said, “They still pass because we maintained backward compatibility by keeping the original method names as thin wrapper methods that call on new unified method”. This is actually relatively easy to spot in a code review as well, if you pay attention. This is also something you can then, again, put into your rules and say, never do backwards compatibility with thin wrapper methods, and your rules file will grow and grow. It’s just one example of why you still very much have to look and review your code or it’s going to get bigger much faster.
Tests in general are a whole thing. AI is actually quite good in general at generating tests. I’m not saying never do that, but there’s a lot of caveats here. For example, I often see it create very verbose or redundant tests or redundant test assertions. Like put additional assertions in, even though they’re already in a lot of other tests. Then the more you have of those, the more brittle your tests will become over time. At some point you’ll have a test suite where every time you do a change, 20 tests fail. On the flip side of that, not enough tests.
Recently I asked it to change the behavior of an API endpoint. To make that happen, it also had to change some other files, and it only added a test for the API endpoint, which of course mocked the other things. I didn’t have test updates for the other things that had changed. I have to be vigilant and see how good is actually the test coverage here. Speaking of mocks, there’s a lot of mocking. Again, this is a very typical thing that we started with as developers as well. I often see it mock like even the data objects that I pass around different methods, which is a little too unit for me. There needs to be a little bit functionality that actually still gets tested. Tests in the wrong places as well. Sometimes it just doesn’t put it in the right test suite, which could again in the future, confuse me or the AI about where tests actually are.
Another thing is like, I’ve recently tried to get AI to use red tests as a tool, because that would be really useful for me as well to understand if it did what I wanted it to do. I try to tell it to first adjust the test, then run the test so that it’s red. Then I want to stop so that I can look at it. Then I want it to update the implementation. Because then when I can look at, does it fail for the right reason? That gives me a really good confidence in the next step that it’s actually doing what I wanted. It’s really hard to get it to do that, like even do the red test first, let alone stop. It always wants to go immediately into the implementation.
Then, final example, and this is representative of how it can mess up the design. The two green lines here are the ones that AI added. It introduced a new parameter to this chat constructor, a new string parameter. It’s getting the value for it previously from this knowledge manager component, which is already a dependency of the chat class. It’s like creating a more verbose interface, when in this case, that wasn’t necessary. That first line should actually go into the chat class. It’s a nice small example that I could put on one slide, but that can lead to a sprawl of parameters and dependencies in your code base if it cannot always understand what’s going on there.
Impact Radius of AI Blunders
I think about this in terms of the impact radius of when it does these things, and I’m either not paying attention or I don’t know any better because maybe I’m fresh out of university, that has different impact. There are some things on the commit level, where I think it doesn’t actually matter that much because they’re very obvious. You might actually be slower than faster if AI just keeps creating code that doesn’t work, which still happens. Like I said, I’m very impressed by how things have evolved and I use it all the time, and it often does really cool things, but sometimes it also just totally fails.
Then, if I dig deeper and try to fix it, I might actually be slower than faster. It’s obvious, it might not even end up in the commit. Then, secondly, there are things that if I push them because I don’t pay attention, they might create friction for the team, or like in this iteration. I was working on a developer workflow once, it was the one with the brute force fix of the Dockerfile, and it created a very complicated Node developer workflow. Then, also at some point, I realized the hot reload wasn’t working anymore.
If you would push that in a team, the others would immediately say, why is the build suddenly so slow, or why is my hot reload not working anymore? Also, you have more of a tendency to create larger commits with AI because it creates so much code for you, which might lead to more merge problems, or you also have larger change sets that you deploy, so they’re harder to debug and they’re more high risk. Sometimes it also just makes wrong assumptions about the requirements, or building too much, which then would go into a back and forth with the product owner or the QA, and again, create friction on the team.
The most insidious one, of course, is the outer blast radius, like code base lifetime or feature lifetime. Those are the design things that I talked about, or maybe the test suite is getting more brittle or harder to understand, because that has an impact on long-term maintainability. You might say, but that doesn’t matter anymore because AI will help me fix it, but AI actually also needs a well-factored code base and expressive naming. Imagine if you have nice modularized code, it’s also a lot easier for AI to just reason, I have to make this change in this one module and not in 10 places because it’s like a big ball of mud.
There’s one study that I find particularly alarming with this, like showing first signs of those long-term more negative impacts. This is data by GitClear. This is first the data they published early ’24, about ’23. This is a timeline. Basically, the very right is the end of 2023. The last column is 2023. They say, you could argue that coding assistants started coming in in 2023, and they have these numbers about number of lines of code added, lines of code moved, copy and pasted, and churn. Churn is the one that you see go up like that. Churn in their definition means it’s a line of code that was pushed and then it was either reverted or changed again within two weeks. You pushed it and then you realize, that doesn’t work, or that was unnecessary, or whatever.
About a year ago, they published this and they made some prediction of how this might change. Then they published the data again this year, about 2024. You can see all of those lines are basically going further up or down in the wrong direction. These are all signs potentially for more duplicate code, more churn, so more corrections within two weeks, and less refactoring, which might all come to haunt us in a year or two.
Working With AI Agents
GenAI is in our toolbox now. It’s not going away, even though I’m ending on these cautionary notes. I also want to highlight again, it is very useful if used in a responsible way. We need to figure out now as a profession how to use it effectively and sustainably. For example, as individuals, we have to fight this complacency and sunk cost fallacy. When you get a big component that does a lot more than you actually need it to do, and you’re like, “That’s nice. It does those things that we might need in the future”.
Then you just push it, and then you have to maintain that code. It sounds trivial, but I feel this all the time. I have to fight this all the time because it feels so like, “It’s there. I can just push it”. Review, review, review. Only use vibe coding when you have a really good reason. Then, think about your feedback loops, or like, what are we going to do with those feedback loops now? Can testing actually help us get more control over what’s going on? How do we, at the start of a session with an agent, actually define what is the feedback loop that tells us if this works? Know when to quit. When I feel like I’m losing control, I don’t know what’s going on anymore, I abort the session, or maybe restart, or I do it myself.
Then, as an organization or as a team, maybe we consider good old code quality monitoring tools, which maybe sometimes we didn’t use in the past because we thought, we know what we’re doing. Now we have this new teammate. Maybe dust off SonarQube if you still have it running somewhere. I’m just saying that because I’ve seen a lot of teams who set it up at some point and then never actually looked at it. Then tools like that can also do things like monitor duplicate code. Some of the risks that we saw in the GitClear study.
CodeScene, for example, also has some new IDE integration features that show you in the IDE what your code health is, and if the changes of AI actually made it worse or better. Also, don’t shift right with AI. What I mean by that is that I see a lot of tools pop up now, for example, that do AI code review on your pull request. I wonder, why does it have to be in the pull request? Why not before I push? What can I put into my pre-commit script? I recently pushed some code and the pre-commit had a security scanning tool on and actually found a security vulnerability in the code that I had written with AI.
The earlier on in your tool chain you can put this in, the less friction it creates on your team. Then, finally, create a culture where both experimentation and AI skepticism get rewarded: like skeptics and enthusiasts unite. If you’re an enthusiast, don’t say to others, why are you so slow? When you have AI, you should be faster. This is something that some organizations are doing to their workforce right now. They’re saying you have to be X faster and you have to use this. You might guess it, like then when people feel under pressure and they don’t know how to do it with AI because it’s a moving target and we’re all learning it right now, then they’re going to come up with shortcuts. Or they’re not going to fight their complacency and just push it because they have pressure to be faster.
Also, if you’re an enthusiast, praise the skeptics when they’re trying something that maybe two months ago they wouldn’t have tried. If you’re more of a skeptic, don’t say, why would you even think that could work? Actually, appreciate that your colleagues are trying some things because AI has some very unexpected features and downsides. Then, do praise when they’re actually having a closer look at the outputs and actually thinking about how to improve your team workflow.
Questions and Answers
Participant 1: It’s just that because I know a lot of these large language models are done to communicate with us not to really solve coding or mathematical problems. The big elephant in the room is, are there any developments in the machine learning side towards these coding things? Because no matter how many times you change your prompt, still, if it is meant for a different set of things, then you will never develop. A year ago, I asked people at GitHub, they said, yes, we’re using already existing machine learning models, but we don’t have any machine learning scientists working towards this. If we can ever solve this problem, I think it will only take longer to finish it. Are there any advancements in this area?
Böckeler: In the models, you mean like beyond large language models?
Participant 1: There are models specifically designed for coding or specifically developed for these kinds of scenarios.
Böckeler: I don’t know anything outside of the area of large language models. It doesn’t mean that it’s not going on. Large language models think in tokens. The key here is, and that’s what we’re seeing with the agents, the integration with tools like an IDE that actually understands something about the structure of the code. That, for example, can give that feedback about the squiggly lines or stuff like that. I think Claude Sonnet is already quite good at what it does. There are just some things, the examples that I showed here, that, at the moment, I cannot see how you can really ever fully get over them with when you use large language models.
See more presentations with transcripts