Transcript
Shane Hastie: Good day, folks. This is Shane Hasty for the InfoQ Engineering Culture Podcast. Today I’m sitting down with Trisha Gee. Trisha, welcome. Great to see you again.
Trisha Gee: Thank you for having me.
Shane Hastie: Now, you and I know each other. And you are a regular on QCon and InfoQ, but there’s probably two or three people who haven’t come across you before. So who is Trisha?
Introductions [01:09]
Trisha Gee: I am a developer advocate for Gradle, a Java champion, author. So I’ve been a developer, mostly in the Java world, for about 20 odd years. And a lot of what I do as a developer advocate is really around talking about how to make developers’ lives a bit less rubbish, how to make things easier for them, how to smooth their journey. So this is how I’ve ended up talking more about developer productivity lately, and talking about what we as developers can do to be more productive, to unblock ourselves from the rubbish that we don’t want to do.
Shane Hastie: So we’re here today, partially because you’ve been talking and writing a fair amount about flaky tests lately. Why? And why should we care?
Why focus on flaky tests? [01:58]
Trisha Gee: Right. Excellent question. So one of the reasons I took the job at Gradle about two or three years ago, is because they have a developer productivity tool, which looks at things like build times, and test times, and does some acceleration and stuff. But the thing that really caught my attention was that it identifies flaky tests. And I was like, yes. See? People don’t care enough about flaky tests. When I used to work at LMAX, which is what? 15 years ago now, when I worked with Dave Farley, and Martin Thompson, and these various people who have appeared on InfoQ and QCon and stuff in the past, we had a tool which would identify flaky tests, based on my manual process of going through all the failing tests, re-running them all, seeing which ones were really failing and which ones were not, and then letting people know what they’d broke and what was flaky. Which reminds me, I should-
Shane Hastie: What is a flaky test?
Trisha Gee: Exactly. Good. I always forget to do that, because I’m so incensed about flaky tests, I forget to tell people what they are. So a flaky test, a non-deterministic test, an intermittent test, is a test where with the same code, with the same circumstances, the same environment when you run it more than once, sometimes it passes and sometimes it fails with exactly the same circumstances. And probably, when you first start talking to developers about this, the first thing they say is, “We don’t really have flaky tests”. And then when you describe that to them, they go, “Oh, yes. We have some tests like that, but that’s normal, isn’t it?”
It’s not uncommon, certainly, particularly when we’re talking about integration tests, tests that needs to talk to a database, an external service, these kinds of things. Longer running tests, UI tests, that kind of thing. It’s not uncommon for a test to time out while it waits for something to come back. There are lots of different things which can contribute to the flakiness and they’re really hard to fix, which is why I think we have a to go, “Oh, that’s a flaky test. Okay”. And then not think about it, because we’ve got more urgent things to worry about.
Shane Hastie: So we know what they are now. Why can’t we just say, “Well, yes. Ignore that?”
The Impact of Flaky Tests on CI and Developer Workflow [04:06]
Trisha Gee: We could. The first step is to identify your flaky tests. The reason why I find flaky tests really frustrating is that if you have a lot of automated tests, which is a great thing, it’s a good position to be in. If you have a good set of automated regression tests, you run them regularly in CI, this is a good thing. Everyone tells us we should do this. But if you have a CI environment where lots of the builds are failing, due to test failures, and you don’t know if those tests are really broken and something you should fix, or oh, this is one of those flaky tests that occasionally it can’t connect to a service or it can’t download the whatever. So it’s two different types of red. Red is, I need to fix this test or I need to fix this code, and red is, well, maybe I can ignore that.
If you can identify the flaky test, maybe you can ignore it. Maybe you can have it automatically rerun so that if it passes once, then it’s flaky and then it doesn’t fail the build, and ideally, you flag that somewhere else. You have a list of your intermittent tests. Or it does fail the build and you put it onto a list of tests you need to fix right now, because you don’t tolerate flakiness. Or you run it in a different build, or in a different agent, a different environment, so that you have your flaky tests separate from your critical go live tests. I think I’ve jumped ahead to what to do about them. But the problem is that if you have failing tests, and you don’t know if it’s a genuine failure or a flaky failure, all is that your build is red all the time or a lot of the time, and you don’t know where to focus your energy.
Or if you’ve got genuinely failing tests, because you broke something, or because you wrote the test wrong or something that you can do something about, you might be tempted to ignore those. Because you’re in this habit of ignoring tests, because flaky tests are in your build and we ignore them. So flaky tests are just a toxic thing for your build, because they stop you paying attention to this important information about the quality of your code, the quality of your tests.
Shane Hastie: The situation that you described, a timeout, or when a file’s not there, but it’s not really critical, these are real. What do I do to prevent the flakiness?
Common Causes and Solutions for Flaky Tests [06:26]
Trisha Gee: Yes, and that’s why it’s so hot. So there’s a lot of different causes for flakiness. A very common one is UI tests or emulator tests. It takes a while to start up the UI. You’re waiting for an element to appear and that kind of thing. So those kinds of timeouts to some extent, I guess we might tolerate a certain amount of that, but you can overcome some of those problems by increasing your timeouts, for example, if you think that things will appear at some point. Of course that then leads to longer build times, and longer build times means longer feedback cycles. And then you are waiting longer to figure out whether you broke the build or not, and then you start ignoring it anyway. But one answer for some of these types of things are literally just adjusting the timeout, or being smarter with timeouts, because there are some types of tests where I used to work, we always waited a fixed 30 seconds for an element to appear.
So if it appears in two seconds, you still wait the full 30 seconds to make sure the UI has appeared. So when you’re using timeouts, you need to be very smart about them. So you wait until a thing appears, and then you can move forward. If you have fixed timeouts, that is a sign of, you’re going to have slow tests and potentially flakiness in your tests, but there are other things as well. So things like if you’re waiting for services, if you’re waiting for databases, you need to be quite smart about how you set up your preconditions. You might need to be pretty smart about, okay, we’re going to have, this whole suite of tests requires this database. Let’s make sure we wait for that database at the beginning, but then we can run all the tests pretty quickly once we know the database is there, instead of each individual test having to start up and wait for a database.
Obviously, it depends if you’re using test containers and things like that. Different technologies impact this differently, but you get the idea that a lot of flakiness can be around timeouts. A lot of it can be things like genuine failures though. So one of the courses of flakiness we had when I worked with Dave Farley, once we eliminated the obvious ones like our UI timeouts and our database timeouts, we still had flakiness in a bunch of unrelated tests, and we couldn’t figure out what the problem was. It turned out to be a race condition in the production code. And race conditions are really hard to find. You don’t want your users identifying them, particularly when you’re working on financial software. But if you have this huge set of automated tests and they’re all running and hitting the QA environment at the same time, you can actually simulate a lot of this production load that you will get in production, and find some of these production problems in your QA environment before they impact your users.
So flakiness might be a sign of an actual problem, whether it’s a race condition or whether it’s not correctly dealing with load, or not correctly dealing with contention, those kinds of things. And this is another reason why looking at flakiness is really, really important. It’s great. It’s very important from a developer morale point of view, from my point of view, but some of your flaky tests could well be caused by production code problems, so you really want to take that very seriously.
Shane Hastie: Let’s dig into that developer morale. Why? Why does a flaky test make me unhappy?
Flaky Tests and Developer Morale [09:43]
Trisha Gee: I could tell you why it makes me unhappy. So it can impact morale, because if your CI environment is red all the time, so I’m a ticked box kind of person, so my boxes aren’t being ticked, so I feel like it’s incomplete, right? I’m not getting that lovely endorphin run from like, oh, I did it and it works. So there’s that. But if it’s always red, because something failed, because of some flakiness, or because tests are failing because we’re ignoring them, then it implies that we don’t care about the quality of our code. And we’re investing time writing the code. We’re clearly investing time writing tests too, which is, we’re way ahead of the curve there, because there are some teams who are still not writing tests. So we invest a lot of time, and energy, and effort in these things. We’re doing everything correctly, and yet at the end of the day, we’re not doing that final step of paying attention to those tests.
So it feels a little bit like everything that we do doesn’t matter. I know it feels a little bit like philosophical, but you put in all this effort to writing the code and writing the tests, and then you don’t pay attention to the results, or someone else doesn’t pay attention to the results, or someone else is introducing flaky tests. Then we starting to get this feeling of, oh, well, maybe quality isn’t that important. And maybe spending time writing good quality code, good quality tests just doesn’t really matter as much. And maybe I am just a cog in a wheel, and I’m not really having a big impact on this team.
Shane Hastie: And certainly from my own experience, you start to feel that, okay, well, they don’t care. We just push it through every time anyway. Who gives a damn?
Trisha Gee: Exactly.
Shane Hastie: So you’ve given us some guidelines about things we can do about them. How do I convince others that this matters?
Convincing Others to Care About Flaky Tests [11:39]
Trisha Gee: That’s a good question. I have visited a number of development teams lately, where there’s one person who jumps up and down about flaky tests, and feels frustrated that no one else really cares. One people can do if they care about flaky tests is to share some of the stuff I’ve written or a podcast like this, so people go, oh, it’s not just a case of ignoring it and getting on with stuff. It actually has a much bigger impact on the team than I was thinking. So sharing some of the problems clearly, in terms of the impact on morale, the impact on quality, that can motivate some people. One of the other ways that helps with developers as well though is, we do sometimes like solid numbers. So flaky tests can lead to you having much longer build times. One of the ways to identify flaky tests is to get, for example, Maven or Gradle to automatically rerun your failing tests.
So if a test fails, you subset retries to three or something. And of course these flaky tests are often integration tests which take a little bit longer to run. So it fails once. So then you rerun it to see if it’s really failing, and then maybe it fails again. And then you rerun it to see if it’s really, really failing, and then it passes. So that test now takes three times longer in your build than it should do, and that applies to every single flaky test. So for every flaky test, if you’re doing reruns to identify them, or to ignore them because you know they’re flaky, you are adding a lot of extra time to your build unnecessarily. So now you’re having to wait a lot longer to find out, did my code work correctly? Is my pull request ready to go? Or whatever it is that you’re waiting for.
And of course if you’re running it locally, you’re running your build locally and then you have to go off and get coffee because it takes you 5, 10, 15 minutes. So addressing flaky tests can bring your build and test time down, which is really helpful. And obviously, the amount of time it brings it down depends on how long your build is, and how many tests you have, and how many integration tests. But if you have a tendency to lean towards a lot of integration, end to end UI tests rather than unit tests, then A, your build and test times will be longer. And B, your amount of flaky tests will be significantly higher, because these are the types of tests which are generally more flaky. So if you can push your tests further down the testing pyramid towards unit tests, fast reliable tests, this will decrease your build time significantly, and it will also increase the reliability of your build.
So you’re more likely to run it. And when it goes green, you go, oh, great, I’ve done something great and I can move forward. So sometimes using something as simple as build times can really help motivate developers to be like, oh, okay, this test class has 15 tests in. Each one takes 20 seconds to run. Some of them are flaky. What if we could move half of those tests into unit tests, because they don’t really need to be end-to-end tests? And then the other ones, maybe we can tighten up the timeouts, or be smarter about how we run the tests so that they’re less likely to be flaky. Then you’ve got to win across the board there. You’ve got fewer flaky tests, and your build and test time is lower.
Shane Hastie: This does require giving more thought to your test design though.
The Importance of Test Design [14:53]
Trisha Gee: Yes. The good thing is, there are a lot of well-known best practices in the testing space, which will address a lot of this. So if you do tend towards having unit tests where possible, so that you have fewer integration points, fast unit tests, a lot of people don’t realize you can have sociable unit tests. So a unit test doesn’t have to be one test, one class. It can be a bunch of classes, just as long as it’s not also talking to a service or a database or something external, testing 10 classes is no slower than testing one class. It’s not starting up a whole bunch of class loaders or whatever. So pushing towards unit tests, but standard testing, best practice, things like being smart on setup and teardown, things like not competing for the same resources, making sure your tests are clearly separated and they’re not dependent on each other.
They’re not dependent in time or in data. All of these things are just good best practices for writing good quality tests. Obviously, Kent Beck’s written a whole bunch of stuff about testing and writing good tests. There was something he wrote about, I think it was about fast unit tests, and there’s a whole bunch of stuff in there about just good practice for writing tests. So if we follow good testing practices which, have been around for decades now, we can start to reduce some of these problems.
Shane Hastie: You did make the point that a lot of organizations haven’t even got to the point of doing much testing. Why?
The Testing Mindset is Valuable and Different [16:27]
Trisha Gee: Because testing is hard. Testing takes time. Testing also requires a different mindset. It doesn’t necessarily require a different person, although QA engineers are a thing, test engineers are a thing, and they do have a very specific skillset. And I think it’s really important to have those skillsets in your team, because when you’re writing a test, you need to be able to think differently about things. It’s not just about when you’re writing the code for, “I’m going to place order”, you think about inputs, outputs, and you think about usually the happy path. Maybe you might think, what happens if I put zero in? But mostly you’re getting from A to B as a developer.
And when you think about tests, it’s not just A to B. For every single happy path, you’ve got probably a dozen unhappy paths. So every single piece of work that we do, when you start thinking about testing it, it suddenly expands with this combinatorial explosion of different edge cases. And okay, what if it’s null? What if it’s zero? What if it’s max int? Obviously these are simple cases. And then once you start talking about microservices interacting with each other in distributed systems, it becomes really difficult to think about what the edge cases are, and how the software should behave. And on top of that, often, when we get the requirements from whoever, whether it’s from our lead engineer, or the business or whoever, they often obviously haven’t thought about all the different edge cases either. So when you think, okay, what should happen when it’s null? It’s not always in the spec. You have to either think, what should happen when it’s null?
Or you have to go back to the business and say, “What happens when it’s null?” And they’re like, “What does that mean?” And then you have to explain, or you have to figure out the circumstances under which this value could be null from the user, explain that to the business and get them to think about what are the implications on the user flow? So testing is really hard, because it needs a lot of thinking. But I don’t think that should put us off, because as developers, we love hard problems. Hard problems is what we do. So I think we have to reframe it in our heads in that, yes, we could implement the functionality, and get our gold star and move on, or we could get triple gold stars by implementing the functionality, and then thinking about the really difficult bit of, how would I break this? What insane things will a user do that they should never do, that will make this behave in a weird way?
And I think that’s really interesting, but I think that when we’re measured on how many features did you deliver? How many bugs have you fixed? Then we tend to want to run to doing more features, instead of thinking deeply about doing a really good quality feature.
Shane Hastie: That gives us a nice segue into developer productivity. What is it? How do we measure it?
Measuring and Understanding Developer Productivity [19:19]
Trisha Gee: So I just recently finished reading Slow Productivity by Cal Newport, and there’s a quote in that book that I use in a couple of my presentations, which is that, “It’s amazing that in the 21st century we have no good measurements for knowledge work productivity”. So knowledge work has obviously been the area of work which has expanded enormously throughout the 20th century, and well into the 21st century. Knowledge work obviously includes developers, designers, all of this kind of thinking, often sat at a computer kind of work. It’s really difficult to measure, because we could look at the output, lines of code, or commits or those kinds of things. And that kind of output is a proxy for productivity, because if you weren’t sat at your computer writing code, you didn’t write any lines of code and you didn’t do any commits, but you start measuring those sorts of things, developers are smart enough to be able to hack that system and optimize for things like lines of code and number of commits.
Because ultimately, lines of code and number of commits is not what makes the code that you write useful for the end user. So you need to be looking at a much bigger picture of, yes, did you deliver the feature? But did you deliver the right feature? Does the user use it? Does that actually impact the business in some positive way? Which of course you might not have much control over, but you might have the ability to say, “Have you had a UX person have a look at this, and make sure the users can find it and use it”. So productivity is a much bigger thing. And on top of that, a lot of it is about thinking. When I used to work with Dave Farley, we used to pair program a lot. And I’ve got my fingers on the keyboard and I’m trying to write code, because that’s what we do as developers.
And he sat right back and he’s thinking. He’s like, “Well, if we’ve got this module, and this module, and they have these messages between them and we want to add this feature, it makes sense if we reshape things this way. Or maybe it doesn’t even belong in here”. There’s a lot of thinking and reasoning through the simplest way to add something into the system in a way which is maintainable, and unusable in the future. And it’s very difficult to measure thinking time. At the end of three days of thinking, you might have two lines of code. How are you supposed to measure that? So there is a new-ish framework, or framework for creating a framework for measuring productivity, called SPACE. And SPACE measures five aspects.
Well, SPACE suggests five axes for measurement. And you are generally supposed to pick three-ish axes, and one metric from each of those axes to figure out generally speaking, what your productivity looks like. And the goal of these kinds of frameworks is not to get some sort of concrete number, but to get that number to move in the direction you want. So it includes things like satisfaction of the developer. A is activity, so that’s things like lines of code, and it’s things like commits, and the things we used to measure regularly. E is efficiency and flow, which I think is really interesting, because flow is a really important part of what we do as developers. We need the time to be in flow, which means fewer meetings, fewer interruptions, fewer slack messages coming at us all the time. Oh, and C is communication. P is performance.
So performance includes things like the quality of the code. It’s not necessarily about your performance as a developer or about the speed of the code, it’s about how well the application is performing, I guess. So you’d pick three of those different areas. And you can see things like satisfaction, activity and flow are like three things that might almost be in contention with each other. Well, I guess flow and satisfaction might go together. So if you measure three things which are somewhat tugging in different directions, then overall you should get some sort of metric which is indicative of a more holistic view of a developer’s productivity. But these things are really difficult, because satisfaction for example, needs you to write good quality developer surveys, and pass out those surveys regularly, and crunch those numbers. And that’s a skill in its own right as well.
Shane Hastie: We’ve got a possible tool with SPACE, to help us begin to measure it. But what does productivity feel like for a developer?
What Does Productivity Feel Like? [23:54]
Trisha Gee: So this is why I’m really interested in productivity. If you hear the heads of organizations talk about productivity, what they want is the most bang for their buck. They want more lines of code out of their expensive developers, or they want fewer developers to write the same amount of code or whatever. For me as a developer, I think probably the most important thing is not being interrupted, and not having to context switch, so I can get that deep thinking, get in the flow, so that I can solve these kinds of problems. And you need that state to do things like write the good test that we were talking about. You need to be able to focus on solving a problem, and be able to perhaps make a list to the side of stuff you want to come back to like, what if this fails? What if this happens? What if the other?
And be able to come back to that and have everything still in your head as you go through and fix this problem. So you need time to be able to get in the flow, to not be interrupted, to have a lot of this stuff in your head. You also need good tools. And I know it’s easy for me to say. I have worked for two tool companies, like JetBrains and Gradle, but there’s a reason why. Because if you’ve got good tools, it helps take away a lot of the distractions. So the reason I love IntelliJ IDEA, is that I can do refactoring easily and my code will continue to compile. I don’t have to think, where do I have to hunt down where this variable is used? I don’t do that. I just rename the variable, and I just move forward, and I keep going.
And Gradle’s Develocity tool for example, helps decrease build times, and build and test times. And that’s great too, because I’ve worked at a lot of places, where running the full integration test suite will take 20 minutes. And yes, you can work on something else while you’re waiting for that, but then you’ve lost the context of what you’re working on in your context switching. So if we can bring those 20 minutes down to three minutes, by using parallelization machine learning to figure out which tests you do and do not need to run by using caching and acceleration technologies, then I only have to wait three minutes to go, okay, oh, I broke this thing over there. Great. Let me fix that before I move on to the next thing. So tooling is a really important part of the productivity space, because we want the tools to do the stuff that we don’t really want to do.
The Potential and Risks of AI Assisted Coding [26:06]
And obviously, I’m going to have to mention AI, but I don’t really have any strong opinions on AI yet. AI does have the potential to help us be productive as well. At the moment, for me personally, when I’m using an AI coding assistant, and I’ve tried a few now, right now I still find it a bit distracting, because it’s a different way of working for me. And I have to read a lot of code, and I’m not used to reading a lot of code as I’m thinking about what I want to do. I’m used to writing code. So there’s definitely a skill, I think, in being productive with something like the AI coding assistance. However, it’s also reopened up a lot of opportunities for me, because I don’t write a lot of coding, well, because I write books, and articles, and podcasts and things like that.
And I’ve got a few coding projects, which have been back-burnered for a long time, but they’re going to need me to do a lot of research around how do I write a plugin for this thing? Or how do I use this particular technology to do some UI stuff? I haven’t done any UI stuff for like 15 years, so I can’t remember how it works. And I could get an AI assistant to write the basic skeleton and the structure for that, take care of a lot of the boring stuff, while I think about, okay, fine, it functionally works. Now what should it do? How should it react? How should it behave? And then I use my 20 years experience to move the application in the direction I want it to go in, and I don’t have to worry about learning the latest nuances of Angular or whatever, because that’s been taken care of.
Shane Hastie: Well, previously Adam Sandman said on the podcast, talking about the use of the AI copilot tools have made a point of one study. And it is a single study found in a context, but 300% more code produced in the same time period, and 400% more bugs.
Trisha Gee: Yes. That’s a good stat. Yes. And then we come full circle to what we were talking about in terms of testing, and also in terms of productivity. If you measure productivity in terms of lines of code, of course an AI coding assistant is going to really help your productivity. You’ll get a lot more lines of code. If you get the AI assistant to write your production code and your test code, I feel a little bit like maybe… Would you let your toddler write your code, and your test code, and just let it off into the wild? No, you wouldn’t. Or an intern even. You would want to do some serious checking of that. So if you’re going to get AI to generate a whole bunch more code, you really need, in my opinion, a lot more good quality tests, not flaky tests, that run quickly, that give you the confidence that the code does what it’s supposed to do.
One of the things that people miss when they talk about automated tests, particularly people who like to write their tests after they’ve written the production code, is, the easy trap to fall into is to write a test that tests the code does what it does. So the code, you write a square, and the test tests that you have a square. Okay, that’s fine, but you’re supposed to be thinking about, is it behaving correctly? And what is the correct behavior? So I have a square. Okay, so what happens if I tell it to create square with five sides? Oh, I didn’t think about that. That’s what you test for. Your tests are for saying, like we said before, what happens with null and zero? Your AI stuff can do that, but when I’m looking at a pull requests, I look at the tests to find out what the behavior should be. And the tests will tell me things like, if this type of user places this kind of order, but they come from this organization and it has these rules, then don’t let them place the order, the business requirements, and it’s acceptance tests, really.
And you can’t generate acceptance tests from existing code, because all it can do is test that the code does what it does. You can’t generate the tests that state what it should do. In theory, perhaps you can generate them from a well-defined Jira card or whatever, but you can’t generate those sorts of tests from the existing code. So with AI, what you really want to do is, you want to have good quality tests that are checking it does what it’s supposed to do, not checking what it does do. And because you’ve got a lot more code, you need to be running those tests faster. You need to have a much faster cycle, ideally by doing things like pushing them down the testing pyramid. But also, this is where things like Gradle’s Develocity tool comes in really handy, because we can parallelize your tests. We can use predictive test selection to select just the tests that are relevant for those code commits. And we can use caching to not run a whole bunch of stuff that you don’t need to run.
So if you’re going to have more code, if you’re going to have more tests, you need some sort of acceleration around running the code and running the tests. Otherwise, you’re going to grind to a halt, where your test suite takes all night to run, and you can’t do anything until the next day.
Shane Hastie: Been there.
Trisha Gee: Yes, exactly.
Shane Hastie: Trisha, a lot to explore, a lot to think about there. If people want to continue the conversation, where can they find you?
Trisha Gee: Everywhere. I have a website, trishagee.com. I am on Bluesky, Trisha G. I’m occasionally on X. I’m more on Bluesky these days, LinkedIn and Mastadon. Are there any more? I’m on most of those things, but you can reach out to me on those. The place I’m most active is LinkedIn and Bluesky.
Shane Hastie: Well, thanks so much. It was really good to see you again.
Trisha Gee: Thank you very much for having me. It was great.
Mentioned:
.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.