Supporting Engineering Productivity For All

Transcript

Murphy-Hill: I think we all want our engineers to be productive. There are lots of things that we know about what makes engineers productive, but it’s unclear how to prioritize. Let me give you a couple of examples.

Suppose you’re an individual developer, you’re just an IC. You’ve got limited time to invest. If you had to choose between one of these things, what would you choose? If you decided that you should spend some extra time seeking out the best tools and practices and migrating over to those, would you do that, or should you shut down email notifications during the day? We know both of these things improve productivity. If you had limited time, what would you choose? Or if you’re an eng manager, should you invest time in reducing complexity of the system you’re working with, or should you try to give your developers more autonomy? Or if you’re an executive, should you invest in better development tools? Should you create your own? Or should you invest in less distracting office space? These are very different things to invest in. There’s lots of investments we could make. What are the highest priority investments?

At Google, one of the things we did was try to study productivity and specifically figure out what factors most strongly predicted developers’ self-rated productivity. Let me tell you about both halves of this question. For productivity factors, again, we intuitively know a lot of things that improve our own productivity. We went to the research literature, some of it from software engineering, some of it from the management literature. We tried to condense them all into these 48 different factors that really just as broadly as possible try to capture what we know about what makes developers productive and information workers more generally.

For instance, we have a question like, my job allows me to make decisions about what methods I use to complete my work, a type of autonomy. People would agree with this or disagree, five-point scale. When it comes to self-rated productivity, we designed a question, actually came up with a bunch of questions, we refined them by talking to developers, getting what they were thinking of when they answered this question. We landed on a pretty simple question in the end, which was, I regularly reach a high level of productivity. This question is simple and straightforward enough. I think it continues to live on today in Google’s general employee survey. You might look at these types of questions and you might say, “I’m not sure about self-rated productivity. I really want to know how productive people actually are”. The alternatives aren’t great.

For instance, how much code did you write? That’s a way to measure productivity. Probably not the be all end all. We do know that the answer to this question actually does map to objective measures of productivity. We know it correlates with developers who tend to say yes to this question, I agree with this question. They also tend to write more code than developers who don’t, both in terms of lines of code written and number of PRs submitted. We’ve essentially got a bunch of factors that we know predict productivity. We’ve got an outcome, which is a measure of productivity itself. Then we figure out how they correlate together and which ones are the strongest associations.

I’m going to give you too many to read, and I’m going to zoom in on these. At a very high level, this is a ranked list. The top one is at the upper left. The least predictive is at the bottom right. Very broadly, the two columns on the left are significantly correlated with productivity. The ones at the right were not correlated at all. We ran this at three companies. At Google at the time, but we also ran it with National Instruments, and also ABB. A variety of different companies and company sizes and markets. We’ll just dive into maybe the first two here.

The first most important thing that predicted productivity across the companies was, I’m enthusiastic about my job. This question actually I feel like doesn’t get a lot of respect in our surveys. I don’t see them in a ton of surveys. The management literature generally calls this employee engagement. It’s actually no surprise that it’s at the top. Intuitively for me, it’s what gets me out of bed in the morning. If I get out of bed excited about the work I’m going to do, that’s going to help me feel and be more productive. The second bit, people on my project are supportive of new ideas. I think this you can also recognize from the keynote, Project Aristotle, which was about psychological safety. I think psychological safety was second on the list.

Let’s go back to those questions I asked you. Individual developers, I said, should you seek out new tools and practices or should you shut down email and other notifications? If you answered seeking out the best tools and practices, give yourself a pat on the back. It turns out it did predict productivity in the surveys that we ran. Shutting down notifications actually didn’t seem to make much of a difference in terms of people’s productivity. Managers, so autonomy versus software complexity. Turns out autonomy, we had three questions about autonomy in different senses. My job allows me to make decisions about what methods I use to complete my work. My job allows me to make my own decisions about managing my time. My job allows me to use my personal judgment when carrying out my work. Three different flavors. Those really popped, were really important in terms of people’s self-rated productivity. My software is extremely complex. Again, near the bottom, didn’t make any significant difference across the three companies that we looked at.

Then, the third one, as an executive, should you invest in quieter office space or using the best tools and practices? It turns out the best tools and practices just eked out reducing interruptions. Actually, this one was a bit of a trick question. It turns out this ordering is right for ABB and Google, but it turns out the order is flipped for National Instruments. I’ve got lots of paper citations at the bottom. If you read our paper and you can figure out how we characterize these two different companies, and you can come in with your company and say, I think we’re more like ABB, or, I think we’re more like Google, and you can interpret the results based on that context. In fact, using the best tools and practices was the top one for Google, but this whole list is averaged across the three. Tools and practice was super important for Googlers.

How Productivity Varies – Conflict in Code Reviews

This ranked list is one way to think about individual engineers’ productivity, but it really masks how productivity drivers are different across different people. To illustrate in the remainder of the talk, I’m going to show you three different ways that productivity varies across different engineering populations. The first way is conflict in code review. Productivity factor number nine on our list was, “My project resolves conflicts quickly”.

If you look at an open-source survey that GitHub ran in 2017, you will have seen that people who experience negative behavior, often quit open-source projects. That’s no surprise that conflicts happen and resolving conflicts is important. We came into it through a lens of code review. Here’s a code review that I did on GitHub, someone I was working with, Sophie. Sophie issued this pull request and wanted my review on it. Eventually, we merged it. I gave her a little bit of feedback. These kind of code reviews made us ask, does it matter that Sophie is a woman? Would I have given her the same amount of feedback if she was a man? Would I have given her feedback if she was older? Would I have given her different feedback if she was white instead of Asian? The previous literature in the social sciences suggests that people fill in blanks in your understanding of others based on biases that you have. The suspicion is, yes, it probably does, but the question is how?

Here we looked at a bunch of code reviews that folks at Google were doing. Our outcome here was what we call pushback in code review. We have high pushback code reviews, and we’ve got regular code reviews. High pushback code reviews are those with more than nine rounds of reviews, so more than nine back and forths. The reviewers spent 48 minutes reviewing. This is not wall clock time, this was active time, like clicking on things in the review tool, looking up documentation and so forth. The author spends at least 112 minutes responding to that feedback. These numbers, they seem somewhat arbitrary, but they’re all 90th percentile. These are essentially the longest, most rounds of the code reviews in Google.

In total, high pushback code reviews only represent about 3% of all the reviews done at Google. They’re very contentious. We’ve also surveyed people about these reviews and it turns out people are much likely to say that these were negative interactions, these were interactions where they felt there was a lot of interpersonal conflict that was perhaps unnecessary. What we’re looking at here is whether folks from historically marginalized groups get more or less pushback than folks from majority groups. As I said, we’re looking at predictors here like gender, race and ethnicity, and age. We’re using the U.S. government categories for these.

Unfortunately, for instance, we don’t have binary gender. We know that a bunch of other things affect how much pushback someone gives. A lot of code tends to get more pushback than a very short change. If you have two reviewers, you’re going to get more feedback than if you just had one reviewer. If the author is more junior, they’re going to tend to get more feedback naturally anyway. If someone’s been at the company for less time, they’re going to tend to get more feedback too, because they’re just ramping up. We’re going to control for all of those things and hopefully what’s left over is just the bias or differences in pushback for these different demographic groups. Then, a little bit of the nitty-gritty details, we used a mixed-effect binomial logistic regression here. We used 6 months of data at Google. We replicated it with other 6-month period and the results were quite consistent. It’s about 2 million code reviews and about 30,000 authors.

To cut to the chase, here’s what the results look like. I’m going to have to interpret them a little bit for you. What we have here at the top is we have these odds ratios. 1 is the baseline, anything under 1 is less likely to get pushback, and anything more than 1 is more likely to get pushback. Let’s just look at female here. Women in particular are likely to get more pushback, 21% more pushback compared to men. Similar situation with race or ethnicity here. Black engineers have 50% higher odds of having a high pushback code review than white engineers. Hispanic or Latinx engineers, about 15%. Native American engineers, non-significant. If you take a minute, you can probably guess why. There aren’t a ton of Native American engineers in tech, unfortunately. Asian engineers, about 42% higher odds of having high pushback code review. We also looked at age here.

The baseline here is basically new college grads, people 18 to 24. When you look at age, basically the older you get, the more pushback you’re likely to get during your code review. In particular, recall, this is after we’ve already controlled for level and we’ve already controlled for tenure. We’re taking that out of the equation. Once you take that out of the equation, what you’re left with is folks who are 60-plus are about three times more likely to have a high pushback code review than someone in the new college grad category. It looks like there’s some bias happening in code reviews, and it’s not just a Google thing. This is something we’ve replicated elsewhere. This happens on GitHub as well. I think it’s just a natural thing we would expect where people’s biases come out. There’s no reason to expect that code reviews would be different. Here we’ve quantified it.

What we did in response is, on top of Google’s existing code review tool, we built an anonymous version. The anonymous version basically, rather than having author names, we use the nice Google Docs anonymous animals. Here we have an Anonymous Goose, Anonymous Otter, Anonymous Frog. You can click on review. You can open the review. It doesn’t tell you who the author is. As a reviewer, this is intended to be essentially de-biasing. You can review the code without knowing the identity of the author and hopefully not influenced by it. A few other features are you can also deanonymize pretty quickly. If you just mouse over the anonymous name and you click the deanonymize, you can see who it is. It turns out that’s important. Engineers sometimes really need to urgently get in contact with the author, so this allows that. It turns out it’s actually really rare.

In this user study that I’m about to tell you about, less than 1% of the time did people actually click the deanonymize button. We ran a field study with this where we asked a bunch of questions about how well this worked in practice. We asked, how often can reviewers just guess author identities? How does anonymous code review change reviewers’ velocity? There was a worry at the time, our prior paper about GitHub, someone at Meta had seen it, an engineering director there had seen it, and he said, I think anonymous code reviews are a really bad idea because it’s going to slow down our engineers quite a bit.

That’s a question we can test. We can see, does this really slow down engineers or not? How does anonymous code review change review quality? How does it affect reviewers’ and authors’ perception of fairness? What are the biggest advantages and disadvantages from a reviewer’s perspective? Also, what features are important to this tool? I’m just going to talk about three of these. It turns out that in terms of anonymous code review and regular code review, people finish their reviews just as quickly, so there’s not a degradation in the amount of time people save, so that’s good. Does it change review quality? Seems to be about the same based on the data that we have, but actually there was a little bit of evidence that it was better.

In particular, we looked at rollbacks, which is basically code that needs to be rolled back. It turns out there were fewer rollbacks in the anonymous case than in the regular code review case, so a little bit of data. Then, what do engineers perceive as the biggest advantages and disadvantages? They thought that because they weren’t looking at author identities when people did this for a few weeks, they did more thorough reviews and they thought that their bias was reduced. Of course, the question that we don’t have the answer here to, if you’re paying close attention, is, did we really eliminate bias here or not?

Unfortunately, I don’t have an answer for you. The reason is because you need a whole lot of data and you need a lot of people. We had 400 people using it here over two weeks, but you probably need a whole organization the size of Google or one of its sub-organizations to be using it for months before you would know whether you were actually stamping out bias or not, because, in part, there are so few engineers from historically marginalized groups.

How Productivity Varies – Bias in Technical Documentation

The second way that engineering productivity affects groups differently is through documentation. Productivity factor number 16 on that original list was, “Knowledge flows adequately between key people in our project”. One way that you do that is through documentation. According to the Stack Overflow survey, documentation is often highly valued but often overlooked. We were inspired by social science literature that looks at how seriously people take opinion articles that were written by people of different demographics. We thought, maybe sort of the same thing applies in terms of technical documentation. Here’s a public Medium article about the OpenCV Python library, and it’s got somebody’s face on it right now, but actually we could just replace it with people. We replace it with faces from people from different groups.

In particular, here we have people of different ages. We’ve got women on the top. We’ve got men on the bottom. We’ve got a young woman here, a middle-aged woman, and an older woman here, and likewise for men. The question is, does this make any difference in terms of how people read the article, that people’s biases also kick in here? To test this, we had a bunch of these articles and we systematically replaced the faces that were at the top of the articles, and we asked people a few questions about them. One of them was, how in-depth do you think the article will be based on the topic in the first paragraph? We didn’t give people the whole article.

The reason we didn’t give people the whole article is, first, we had limited time in an experimental setting. We couldn’t have them reading 10 articles, so we could really just have them read the top part. Also, it tends to simulate what we typically do when we’re looking for answers to our questions, which is, we typically don’t read the whole thing. We maybe look at the top, maybe skim it a little bit, and then decide whether we’re going to go in-depth more. Our estimation of how in-depth this article is going to dictate whether we’re going to take it seriously and whether we’re going to read through it more. Like I said, we systematically replaced faces.

This chart shows how in-depth people thought the article was versus how superficial. If we say young men here, we say this is the rating people typically gave them. Ratings down here are more superficial. It turns out that if you’re a middle-aged face at the top of an article, people rate it a bit more superficially as a young person’s face at the top. I’ve drawn these arrow bars on here, which just statistically tells us that there isn’t really a significant difference. Maybe the difference here, this drop is really due to noise. We can’t really be sure. We say, not really sure if there’s a difference there. If we plot everybody else’s faces, there is one significant difference, and it’s the older men. It turns out, compared to younger men, people take older men’s articles, they treat them as relatively superficial compared to younger men.

Interestingly, we didn’t see any significant gender bias differences, although there’s an interesting one right here. This was really the only one that was statistically significant here. This is a bit of a surprise to me because I expected gender bias to be a more significant effect than age bias. I at least believe both are pretty prevalent. This suggests that actually age is the stronger effect here. Actually, that’s pretty consistent with the code review slide that I showed you before where actually folks who were 65-plus were three times higher likelihood of getting pushback than someone who’s a new college grad. Age is actually a pretty strong signal that people use for doing technical tasks.

How Productivity Varies – Building Diverse Teams

Last but not least, demographic diversity itself has an effect on productivity. In particular, this paper by my colleague Bogdan Vasilescu, found that on GitHub, both gender and tenure diversity were positive significant predictors of productivity. In particular, teams that had more gender diversity, which in our field typically just means more women, teams that had more gender diversity tended to be more productive in terms of how much code they wrote. You’ve probably seen articles like this in the management literature about how diverse teams outperform homogeneous teams, sometimes. I think building diverse teams is a goal that many of us have. The challenge, of course, is that many of the teams that we work on today aren’t very diverse. We’ve got quite aways to go.

In the next study that I’m going to talk about, this last study, our insight was that, sure, most teams are not very diverse, but in a really big company like your Microsoft, like your Google, like your Meta, there’s going to be some teams that are doing really well. In fact, there are teams at Google, just looking at men and women, that have gender parity. Our research question was, what can we learn from those few teams that are really diverse? What can most of us who are in pretty homogeneous teams, what can we learn from the most diverse teams that are out there?

In particular, this is how we found these diverse teams that we interviewed. In the prior slides I’ve talked about, it was entirely quantitative. We were looking at a large scale of data or doing an experiment and we were looking at quantitative outcomes. This one is entirely qualitative. What we’re going to do is we’re just going to go and talk to these teams, figure out how they got to where they are today. The way we selected the few diverse teams that we talked to was, we looked for teams between 5 and 15 people. If it was fewer than 5, it’s hard to tell whether it’s diverse or not. We looked for teams for which the manager had been in place for at least four years, so that’s actually pretty long, just to make sure these teams were somewhat stable. In terms of diversity, we were looking at teams that had at least four races or ethnicities represented. I said earlier, we were just using U.S. guidelines for race and ethnicities, which means there’s only five categories, and having four of them is quite a few.

Then, we also look for gender diversity on the team. This is a little bit more technical, but we’re looking for a Blau diversity index of about 0.4. That means, didn’t have to have perfect gender diversity but needed to have a good mix. Because our data was only binary gender data, we were missing a perspective of non-binary folks. We made sure we sampled a few teams with non-binary folks on them as well, as well as trans folks as well.

In terms of inclusion, if we just take the ones on the left, we’ve got diverse teams. It doesn’t mean that things are going well on these teams. We wanted to have some measure of success, at least for the people on the teams. We made sure that the folks from underrepresented groups on these teams had been there for at least two years, and at least one of them had been promoted. Again, imperfect measures here for what success means, but this means it’s a fairly long-lasting team. From the objective metrics, this was about as good as we could do in terms of success.

We went and we talked to these teams. We talked to about 20 different people. About 11 of them were engineers, and about 9 of them were managers. In talking to them, we came away with a bunch of themes about how these teams were successful in increasing their diversity. In particular, it divides into two different categories. One category is practices they used in recruiting and hiring, and the other is practices that they used in what we call developing technical allyship, which I’ll define. In recruiting and hiring, one of the practices they use is what we call self-empowering hiring managers.

To give you an example, Fred used to think that individual managers didn’t have much power to impact diversity. He changed his perspective, and when he became more proactive, he successfully grew representation on his team over the period of about five years. He recommends that managers start by reflecting on what incentives and biases led them to create homogeneous teams in the first place, and follow by creating a plan to improve diversity. The self-empowering is moving from a position of, this is HR’s responsibility to bring me a wide variety of candidates, to, this is something I’m going to need to do for myself if it’s something that I want for my team.

Growing the pool and pruning back assumptions. An engineering manager, Tim, had an open role to fill, and he asked himself, what are the real requirements for this role? I know what we typically list we need for a software engineer, but he really refined that list down to what he thought were really only the core requirements that were needed for the job. That’s what he did for the posting. Rather than trying to fill the role as quickly as possible, he took the time to intentionally check his own assumptions and his own mental image of what an engineer was. Reflecting, he said that the person that he intuitively thinks of as the best match for a team might not actually be the most successful person who is out there. That’s what we mean by pruning back assumptions. Growing the pool is about using your network and using the network of people you know, especially folks from historically marginalized groups, to grow the pool of not just the people who immediately apply for the job.

One thing that was very timely when we ran this was, managers were really worried about losing their headcount when they had an open role. What they said is, often to deal with this, they’ll try to fill a role as quickly as possible. They said that that is actually pretty detrimental to their ability to make hires from folks from historically marginalized groups, because not everybody sees the posting at the same time. For example, when I posted something on LinkedIn maybe a month ago about our job post, it’s going to go out to my network immediately, but my network just statistically is mostly white men, and so it’s going to reach those people first. You need to wait longer for the message to reach people who are not like you. Most of the managers here were white and Asian men. Leveraging your network and the network of folks you know can be really helpful here.

Ongoing commitment to diverse candidates. Software engineer Claudia said that her manager, when she was interviewing for her current role, made it really easy to see how she would succeed on the team. The way he did that was that he brought up another engineer on the team who was also from an underrepresented group into the conversation, and he explained how he was supporting her career growth. When Claudia heard that, she said, “This guy is serious. He’s actually helping this person grow. I can see how I could grow in this role as well”. Now you have a more challenging situation where you don’t have any folks from underrepresented groups on your team, and what Claudia suggested in this case is managers should acknowledge that there’s a problem in the first place and demonstrate that they’re actively trying to fix diversity on their teams. This commitment to diverse candidates is twofold. One, it’s saying it, saying that you’re committed to improving diversity on your team. The other thing is also demonstrating it.

One of the things people talked about is that a lot of times they saw examples where people talked about it, but it wasn’t really clear that they were actually doing it. One engineer that we talked about in particular, she said that one strategy she used when she’s changing teams in a company is she looks at the potential hiring manager’s calendar. It’s usually guys. If this guy says, yes, diversity and inclusion, super important to me. If she doesn’t see any diversity and inclusion activities on this guy’s calendar, she knows he was all talk. I thought that was a really interesting strategy for figuring out whether people are serious or not. Doesn’t work in all cases. At Microsoft, I’ve learned, for instance, it’s not traditional for people to open up their calendars for everyone to see, but at other companies it is. Obviously, that doesn’t work for external candidates, it just works for internal transfers.

Establishing guidelines and accountability. An example here from Fred. Fred says that for every single hire transfer, the hiring manager has to present the pros and cons of each candidate to the VP. The hiring manager needs to present a standard number of candidates, say five on his team, to this VP. This is establishing a typical process in accountability where we’re not just, as a hiring manager, I get to decide on my buddy and bring him in immediately. I have to have a slate of people and I have to argue why the person I want to hire, we should actually hire that person. Jenny, an engineer, she reflected that there was this combination of top-down guidance, so this VP’s mandate that you have to present this slate of candidates, needs to be combined with this bottom-up attitude of just getting it done, so the self-empowerment, if you want to increase representation on your team. I’ll say that the guidelines and accountability, again, this was in the context of Google, Google has lots of processes in place for hiring.

Actually, the most diverse teams had additional processes on top of this, so this thing about going to your VP with the candidates, that was actually very specific to this group. Actually, you saw in our sampling of how we sampled the most diverse teams at Google. These people could have been randomly distributed throughout the company, but it turns out they were actually clustered in two particular organizations under a couple of different VPs. I think that’s precisely because the VPs were really setting the tone for what it meant to hire inclusively, what it meant to have an inclusive culture. Just to say that the top-down guidance is really important.

Onto technical allyship. Technical allyship is something that we defined to be just a more specific version of general allyship, where someone from typically a majority group bolsters the confidence of the technical skills of someone from a historically marginalized group, or advocates for them when their technical experience is being undervalued.

For instance, one example that Gus, a manager, gave: Gus said, you need to give underrepresented engineers a chance to fail. He didn’t mean this in a negative way, he didn’t mean to set them up for failure. What he means is, you need to give them the chance to both succeed and fail. He said that one of the traps that he used to fall into was trying to protect folks from underrepresented groups and not necessarily giving them stretch goals. Rachel and I were talking, she said startups were a really hard place to give people stretch assignments. I think it’s interesting about whether this advice applies in a startup situation, but at least in big companies, you have a little bit more flexibility to give people stretch assignments.

What Gus said is, he said, you might find skills there that you didn’t appreciate or see. He gave an example of a junior engineer who got the opportunity to demonstrate that she was exceptionally good at leading engineers who were much more senior than her. He didn’t see this coming, he didn’t expect it, but because he put her in a situation where she could spread her wings, he was impressed. I think what you can see is like, this is just good management in general. It’s especially important for folks whose skills are often undermined by the culture that we live in. That’s something that managers can do for developing technical allyship.

Leadership can also help with technical allyship. Again, going back to manager Fred, who was an engineering director at the time from an overrepresented group, he said that everyone needs to see their representation at different levels of management. It’s important to see that you have a place to grow as someone who’s an IC, for instance, coming onto a team, a place that you can grow in the organization. He said that he found that the fastest path to grow diversity on engineering teams is by building a leadership team that has strong representation.

Then, what can coworkers do to improve technical allyship? Aliyah gave an example where she suggested a new idea during code review to a more senior man’s code who happened to be a TL. The thing the TL did is he turned around and he explicitly accepted the idea. It turned out to be a good one. He explicitly said, “This is really good. I want this”. He implemented it, and he also praised it as a great contribution. It may have seemed really minor to him at the time, but it helped her build confidence in contributing other pieces of knowledge during code review and elsewhere to people at higher technical levels.

Conclusion

In conclusion, enabling engineers from all backgrounds to be their most productive selves can be challenging. Sometimes the symptoms of people being productive or not productive is obvious, things like pushback, for instance, but sometimes it’s not. To make all engineers productive and not just the average engineer, we need to act intentionally as we build engineering teams. Whether that’s considering things like anonymous code review to de-bias our code review process, or thinking about how author identity shows up in technical documentations, or what the rest of us can learn from exceptionally diverse teams.

Questions and Answers

Fox: I’m curious how accessibility featured in this. I know it can be a lot harder to tell just from a profile picture, but is it something that came up during your investigations, and how did it differ from race, ethnicity, and age?

Murphy-Hill: In the studies that we talked about here, we didn’t focus on accessibility much. I’ll talk about the first two studies where we talked about race, ethnicity, and gender. Part of the reason is that large companies at least don’t uniformly report disabilities in the same way they do these other categories. I would actually be very interested to see whether people with disabilities are treated differently during code review, for instance. Unfortunately, since big companies like this typically don’t collect that data, at least not in such a uniform way that they do with these other categories, it’s hard for us to say.

In particular, that’s even a symptom with this last one where we talked about diverse teams. I don’t think disability came up much, but I think that’s probably because of the way we walked into it, which is that, we came in specifically looking for those dimensions, again, race, gender, ethnicity, and age. We came in looking at those dimensions because that was the data that we had access to, so that was the team we talked to. I don’t know whether we even talked to teams that had folks from different abilities on it. I think that that’s probably a major shortcoming.

Constructively, probably what I would say is, companies are collecting more data like this. The race and gender data is not optional for U.S. companies. They have to report to the government. Disability data, they don’t, so they typically don’t put in quite so much effort in collecting that data, but they will often ask you for it. Maybe once a year, they’ll say, could you fill in our self-ID data? If you’re like me, sometimes you’re a little skeptical. You’re like, I don’t really want my employer to know about my disability, and that’s fine. Maybe also think about what that might enable if you did provide it, and if you do trust the company with it. You could enable analyses like this where we could look at disability at scale and understand how the engineering process is inclusive or not of folks like disability. I don’t have any information.

Participant 1: In particular, something like the blinded code review seems like something that could be potentially low effort to implement if the tooling supports it. What is the state of tools, GitHub or whatever, that support this type of blinded code review, or I imagine even blinded interviews would also be kind of similarly.

Murphy-Hill: I pointed a couple things I know about. There is an anonymous code review plugin for GitHub. I don’t know whether it’s maintained anymore, but it’s out there in the Mozilla store, I think. Last time I tried it, it worked on a variety of platforms. Basically, you would just install it as a browser extension, and it would remove names. When we implemented it, we did something similar. One of the challenges there, though, is that people often get email notifications from the underlying platform from like GitHub, and then often it unanonymizes it. It turns out anonymizing using a Chrome extension in Gmail, like a very complex web product, is quite difficult. Unintentionally revealing people’s identity happens a lot there. Systems like GitHub support things like anonymous code review, but you can get a little ways by implementing extensions to do it.

As for interviewing, one of the platforms I like for interviewing is Byteboard. Byteboard does asynchronous interviews where they give you a design doc, so the sketch of a design doc for a piece of software. There are some holes to fill in, like a senior dev is asking you about the decisions that were here, here, and here, and you need to respond. In some sense, it’s like a very realistic type of interview. Then, there’s a second part where you actually do some coding based on what you’re talking about in the design doc. This is really nice because it’s not at the whiteboard where obviously you can’t mask the person’s identity, but they’re doing very realistic work. As an evaluator, you don’t need to know anything about that person’s identity. You can just look at the work product. Byteboard is a product that I like that makes identity less salient. There are probably some others. Skepticism is good. We’re trying to do science here, so we should always question research results.

Participant 2: You had mentioned earlier on with the code review ones that the rollbacks dropped with the anonymous reviews. Did you have a chance to dig into the why?

Murphy-Hill: Yes, we didn’t have a why. It’s hard to tell, these are across the whole company. We look at a particular code review that got rolled back, for instance, and as an outsider, I look at it, and I’m like, “I don’t know what’s going on here. I don’t know why they rolled this back”. We didn’t look in any depth and couldn’t say why. What people said when they were doing the code reviews that are anonymous, they said they felt like they can really focus on the code and not the person who wrote it. To give you a very concrete example of that, I’ve been coming at this from a, like, what if the author is from a historically marginalized group, but actually you have the problem in the opposite direction, which is, what if the author is a super-senior person?

If you talk to some super senior people at Google who are well-known, famous people within the company. These people will also tell you that they can’t get a fair review either, because what happens is people see, “This famous person wrote the code. They must know what they’re doing. I won’t give them as much pushback as I would otherwise”. It’s actually somewhat freeing not to see locally famous developer’s face at the top, because you can just say, “It could be anybody. I’m just going to review this like I would anywhere”. I think the theory that I would have at least, and there’s not enough data to say and not enough depth to say, but the theory that I would put forward is probably, it allows people to be freed of those expectations with respect to social identity, and they can just be more thorough that way. Perhaps that’s the reason why there were fewer rollbacks.

Participant 3: The more mature people in the organization who are writing code, I would wonder why would they get the pushback? Because obviously they would have spent so long in the company. They know the architectures and everything. Did we get to the bottom of what was the root cause? Why would a senior engineer who has spent 20 years in the company face three times the pushback?

Murphy-Hill: Senior engineers don’t necessarily have to have spent a long time in the company. For example, I am a middle-aged engineer, and I’m new to Microsoft, for instance. Secondly, I’ll also say that in the way that we did the modeling here, we’re not just looking at averages between amount of pushback these different groups get. We’re also accounting for the fact of how long they’ve been at the company and what their level is. Senior engineers tend to get less pushback than junior engineers. You’re right, the longer you’ve been at the company, the less pushback you tend to get.

Once you take those factors out, folks who are older still get more pushback. This doesn’t answer your question quite, because you’re asking like, what’s the mechanism here? I think the mechanism is just that people have biases around age and tech. In particular, people think that the older you get, the less up-to-date you are on new technologies, the more you’re stuck in your ways, the more you’re inflexible, for instance. We did a prior study that showed that’s actually not true.

When you look at Stack Overflow, for instance, where people ask and answer questions, some people actually list their age there. What we found is even on new technologies, older developers tend to answer questions better than younger folks. I think the stereotype doesn’t hold. I think that’s the stereotype that leads into these situations where older people are getting more pushback. That’s a theory I would put forward, but then it’s quantitative data, so you can’t be 100% sure.

Participant 4: You do control for all those, like the tenure and seniority. Does that have a contribution to the pull-down effect of those biases? If you’ve been at a company 20 years and you’re 60 years old, there’s no one else in that group, potentially. You’re leaving out some of these people who get no pushback, potentially. Is that part of the equation there?

Murphy-Hill: If you’ve been at the company longer, you get less pushback. If you’re older, you get more pushback. You’ve got two separate forces working in opposite directions. That’s true. To some extent, they’ll cancel that out. If you’re older and you’ve been at the company for a long time, you won’t necessarily face three times as much pushback. There might be some cancellation in the two.

Participant 5: The insights you’re gleaning, they do have multiple dimensions, and deeply nuanced. Is there any longitudinal study to back these insights?

Murphy-Hill: Longitudinal study to back the insights? One thing you’re saying is that a lot of these insights, the one about code review, for instance, these are really short-term decisions people are making where, a code review you’re not spending a ton of time. You’re spending at most maybe an hour or so typically doing review. The decision you make on it to push back or not, or to accept or reject is a somewhat instantaneous decision. We hear the length of the interaction. People have talked about having such bad code reviews that they’ve quit projects. Because there are so few cases, it’s hard to say whether that’s a widespread phenomenon or not. I point at that as one of the longer-term effects. Also, say like the diverse team study, that was a longer-term thing too. Although we interviewed people at a point in time, this is after these teams are really well established. I think that just adds to the richness here. We have some instantaneous quantitative effects and some longer-term qualitative effects too.

See more presentations with transcripts

Supporting Engineering Productivity for All

Transcript

How Productivity Varies – Conflict in Code Reviews

How Productivity Varies – Bias in Technical Documentation

How Productivity Varies – Building Diverse Teams

Conclusion

Questions and Answers

Leave a Reply Cancel reply

Stay Connected

Latest News

What AI Reveals About Web Applications— and Why It Matters

How To Choose A Software Development Company

Is that virus warning real? How to recognize false alarms

Steelseries Arctis Nova 7 Gen 2 makes this fan favourite even more of a must-have | Stuff

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

How Productivity Varies – Conflict in Code Reviews

How Productivity Varies – Bias in Technical Documentation

How Productivity Varies – Building Diverse Teams

Conclusion

Questions and Answers

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News