Transcript
Hoffman: My talk is also a little bit about astronomy. This is the geocentric model of the solar system. For almost 2,000 years, people believed that the sun and the planets orbit around the Earth. They had pretty good reason to believe this. They saw in the morning that the sun would rise in the east, and at night it would set in the west. They thought naturally that the sun was circling the Earth. They also noticed that the stars didn’t appear to move. You would think if the Earth was moving, then your angle relative to the stars would change. The position of the stars would be different. Furthermore, using this model, ancient astronomers were able to make actually really accurate predictions about the positions of planets. They could predict the position of planets with up to around 90% accuracy. They could use these predictions to navigate, to sail, and to create calendars. This model actually worked really quite well.
By the Middle Ages, there were some issues. This is because the initial observations they had used to make their equations stopped applying as well. They had to continuously revise the model. One real problem with the geocentric model is that sometimes planets appeared to move backwards. We understand why that is today. It’s because the Earth and the planets are orbiting at different rates. When the Earth passes by a planet, that planet appears to move backwards. If everything’s going in a circle around the Earth, there really should be no backwards motion. Ancient astronomers solved this basically by making the model more complex. What they did is they introduced something called an epicycle. An epicycle is a micro-orbit within a planet’s orbit. The planet is going around in a circle, and it’s also going like this. Essentially, by introducing arbitrary epicycles and other such constructs, we might think of them as hacks, they were able to fit this model to the observed data pretty well.
However, in around 1500, Copernicus came along. He revived the heliocentric model of the solar system. This is a model in which the Earth and the planets orbit around the sun. We know this model to be mostly accurate. The Greeks had considered this. They had decided this is clearly nuts, but Copernicus revived it. One of the reasons why he liked it is because he thought it was conceptually simpler. For instance, with this model, theoretically, you don’t need epicycles to explain backwards motion. Copernicus was hamstrung in that he thought that everything has to go in a perfect circle. This was to match his notions of Aristotelian harmony and beauty. Because of this, he had to add back epicycles.
Then Keppler came along about 100 years later. Keppler realized that planets actually orbit in ellipses. Once you allow the planet to move in an elliptical form, then the heliocentric model not only has better predictive power than the geocentric model, for the first time, but is also far simpler. At this point, the heliocentric model became widely adopted by astronomers and scientists. It was also, in part, due to observations by Galileo.
What can we learn from this as software practitioners? First off, if you have a subpar model, a model you might have for many valid and good reasons, you can make that model work for a long time if you just add arbitrary complexity to it. You can just keep adding epicycles and things like that and adding complicated code on top of your already complicated architecture. You can make it work. A better architecture will solve the same problems as an inferior architecture in a simpler way. It will also let you solve additional problems that you probably could not have solved earlier.
Therefore, it pays to sometimes take a step back and to ask, is my foundational model, are my core assumptions still valid? Are they still serving me? Are they making my life easier, or are they making my life a lot harder? As software developers, we rightfully are a little risk-averse. We like to proceed incrementally for all kinds of excellent reasons. That’s usually what we should do. We should maybe add a little complexity here and a little complexity there instead of revising the core model. I’m not advising against that. If you notice yourself solving the same issues over and again, then it’s worth taking a step back and questioning the more core assumptions that you’re making, like Copernicus did.
Background
I’m Ian Hoffman. I’m a staff engineer at Slack. Previously, I worked at Chairish. I’m going to talk about a time when we at Slack revisited our core architecture and made some big changes to it for similar reasons to why the ancients started with this model and then why Copernicus revised it.
Slack Overview
First off, what is Slack? Slack is a communication app for businesses. You can use it if you’re not a business, but it is designed probably primarily for businesses. We have three first-party clients, a desktop/web client written in Electron and React and Redux, and then iOS and Android apps. Our backend is a monolith written in Hack, which is a language like PHP. It’s like a strongly typed, just-in-time compiled version of PHP. Most of our data is stored in MySQL databases sharded using the Vitess sharding system. This is what Slack looked like circa 2015 or something like that.
Slack’s V1 Architecture (The Workspace Model)
I’m going to begin by talking about the evolution of Slack’s architecture in order to motivate the changes we made. Then I’ll describe what the changes we made were, why we made them, and then how we went about making these changes. Finally, I’ll close with some takeaways. Slack began in 2013 with a pretty simple architecture that I like to call the Workspace Model, though it does not have an official name. In this model, a Slack workspace is equivalent to a Slack customer. A workspace contains users, channels, messages, apps, all the things you’re used to in Slack. Slack is this channel-based communication platform. You can enter a channel and send a message to other users who are in that channel. This is all contained within one workspace. Slack also has this concept of apps, which are third-party apps that developers can build and run in Slack. For instance, a bot that you can use to triage tickets from Jira, for example, or to look at issues in GitHub, you can run that in Slack.
Importantly, in this model, each workspace is a closed system. Workspaces share nothing between each other. That means if I, Ian Hoffman, I’m a human being and I have access to multiple Slack workspaces, Slack doesn’t know anything about that. These are separate logins. It’s not one account. This has a nice property, which is that the data for a single workspace can be put on a single database shard.
Basically, the server will route queries from a workspace to a shard. If you want to scale up, you just buy more databases and put more customers on them. We have this core assumption that the data for a single customer would fit on a single database shard. Maybe we have to buy a really big database, but we can still serve all of their traffic with one database. That’s because in the beginning, Slack was targeted at teams with maybe a hundred or maybe a thousand people. The chance of them producing so many messages and channels and just so much data that we couldn’t handle it on like an extra-large MySQL instance, seemed unlikely.
First off, I’ll walk through an example of how this worked in a little more detail. Let’s imagine you have a third-party app that you’ve built as a developer. Let’s take the GitHub app as an example. This lets you do GitHub-y stuff in Slack. That means the Slack client has to find out about this. It’s going to make a REST API call in order to load information about the GitHub app. How this works is the client makes an API call to the server and it has this token. It tells the server, I want to load the GitHub app. It uses this token, which is encrypted and authenticated. The token has a user ID and a workspace ID in it. The server then takes this workspace ID and it knows how to map that workspace to a specific database shard. It’s going to look on that shard for the GitHub app. If it finds the app on that shard, then great, the client is allowed to use the app. Otherwise the app is not installed for that workspace. It’s not allowed. That’s an error. This worked pretty well, but there are some problems.
One I was already hinting at, which is, what if the data for one customer doesn’t fit on one shard? This began to happen because Slack caught on much more than people originally expected. Soon enough, we had customers with 10,000 or 100,000 users sending millions of messages.
At that point we started to really knock over databases. Also, what if customers actually want multiple workspaces? There are all kinds of reasons why a large enterprise might actually want to partition their organization into multiple workspaces, because workspaces act as content boundaries. You can say, I want these employees to have access to this stuff and these employees to have access to this stuff, but I want to administer them as a unit. I want to handle billing in one place. I want to handle security in one place. I want to manage users in one place. That’s impossible to provide if there’s nothing shared between workspaces. We wanted to solve these problems, and we were also running into these scaling issues of just our customers getting too large and workspaces having too much traffic.
Slack’s Architecture V2 (Enterprise Grid)
To solve this, we made our model a little bit more complicated, just like the ancients introduced these epicycles. We introduced our Enterprise Grid architecture in 2017, which was our enterprise product. It’s interesting to note that actually the Workspace Model is still basically how Slack works for our smaller customers. For the really big fish, of which there are many, the Enterprise Grid model became a significant product that we sold a lot of. In Enterprise Grid, a customer can have many workspaces, all under the umbrella of their grid, of their enterprise. A user can belong to many of those workspaces, all within the enterprise. Are any of you aware of a company you work at using Enterprise Grid, or have you used this product? This is what I’m describing here.
Finally, data, such as like channels and apps and that sort of thing, can be shared across multiple workspaces. You can say, I want an announcements channel that is available in every workspace on my enterprise. This makes disseminating data throughout your Slack instance much easier than if you were trying to manage multiple totally isolated workspaces, as people were doing priorly. This is how Enterprise Grid looked. You can imagine this DF, JE, SM, these are just like design mockup nothings, but imagine these are all workspaces under the Acme, Inc. Enterprise. If you’re looking at DF, you’re only seeing data from DF, and same goes for the other ones. You have to switch between these workspaces in order to see everything that you can access within the enterprise.
The architectural implications here. The way we made this work is to say, we have these workspaces, and we’re going to have one special secret, invisible workspace, that serves as the org, the parent of all these other workspaces. Just as the data for each workspace is on its own shard, the data for the parent org is on its own shard too. What is org-level data? It’s any data that is shared. Anything that should be available to the entire organization, or entire enterprise, we use these words interchangeably, but anything that should be available to the whole org lives on the org shard.
Basically channels, apps, that are shared with more than one workspace go on the org shard, and everything else goes on the workspace shards. Now that you can find things in two places, how do you successfully route queries? We did something very simple, which the backend just now queries the current workspace shard and the org shard, always. We just always do two queries. Theoretically, this should decrease the load on any one workspace shard, because now instead of customers with one gigantic workspace, they probably partition their large workspace into many more focused workspaces, therefore spreading out the data and the load across these different workspaces. Here’s another beautiful architectural diagram. It’s not really on par with the models of the geocentric and heliocentric solar system from earlier. I’ll go through the same example again.
In this example, we’re going to load the GitHub app. Again, we pass up this authenticated token. It has the user. It has the workspace. Again, we query the workspace shard. This time, let’s say that this is an app installed on the org level. Let’s say it’s available to the entire org. This time, we find, it’s not available for the workspace. What do we do? We load the org ID for that workspace, assuming the workspace is part of an org. We get back the org ID, and we now route that org ID to a shard. We’re looking now for the GitHub app on the org shard. If we find it, again, we return it to the client, and we let the client use it. What this means is that any workspace under this org will end up querying the same org shard. This is a simple way of making that org-level app available to all the workspaces.
This model works really well, and Enterprise Grid became a really successful thing for Slack, but there were some problems that I’ll now go through. First off, there were some UX issues. When we conceived of Enterprise Grid, the users really belonged to one workspace on average. We weren’t really building something to handle a situation in which one user was in several workspaces, all within the same Enterprise Grid, and had to switch between them to do their work. As Enterprise Grid matured and more companies started to use it, people ended up in more than one workspace within the grid.
Actually, I’ll ask, of people here who use Enterprise Grid, do you know if you’re in more than one workspace? It’s quite common these days. That meant people had to switch between their workspaces to do their work. They would miss activity in workspaces they looked at less. We tried to fix this by introducing hacky things like a threads view and an unreads view, that actually aggregate org-wide data, but within the view of a single workspace. Here is the unreads view, and you can imagine that if this help customer support channel comes from a different workspace than Acme Sites, if you click on it, you would be bounced into that other workspace, which would be jarring. At least it would let you see everything in one place, and so it represented a slight improvement over the prior approach.
There were also bugs. One really interesting bug here is that there were inconsistent views of org-level data. Imagine a channel, like an announcements channel, that’s shared across all the workspaces on an organization. If you’re an administrator in only one of those workspaces, you can modify the channel so you can rename it when you’re looking at it from that particular workspace. When you switch to a different workspace, you now can’t. As a user, this makes no sense and it’s hard to explain to people. That was one persistent class of issues. Also, I’ve described how Slack’s backend, we partition our data by workspaces.
The data for a particular workspace is stored on a shard, on a database server for that workspace. We do the same on clients, actually. We take data from each workspace and all that data is completely separate on the client. It’s not munged together in any way. We have separate client-side data repositories for each workspace, even though you might be looking at multiple workspaces on the same client, you might be logged into several workspaces under your grid. What this means is that things can get out of sync because if you fetch org-level data for one of those workspaces, or let’s say for a few of those workspaces, you’ve now ended up loading this org-level data into the datastores for each of those workspaces.
Now you change it in only one of those workspaces, the one you’re currently looking at, and that update, for whatever reason, might not make it to those other workspaces. Then you’re looking at a stale view when you switch to the other workspace. We fixed a lot of these bugs, but they were inherent in the model. In the fact that you have these multiple views into the same piece of data, there’s always a chance for things to get out of sync. We had to be really vigilant to prevent this bug from reemerging.
Also, this model is just not the most efficient. If you have an org-level piece of data, you’re going to load it again in every single workspace you look at. For example, DMs are org-level. That means every time you go to a workspace within the grid, we’re reloading your DMs. That’s just enormously wasteful. It’s also a larger memory footprint. I was just talking about how we were duplicating org-level data into these workspace-partitioned data repositories on the client. That also just causes memory bloat. We did build an org-level datastore that lets you store org-level data in one place, but this took time to adopt. Slack is an interactive system, so it’s not just the client querying the backend, the backend will push updates to the client.
The way the backend does this is it maintains WebSocket connections for each workspace. Some of our customers actually have thousands of workspaces. That means whenever you make an org-level change that should impact all the workspaces, it has to be fanned out across thousands of WebSocket connections. This is inefficient. Eventually, we built an org-level WebSocket to fix this, but again, that was not widely adopted and took time to adopt. We felt like we were running up against this class of issues where there were these persistent UX problems, there were persistent user experience bugs and confusing bugs of the multiple perspectives bug I talked about earlier, and then there were just inefficiencies with this model too where we were doing a lot of redundant work.
Changing the Model
We decided that maybe we’d been going about this the wrong way. In 2022, we took a step back and we reconsidered some of our core assumptions here, like Copernicus did. We’re on a far lower plane than him, just to be clear. We asked the foundational question, which is, why should users view data from only one workspace at a time? What if you could see everything you needed in a single place? If you’re a user, why do you really care where a channel comes from as long as you know you have access to that channel? Wouldn’t this be a simpler experience? If we had this model, there wouldn’t be any context switching between workspaces because there are no workspaces. There wouldn’t be any missed activity, or at least no missed activity in different workspaces, because, again, there are no workspaces. You can’t have inconsistent views of org-level data because you’re only getting one view of your data.
There can’t be duplicate API calls, because, again, you’re only getting one view of your data. You’re not switching between workspaces. You’re not storing redundant data on clients because you only need to store the data once. You don’t need to store it per workspace anymore. We felt like this nicely simplified a lot of the issues we were running into with Enterprise Grid. We felt like this was also a better foundation for the product to continue evolving. Though Slack had started off with this Workspace Model and had started off being highly all about workspaces, we had moved in a more org-level direction over the years.
A lot of our new features like Canvas, which is our document editor, and Lists, which is like Google Sheets, and of course DMs are org-level by default. These are things that are available across the whole org. It was a weird experience to switch between workspaces but always see the same canvases, the same lists, the same DMs. Giving you an org-level view matched the direction that Slack has been evolving in any ways.
Slack Architecture V3 (Unified Grid)
With this, we introduced our v3 architecture, which I call Unified Grid. In Unified Grid, the user can see everything they can access within the enterprise in one view. However, access control is still determined by workspaces. The workspaces still act kind of like ACL lists, limiting what you can see. Importantly, Unified Grid does not change what users can do. It doesn’t actually change anything about the permissions model. It just reorganizes things in a quite fundamental way. You’ll remember, this is what Enterprise Grid looked like, and this is what Unified Grid looks like. Here you can see that we’ve reclaimed this sidebar, which used to show one tile per workspace. We now can use the sidebar to show other useful things, like all your DMs, all of your activity across all workspaces, your later feature, which is like a to-do list or a set of reminders. You can see within this Acme Inc., channel list, you can imagine that we’re incorporating channels from multiple workspaces all under Acme Inc.
How did we do this? In Unified Grid, the API token no longer determines the workspace shard. You’ll remember in all those prior architectural diagrams I showed, we were passing up the current workspace ID when we were making an API call to the backend. How do we select the workspace to put in that token when we’re in Unified Grid? There is no current workspace. We would have to pick arbitrarily. We don’t do that. Instead, we include the ID of the current org. You’ll remember the org is just a secret hidden workspace. We’re including that ID. Now the server needs to check both the org shard and also multiple workspace shards, because the data we’re looking for could be found on the org shard, as before, or it could be found on the shard for some workspace. We don’t know which one, because you’re looking at everything at once.
This sounds really non-performant, obviously, and we were, of course, a little bit concerned about this. It turns out that we can limit the workspaces that we check to just those that the current user is a member of. Most users are still in just a handful of workspaces, maybe three or four at most. This check ends up being fairly efficient. There is a long tail of users who are in hundreds of workspaces, but it turns out that most of these users are system administrators. They’re not actually using Slack and all those workspaces. They just need to administer them, and that means they really need to get to the admin site for those workspaces, which requires them to be a member of the workspace. The way we handled this was to say, we’re basically going to consider 50 workspaces when we do this check every shard logic, and we’re going to let you edit this list. It’s confusing, but this affects a minuscule subset of users, like we’re talking point something-something.
This strikes a good balance between handling these extreme outliers and allowing us to move forward with this architecture. This is an even more complicated diagram that I’ll go through. Same example as before, we want to load information about the GitHub app to display in Slack. We make an API call, but you’ll notice that this time we’re using the org ID. This E789 number is in the token instead of the team ID. That means that we end up querying the org shard first. We say, is this app installed at the level of the org? Let’s say that it’s not. Let’s go back to our first example where the app is actually just installed for one workspace. We get a miss on the org. The next thing we’ll do is we’ll load up all the workspaces for the user, and that’s a cache hit in memcache, 99.99% of the time. Then with that list, we’ll loop over every workspace, and we’ll query its shard. We’ll query at most 50 workspaces here, if that’s how many the user is in, but commonly we’ll query 3 or 4.
We strongly believe this offered a better user experience, and actually a better developer experience too, and just got rid of this concept of the workspace that was becoming increasingly vestigial, and moved things to be on the org level in a way that really matched the direction we wanted to go at Slack. It was of course a very large change. We were concerned that Unified Grid would be dead in the water from the start. To give you some stats, we had well over 500 API methods that depended on a workspace for routing.
What this means more concretely is that these were API methods that were called by a Slack client, a first party client that would be changing in Unified Grid, would be switching from using the workspace to the org, and that we’re routing to a database table where that routing depended on the workspace in the API token. You can imagine that any of those APIs/all of them, could break if we stopped including the workspace in that token. We also had over 300 team settings which could differ at the workspace level. This is a setting where for the same setting, you could have different values for each workspace on the org. There was a question of like, how do you rationalize that in one view? Each of these settings would need to be handled on a case-by-case basis basically, where we had product teams deciding what made sense for that setting.
Then, any backend change we made need to be replicated on all three clients. We were essentially doing 4x the work listed above. However, we didn’t want to take this all on at once, that would be crazy. We decided to begin with a prototype. We built a very simple prototype that could basically boot up, send a message, and show the messages in a channel. We began to use this prototype for our day-to-day work. We made it really easy to turn the prototype on and off with a single button. If you ran into an issue, you would just exit the prototype, keep doing your work, mark down that issue somewhere, and then when you had time, go and fix it.
At a certain point, we invited peers to start using the prototype, and they began to give us feedback. A really useful piece of feedback we got here is a focus mode. Some people missed having this per-workspace view, because they wanted to focus on only content from a particular workspace. Maybe they don’t care what’s going on in the social workspace today. This is a view where you can actually filter down your channel list to just channels from specific workspaces in the Slack client in Unified Grid. Eventually, as the prototype matured, we invited leadership to use it. At this point, leadership came on board, and Unified Grid became a top priority in January 2023.
At this point, the core team pivoted from prototyping into attempting to projectize what we were working on, creating resources and tools that other teams could use to help with this large migration. The first tool, which is very exciting for everyone, I’m sure, was a bunch of spreadsheets. We created spreadsheets just listing every API method and permission check and workspace setting, which might need to be audited as part of Unified Grid. As I mentioned earlier, we had data about which APIs were called using a workspace token, and then use that token to route to a particular table. All of those APIs were, of course, fair game.
Then any permission check they ended up making needed to be looked at, too. Any setting they looked at had to be checked, and all the way down the rabbit hole. Once we had these in place, we worked with project managers and subject matter experts to assign them to various product teams. We also created a bunch of documentation to make it easier for these teams to do this migration. We worked out these approaches during the prototyping phase, so that was another really invaluable part of prototyping, was we got a sense of how hard this migration was going to be. We realized that there were actually three primary tactics to migrate an API to be compatible with Unified Grid. This is a little bit of a sidebar, but several years ago, a Slack engineer named Mike Demmer came to QCon, and he spoke about our Vitess migration. He was also the architect of Unified Grid.
The Vitess migration was a change in which we moved away from this per-workspace, per-org sharding model, to a more flexible sharding model. We’re using Vitess, which is this essentially routing layer for MySQL. We could re-shard tables along more sensible axes. For example, we re-sharded our messages table, such that all the messages for a particular team, or a particular workspace, are not on that workspace’s shard. They’re now sharded by the channel ID. It’s now that all the messages for the same channel are on the same database shard. This is a much more sensible sharding strategy for messages, because it’s unlikely that one channel has too many messages for a database shard. You can easily imagine that one workspace has an incredible amount of messages in it.
The good thing about this is that if a table had been re-sharded, such that it no longer depended on the order of the workspace ID, then it didn’t have to change in Unified Grid, because we were already routing based on something that wasn’t going to change, because we were changing this API token from containing the workspace ID to the org ID, and that doesn’t affect how these queries are routed. There’s another class of API which actually requires workspace context. At Slack, every channel is created within a specific workspace. We could have revisited that for Unified Grid, but we decided not to. We decided that that’s still a decent baseline. In the past, the workspace in which to create a channel, would just be determined by the workspace you were currently looking at. If you’re in a workspace, that’s the workspace where you create the channel.
In Unified Grid, of course, there is no current workspace. We made this really simple. We just made this implicit decision explicit, by popping up literally a dropdown menu and having the user pick a workspace when they go to create a channel. Finally, if the API doesn’t fall into either of these buckets, so it’s still sharded by workspace ID or org ID, and it doesn’t require this more explicit context, then we do the strategy I described earlier where we have to check all the users’ workspaces potentially in this potentially expensive manner.
This was obviously a really big change, and with large changes, things can break. We wanted to make sure that we had good test coverage. Over 10 years of Slack existing as a product, we have written thousands of integration tests, probably more. We didn’t want to rewrite all these tests, and we also didn’t want to lose the coverage they provided. What we did is we created a parallel test suite that runs all of these tests, but it automatically switches the workspace context to the org level. The APIs suddenly began to receive an org, and they, of course, all break. This gave us a burndown list, and our product teams fixed them during the migration, which was very kind of them. By the time that we launched, there were actually zero tests failing as a result of this anymore. This allowed us to avoid rewriting our test suite and to still have pretty robust coverage.
Finally, we did some basic things like create easy-to-use helpers just wrapping up common logic. You know how I described earlier this bug in which you could administer a channel only from a workspace where you were an admin, and if you switched to another workspace with access to that channel where you weren’t an admin, you couldn’t administer it? What that means is that in the old Slack client, in Enterprise Grid, you could simply click through your workspaces until you found one where you were an admin, and then you could administer the channel. We do this for you. We just have a helper that says, can the user act as an admin for this channel? It takes the user, it takes the channel, it intersects their workspaces.
If the user is an admin in any of those workspaces, then the answer is yes. With this, we got to do something very gratifying, which was watch a shard go to zero as product teams jumped in and began to burn down these APIs and permissions checks and settings. We began our rollout in September 2023, and we finished in March 2024. We forced an upgrade of the last pre-Unified Grid mobile clients this October, so quite recently.
Takeaways
What did we learn from this whole process? Some of these takeaways will be more mind-altering than others, probably. First off, you should really centralize complexity. You might look at this and say, isn’t this a simpler model? This is our v1 architecture. Isn’t this a lot simpler than this? This seems like a step backwards. I think the counter argument is that we now handle such a broader range of use cases for our customers, and we’ve centralized complexity on the backend. Before, customers and clients and users had to think about things like, what workspace am I in? Now, they don’t anymore. While we’ve made things harder for the backend, we’ve made them simpler for clients.
In fact, in some ways, we’ve made things easier for backends, too, because the server now explicitly has to handle the possibility that something is in an arbitrary workspace. Whereas, before, the current workspace was always implicit in every operation. You can view an action prior to Unified Grid as a function of the current user, the resource, and implicit workspace. Whereas, in Unified Grid, we’ve made that explicit by saying the action is a function of the user and the resource, and that’s it. This could also be an example of explicit is better than implicit.
In terms of efficiency, the fastest API call is one that doesn’t happen. This is pretty anecdotal, but this is the calls required to boot the Slack client for a user in January 2022, prior to Unified Grid, versus January 2024. In Unified Grid, API calls can take somewhat longer, because they need to do more things. You’ll note that we make many fewer API calls. This client.counts API is the API that paints highlights on the sidebar. It figures out which channels have unread things and that sort of thing. We make almost twice as many calls to it priorly for this user, within their enterprise. Then, the boot API, we replaced our boot API, which is called client.boot, with an API called client.userboot. Doesn’t really matter. We make four times fewer API calls in Unified Grid as we made priorly. Even though each of those API calls is a little bit heavier, this is like a massive saving overall.
Also, you should really prototype. Prototyping is a great way to get feedback, to figure out if something is going to be feasible, to work out the rough edges of the UX. To bring it back to our friend Copernicus, his initial model was not so great. It was maybe an improvement on the geocentric model, but it had all kinds of problems. By putting this theory out there, he allowed people like Galileo and Keppler to make significant progress and to eventually make this model become accepted. If you don’t put your big ideas out there, and a great way to put them out there is with a working prototype people can play with, then they’re not going to become reality, ever.
Finally, take a step back and ask the big questions. For example, does the Earth actually orbit around the sun? Also, is our architecture serving us? I think as engineers, we, as I said in the beginning, can be a little averse to change. We like to make small incremental improvements. I think a way in which this manifests is that we often take the status quo that we’ve received as dogma. We say the product behaves like this, there must be a good reason why it behaves like this. We’re using this set of technologies, there must be a good reason why we’re using these technologies.
Often things that made a ton of sense several years ago, have changed for all kinds of valid reasons. When you’re considering how you might improve the architecture of your application just to solve real issues you’re facing, you should be empowered to question these holy cows. To like Copernicus and Keppler, take a step back and say, what is the inherited wisdom we’re just following, and what can we change to make our lives easier? Then, how can we make that change responsibly?
Questions and Answers
Participant 1: After all the lessons you’ve learned from those three models, and the path that you followed to get to where you are right now, is there anything that you’ve learned from failure that you would change?
Hoffman: At the time that Enterprise Grid was built in 2017, it wasn’t like we were unaware that we could have built something like Unified Grid instead. Given what we knew about the way users used Slack, we felt like it wasn’t worth it at the time, and given the pressure we were under as well. I wouldn’t say that we should have gone back and done that differently. I do wonder if we were to do it all over again, if we would consider like having just a more first-class person entity in the way that Discord does. I also think there’s some real advantages of having total separation between data for different customers. Given Slack is so business focused, I don’t know if that ever would have flown. Maybe we would have been bolder a little bit earlier and done this a little bit earlier.
Betts: Do you feel like you had to introduce those epicycles from Enterprise Grid before you realized it was so wrong?
Hoffman: At the time we did Enterprise Grid, it was the pragmatic thing to do. I think in retrospect, it certainly did increase complexity. Also, at the time, at least from a user experience standpoint, the average user was really in just one workspace within their org. It was really hard to justify the investment that reducing that complexity would have taken at the time. Again, maybe we could have started down this path a few years before we did. Though we were doing incremental things, like things like the Vitess migration made this so much easier to do than it would have been in 2017.
Participant 2: Do you do any sharding for the WebSockets connections to make sure that you efficiently push data to all of the relevant WebSockets connections?
Hoffman: I don’t know a ton about how our RTS system works under the hood. There’s a whole team that works only on that. In Unified Grid, we attempt to push most data to the org-level socket only. That means it just gets pushed to one place. For workspaces, we do check whether that workspace is online. The RTS server has an understanding of which workspaces are currently connected. Then we can avoid pushing to workspaces. This is for users. This is how users work. Users can be off or online, and users have their own sockets as well. That’s for users. For workspaces, I think they always get all pushes because they’re always online in some capacity.
Participant 3: Did you face resistance in part of the technical teams? Because I bet you faced them. There are always some of the engineers, maybe the older ones, which are protecting somehow the previous solutions because they feel like it’s theirs, it’s something which works, it’s not worth changing it. Did you face anything similar?
Hoffman: We didn’t face that much resistance from engineering teams within Slack. I think the prototype was a really big reason of why we didn’t. Because people began to use it, and it became something that people liked. One thing I left out of here too is that, every few years, Slack does a total UI revamp, it’s our IA series. We’ve had information architecture 1, and 2, and 3, and now 4. We managed to combine forces with this UI revamp. We said, we’re changing the UI in this fundamental way, let’s use this as a chance to do Unified Grid as well. Once we had both design and product pushing for this, and then also a significant portion of engineering, I think it was pretty easy to get buy-in at that point. There were certainly individual engineers who were like, I don’t want to work on migrating code for three months, but they were overall pretty accommodating about it.
Participant 4: Beyond the requirement from product, what about the financial side? Did the cost go up after changing to the more complex system?
Hoffman: I don’t know. I have not seen overall numbers for this. I think on first principles, we would expect the cost to remain stable or go down, because we do less traffic than we used to do. It’s possible the cost went up somewhat.
Participant 5: I’m curious for the question of, is our architecture still serving us and making decisions for either smaller steps and incremental changes, or big revamps? What information makes these conversations easier?
Hoffman: I think having lots of examples of things where the architecture has made things hard. All the similar bugs we had been fixing around inconsistencies and people seeing actions they could do in some channels and not others and not understanding this. There had been entire projects to consolidate API calls so that we weren’t redoing them for every workspace. All those projects failed because they were so complicated and they didn’t change the overall model. It made them even harder to ship, because you were changing the underlying architecture without changing the user-visible architecture. I think at a certain point, it was like, wouldn’t it be easier to just fix all of this at once? At that point, we were able to get buy-in. In some ways, I think if we hadn’t been running into these issues, it would have been very hard to make a pitch for this.
Participant 5: It’s more of a backward-looking data we have up-to-date.
Participant 6: In your presentation, you showed the older architecture and you just moved to the current one. Did you consider any other architectures? Because when you’re prototyping, you don’t know the outcome, unless with the production data and everything. Some prototypes will work well when there is a smaller subset of data, but when it is production, you have a large subset, sometimes the architectures go sideways. Did you consider any other architectures, like how did we make a decision?
Hoffman: We did. We considered an architecture where actually instead of just doing Unified Grid for the grid, we had user-level Unified Grid. You literally saw everything you could access in one place, whether or not it came from the current enterprise. If I was in a workspace for work and a workspace for my apartment building or whatever, that would all go in one place. A little bit more like the Discord model or something, where it’s one user getting access to everything. We decided that was counter to Slack’s position as an app that’s primarily focused on businesses.
See more presentations with transcripts