Transcript
Dai: I have been at Roblox for almost 4 years now. When Jennifer first reached out to me, I didn’t have a thing to talk about. Basically, just do casual chatting. I told her what I’ve done at Roblox, the projects that I’ve done. I shared two stories. She immediately caught that there’s a thing there. There’s a trend that actually I’m shifting left from production to testing. Throughout this shift, I can see how my work contributes to increase the productivity of engineers at Roblox. I also want to call out that there are two things I pay particular attention to in my past 4 years, the two metrics, those are reliability and productivity. Those two things have been driving me to make all those decisions and leading our team when we are facing choices, leading us to make the decisions.
Background
I’ll start with some background. Then I’ll share two stories: first migration, and then second migration. Then, some of my learnings throughout the journey.
When I joined Roblox back almost 4 years ago, I started at the telemetry team. I was really excited back then because Roblox was actually the smallest company I’ve ever worked for. Before Roblox, I was at Google and LinkedIn. That was much bigger than Roblox. It was a small company, and I got to know a lot of people. I was so excited, like I can work with everybody. Everybody can know each other. Back then, the company was undergoing a lot of fast user growth. We’re still on a very fast trajectory. Everything was new to me. It was a very exciting experience to me to begin with. There are a lot of things going on. As the telemetry team, I got to experience all the setup, all the size and scale of our architecture infrastructure. We run everything on our own DC. We have almost 2,000 microservices. As the telemetry team, we were managing billions of active time series.
Then, that’s not the fun part. There were a lot of on-calls. The first week that I joined Roblox, the very first weekend, I was actually on hacking with a couple of friends. Somehow that route had signal, and I got paged. I got dragged into a Zoom call with the VP and CTO being there together, debugging some Player Count job issue. Then, the question that got to the telemetry team is that, is this number really trustworthy? Can you tell me where this number was from? Can you tell me if it’s accurate, if it’s really telling the true story? It was a really stressful first weekend. I almost thought about quitting the first weekend. I’m glad I didn’t.
Then, I just hang in there. Then, the worst part is that every SEV score back then, people would start with questioning telemetry first. Every time there is a company-wide incident, we got paged as a telemetry team. I would never have thought about that back at Google. The metrics are just there, you’re trying to read the numbers, trying to figure out what’s going on yourself. Who would actually question telemetry being wrong?
That was actually very common at Roblox back then, especially those SEV0, SEV1s, people would actually spend first 10, sometimes even 20 minutes to rule out the possibility of telemetry being wrong. I don’t really blame that, because the telemetry system was not so reliable. We did produce wrong numbers from time to time. That made me realize that telemetry, essentially, is reliability. Reliability should be the first metric to telemetry. Bad production reliability actually ends up causing very low engineering productivity, because all those engineers spending their time trying to work with us to figure out whether it’s a telemetry issue and how to fix the telemetry issue. That was really time consuming. Ended up having a very long, for example, mean time to detection, as well as mean time to mitigation.
We were like, we have to stop. We need to fix this. Let’s take a closer look at the telemetry problem. The problems were actually everywhere, basically. We took a look at the in-house telemetry tool. The engineers prior to us, they built a very nice in-house telemetry tool that actually lasted for years. Served the company needs really well. This in-house tool, it’s responsible for metric collection to processing to storage and to visualization. Everything was in-house. There’s some clear disadvantage of this pipeline. The bottom chart is a high-level overview of how this pipeline works. If you pay close attention, at the very end, there is a key-value store. That’s a single key-value store. Which means if you want to get a metric, for example, QPS, and then latency per data center across all your services, per data center, and then you want to GROUPBY by a different dimension, for example, let’s say per container group or something.
Then, you need to go through this whole process, create this processing component here to generate the new key-value here. That’s a very long process for generating a single chart. We have to go through this whole thing to be able to draw a chart in our in-house visualization tool. It’s very inflexible and very slow. There are some other problems with this old pipeline. We have everything built in-house. Our quantile calculation was also built in-house. We made very common mistakes of how quantiles should be calculated. You got quantiles on a local machine. Then you got quantiles by aggregating among all machines. That’s a very typical mistake of how to calculate quantiles. There are inconsistent aggregation results with other standard tools.
The worst part is that if you make a mistake in all of your systems by all of your teams, the same mistake, you probably can still see the trend of telemetry going wrong by looking at the wrong number, because they’re making a mistake. Maybe the trend still tells you something. The worst part here is that some teams are using this way to calculate things like quantiles. Some other teams, they’re using maybe the open-source standard way to calculate quantiles. When we do side-by-side comparison across services owned by different teams, we got very inconsistent results. Then, there are availability issues. As a telemetry team back then, our availability was less than 99.8% availability, that means we have at least four hours of outages every quarter. I don’t blame people who questioned telemetry at the beginning, because we have so many outages.
The First Migration
With all those problems being identified, it’s clear that we need a better telemetry solution. We came up with a very detailed plan. The plan contains three steps. Step one is to design, implement, and productionize a new solution. We evaluated like buy versus build options. We ended up with Grafana Enterprise and VictoriaMetrics as our telemetry solution. We productionized our infrastructure on top of that. Step two is transition. It’s clear. When you do migration, you do it right for some time period and then you kill the old one, and then you move to the new one. Throughout the transition process, you make sure your data is consistent. Because it’s telemetry, you also need to make sure the alerts and dashboards, they’re taken care of. The very last step is basically to remove the old pipeline, and then we can claim victory. That was the plan. That was my first project, basically, after joining Roblox. We did estimation and we thought one quarter should be enough.
One month for developing, one month for transition, one month for kill, and celebration. That’s just me doing engineering estimations. Then, the reality. That was also very eye-opening for me because, from telemetry side, we can see all the limitations, all the shortcomings of the old tool. The reality is that migration of basic tools is very hard. Like I said, that was an in-house solution that has been existing for almost 10 years. There were a lot of customizations made to the in-house tool to make it really easy to use and really pleasant during the incident calls.
For example, there are very customized annotations to label every chart, like the changes that are made to the service, deployments, configuration changes. Then there were very small things like a very cool Player Globe View. There are latency differences, like I mentioned earlier, like quantile calculation was a big pain point. Very inconsistent. The technical issues just takes time. We spend more time. We can bring those two tools closer to each other. Those are easy. We can just focus on our day-to-day job and get those problems solved.
Then, I think the more difficult side from my experience looking back are actually engineers, our engineers, they are very used to and attached to the old tool. It has been there for almost 10 years. I can share some stories. We tried to include a new link to the old chart, which will redirect people, if they click them, will redirect them to the new chart in this new system. Everything will be very similar to the old one. Engineers, they choose to stay with the old tool.
There was one time, I remember, like if they want to go back to the old tool, they have to attach redirect equals to false to their URL. We tried to create those frictions for them to use the old tool. That’s just people, they tend to use redirection. They will just add manually every time, even bookmark redirect equals to false to the old tool. It’s just like that stickiness and attachment to the old tool. That was really eye-opening for me. We also realized that because people just love the old tool, if we force them to move, it would harm their productivity. The people who actually get harmed, their productivity, are people who have been there for years. Those people are valuable. They have valuable experiences during, for example, incident calls and debugging complicated issues. We need to take care of those people’s experiences carefully.
What happened then? Instead of one quarter, it took roughly three quarters. There were three key things that we actually invested in to make this migration smooth. First one is reliable. The reliable meaning our new pipeline needs to be super reliable. Remember I said there was 99.8% availability with the old one? We managed to achieve 100% availability with the new tool for multiple quarters in a row. Then people started to trust this new tool more than the old one. Then, delightful. That’s my learning, and now I also apply this to future migration projects that I worked on, that the transition experience really needs to be delightful.
Try to ask at the minimum work from the customers, even just internal tools. Try to ask as minimum as you can, and do as much as you can for them. Then, overall, if we can make it reliable, make the experience delightful, you will see a productivity improvement. Also, usually the new tool, we bake into our new thinking of how to improve their productivity. When they are getting used to the new tool, we can also see a productivity boost.
I’ll give you some high-level overview of the architecture. This is not the architecture that you’ve always wondered, so I’ll stay very high-level here. Remember I said the problems with our storage, that there were limitations. It wasn’t as scalable as we hoped it to be. We choose to use VictoriaMetrics. We shard our clusters by their services. We have a query layer on top to help us when we need to re-shard. When a service gets too big, it probably needs its own shard. Then we move those services to its own shard and dual-write for a while, have those query layers covered up, so people won’t tell the difference when we move their data around. On the right side is our Grafana setup. We actually set up our Grafana in AWS with multi-region and multi-availability zones. A single version failure wouldn’t actually cause any impact to our telemetry system. It worked pretty well, at least for us. That was the architecture.
Now I’m going to share more about the transition. How we make the transition internally, are three major steps. First, we send a lot of announcements to get a lot of awareness. Those announcements are usually top-down. We send it to email, Slack, and also put up a banner on the old tool, just to get awareness. You wouldn’t believe how people try so hard to deny reading all the messages that are reaching them. Yes, announcements, getting the awareness. That’s the first step. Then we took the soft redirect approach, like the redirect URL equals to false. That trick. That was one of them. We put up a link to every chart, every dashboard in the old tool.
Basically, we prompt them to create a new tool and take them to the new Grafana chart, which basically is an identical chart. People don’t really click that. I’ll share more. After the soft redirect period, that was actually several months. When we reached a certain adoption rate with the new tool, that’s when we start to enforce the hard redirect. If people really need to go back to the old tool, that’s when we enable it.
Otherwise, it will be hard redirect. By the time we actually enable hard redirect, the daily active usage of the old tool was less than 10 already. What we did along the way. We did a lot of trainings. You wouldn’t believe for internal tools, you need to organize company level sessions, also callouts, ad hoc trainings. If a team requests a training with us, we’ll be very happy to just host a training session with them. We also have recordings, but people just prefer in-person training. Also, we kept an eye on our customer support channel and usage dashboard. We proactively actually connect with heavy users and their teams. We try to understand, what is blocking you from this new tool? For a time period every day, we just get the top 10 most active users of the old tool, and we’re going to reach out one by one to them.
At the end, we actually made it. We delivered a reliable telemetry system. We deprecated the old tool. The on-call is still part of life, but it’s just much more peaceful. Some more fun story to share is that, at Roblox, we have many incident calls. During those incident calls, a true rewarding moment was that people started to send those new tool links, like Grafana chart link instead of the old tool link in the Slack channels to communicate with each other. Because the old tool is still unreliable. Sometimes it has issues.
If people report, like the old tool has issues, we don’t even need to jump into the conversation. Our customers who are used to the new tool already, they would actually reply back to those customers saying, try the new one. The new one is super cool. That was a really rewarding moment, now thinking back. The on-calls are still part of life. The good thing is that people don’t really question telemetry when they get paged, which is a good thing. Then, still there are other improvements we can make to our telemetry system. For example, how to make our root cause analysis easier. Those investments are still ongoing.
The Second Migration
For me, personally, first migration was done. What’s next? I was at the telemetry team. I also spent some time trying to build an automated root cause analysis system with the team. Then, soon I realized there are still incidents, and it doesn’t take that much time for us to analyze the root cause of those incidents. We analyzed the root cause of those incidents. Soon we realized that, actually, 20% of them were caused by change rollouts. That was a lot of financial loss and also user engagement loss to us.
Also, just imagining how much time our engineers need to spend time on debugging, triaging, and also doing postmortems on those incidents. Just a lot of engineering productivity loss, in my opinion. I talked to my boss, and then, instead of improving telemetry, I told him I probably want to have a better understanding of the incidents and see how I can help with actually reducing the number of incidents. It was very obvious, validation of the rollouts needs improvement. It’s still a lot of manual steps involved today, but back then, 100% manual steps were involved in change rollout at Roblox. When a change got rolled out first, it would be manually triggered by an engineer. This engineer needs to be responsible for the validation of this change. By looking at the charts, the alerts, pay attention to their secret runbook.
Then, if there’s any issue, the manager would be responsible for rollback, if needed. We really trust our engineers throughout this process. No matter how good engineers are, we’re all still humans. There are still misses sometimes. Depending on the level of engineers, like junior engineers sometimes they’re not really experienced. They don’t know which chart to look at. There’s no way of blocking production that protects them from making mistakes. It happens, basically. It contributes to 20% of incidents. It’s very obvious that improvements must be made. We need to add automated tests to our rollout pipeline and add automations of rollout trigger, rollback, and all those validations. Everything should be automated. There are so many manual steps involved in a change rollout, there are a lot of rooms for improvements.
Then, our question is where to begin with. I didn’t really know. I feel like there were just so many things that we can do. Just get one thing done and let’s see the results. Our PM told me to do some customer interviews. I’m glad that I listened to him. I did a lot of customer interviews together with our PM in our group. Customer interviews for internal tools. We talked to a lot of groups. They were telling us the same story, like tests are, in general, good, but our test environments are not stable. Roblox’s infrastructure is too complicated. The test environment is hard to make it stable. Teams today have their own validation steps. It’s really hard to generalize and automate all those validations, because we have the client apps. We have the platform side. We have the edge side, the game server side. It’s really hard to generalize those validations for all the teams. We decided our first step is to automate the canary analysis part. Why?
First, canary analysis happens in production. Production is stable. We don’t need to deal with the test environment problem at least for now for this step. Second, it would actually bring immediate values to improve our reliability. Our reliability is just not good. Twenty percent of incidents were caused by changes. There are a lot of low-hanging fruits there. Let’s start with this canary analysis, get the low-hanging fruits done. Third, thanks to my experience working on the telemetry team, back then, we created common alerts, because we know the common metrics. We created common alerts for services. Those common alerts can actually be reused for defining the default canary analysis rules. That sounds like a good start.
Again, we came up with some plans. I’m the type of person who likes to do plans a lot. Borrowed from my previous experience, essentially, this is still a migration project. Migrating from the old deployment tool to the new one with some automations. Step one, we designed our new solution. Our new solution involved a new internal engineering portal for deploying their services. Back then, at Roblox, before this tool, we don’t really have a central place where you can view a catalog of services. Everything is just thrown at one single page. You cannot really tell how many services are there, and what services are having changes, or who owns what. It’s just really a mystery to tell back then. We also define the canary analysis by default. Sometimes teams, they don’t really have canary instances. When they roll out to production, it’s to 100% with one single click.
Then we also define a set of default rules that can compare canary metrics with non-canary. Those rules were basically based on the default alerts that I mentioned previously. At the end, we also designed a new automated canary evaluation engine. Step two, our plan was, let’s develop the new solution and then move to transition. We’re going to enable canary analysis for every service rollout. We’re so friendly, we invest so much into customizations. We think people will like us. We also allow them to choose whether you want to have auto rollback and roll forward. In our design, those were all considered, because we thought that this would be a very easy and friendly project to roll out. We had all those things considered in our plan. Then step three is basically deprecation, disable the old way of service rollout. That was the plan.
As a spoiler, we ended up sticking to this plan. There are some hiccups or pushbacks. Again, we got a lot of pushbacks. This time, the pushback was even stronger than the telemetry migration project. There were just productivity concerns. The screenshot here is a snippet of the old tool, what it looks like. Just like two text boxes and a button at the bottom. Then, that’s it. Basically, that’s all you get from deploying your service. After we deploy, probably there will be a table showing you the progress. That’s it. Everybody goes to this place. Everyone’s deployment is visible to each other. Just to recall, we have 2,000 services. Also, that was quite fascinating. People really stick to this old tool and think that this old tool can give them productivity boost because it’s a single click.
The new tool, because we introduced the default 30-minute canary analysis validation phase. When we were debating how long the default time duration should be, I told them, I’m so used to 45-minute canary validation phase. People were like, no, we can only do 10 minutes. We ended up with 30 minutes. That was a very random number we came up with. We got a lot of pushbacks. At the end, we actually reduced it a little bit to begin with. Then the third pushback, again, just new tool and new deployment workflow. People actually told us that it takes time to learn. Even though we even had a UX designer for this internal tool, we thought the whole operation, the workflow, is very streamlined, but no, engineers don’t like it.
Just a high level of what happened. This is a magic three quarters. Yes, it’s just a magic number. Again, we did it together in three quarters. In summary, again, those are the three pillars that I think plays a key role to the success of this project. Reliable. Because of canary analysis, it’s such an obvious low-hanging fruit, it ended up actually catching a lot of bad service rollouts to production. We had the metrics to track that. We tracked the services that failed at the canary phase, and actually ended up getting rolled back. That same image version never got rolled out again. We have a way to auto track that number. That number was really high. It was really delightful. Those new tools are just so obviously better than the older one. Every service gets its own page. You can clearly see who owns what. What are the alerts configured to those services. Which alerts are firing. What are the recent changes, deployments. This is so easy to tell what’s going on with a service.
After a while, after the initial pushback, people started to like it. We pay very close attention to their feedback. We just invest heavily to make sure that this is a delightful experience for them. After everyone transitioned to this new tool, we clearly saw a boost in our productivity. Throughout our user surveys, people reporting their productivity themselves, or show their appreciation to our new tools. Also from our incident drop, we also measure how many bad service rollouts actually caused a production incident. The number was much lower than before.
This is just showing you the canary analysis, the UI. I didn’t manage to get approval for getting a full screenshot of what the new tool looks like. This is just the canary analysis part of the new tool. You can see on the top is the deployment workflow. There is a canary deployment phase that happens automatically for every production rollout. Then there is a canary analysis phase in between. If people click that, they can see a clear analysis in the Grafana charts, set of rules, and how long, and all the configs that were configured for this run.
At the end, there is a production rollout with a clear progress report and easy navigation to their logs and charts, everything. We also allow customized rules, configs. Our engineers really like UIs. Somehow our engineers, they don’t really like command line tools. We made everything in the UI. They can just click and edit their custom rules and configs. When they submit, we’ll automatically create a PR for them. Everything is checked in on GitHub. Every change will need to be reviewed by their teammates.
This is the adoption trend with this project, canary analysis. There are some fun things that you can tell from this trend. You can see at the beginning, the slope was very steady. We were having a really hard time to get early adopters. We talked to people. We tried top-down, and also, we tried bottom-up approaches. Talking to people, “Try our new tool. It’s really cool. Get your team on board”. People will say yes, but then their whole team is still on the old tool. We also tried the top-down approach. Talk to managers, “Your team owns those services. Those service rollouts are crucial to the success of the company. You have to use this new tool to ensure your service rollout is reliable”. It didn’t work well either. What ended up working well was that we collaborated, we have a reliability team, with SREs.
Then, when there’s another incident caused by a service rollout, we would jump to that incident call and also tell them how canary analysis could have prevented the issue. Of course, that was after the mitigation is done. We jumped into those calls. Also, we joined the postmortem process and basically onboarding with automated canary analysis, part of the action items. With that attraction, with the critical services and also incidents, the good thing about incidents is that it has a lot of awareness. A lot of people pay attention to them. We got those services on board. Then, there are people, basically, people with the spirit of awareness. We also get our own teams, our peer teams to try our tool.
From all those directions, we start to have slowly increased adoption in our new tool. I think this one, this big jump was actually because there was a reliability initiative email sent out, announcement sent out just basically saying, you should be all using automated canary analysis. Those are the examples where we could actually prevent those incidents with automated canary analysis. Both bottom-up and top-down, that’s how we get our adoption. Over and again, we start to see a lot of self-adoptions.
Basically, this is just organic growth. We also changed our onboarding tool for new hires. Basically, we do change that page to use this new tool. We paid a lot of attention to details. New hires, they just like this new tool. They will never go back to the old one. As of today, it’s already being used 100% for all the service rollouts. I don’t think, actually, in this case, no one is missing the old tool. The telemetry use case, there are still people missing the old one. This one, no.
What’s next? Canary analysis is actually a true low-hanging fruit, I think. Because we do need to deal with the environment problem. I do think creating a stable test environment and being able to deal with all those version conflicts and also run the integration test with the real dependencies, that’s a more complicated, in my opinion, problem to solve. That’s a good start to begin with. What’s next? Next year, we are investing heavily in integration tests and help our engineers to write good tests, and also improve our integration testing environment with all those micro-environments, allow them to run integration tests with proper dependencies, and also automated, basically like continuous delivery, CD support.
Summary
Just a reflection on the two migration stories. Those are just my personal key takeaways from my experiences. Make production tools highly reliable and available. That’s the basic needs. Internal tools really need to be reliable and available. Sometimes they’re like telemetry, for example, and also the deployment tool. They’re probably the ones that require the highest availability numbers. Understand what is needed most and where to invest. Thinking back, I feel very glad that I listened to a lot of people’s opinions, like our PM’s advice. I listened to our customers.
All those voices really helped us to make the decision. Have a North Star with a clear path on how we get there. Remember those two plan slides? I feel like they are crucial, even though we didn’t really stick to the timelines. We were really off on the timeline side. We stick to the plan. Those are our North Star to guide us. Even though there were pushbacks. For example, there were slow adoptions. There were those moments. We need to believe that what we’re doing is really beneficial for the company and for the engineers, and stick to the plan.
The fourth one is, be considerate and don’t be interruptive. Even though it’s just internal tools, we need to be really considerate to our internal engineers’ productivity. Roblox today has over 1,500 engineers. There’s a lot of engineers and a lot of productivity time. Don’t be interruptive. Just don’t force them to change, is my personal takeaway. Then, a delightful experience is extremely important for internal tools. Internal tools, they are just like external product now. In my opinion, you need to make it really delightful and easy to use, almost like an external facing product.
Questions and Answers
Participant 1: How are your rules tuned to handle different cardinality of metrics based on high traffic applications and non-high traffic applications?
Dai: How I tune that, that’s still an area that we’re investing. The current struggle we have, one is at every service level, setting a limit. We have a monitoring on which service is giving us more unique number of time series in a very short time period. We’ll get alerted for that. We’ll actually drop them. Unless they fix their metrics, we’ll drop this particular metric from that service. That’s one thing. Also, we have other layers of throttling on the cluster side. We use VictoriaMetrics. They also have tools for us to use.
Participant 1: Have you had false positives where you have a canary run that fails and you investigate and it’s actually a problem with the run itself and not a bug?
Dai: Yes, that happens all the time. We do track false positive and false negative, to see if a canary succeeded or failed. Did it end up, like this version gets rolled out or rolled back? We do check those numbers. Our current number, we are at about 89%, 90% accuracy. It’s not 100% all the time. To address that issue, I don’t think it’s our team. We have a very small team supporting this tool, at least at Roblox. We cannot afford to solve this issue looking at every single issue for every service. I think what we do and I think worked pretty well is that we train our users.
Instead of giving them the correct rule, taking a deeper look with them together, it’s basically, we send them the guide on how to tune their rules. They know their metric better. They were the ones who were doing the manual validation. They know how to set up the rule. Our general philosophy is basically, teach them how to use the tool, how to tune their rules. After that, it’s them. Also, sometimes, it’s better to be more aggressive on those tools.
Participant 2: You talked a lot about pushback from the engineers and the users of the tool. You didn’t talk at all about pushback from the executives or higher ups. When a project goes three times longer than you expected, usually, someone’s going to say, what’s going on here? How did you manage that?
Dai: There was a lot of pressure to manage. I got a lot of direct pings from a very high-level person asking me, why are we doing this project? Also, very high-level people talked to my bosses’ boss’s boss, several layers above, can we stop this project? How I handled that is, basically, I did a lot of company-wide sessions, including office hours with executives and very high-level people, like explain to them using numbers rather than the perception. Numbers tell them why we need to make this transition from the reliability gains, from the cost gains, from the engineering efficiency gains. Just try to convince them. I also really listened carefully to their feedback and try to invest however we can to make their experience better. If they say this annotation on Grafana looks different from the old tool, we spend a lot of time trying to fix those annotations, trying to align the timelines to seconds, just to bring the experience closer. I think at the end of the day, we’re just all humans. We spend so much time trying to make them happy. I think people would actually also show appreciation and understand the importance of our work and the respect we give to everybody during this migration.
Participant 3: Could you talk a bit about how you made sure that your new telemetry tool was actually giving you the right numbers? Did you have some process to compare the results with the old one?
Dai: Telemetry, it’s basically like counters, gauges, and quantiles. The counters and gauges, I don’t think our old tool was doing anything wrong. I think we had a good understanding of how the open-source world is dealing with those ways of measuring metrics and performances. The only thing that’s different from the old tool and the open-source world was the quantile. It’s not magic. You have the raw numbers. You can do calculations. You can get the real numbers of the real quantile distribution. You have the histogram calculations and also the old tool quantile over quantile thing. You basically compare all those three numbers. It wasn’t that hard to get which one is closer to the real quantile distribution.
Participant 3: Was it a manual validation or did you automate it?
Dai: We have automated validation and also a manual one. Manual one to prove that the old way was wrong. The old one to prove that the new tool and the old tool produced a similar metric output, with the differences, which is basically quantile, the new tool is basically correct.
Participant 4: You talked about build versus buy in the first migration, but I was interested if you ever revisited that as you were building, particularly the canary capability.
Dai: Yes, I revisited it. I think it’s a question that I don’t think there is a perfect answer. It’s really case by case. Sometimes it’s company by company. When you want something quick, buy seems very feasible. It can give you something really quick and it can solve your current problem. As you’re getting bigger, the costs are getting higher, when you reach a certain number, you start to take a break and think about the solutions.
See more presentations with transcripts