Transcript
Narayanan: Myself and my colleague Karthik here, we are going to be walking you through, as part of this session, on how we solve for scaling the Uber Eats backend and infra architecture, so that we can infinitely scale the number of merchants that are deliverable to any particular eater. I’m Janani Narayanan. I’m a senior staff engineer working on search ranking recommendations for the past few years.
Ramasamy: I’m Karthik Ramasamy. I’m a senior staff engineer. I work in the search platform team. Our team builds the search solutions that powers various search use cases at Uber, including Uber Eats, location search at rides, and other things.
Narayanan: A fun Easter egg here. If you have your Uber Eats app open, you could go there into the search bar and then look up who built this, and you can see the list of all of the engineers who are part of the team.
nX Selection
Which of the following is not among the top 10 searches in Uber Eats? It is unexpected for me when I first saw this. It was actually Mexican. Apparently, not many people care about eating healthy when they order from outside.
The problem that we want to talk about as part of this is, how did we go about expanding selection by nX? Before we go into what does nX mean, I want to spend a little bit of time talking about what selection means. It means different things for different people. If I were to talk to the merchant onboarding team, the operations, they would say that onboarding as many merchants as possible, that is considered as a success metric. That is considered as good selection. If I were to speak to eaters, different kind of eaters will have a different answer to this. Someone could say that I care about getting my food within a particular ETA, so that is good selection. My favorite restaurant, if it is on this platform, then this platform has good selection.
Some other folks can also say that if I get to find new stores, discover more stores, which I wouldn’t have normally found it, or instead of going out of the app and then finding somewhere else, or word of mouth recommendation and then coming back to the app, if the app in itself can curate my preferences based on my order history, based on my search history, you know everything about me, so why don’t you give me something that I haven’t tried before, surprise me. That is considered as good selection. What we are here to talk about is, given all of the real-world aspect of Uber Eats out of the picture, what is a technical challenge that we could solve where restaurants or stores are getting onboarded onto this platform, and the eaters want to have access to more of selection, more of restaurants available to them when they are trying to look it up in any of the discovery surfaces.
To get a sense of how the business is growing, how the market is growing, as we can see, just before the pandemic and through the course of the pandemic, the business has expanded and branched down to multiple different line of businesses. Why this is important is because that is all part of the scale that is inclusive of what we were trying to solve. It is not just about restaurants. Starting from pandemic, we have grocery stores, then retail stores, people who are trying to deliver packages, all of these things are part of the same infra, same ranking and recommendation tech stack, which powers it under the hood. Why this matters is that, up until now, we are talking about it in terms of restaurants and stores, but the index and complexity comes in where in case of a restaurant, if we talk about a single document, it would probably have 20, 30 items, and that’s about it.
If we think about grocery as a line of business, for every single store, there are going to be 100,000 SKUs for each and every store. All of those items also need to be indexed. Onboarding a single grocery store is very different in terms of scale, comparison with onboarding a restaurant. Another example is, before we started working on this, people were able to order from restaurants which are 10 to 15 minutes away from them. Now, you could order from a restaurant which is sitting from San Francisco, you could order it all the way to Berkeley.
Let’s say if you want to order something from Home Depot, and the item that you’re looking for is not here but it is somewhere in Sacramento, you should be able to order it from Uber Eats and then get it delivered to you. That is the breadth of the line of businesses that we wanted to unlock, and also the different challenges in terms of scale that different line of business offers for us. With that in place, in terms of selection, we are specifically focusing on the quantity of selection that is exposed to the eater when they are going to any of these discovery surfaces. The personalization aspect of it, that’s a completely different topic altogether.
Discovery Surfaces (Home Feed, Search, Suggestions, Ads)
What we mean by discovery surfaces, let’s start with terminologies. There are four different discovery surfaces. Which of the surfaces do you guys think most of the orders come from? Home feed. Depending on whether it is e-commerce or online streaming services, the surface area is going to change. For specifically delivery business, it is the home feed. In terms of the tech stack that we are looking at, there are different entry points to this, different use cases that we serve as part of the work that we did. If we take search, for example, there are different kinds of search, restaurant name search, dish search, cuisine search. Home feed has multiple different compartments to this. There is the carousels, which are all thematic based on the user’s order history.
Then we have storefronts, which is a list of stores based on the location. At the top of your home feed, if you look at your Uber Eats app, there would be shortcuts, which will be either cuisine based or promotion based and whatnot. All of these entry points need to work in cohesion. In other words, regardless of whether someone goes through suggestions, where someone is searching for pasta and you are trying to also show pastrami based on the lookahead search. We are looking at McD as a search, and we also want to show Burger King as part of the suggestions that come up. All of these different use cases need to be addressed as part of this. Because if I’m able to find a store or a restaurant through my home feed, but I’m not able to locate it through my suggestions at the same time, that is considered as a poor customer experience. We needed to address all parts of the tech stack in one go, in one XP.
Overall Architecture – Infra Side to Application Layer
Given this, let’s take a look at the overall architecture from the infra side to the application layer. The leftmost thing, that is all the infra layer of it. It is the corpus of all of our stores and indexes, how we index, how do we ingest, all of that goes into that. Then we have the retrieval layer, that is where the application layer and my team and my org starts, where the retrieval layer focuses on optimizing for recall. The more stores that we could fetch, the more stores that we could send it to the next set of rankers so they can figure out what is an appropriate restaurant to show at that time.
The first pass ranker is looking for precision and efficiency. What this means is that as the restaurants or stores are fetched, we want to start looking at, how do we do a lexical match so the user’s query and the document are matched as much as possible in terms of relevance? Then we have the hydration layer, where a lot of the business logic comes into picture in terms of, does it have a promotion, does it have membership benefits involved? Is there any other BOGO, buy one, get one order that we could present and whatnot? ETD information, store images, all of those things come into picture there. Then we have the second pass ranker, which optimizes for the precision. This is where a lot of the business metrics get addressed. We look at conversion rate. We also look at, given the previous order history, all of the other things that I know from different surfaces of interaction from this eater, how do I build the personalization aspect of it so we will be able to give more relevant results to the eater.
Given this overall tech stack, I want to give a sense of scale in terms of the tech stack. Particularly, I would like to draw your attention to two of the line items here. One is the number of stores which are part of the corpus, millions of them.
The other one is the number of matched documents that you use to fetch out of the retrieval layer. When you look at it, the scale sounds like, there’s just only thousands of documents which are matched. What matched means is when I look at the eater’s location, how many stores can deliver to that eater’s location? When we had these tens of thousands of them when we started, we said, if you wanted to make it nX or increase it more, all that we needed to do is fetch more. Let’s just increase our early termination count. Let’s fetch more candidates and then go from there. We did a very basic approach of fetching 200 more stores, two-week XP, and it blew up on our face where we saw that the latency increased, P50 latency increased by 4X. In this example, we could see that the red dot is where the eater is, that is the eater’s location, and as it expands, which is the new red dots that we started adding, that is where the latency started increasing. This is a serious problem. Then we needed to look into where exactly the latency is coming from.
The root cause, as we started diving into it, had multiple aspects to it where we needed to look at different parts of the tech stack to make sure that some design decisions made in, let’s say, ingestion, how does it impact the query layer? How do some of the mechanisms that we have in the query layer, don’t really gel well with our sharding strategy? It was a whole new can of worms that we opened as we started looking into it. First, we started with benchmarking. If there’s a latency increase, especially the retrieval layer, let’s just figure out where exactly it is coming from.
In the search infra layer, we added a bunch of different metrics. Depending on whether we are looking at grocery or eats, there is one step particularly which stood out where we were trying to match a particular document to the query, and then, this document match, put it into your bucket, then move on to the next document. When we iterate over to the next document that matches, that took anywhere between 2.5 milliseconds latency for grocery and 0.5 milliseconds for eats, and that is unexplainable for us. That was unexplainable at that time. It is supposed to take nanoseconds, especially if you have optimized index. Then we started looking into, this is a problem area that we needed to start looking into.
The other area that I want to talk about is how we are ingesting the data, and the pieces will fall in place in the next few slides. For those of you who are following Uber’s engineering blogs, you would now be familiar that Uber does most of its geo-representation using H3. H3 library is what we use to figure out how we tessellate the world and how we make sense out of the different levels that we have in place. Depending on the resolution, the different levels optimize for different behaviors that we want for the eaters and the merchants.
Given this, we represent any merchant and the delivery using the hexagons to say that merchant A can deliver to A, B, C locations by using hexagons in the resolutions. How this gets indexed is if we take this example where we have restaurants A, B, and C, and hexagons are delivery areas which are numbered, the index will be a reverse lookup, where, going by the hexagons, we would say that in this hexagon, two or three different restaurants can deliver to me. Fairly straightforward so far.
From here, what we did is, now that we understand how the index layout looks, this is the second problem that we identified as part of selection expansion. At the time of ingestion, we had this concept of close by and far away, and that is the column that we use to ingest the data. At a store level, the upstream service had the decision to say, I’m going to give you a list of stores and the deliverable hexagons, and I’m going to categorize them as close by and far away. When we did that, if we look at this example, hexagon 7 considers both A and B as far away. If we look at the real data, B is actually much close by, in comparison with A, but the ingestion doesn’t have that information.
Naturally, the query stage also doesn’t have this information. Only the upstream service had this information, which we lost as part of this. This ETD information, without that, we are treating A and B together at the time of rankers, and that was another problem. In terms of search area, even though we say that we’ve only increased by 200 restaurants, going from, let’s say, 5 kilometers to 10 kilometers to so-and-so, would mean that we are increasing the area by square. The search space increases exponentially, even though we say that I’m only trying to deliver from 10 miles to 12 miles or 15 miles, and whatnot. This meant that we are processing a lot number of candidates which will tie in into why going from one document to the other was taking such a long time.
The next thing is the store distribution. If we were to make it as a simple concentric circle around where the eater’s location is and the Lat-Long is, what we could see is, as we start expanding further into more geos, the number of stores or the distribution of stores in the innermost circle versus the outer circle and whatnot is going to be anywhere between 1:9 ratio. We will get more of faraway stores than the close-by stores, and ranking them becomes harder.
Another thing to note is, if we are going to find a restaurant which has much higher conversion rate because that restaurant is more popular and whatnot, but that is in the second circle or the third-most circle, then it is highly likely that in the second pass ranker, that store will get a higher precedence because it has higher conversion rate. In reality, people would want to see more of their close-by stores because a burger is a burger at some point in time. That was one of the problems that we saw as we started fetching more stores where good stores were trumping the close-by stores, and the ranking also needed to account for that.
Search Platform
Ramasamy: Next we’ll go share some insights about the search platform that powers the Uber Eats search. Then we will talk about some optimizations that we did to improve the retrieval limit and also the latency. How much traffic does Uber Eats get per day? It’s tens of millions of requests per day. Uber has an in-house search platform that is built on Apache Lucene. We use a typical Lambda architecture for ingestion. We have batch ingestion through Spark, and then we have real-time ingestion through the streaming path.
One of the notable features that we support in the real-time ingestion is the priority-aware ingestion. The callers can prioritize requests and the system will give precedence to the higher-priority request to ensure a high degree of data freshness for the high-priority request. At Uber, we use geosharding quite heavily. This is because most of our use cases are geospatial in nature. I’ll share some insights on some of the geosharding techniques that we use at Uber. Then, finally, we build custom index layouts and query operators that are tuned for Uber Eats cases that take advantage of the offline document ranking and early termination to speed up the queries.
Here’s the architecture overview of the search platform. There are three key components here. The first one is the batch indexing pipeline. The second component is the streaming or real-time updates path. The third component is the serving stack. We start with the batch ingestion. Usually, these are Spark jobs that takes the data from the source of truth, convert them into search documents, partition them into shards, and then builds Lucene index in Spark. The output of the Spark jobs are Lucene indexes, which then get stored into the object store.
Then updates are then constantly consumed to the streaming path. There is an ingestion service that consumes the updates from the upstream, again converts them into search documents, and then finds the shard the document maps to and then writes to the respective shard. One thing to note here is that we use Kafka as the write-ahead log, which provides several benefits. One of them that we talked earlier is implementing priority-aware ingestion. Because we use Kafka as the write-ahead log, it enables us to implement such features. It also provides us to implement replication and other things using Kafka.
The searcher node, when it comes up, it takes the index from the remote store, and then it also catches up the updates from the streaming path to the write-ahead log, and then it exposes query operators to run the search queries. There is another component here, it’s called the aggregator service. It is actually a stateless service. Its main responsibility is to take the request from upstream, find the shard the request maps to, then send it to the respective searcher node, and execute the queries. Also, aggregate the results and send it back to the caller if there are query fanouts and things like that. That’s the high-level overview of the search platform that powers Uber Eats search.
Sharding Techniques
Next, I will talk about sharding techniques that we use. As we have been talking earlier, that most of our queries are geospatial in nature. We are looking for find me restaurants for given coordinates, or find me grocery stores for given coordinates. We use geosharding to make these queries more efficient. The main advantage of geosharding is that we can locate all the data for a given location in a single shard, so that the queries are executed in a single shard.
At scale, this is quite important, because if you fan out the request to multiple shards, then there is an overhead of overfetching and aggregating the results, which can be avoided by using geosharding. The other benefit is first pass ranking can be executed on the data nodes. The reason being that the data node has the full view of the results for a given query, and then you can push the first pass ranker down to the data node to make it efficient. The two geosharding techniques that we use are latitude sharding and hex sharding. I’ll talk about both of them in detail.
Latitude sharding works this way, where you imagine the world as a slice of latitude bands, and each band maps to a shard. The latitude ranges are computed offline. We use Spark job to compute it. The way we compute is a two-step process. First is we divide the map into several narrow stripes. You can imagine this in order of thousands of stripes. Then we group the adjacent sites to get roughly equal-sized shards. In the first step, we also get the count of documents that maps to each narrow stripe.
Then we group the adjacent stripes such that you get roughly equal-sized shards, the N being the number of shards here. There’s a special thing to note here, like how we handle the documents that falls on the boundary of the shards, that is the green zone that is in this picture. Those are documents that falls on the boundary of two shards. What we do is we index those shards in both of the neighboring shards. That way, the queries can go to a single shard and get all the documents relevant for the given query. The boundary or the buffer degree is calculated based on the search radius. We know that the queries are at the max going to go for a 50-mile or 100-mile radius.
Then we find the latitude degree that maps to that radius, and then that’s the buffer zone. Any document that falls in the buffer zone are indexed in both the shards. With latitude sharding, we get this benefit of cities from different time zones getting co-located in the same shard. In this example, you can see Europe cities and cities from America mixed in the same shard. Why is this important? This is because the traffic in Uber especially follows the sun pattern, where the activities are higher during the day and it slows down during the night. This sharding naturally avoids clustering cities with same busy hours in the same shard.
That helps us a lot in managing the capacity and stuff. We also see some problems or challenges with the latitude sharding. One of them is the bands are too narrow at the center. That’s because the cities are denser in this space, and then you reach a point in some use cases where it’s difficult to divide further, especially considering you have a buffer zone. Over time, the shards become uneven, and some shards, especially towards the center, are larger when compared to the rest of the shards. This creates problems, like your index builds take longer time because you’re bound by the larger shard. Also, those shards experience larger latencies and stuff.
The optimization for this problem is the hex sharding. Hex sharding, we imagine the world as tiles of hexagons. As Janani said, at Uber we use H3 library very extensively. H3 library provides different resolutions of hexagons. The lowest resolution, which means larger hexagons, results in about 100 tiles for the whole world. The highest resolution results in trillions of tiles. Selecting the right resolution is key for using hex sharding. We use some observations and empirical data to decide the hex sizes. At Uber, we generally use for hex sharding, hex size 2 or 3.
Again, we use the same approach of offline jobs to compute the shard boundaries. Basically, we pick a resolution, we compute the number of docs that maps to each resolution, and then group them into N shards, basically N equal shards using bin-packing. We also handle the buffer zones similar to latitude sharding. In hex sharding, you have to imagine the buffer zones also in terms of hexagons. The key here is, choose a resolution that is smaller than the main resolution hex for the buffer zones. Then you index the documents that falls in buffer zone in both the hexes. In this case, the right-side shard shows that the main blue area is the main hexagon and outside are the buffer zone hexagons that gets indexed into it as well to avoid crash out queries. That’s the details on sharding and the architecture of the search platform.
Solution
Next, we will talk about some specific optimizations we did for the Uber Eats use case, taking advantage of the query patterns and other data from the use case to improve the recall and also reduce the latency. The first thing that we did is building a data layout that can take advantage of the query patterns. I will share a couple of data layouts that we used, one for the Eats use case, another for the grocery use case. I’ll also walk through how those layouts helped us to improve the latency. The second technique we’ll talk about is how we use the ETD information that Janani was talking about earlier, how we index that into the search index. Then, how we divide the search space into non-overlapping ranges and then execute them in parallel to improve the latency. Then, finally, we’ll talk about how moving some of the computations that were happening in the query time, such as far away versus nearby computation that Janani was talking earlier, and how that helped to improve the recall and the latency.
Data Layout
I will go through the data layout. This is a data layout that we use for the Eats index. If you look at the Eats query pattern to begin with, you are basically looking for restaurants or items within the restaurants for a given store. We use this insight to also organize the documents in the index.
In this case, we take the city and we co-locate all the restaurants for a given city first. You can see, McDonald’s and Subway, they’re all the restaurants under the city, SF. We then order the items or the menus under those restaurants in the same order. You go with this order, city followed by all the restaurants in that city, and then items for each of the restaurants in the same order as the store. The benefit we get is the faster iteration. A query comes from SF, you can skip over all the documents of other cities that may be in the shard and just move the pointer right to the SF and then find out all the stores. That makes the query faster. The other nice benefit that we get is that if your data is denormalized, in our case, sometimes we denormalize all the store fields into the items as well.
In that case, you have a lot of common attributes for the docs. The item docs for the store will have all similar store level attributes adjacent to each other. This provides better compression ratio. That’s because Lucene uses delta encoding and if you have very sequential doc IDs, then your compression is better. Then, finally, we also order the documents by static rank, which helps us to early terminate the queries once we reach the budget.
Next, I will share a slightly modified version of the index layout that we use for grocery. That’s because of the nature of the grocery data. It’s pretty similar. Again, we first sort by city, then we take the stores, sort by stores, stores are ranked by the offline conversion rate order. Here, the difference is we place the items of the store next to each other. I will share why that is important. This is how the layout looks, city, then store, and the items of the store go to the second store, items of the store, and third store, items of the store.
One benefit, let’s say if you look for a specific posting list with the title as chicken, then you get the store 1, all the items with the title chicken for that store, and store 2, all the items with the title chicken for that store, and store 3. As Janani was saying earlier, grocery scale is very high compared to each. You have hundreds or thousands of items in a single store that can match the given title. When you’re executing a query, you don’t want to be looking for all the items from the same store. You can give a budget for each store, and then once you reach that limit, then you can skip over to the next store. This layout allows us to skip over the stores pretty quickly, but also collecting enough items from a given store. The other benefit that we get, from the business point of view, is it scales us to get diverse results. Your results are not coming from a single store, you also cover all the stores in the search space. That’s the main advantage of this layout.
Next, here’s some interesting numbers that we observed when we moved from one unsorted or unclustered layout to the clustered layouts from location and store. Here’s the latency of a single query that is executed before and after clustering. This query returns about 4K docs. As you can see, the retrieval time before clustering is around 145 milliseconds, and the retrieval time after clustering is 60 milliseconds. It’s about 60% better after clustering the docs based on the query pattern. Down below, the graph shows the doc IDs, time taken to get each hit in the retrieval loop.
As you can see, before sorting, the hits can take anywhere from 10 to 60 microseconds for a single doc. The latency here is in microseconds. After sorting, as you can see, the latency is a lot better, like each hit takes less than 5 microseconds. Here’s the overall improvement in latency that we observed when we rolled out this clustered index layout. You can see more than 50% improvement in P95. We also see equal improvement on P50 latencies as well. The other benefit is index size reduced by 20%.
ETA Indexing
Narayanan: One of the aspects that we talked about as part of ingestion is the metadata that we get from the upstream services was not passed on to the rest of the stack to be able to do meaningful optimizations on top of it. What this means is that if we take restaurant 1 and 2 as part of this example, as we index that restaurant 1 can deliver to hexagon 1, 2, 3, 4, we do not know relative to H1 how far away is H2, how far away is H3, and whatnot. This is an important metadata that we needed to pass it to the rankers so the rankers can penalize the faraway stores in conjunction with the conversion rate that they have available. This information needed to be passed on from the upstream team altogether. We started off with this. Now that we have one more dimension that we needed to index data, we were benchmarking a couple of different approaches of how we could have both the hexagon and the ETD indexed and used in the retrieval layer.
What we finally ended up doing is that for each and every range, after discussions with product and science team and whatnot, we aligned on what ranges make sense in terms of our query pattern and we said, let’s break them down into a few ranges that overall reflects how people are querying its ecosystem. We dissected it by multiple of these time ranges, 0 to 10 minutes, 10 to 20 minutes, 20 to 30 and whatnot. After we dissected it, we also said, from this eater’s location, let’s say hexagon 1, what are the restaurants which are available in range 1, range 2, range 3, and so on. We did that for every single hexagon available. For those of you who are following along and then thinking about, I smell something here, so how about there are other different ways of doing things?
For example, in this case, there is a tradeoff that we make in terms of, where do we want the complexity to be? Should it be in the storage layer or should it be in the CPU? In this case specifically, if we take a particular restaurant A, that restaurant can be in multiple different hex ETD ranges. Restaurant A could be 10 minutes from hexagon 1, 30 minutes from hexagon 2, which means that we store it a couple of times or multiple times in this case. That is a case where at the cost of storage and ingestion level offline optimization, we get the benefit of making the query faster.
Even for this, we tried a couple of different approaches, and we will soon enough have a blog post which talks more about the multiple alternate benchmarks that we did, where we would go in-depth into one of the other approaches we tried. We tried a BKD-tree approach to see, can we do this in log-in operation, and also a couple of other approaches around, I would only maintain the hexagons as part of the KD-tree, but in the retrieval layer, I could make a gRPC call to get the ETD information and then sort it in memory. Will that work? We did a bunch of different things to get there.
Finally, this is how our query layer looks like. Like Karthik mentioned, this is like gRPC layer between delivery and the search platform. We added a new paradigm of these range queries and we started having multiple ranges that we can operate with. This enabled us to leverage the power of parallelization. To visualize this, if a request comes in, let’s say somewhere in the yellow circle, for that request, there will be multiple queries which would be sent from the application layer all the way to the storage layer.
One query would be for the yellow layer, which is the closest bucket, and another query for the light green and dark green and so on. This is how we were able to get this nX in selection expansion at constant latency, regardless of which line of business that we care about. It involved changes in every single part of search and delivery ecosystem, multiple services, multiple engineers to get it to the finish line. After we made all of these changes, we put it into production and we saw that the latency decreased by 50%, which we thought was originally not possible. The cherry on top is we were also able to increase the recall. Before this, we had a different algorithm to query the concentric circle of expanding radius, and in that algorithm, we made a tradeoff between recall and latency. In this case, we were able to get more stores because that is how the rankers are able to see more candidates to make the optimization.
One of the other use cases that we haven’t spoken about so far, but also important enough in terms of customer experience, is the non-deliverable stores. In Uber, at least in the restaurant side, there can be many cases where you would look for a store, but it is not yet available, not yet deliverable, but it is available for pickup. The reason this exists is based on marketplace conditions, where the merchants are located, where we could send couriers, the time of the day and whatnot.
At some time of the day, we won’t be able to deliver to a restaurant, and this deliverability of a particular restaurant is also dynamic. Given this, we still want the eater to know that we do have this restaurant, but for some other reasons, we won’t be able to deliver at this particular point in time. We wanted to support this. Even in this use case, we moved a bunch of complexity from the query layer into the ingestion layer. At the ingestion layer, we did an intersection of how many of these hexagons are deliverable from the store, how many of them are only discoverable. We did that discoverable minus deliverable intersection, stored it in the index, so at the time of query layer, we would quite simply be able to say that, ok, it’s either in the deliverable or in the discoverable, and I could get it from there.
Key Takeaways
Overall, what we wanted to get out of this is, we started from first principles. When the latency shot up, we did a benchmark to understand where it is coming from, and started to narrow it down to the sharding strategy of, I have a San Francisco document and I have a bunch of Japan documents, because Japan has the most concentrated restaurants possible, so given that, if I were to take a San Francisco request and go through a bunch of Japan restaurants, that is obviously going to increase the latency. That is where the millisecond latency in get next doc came in. Index layout is one of the most overlooked pieces of software that we don’t even look at, where we needed to spend the two to three years to understand the query pattern, and then figure out what is it that we needed to do in our index layout so that it can be optimized for the request pattern that we care about.
Then, the sharding strategy needed to be aligned based on what we are trying to get at. We even saw test stores, which were part of the production index, which was adding to this latency. Three times the stores that we had originally were test stores, and we were processing all of those things when we were trying to process a real-time request, so we needed to create a separate cluster for the test stores.
Apart from this, there were a few other expensive operations which used to happen in the query layer. We had some fallbacks available at the query layer. In any distributed system, there is always going to be timeout. There is always going to be some data which is not available from the upstream. When those things happen, we used to have a query layer fallback to say, try to get it from this place, or if you don’t get it from this service, get it from this other service, or get it from a bunch of other places. We moved all of this fallback logic to the ingestion layer, so at the query layer, we just know that I’m going to query and get the data that I need, and all of the corner cases are being handled.
Apart from the parallelization based on ETD, we also had a bunch of other parallelizations in terms of, this is a strong match in terms of query, this is a fuzzy match, and this is either/or match, let’s say Burger and King would mean that I’m going to look for stores which have Burger and also look for stores which have King, and then do a match. We did all of these different things to leverage the non-overlapping subqueries and get the power of parallelization.
Time to Resolution
How much time do you think was expected to be spent to solve this problem? We spent about two to three months to identify where the problem is, because there are multiple different teams, like feed is a separate team, ads is a separate team, suggestions is a separate team, 1000 engineers together. We needed to instrument in multiple different parts to even identify the problem for two to three months. It took us four to six months to get to the first round of XP. Right now, I think this Q1 or so, we are available in the rest of the world too.
Questions and Answers
Participant 1: You did all this moving of stuff to the ingestion side, is there anything you put in there that you don’t need anymore, that you’d want to go back and take out?
Narayanan: This architecture that we have is also something which is evolving. From that perspective, there are some changes that we did in terms of live ingestion. I would give an example. We went with the idea that many use cases need to be live ingested, and then we realized that there are some cases which don’t even need to be ingested at that time, which would also help us in building the indexes faster. The SLAs will become better. One thing that we decided to take out later is when a store moves a location, that location update used to be a live ingestion, which will go through Kafka and then get it into the index.
Operations said, we need to get it in right after the store moves, and it has to be in milliseconds of latency to get it ingested. When we started understanding more of what the use case is, there is a time period involved between when the store decides to move a location, when ops gets to know that, and when tech will start moving it. They usually have made this decision two or three months in advance, and we have time for about a week or two weeks to actually make that transition. We decided that, the priority queue approach that he talked about as part of the ingestion, so we don’t need this as a priority, because this can go as part of the base index build, that is not going to use my compute resources.
Participant 2: You mentioned about the two months to identify the problem, and it takes two weeks to solve it. What kind of observability platform do you use to measure these metrics? Do you use any OpenTelemetry to identify where those queries are slowing down, and what queries are performing?
Narayanan: The expectation when we started the problem was that we will land it in two weeks, not that it took us two weeks to solve the problem.
On OpenTelemetry, we have in-house telemetry services that we have in place. In fact, there is one company branched out of some of the engineers who worked in the team. We use M3. That is our metric system, and that is what we integrated. Jaeger for tracing. When we started instrumenting it, at that time our search infrastructure wasn’t integrated with M3, so that was also something that we needed to do along the way to get it instrumented and then get it out the door. One reason we didn’t do that at that time was because of in-memory usage for, it’s a sidecar agent. Because of that, we didn’t want to have that in-memory usage at the time of production. We spun off a separate cluster, which was very identical in terms of hardware configurations and capacity, and that is where we did all of our benchmarks so that it doesn’t impact production.
Participant 3: You said you use Lucene for indexing. Does that mean that for searching specifically, you have a separate cluster that is specifically used just for searching versus indexing, or is it the same cluster that serves both reads and writes?
Ramasamy: We use the same cluster for search and indexing at the time. If you notice the architecture, we have two components of ingestion. One is the batch ingestion and the real-time ingestion. What we do is we move all of the heavy lifting on the indexing to the batch side, and the live ingestion or real-time ingestion is kept lightweight. Searcher node is utilized mostly for queries. Very little is used for indexing. That’s the current state. We are also working on the next generation system where we are going to fully separate the searcher and the indexer.
Participant 4: I would think that most people are querying for food when they’re at home or at work, and so subsequent queries are going to be pretty common. Do you all do anything in order to reduce the search space, you effectively cache the hex cells on the subsequent calls? For example, if I’m at my house, on the first query, you have to go out and do the work to determine what the boundaries are, but then on the subsequent queries, the geography isn’t changing. The only thing that’s changing is the restaurants. Have you all looked at that type of caching for subsequent queries?
Narayanan: We do have a cache in place, not for the purposes that you’re looking for. We don’t cache some of these requests, because if we look at the session, so throughout the session we do have some information that we maintain in memory, and then we could serve from there. We haven’t done a distributed cache there. Many at times, we want to also be able to dynamically look at store availability, item availability, which changes very often especially during the peak times. People run out of things, like restaurants run out of things. Because of that, we don’t intentionally do caching for that particular purpose.
Also, the delivery radius or the deliverability also expands and shrinks based on search, based on whether there is accidents in that area, rerouting happens and whatnot. There is a lot of those things in place. If there is an incident, someone could go change, rejigger those delivery zones too. We want that real time to reflect, because the last thing someone wants is to be able to add everything into their cart and then see that the store is no longer available. That is the reason we don’t invest heavily in caching at the time of retrieval in that part, but we use it for a different part in the query layer.
See more presentations with transcripts