Transcript
Bhattacharya: We’re going to talk about large-scale recommender and search systems. Initially, I’ll motivate why we need recommendation and search systems. Then I’ll further motivate by giving one example use case from Netflix for both recommendations and search. Then identify some common components between these ranking systems, our search and recommendation system. What usually ends up happening for a successful deployment of a large-scale recommendation or search system. Then, finally, wrap it up with some key takeaways.
Motivation
I think it’s no secret that most of the product, especially B2C product, have in-built search and recommendation systems. Whether it’s video streaming services such as Netflix, music streaming services such as Spotify, e-commerce platforms such as Etsy, Amazon, usually have some form of recommendations and some form of search systems. The catalogs in each of these products are ever-growing. The user base is growing. The complexity of these models and this architecture and these overall systems keeps growing.
This whole talk is about, with examples, trying to motivate what it takes to build one of these large recommendation or search systems at scale in production. The reality of any of the B2C, business to customer products are that there are almost like, often depending on the product, there could be 100 million plus users, like Netflix has more than 280 million users, 100 million plus products. In general, to rank for these many users, these many products at an admissible latency, is almost impossible. There are some tricks that we do in industry to still keep the relevance of what items we are showing to our users, but be able to be realistic in the time it takes to render the service.
Typically, any ranking system, whether it’s recommendation system or search system, we break it down into two steps. One is candidate set selection, or oftentimes also referred to as first pass ranking, wherein you take all these items for user and instead of those millions of items, you narrow it down to hundreds and thousands of items. That’s basically called candidate set selection. You are selecting a candidate that you can then rank. In this kind of set selection, we’ll typically try to retain the recall. We want to ensure that it’s high recall system. Then, once these hundreds of thousands of items are selected, then we have a more complex machine learning model that does second pass ranking. That leads to the final set of recommendation or results for search, query that then gets shown to the user. Beyond this stratification of first pass and second pass ranker, there are many more things that need to be considered.
Here is an overview of certain components that we think about, we look into irrespective of whether it’s a search system or a recommendation system. First is the first pass ranking that I showed before. The second one is the second pass ranking. For the first pass ranking, typically, depending on the latency requirement, one needs to decide whether we can have a machine learning model, or we can build some rules and heuristics, like for query, you can have lexical components that retrieves candidates. Versus for recommendation, you can have simple non-personalized model that can retrieve candidates. The second pass ranking is where usually a lot of heavy machine learning algorithms are deployed. There, again, there are many subcomponents like, what should be the data for training the model? What are the features, architecture, objectives, rewards? I’m going to go into much more details on some example second pass ranking.
Then there is a whole system of offline evaluation. What kind of metrics should we use? Should we use ranking metrics? Should we use human in the loop, like human annotation for quality assessment, and so on? Then there is the aspect of biases where when we deploy a system, all our users are seeing the results from that system. There is a selection bias that loops in. How we typically address that is by adding some explore data. How can we set up this explore data while not hitting the performance of the model? Then there is a component of inference where once the model is trained, we want to deploy it and then do the inference within the acceptable latency, throughput, what is the compute cost, GPU, and so on? Ultimately, any experience typically in a B2C product, we are A/B testing it as well.
Then in the A/B test, we need to think about the metrics. I wanted to show this slide first. If you want to take away one thing, you can just take away this, that these are the different things you need to think about and consider. During this talk, I’m going to focus on the second pass ranking, offline evaluation, and inference setup with some examples. Just in case you were not able to see some of the details in the previous slide, here are the sub-bullet points where data, feature, model architecture, evaluation metric, explore data, all these things are crucial for any of these ranking systems.
Recommendation: A Netflix Use Case
Now let’s take a recommendation use case from Netflix. On Netflix, when a user comes to Netflix, usually there is a lot to watch from, and we often hear in our user research, it sometimes feels overwhelming. How do we find what I want to watch in the moment? Netflix oftentimes has so many recommendations on the homepage. It just feels overwhelming. One approach that our members do take is they often go to search, and then they will type something. For example, here, show me a stand-up comedy. The results from these search ranking systems are either personalized or just relevant to the query. I think 60% of our users are on TV, and typing query is still a very tedious task. Just to motivate the search use case, most of the discovery still happens on homepage on Netflix, but 20% to 30% of discovery happens from search, second only to the homepage. There is a nice paper that talks about the challenges of search in the context of Netflix linked here.
Usually, when a member comes to search, there are three different types of member intent. Either the member directly knows what they want, so typing, Stranger Things, and then you specifically want Stranger Things. Versus you know something, but you don’t know exactly, so that’s find intent. Then there is explore intent where you type as broad things like, it’s a Saturday evening, show me comedy titles, or something like that. Depending on this different intent, how the search ranking system responds are different.
Going back to the aspect of a member coming from homepage to search and having to type this long query on a TV remote, which is very tedious. What if we can anticipate what the member is about to search and update the recommendation before the member needs to start typing? That’s why this particular example I’m referring to as a recommendation use case, even though it is after you click on the search page. Internally, we refer to as pre-query, but in industry it often is also referred to as no-query systems. This is a personalized recommendation canvas, which is also trying to capture search intent for a member. Here, let me motivate the purpose of this canvas a little bit more. On a Thursday evening, Miss M is trying to decide whether she goes to Netflix, HBO, Hulu, and so on.
Then she comes to Netflix because she heard that they have good Korean content. There is a model that understands this member’s long-term preference, but in the moment, she recently heard that Netflix has some Korean content that is really good from her friend, and her intent changed. What she did is she browsed on the homepage with some horror titles, and then she browsed on the homepage with some Korean titles. Now she still didn’t find the title that she wants to start watching on homepage. Now she went on search. In this moment, if you’re able to understand this user’s long-term preference, but also the short-term intent, that is, she’s looking for a Korean movie, and before she has to search Korean horror or something, we can just update the recommendation to show her a mix of Korean and Korean horror movies on the no-query, pre-query canvas. We can really capture her intent without the struggle of typing.
If you imagine, to build a system like this, there is, of course, modeling consideration, but a large part of it is also software and infrastructural consideration, and that’s what I’m going to try to highlight. This is anticipatory search because we want to anticipate before the user has to search based on the in-session signal and browsing behavior that the member did in that current session. Overall, pre-query recommendation needs this kind of approach where it not only learns from long-term preference, but also utilizes short-term preference. We have seen in industry that being able to leverage browsing signals in the session, it’s able to help the model capture user short-term intent, while the research question there is, how do we balance the short-term and long-term intent and not make the whole recommendation just Korean horror movies for this member?
There are some advantages of these kinds of in-session signals. One is, as you can imagine, freshness, where if the model is aware of the user in-the-moment behavior, then it will not go into a filter bubble of only showing a certain taste, so you can break out of that filter bubble. It can help inject diversity. Of course, it’ll introduce novelty because you’re not showing the same old long-term preference to the member. Make it easy for findability because you’re training the model or you’re tuning the model to be attuned to the user’s short-term intent. It also helps user and title cold starting. Ultimately, how we call it in Netflix is it sparks member joy. We see in our real experience, so this is a real production model, that it ultimately reduces abandoned session. In the machine learning literature, there is a ton of research of how do we trade off between user long-term interest with short-term interest.
In the chronological order of research done many years ago to more recent, earlier we used to use Markov chain, Markovian methods, then there is reinforcement learning, there are some papers that tries to use reinforcement learning. Then, more recently, there is a lot of transformer and sequence model that captures the user long-term preference history while also adding some short-term intent as a part of the sequence and how they balance the tradeoff. I’m not going into details about these previous studies, but some common consideration if you want to explore this area is, what sequence length to consider. How long back in the history should we go to capture user long-term and short-term interest? What are the internal order of actions? In the context of e-commerce, for example, purchase is the most valuable action, add to cart is a little less, click might be much less informative than purchase, and the different types of action.
What is the solution that we built? I’ll go into the model itself later, but first I wanted to show the infrastructure overview. A member comes to Netflix and the client tells the server that the member is on pre-query canvas, fetch the recommendation. In the meantime, as JIT’s just-in-time server request call happens, we are also in parallel accessing every engagement that the member did elsewhere on the product. There has to be a data source that can tell, one second ago the member thumbs up a title and two seconds ago member clicked on a title. That information needs to come just in time to be sent to the server. While we also, of course, need to train the future model, so we also set up logging.
Ultimately, this server then makes a real-time call with the in-session signals as well as the historical information to the online model, which is hosted somewhere. This online model was trained previously and has been hosted, but it’s capable of taking these real-time signals to make the prediction. Ultimately, this model then returns a ranked list of results within a very acceptable latency. In this case, the latency is lower than 40 milliseconds, and ultimately sends the results to a client. In the process, we are also saving the server engagement, client engagement into logging so that the offline model training can happen in the future. There is a paper in RecSys on this particular topic. If you’re more interested, feel free to dig deeper. That is the overall architecture.
Here are some considerations that we had to think through when implementing a system like that. Actually, one of the key things for a system like that was this just-in-time server call. We really have to make the server call or access the model when the member is on that canvas. We have to return the result before the member even realizes, because we want to take all the in-session signals that the member did the browsing on the product in that session to the model. Because otherwise we lose the context. Let’s say in the Korean horror movie context, the member is seeing a Korean horror movie and immediately goes to search, and if you’re not aware of the interaction that the member did on homepage, then we will not really be able to capture the member intent. The recommendations will not be relevant to the member in that short-term intent of the member. Here are some considerations. The server call pattern is the most important thing we needed to figure out in this world.
More interestingly, different platform, I don’t know if that’s the case for other companies. In this particular case, different platforms had different server call patterns. How do you figure out and work together with engineers and infra teams to change the service call pattern and make sure that the latency of the model and the end-to-end latency is within acceptable bound that the member doesn’t realize that so much action is happening within such few milliseconds. Of course, throughput SLA becomes even more important depending on the platform, depending on the region, and so on. Because we want to do absolute real-time inference to capture the user in-session browsing signals, we had to remove caching. Any kind of caching had to be removed or the TTL had to be reduced a lot. These three components really differentiated a work like this from more traditional recommendation where you can prefetch the recommendation for a member. You can do offline computation.
The infrastructural and software constraint is much lenient in a more traditional prefetching recommendation versus this system has to be really real-time. Then, of course, the regular things like logging. We need to make sure client-side and server-side logging is correctly done. Near real-time browsing signal is available through some data source, or there is a Kafka stream or something also making sure those streams have very low latency so that the real-time browsing signal can become available to the model during inference time without much delay. Ultimately, then the model comes, that the model needs to be able to handle these long-term and short-term preference and be able to predict relevant recommendation. There is a reason why I did a priority listing like that. The first three components are really more important than the model itself in this particular case, which is server call, latency, and caching.
What is the model? The model itself in this case is a multi-task learning, deep learning architecture, which is very similar to traditional content-based recommendation model where we have a bunch of different types of signals that gets trained, go into the model. It’s, I think, a few layered deep learning model with some residual connection and some real sequence information of the user. There is a profile context. That is where the user is from, country, language, and so on. Then there is video-related data as well, things like tenure of the title, how new or old the title is, and so on. Then there is synopsis and these other information about the video itself. Then, more importantly, there is video and profile information, so engagement data. Those are really powerful signals, whether the member had thumbs up a title in the past, or is this a re-watch versus a new discovery, new title that the member is discovering? In this particular work, there was this addition of browsing signals that we had added.
This is where the short-term member intent is being captured, where in real time, we know whether the member did a my list add on this title or thumbs up this title or thumbs down some title. Negative signal is also super important. That immediately goes and feeds into the model during inference time, letting the model then trade off between short-term and long-term. We do have some architectural consideration here to trade off between short-term and long-term. Unfortunately, that’s something I could not talk about. I’m just giving you this thought that it’s important to trade off between short-term and long-term in this model architecture. Overall, with this improvement of absolute real-time inference as well as the model architecture incorporating in-session browsing signal, offline we saw over 6% improvement, and it is currently 100% in production. All of Netflix users, when you go to pre-query canvas, this is the model that shows you your recommendation.
Here is a real example. This was the initial recommendation when the user session started. This is on a test user. Then the member went and did a bunch of browsing on homepage, and did browsing on woman in the lead role shows and movies. Then they came back to no-query or pre-query page, and their recommendation immediately within that session got updated to have shows like, Emily in Paris, New Girl, and so on, which has woman in the lead character. Then they did, again, go back to category page or homepage and did some shows related to cooking or baking. Ultimately, in the same session, when they went back to search page, their recommendation immediately changed. You can see it’s a combination of all three. It’s not just souping the whole thing to make it baking show or something. This is where the tradeoff between short-term and long-term preference comes into play. That you want to capture what member is doing in the session, but don’t want to overpower the whole recommendation with that short-term intent only.
Challenges and Other Considerations
What were some of the challenges and other considerations that we took into account? I think something that I alluded to in the previous slide, where filter bubble and concentration effect is a big problem, and still an open question in the space of recommendation and search, where, how do we make sure that when we understand a member need, we are not saying this is the only need you have, and the whole page, whole product gets overwhelmed with that one taste profile or one kind of recommendation. In this, both short-term, long-term tradeoff is important, but also explore-exploit or reinforcement learning, these are areas that are usually explored to break out of filter bubble and avoid concentration effect. Because this is such a real-time system, as you would imagine, depending on the latency and the region the model has been served, sometimes there is increased timeout, which leads to increased error rate. What we don’t want is a member to see an empty page.
There was a lot of infrastructural debugging and stuff we had to do to make sure that the error rate and increased timeout is not affected, which include multi-GPU inference, but also thinking about how do we deploy the model and additional consideration like the feature computation, whether there is some caching in some of the features that doesn’t need to be real-time and so on. Overall, we also want to be careful about not making the recommendation too dynamic. We do want to capture the user’s short-term intent and hence update the recommendation in real-time, but we also don’t want to completely move the floor behind the member’s feed by just changing the page every time the member is coming into that part of the product. We want to have a tradeoff between how much we are changing versus how much we are keeping constant. Because it’s such a real-time system and because it’s so dynamic, it is more difficult to debug.
Also, it becomes more susceptible to network unreliability, which can ultimately cause degraded user experience. Another thing that is important is depending on how you are building these short-term signals, browsing signals. Some of these signals are very sparse, like how many times do you actually thumbs up a show when you enjoy something on Netflix, or on Hulu, or somewhere. Signals like thumbs up, my list add, or thumbs down, they are usually very sparse. Typically, we need to do something to the model to generalize these signals that are otherwise very sparse, and make sure the model is not over-anchoring on one signal versus another. That was my recommendation use case.
Defining Ranking – How and When It Was Right
Participant 1: You mentioned ranking, and I’m assuming that after you computed and you had a list of things, you had to rank them based on, because that page is limited, so you can only show so much. How did you guys go about defining that ranking? How did you know it was right or when did you know it was right?
Bhattacharya: This is where that happens. When we train the model, that’s the example that I shared, the deep learning model with short-term, long-term intent and so on. Then we have offline evaluation. With offline evaluation, we evaluate for ranking. Some metrics for ranking is NDCG and MRR. What rankings mean typically is the model generates a likelihood of something. In this case, let’s say likelihood of playing a title. Then we order that likelihood in decreasing order and cut it off. Let’s say top 20, if you just want to show the member top 20 titles, we rank the probability scores and then take top 20 probability scores for a given context. In this case, let’s imagine the context is just profile ID.
Then we take that as top-k, and then we use some metric, for example, NDCG or an MRR, to evaluate how good the model is doing. There’s something called golden test set here, where usually we would build a temporarily independent dataset, temporarily independent to the training data, to evaluate how good the model is doing. That’s the offline metric. Then we go to the A/B test, which tells us what we saw offline, whether that’s what our members are seeing. A/B test gives us a real test.
Balancing What Is Happening During Searches vs. Tagging with Metadata
Participant 2: As customers are changing their language about how they’re searching, as well as the metadata that’s associated with all of the content that’s available inside of Netflix, it seems like there is this constant change, as you maybe had missed some metadata, because what was being pulled back in terms of recall and precision wasn’t matching actually what the customer’s language was trying to represent. How are you all trying to balance how things are tagged with metadata versus what is taking place during searches?
Bhattacharya: We usually try to incorporate some of those metadata in the model as features so that that correspondence between the query or the user engagement with other titles and the metadata is used for the model to learn the connection. Usually the metadata is static, but the query is dynamic. When the query comes in, depending on the title and metadata that the model thinks is the right relevant result for that query, it gets pulled as top-k ranking. In general, there is also certain lexical and certain filters as well as guardrails in the actual production system. There are some boosting or some lexical matching that happens as well to make sure the model do not surface something that is completely irrelevant to the query or the context.
Search Use Case: A Netflix Use Case
The next use case is a search use case. Although it is a search use case, it’s actually a search and recommendation use case. We built this model called UniCoRn, Unified Contextual Recommender. I’ll get to in a couple of slides why is it called UniCoRn. Similar to what I already motivated, many products, especially B2C products, have both search and recommendations use case. In the context of Netflix, here is an example where we have traditional search, which is query to video ranker. For example, if you type P-A-R-I, we want to make sure the model is being able to show you Emily in Paris or some kind of P-A-R-I related titles.
Then there is purely recommendations use case, which is what we saw in the previous slide, example is no-query or pre-query. Then there is other kind of recommendations, such as title-title recommendation, video-video recommendation. In the context of e-commerce, Canvas is more like this, or in the context of Netflix, it’s more like this, wherein you click on a title, here, Emily in Paris, and you see other titles that are similar to it. That’s a recommendation use case as well.
The overarching thesis for this work was, can we build a single model for both search and recommendation task? Is it possible? Do we need different bespoke models, or can we build one model? The answer is, yes, we can build one model, we don’t need different models, because both of them, the search and recommendation task, are two sides of the same coin. They are ultimately ranking tasks, and ultimately, we want, for a given context, top-k results that are relevant to the context. What really changes is part of this example: how we went about identifying what are the differences between search and recommendation tasks, and how we built one model.
What are the differences between search and recommendation tasks, typically? The most important difference is the context itself. When you think about search in the context, we think about query. We type a query, we see results. Query is in the context. Whereas when we think about recommendation, we usually think about the person, it’s usually personalized, so profile ID. Similarly, for more like this, or video-video, or title-title recommendation, the context is the title itself. You are looking at Emily in Paris, you want to see similar shows to Emily in Paris. The next big difference is the data itself, which is a manifestation of the product. They’re in different parts of the product. The data that is collected based on the engagement are different.
For example, when you go to search, you type a query, you see the results, you engage with the result, you start watching it, you start purchasing it. Versus when you go on homepage, you are seeing the recommendation, which is a laid-back recommendation, then you engage on it. The data, how it’s being logged, and what the engagement the user is, is different. Similarly, the third difference would be candidate set retrieval itself, where for query, you might want to make sure that there is lexical relevance. For personalization, a purely recommendation task, the candidate set itself, the first pass, could be different. Ultimately, to the previous question, there is usually canvas specific or product specific business logic that puts guardrails on what is allowed on that part of the product. What we do is first identify what are these differences, and then we set a goal to set out to combine these differences.
Overall, the goal is to develop a single contextual recommender system that can serve all search and recommendation canvases. Not two different models, five different models, just one model that is aware of all these different contexts. What are the benefits? I think the first benefit is these different tasks learn from each other. When you go on Netflix, if you’re typing, Stranger Things, the result that you see versus on more like this, when you click on Stranger Things, the recommendations that you see, what we see from our members is they don’t want different results for the same context on different parts of the canvas. Or, do they? We want the model to learn this information. We want to leverage these different tasks for benefiting the other tasks.
Then, innovation applied to one task can be immediately scaled to other tasks. The most important benefit is, instead of five models, now we have to maintain one model. It’s overall much reduced tech debt and much lower maintenance cost. Overall, engineering cost reduces, PagerDuty, like overall, on-calls become easier because instead of debugging five models and their issues, you’re debugging one model. It’s an overall pretty big win-win.
How we go about doing it? Essentially, we unify the differences. The first important difference was context. We, instead of having a small context, training one model, gathering data and feature for the small context, we expand the context. Then we do the same things, gathering data, features for the whole context. Instead of just having query or just profile ideas context, we build a model that has this large context, query, country, entity in the context of Netflix’s video ID, and a task type. Task type is telling the model that this is a search task, this is a more like this task, this is a pre-query task, and so on. In a way, we are injecting this information in the data while giving all the information this particular task needs as one dataset. Then, in this particular case, in the context of Netflix, entity here refers to beyond just video, we have out-of-catalog videos.
For example, we often get queries like Game of Thrones, and we have to tell our users, we don’t have Game of Thrones. First, to tell our users, we need to identify what is Game of Thrones. It is an out-of-catalog entity. Similarly, person, like people search Tom Cruise. We need to understand, what is Tom Cruise? It’s not a title. It’s person. Similarly, genre, and so on. An example of context for a specific task would be, for search, the context is query, country, language, and task is search. For title-title recommendation, the context is source video ID. In our example, Emily in Paris, the ID of it. Then country, language, and the task is, title-title recommendation. They’re different tasks. Then the data, which is what we have logged in different parts of the product, we merge them all together while adding this context, this task type, as a part of the data collection. Ultimately, we know which engagement is coming from which task or which part of the product that is associated with which task, but we let the model learn those tradeoffs.
Finally, the target itself, whether it’s a video ranker, whether it’s an entity ranker, like now on Netflix, we also have games, whether it’s a game ranker, so we unify that as well and make the model rank the same entity for all these different tasks.
Here’s a setup. We basically build a multi-task learning model, but multi-task via model sharing. Actually, I’m not sure here if people have built multi-task learning model. Typically, we would have different parts of the objective. Let’s say an example would be, train a model to learn play, thumbs up, and click. There are three parts of the objective, and we are asking the model to learn all the three objectives and learn the tradeoff between these objectives. Whereas in our case, we did the multi-task through data, where we mix all the data, with the context and with the task type tagged to the data, and we’re asking the model to learn the tradeoff between these different tasks from the data itself without explicitly calling out the objectives. Similar to the previous example use case of recommendation that I showed, here also there are different types of features, the big one being the entity features, which is basically the video features or the game features.
Then now a big difference compared to traditional recommendation or search system is, here we have context features, which is much larger. We have query-based features. We have profile features. We have video ID features. We have task-specific features as well. Because the context is so broad, this information has to be expanded. Then we have context and entity features. All these different types, when it’s numeric feature, it gets feed into the model in a different way, versus if it’s a categorical feature, we have the corresponding embeddings in the model. Then, ultimately, the model is a similar architecture to the previous one, which is a large deep learning model with a lot of residual connection, and some sequence features, and so on. Ultimately, the target or the objective of this model is to learn probability of positive engagement for a given profile and context, and title, because we are ultimately ranking the titles.
Let’s take an example. This same model, when a user comes to Netflix and types a query, P-A-R-I, the same model takes that context query and does create all these features and ultimately generates the likelihood of all the videos that are relevant to this query, P-A-R-I. Then the same model, when it’s used on more like this canvas, when a user clicks on Emily in Paris, it generates all these features for the context 12345. Let’s say that’s the ID of Emily in Paris, and generates the likelihood of all the titles in our catalog that are similar to Emily in Paris. That’s the power of unifying this whole model where even though the product itself are in different parts of canvases of the product, we are just using the same infra, same ML model to make inference and ultimately generate rank list of given tasks for a given context.
How is this magic happening? Here are some of the hypotheses based on a lot of ablation studies that we have done. I think the key benefit of an effort like this is each task benefits from each other, each of the auxiliary tasks. In this case, search is one of the tasks is benefiting for all these different recommendation tasks. This model replaced four different ML models. We were able to sunset and deprecate four different ML models and replace it with one model. Clearly, there was benefit from one task to another task. The task type as a context was very important. Then the feature specific to these different tasks, was allowing the model to learn tradeoffs between these different tasks. Another key important thing is, how do we handle these different contexts and missingness of these different contexts? We took like an approach of imputing the missing context.
For example, in the context of more like this, we don’t really have a query, but we can think of some heuristic and impute query. Also, things like feature crossing, which is a specific ML architecture consideration helped. Then, with this unification, we were able to achieve either a lift or parity in performance for different tasks. As a first step, we wanted to just at least be able to replace the four different models and not take a hit in performance. Then, once we were able to do that, we brought in all sorts of innovation, which was immediately applicable to four different parts of the product rather than one place. Here’s an example where, ultimately, we replace initially pre-query search and more like this canvas with this one model. Then we also brought in personalization in it. This is a traditional UniCoRn model. Then we took a user foundation model that is trained separately and merged with this UniCoRn model.
Then, ultimately, immediately we were able to bring personalization to pre-query, to search, and more like this. In the previous world where we had three different models for three different tasks, we would have to bring in these similar features to three different models. Instead of taking three quarters, we ended up doing it in one quarter. Again, there is a recent paper on this work in RecSys. Feel free to take a look. Offline, we got an improvement of 7% and 10% lift on search and recommendation tasks by combining these. That makes the point that these different tasks are benefiting from each other.
This is a redundant slide, just the way I showed you before that we are able to merge in a personalization signal, a separate TF graph into the UniCoRn model to bring personalization to all canvases. Here is an example. After we deployed UniCoRn in production and then we deployed the personalized version of UniCoRn, I usually don’t see this on my profile, so I clicked on s as a query, and I don’t see kids show. Before personalization, I was getting some kids show here, Simon, Sahara. Then, after the personalization model was pushed, all those kids show disappeared and then these were very relevant personalized titles for me for the very broad query, s. Go give it a try because currently Netflix production search more like this and entity suggestion is being powered by these models, this specific model, UniCoRn.
Considerations
What were the considerations? In addition to the other infra considerations that I shared in my previous use case, here, because we are merging search and recommendation, a very big consideration is how we make sure the tradeoff between personalization and relevance. What does relevance mean here? Relevance to the context. If you type s and on Netflix you see a lot of titles that starts with a, I think you’ll find it a pretty bad experience. If you’re typing s, you would expect things to have s, so that’s lexical relevance. Similarly, if you’re typing a genre of romance and you start seeing a lot of action movies on the results, it will be irrelevant. Even though it might be very personally relevant to you, but in the context of the query, it’s irrelevant. We want to make sure that we trade off between personalization and relevance pretty well.
Then, because query is real-time, all these engagements are real-time, we want to make sure that our model is really good, but it’s not hurting latency. We don’t want our member to wait around after typing a query, for 5 minutes. In fact, the latency consideration is very strict, like something around 40 milliseconds to 100 milliseconds, P50. Similarly, depending on the region the app Netflix has been opened, throughput becomes important. Handling missing context is important for this particular case because we are expanding the context quite a lot. Features specific to the context, and ultimately, what kind of task-specific retrieval logic we have becomes important. In this case, one thing to note is we just unify the second pass ranker and not the first pass ranker. The retrieval or the retrieval logic remains different for different tasks.
Some additional consideration. In general, when you’re building ranking systems, in addition to everything I showed, there are things like negative sampling. What should be the negative sampling? Should you look at random negative sampling or should you look at impressions as negative sampling? Overall sample weighting, is one action more important than another action? Then, a very important thing is cost of productization. Even though it’s a winning experience during A/B test, we might not be able to productize because it’s too expensive. We ended up training it on too many GPUs that the company cannot support. Multi-GPU training, even for inference if there are GPUs used, cost of productization becomes a very critical thing to be considered. Then, ultimately, during online A/B testing, what kind of metrics to look at. How do we analyze and tell a story of what really is happening? Debugging from what the members are really liking in an experience becomes very important.
Key Takeaways
Overall, it’s beneficial to identify common components among production ranking systems, because then we can really unify those differences and reduce tech debt, improve efficiency, and have less on-call issues. A single model aware of diverse contexts can perform and improve both search and recommendation tasks. The key advantages in consolidating these tech stacks and model, is that these different tasks can benefit from one another. It reduces tech debt, and higher innovation velocity. Real-time in-session signals are important to capture member short-term intent, while we want to be sure of also trading off with long-term interest.
Overall, infrastructural considerations are equally important as the model itself. I know, saying machine learning, modeling is really the cool or the sexy part, but in real production model, infrastructure becomes even more important. Oftentimes, I’ve seen that being the bottleneck rather than the model itself. Latency, throughput, training, and inference time efficiency, all these are super critical considerations when we build something for at-scale production real-time member.
Questions and Answers
Participant 3: You mentioned a lot about how having one model can benefit the search and recommendation system. Beside the fact that there’s consideration to take into training and CPU, what were the real drawbacks having to condense four different models into just one?
Bhattacharya: The first point here is the biggest drawback or something to keep in mind. We spend a bunch of time trying to ensure that personalization is not overpowering relevance, because recommendation tasks typically over-anchors our personalization, where a search task is more relevance-oriented. How we merged in this particular context here, the same picture, the left-hand side is bringing personalization and the right-hand side is the relevance server. How we merge these two is very important. Because if we do some of these merges very early on, it could hurt relevance. If you don’t merge it with the right architectural consideration, then it could not bring in the personalization for the relevant queries. Going back to this, I think the first one is the personalization-relevance tradeoff, which is a difficult thing to achieve and you have to do a bunch of experimentation.
Then, in general, bigger model helps, but bigger model come with higher latency. How do we do that? We have a few tricks that we used to address latency, which I cannot share because we haven’t publicly written that in the paper. Latency becomes a big consideration and can be one of the blockers to be able to combine.
Participant 4: In terms of unifying the models between search and recommender system, the number of entities in context of Netflix is limited to a number of genres, then movie titles, and the person. Let’s say if it was something like X, social media platform, where the entities would be of unlimited number, will the approach of unifying those models still scale in terms of those kind of applications?
Bhattacharya: I think that’s where this disclaimer, that this is a second pass ranker unification. Prior to Netflix, I was in Etsy, which is an e-commerce platform where the catalog size was much bigger than Netflix catalog size. We usually do first pass ranking and then second pass ranking. This unification is a second pass ranker. I believe this would scale to any other product application. As long as we have first pass ranking, which retrieves the right set of candidate and has high recall, then usually the second pass ranking, the candidate set size is much smaller. To also be able to unify the first pass or the retrieval phase, actually, there are a few papers now with generative retrieval, but this world did not focus on that.
See more presentations with transcripts