Transcript
Moumita Bhattacharya: I’m a machine learning manager at Netflix, leading the foundation models team. Here the talk will cover some use cases from Netflix. First and foremost, I wanted to have this slide where hopefully by the end of the talk, I will be able to convince you that in ML, there is a Harry Potter and the Harry Potter has some magic, want to do something magical. Let’s see if I manage to do that. Search and recommendation is the overarching topic of this talk.
As we all know, search and recommendations as application of machine learning is omnipresent in different products. Whether it’s video streaming services like Netflix or Hulu or Amazon, or music streaming services such as Spotify, Pandora, eCommerce platforms such as Etsy, Amazon, leverages ML, machine learning and AI for search and recommendations use cases. The user base as well as the catalog is ever growing. How many folks are aware of search and recommendation and applications of ML? Usually, because the catalog is really big, in reality, for B2C, business to customer products, let’s say if it’s a big company like Netflix or Spotify, there are 100 million plus users. Netflix has 300 million plus users. The catalog, the items over which we have to score to show to a user is usually more than 100 million.
For example, in eCommerce context, probably it’s in billions. It’s a very tough task to, for each of the user, rank the whole catalog and show it in front of you. Imagine when you join Netflix, you open your Netflix TV, and if it takes five minutes before anything shows up on the screen, it will be ridiculous. What do we typically do? Because the problem space to score is really high, is we usually break it down into two stages. A user comes to Netflix or Spotify or Amazon. There are millions of listings. There is usually a first stage which is typically referred to as candidate set selection or retrieval, which reduces the number of candidates in the catalog to some hundreds of thousands of items. That then a more complex model, usually referred to as second pass ranker, is used to then rank them for precision, so that, as a user, you see something that is personalized and useful to you, and not the entire catalog. This is a very usual two-stage ranking in any industry setup for search and recommendation tasks. During my talk, I will focus more on the second pass ranker and then generalize it to more foundation models.
Common Components for ML on Product
Usually, these are some common components for ML or AI for product in the context of search and recommendation. As I mentioned, there is the first pass ranking which could be a very simple lightweight ML model or heuristic. That’s about the last time you will hear about first pass ranking in this talk. Then we have second pass ranking, offline evaluation, inference setup, and A/B test, and online evaluation before it becomes available to 100% of the users. Second pass ranking has different stages that need to be handled.
First and foremost, where is the data? How do we get the data? What are the features? What is the model architecture? What is the objective? Should we optimize for click versus purchase versus keeping you engaged for more time versus just showing you something delightful for a few seconds? Objective and reward is somewhere where we really try to capture the business need and the user need. Before we can launch anything in front of real users, there is usually a very rigorous setup of offline evaluation to understand whether the model is doing what it is supposed to do before we show it to the user. Those offline evaluations are guardrails.
Then, once a model is ready to be shown to a user, there is a lot of inference considerations like latency. As I was saying, if a user has to wait for five minutes to see a result, that probably will be a very horrible experience. How do we optimize for latency? Some of you who work in ML infra space, you know p50, p90. What is the end-to-end time that the model takes to return results? Throughput, compute cost. Of course, now with GPU and stuff like that, cost is a big consideration. Finally, what are some user metrics that we leverage to assess whether this new model that we showed in front of all the users is relevant or not, and that’s where A/B test metrics, score metrics, and so on comes in. This is just an overview of different components. In my talk, primarily, I will focus on these three components. I think these are something I already covered, like second pass ranker, offline evaluation, and inference. In inference, latency, throughput, and compute cost is very important.
A Netflix Ranking Use Case – Unified Contextual Recommender (UniCoRn)
Now let me share a specific use case, a ranking use case, where, in Netflix context, we proposed one model to serve both search and recommendation use case. Just a historical context, academically, search and recommendation has been approached in two different communities. There are conferences like RecSys that tackles recommendation tasks, and conferences like SIGIR tackles search tasks. We basically said, ultimately, given the right context, this is the same task, which is ranking.
The question we asked was, can we build a single model for both search and recommendation tasks? The answer is yes. Just to repeat the example of what is a search task in the context of Netflix, when you type, let’s say, P-A-R-I, which is a text, then we would expect Netflix to return titles like Emily in Paris, or Cooking with Paris, the Actress Paris, and so on. Then, a pure recommendation task is where we don’t have any context, where we do not have a search term. Then a different kind of recommendation task is video-to-video recommendation task. When you click Emily in Paris, what are the other titles that are similar to it? The premise of this part of the talk is we can build one ML model to jointly serve both search and recommendation tasks.
First, let’s double-click on what are the differences between a search task and a recommendation task. The first one being the context itself. For search, there’s always a query that you type, so it’s a user intent that is provided to the system. Example is query in the input context for search, whereas for recommendation, it could be a video, or it could be nothing, which is basically just profile ID, or your user ID as a context. Then, because they’re usually part of different parts of the product, there are different engagements, so when you go to search, you usually engage with a different part of the product, for example, in Amazon, when you search something, versus when you see Amazon homepage, it’s recommendation.
As a result, there are also different candidates that are retrieved for search and for recommendation. People who work in industry know there’s always some business request based on which there are some last pass business rules set up, so they are usually different as well. The goal of this work was to develop a single contextual recommender system which we named as UniCoRn, Unified Contextual Ranker, or recommender system, that can serve all of search and recommendation tasks. What’s the benefit? Instead of having four different models, when you can have just one model, you need fewer scientists to develop it, or fewer engineers to develop it. You can bring innovation to multiple places by just innovating on one model, so that’s a huge benefit.
Then these different tasks benefit from each other. No one really loves taking care of tech debt, so lower maintenance costs as well as reduced tech debt. These are some huge benefits. We were successfully able to build this one model and replace four different models in Netflix production, and this is part of what powers Netflix search as well as some part of the recommendations today.
How do we go about doing it? Remember the differences in the context that I mentioned? We basically unify the context, or unify the differences. First is unify the context. Instead of just having query and profile ID in the context, we add query, country, language, task type, whether it’s a search or recommendation task, then we also have a combination of data. We take engagement, user engagement data from across the products, and mix them, so it’s a data-driven multitask learning. As well as we add context-specific features, so when it’s a query, there are query-specific features, when it’s just a source title ID, Emily in Paris, more like this, so we add those entity source title ID-related features, and so on. Here’s an example of the task and its context.
For search task, the context looks like something like this: query, country, language, and task equals to search, whereas for title-title recommendation, it’s source video, country, language, and task is title-title recommendation. This is data-driven model parameter-driven multitask learning which is allowing the model to learn unique behavior from these different tasks while also benefiting from each other, other tasks. Then we also combine all the engagement from all parts of the product instead of training different models from different parts of the data. As well as the ultimate rank, whatever this model is ranking, it’s still just likelihood of play, that when a user comes in, whichever part of the product they come in, Netflix wants you to find something that you can play. Similarly, Amazon wants you to probably find something that is relevant to you.
Here’s an example of the model. It’s a fully connected deep neural network. I won’t go into a lot of details of the architecture, but certain things to highlight here in the model. We have entity features which is basically the target because we are ranking a set of videos for Netflix. We are ranking a set of listings for Amazon. Those are features about the final output, so target. Then we have context features like query, source title ID, for example, Emily in Paris or Stranger Things, and profile, like whatever we know from your behavior on the product, we also take those contexts into account. Other context features are like device type or time of the day. Maybe it’s Saturday evening, you want to watch something else sitting with your friends or partner, versus it’s Monday morning, you want to take a break during work, you’re watching something else on Netflix. It really depends on the context.
Then context entity features as well. For a given time, and target, what are some features? Those are called cross features which are very important. Then we have similarly all these categorical features for all these different categories. For real value features, they are just basically numeric features, and then for any categorical features, we have embeddings in the model. Then it’s a fully connected neural network with skip connections or residual connections, and they usually help with the model not forgetting some of the input context. Ultimately, the model tries to optimize for likelihood of positive engagement. In this case, in the context of Netflix, likelihood of positive engagement could be play. This model then gets deployed in production, and this one model then powers search, powers title-title recommendation, powers pure personalization recommendation, and so on. We are referring to this model as UniCoRn, because it’s unicorn, serving four different canvases by one model.
How is this learning happening? What are some unique things about a model like that? Each of these different tasks are actually benefiting from each other. The rest of the task is called auxiliary task. An example is when a user types stranger as a query and stranger thing as a source title ID, this model learns that the user really is looking for something similar. Through a query, the intent is, show me Stranger Things, but when I click on Stranger Things as a title, the rest of the recommendation, I want to fetch a title that is similar to Stranger Things. Then also task type as context and features specific to different tasks let the model learn tradeoffs between the different tasks. When we train an individual model for just one purpose, we are really making this model be very narrow about its intent, whereas when we are able to train one model for different tasks, it’s able to learn from these different tasks, and it no longer remains as myopic.
In some way, I’m trying to motivate the next part of the presentation that I’ll show, which is foundation model, which is taking it even further down, where it’s completely agnostic to any task. This is still specific to search and recommendation. The other additional aspects that I can share here is, imputing missing context is very helpful. For example, for some tasks, pure personalization task or title-title recommendation task, you don’t have query terms, but figuring out a way through a heuristic or some other model, imputing those queries was very helpful.
For title-title recommendation, when we don’t have query, some of the things I’ve done is tokenize the source title name and treat it as a query, or somehow figure out, based on some heuristic, mapping an entity to a query, and so on. Also, something that ML practitioners know that feature crossing can be very helpful in this case. There is a model architecture called Deep & Cross Network, DCN-V2, that is what we are using here. That’s very useful. With this unification, we were able to achieve either a lift or parity of performance for these different tasks. In one go, we were able to replace four different machine learning models in production with this one model.
System Considerations
Some system considerations here. Prior to this UniCoRn model being built, we had proliferation of ML models across the system, and this is in the context of Netflix. I’ve worked in other places, it’s true. You train a model to solve a bespoke problem. Each of these systems had to maintain all these different parts of the pipeline. For example, for email notification, you needed a label preparation step, a featurization step, model training, and then you serve the model somewhere online, and then model hosting and inference has to happen. Each of these is also incurring its own cost. Similarly, for title-title recommendation or related items, the same steps have to be repeated with different labels, with different features, and so on.
Similarly, for search, for category exploration, you name it. Both the offline part as well as the online part used to be done independently. Each of them requires some engineers and scientists to maintain it. Offline pipeline, online pipeline, there are failures, and so on. With UniCoRn, we were able to replace those series of columns with this ML system, where we just now have difference in the label data generation preparation because it’s still connected to the product. For notification, we have a label preparation, for title-title similarity, we have a label preparation, and so on.
For search, for category exploration, for pure personalization, so that’s the only difference in the pipeline, and then everything becomes common. There is unified label preparation, unified feature generation, multitask model training, and then there is one way to host the model in different systems. Then as the user comes to the product, so a client makes a call to the service, service makes a call to the ranker, and then we get the results. It does definitely simplify both the offline pipeline as well as the online pipeline.
However, for online infrastructure, there are some additional considerations. For example, different parts of the product, different online systems have different SLA considerations. For mitigating that, what we do is we do specify or host the model separately for some products if there is a separate SLA, and we also have different knobs to optimize things like caching, whether in which canvas should be cached, for some canvases, or for some part of the product, we cannot cache.
Similarly, for some canvases, maybe latency is extremely sensitive, so throughput and SLA is very important, for some canvases, it’s not. We really try to give those levers for online inference to continue serving the product as it needs, whereas under the hood, the model has now fully been unified. These are some specific choices about inference that we made, deploy the model in different system environment per use case.
Provides knob to tune the characteristics of the model inference, including model latency, data freshness, caching, and exposing a generic use case agnostic API for consuming systems. What it enabled us is now any product partner can come in and say, we have a search use case or a recommendation use case, they have an endpoint of a model to take from, and they can just run from it, and there are different knobs of whether to enable caching, what kind of context to provide this model. It’s much more self-serving in terms of the product partners to be able to use ML. This really increases innovation velocity, as you can imagine. To enable this flexibility, the API we have also enabled heterogeneous context input, so a pure personalization use case would just have user and country maybe. A pure title-title recommendation might have user, or not even user, just source entity ID, and so on. We also have enabled a way to have separate candidate set for each of these tasks.
Foundation Model (FM)
That was UniCoRn, the model that unifies search and recommendation. Now let me go over some specific aspects of foundation model, which is a user-specific foundation model that we built in Netflix, and then hopefully I can bring it together why I’m talking about two different models in the same presentation. What is a foundation model or a user foundation model? Inspired by the effectiveness of large language models, example GPT, Llama, within Netflix we built a large model that can holistically learn member preference, both long-term and short-term, and be task-agnostic.
The whole idea of these large models is, the model’s parameters are so big, and the capability of the model is so large, that we don’t really need to tell it about specific tasks. It can understand and learn a lot more tasks than a few specific tasks. UniCoRn was still specific because it was for search and recommendation. Here we build a large model to just generically understand what members are doing on Netflix, and that’s applicable for other products too. Why even build a foundation model? Here are some few pointers. One is, this is one model that can learn user long-term preference, short-term preference, as well as long tail entity representation. It also reduces maintenance costs to allow operating with small teams. It makes it cost-efficient because we just need to now train one model. Much fewer models, much more bespoke models can be deprecated. Innovation applied to one part of the product can immediately be applicable to other parts of the product. Now we are even going beyond the specific search and recommendation tasks that UniCoRn was already able to replace.
How do we go about building this foundation model? Imagine this user comes to Netflix, maybe in this case he has been on Netflix for four years. We know all their engagement that they have done. First, let’s say the user came to Netflix, discovered Stranger Things, it’s a horror thriller show. Then the user binge-watched this title, and then discovered another sci-fi title, and so on. This is the user engagement history, and it’s an input sequence, so it’s an entire history of engagement about the user on the product. Side note, Netflix does not use any other demographic information about the user, only whatever they have engaged on the product. There is a similarity we might have seen between LLM and this foundation model, between large language model and this foundation model in the context of Netflix.
The similarity is the title or the entity that we have, so Stranger Things, Emily in Paris, that is equivalent to a word in the context of natural language. We do one-hot representation with similar vocabulary size. It follows power-law as in a natural language, and is sequential in nature because we are thinking about the user engagement on the product as a sequence, which is like a document of text. Instead of sentences, we have interaction trajectories. The learning objective is also self-supervised fashion, same as large language model. Given history as context, predict the next word. In this case, predict the next video or next entity. Once we have this foundation model built, it will have fine-tuning capability like the other LLMs or video LLMs that we saw.
What are the differences? One would argue that it’s actually more difficult to really make this foundation model for a context like Netflix work compared to a large language model, because our understanding of user preferences and behavior is more restricted compared to our understanding of language, which has many centuries of understanding of language. What are the specific differences? Words and titles are fundamentally different. Language has more structure. We understand more about semantics and syntax and grammar, whereas title and user behavior is very unique from one user’s behavior to another user behavior.
Both are power-law distributed, example is, in the head, a word has things like stop word. We want to remove those from the corpus that we are building. In case of titles, those stop words are popular titles. Titles have become extremely popular that it almost loses any information content because everybody is watching it. At the tail, a word has more nuanced subtlety of a language. When it comes to tail of the titles, it’s those very niche, very unique tastes that just might be preferred for a few users and not most users. We have similar problems as cold starting, popularity, and so on.
Using the user history, then we prepare this data, which is a big part, to train from pre-train. We are not using a GPT or a Llama or something that has already been pre-trained, because this is trying to understand user engagement history and build a foundation model. It’s unique in its context. In the context of Netflix, there is one pre-trained foundation model. We use the user history. Tokenization is important, similar to LLM. There is low signal-to-noise ratio, heterogeneity in the interaction. There are engagements like, you play a title, or you thumbs up a title, or you add to my list add, depending on the domain. In the context of eCommerce, you add to cart, and then purchase, and see email notifications, and so on.
An important consideration is also, how do we represent an interaction? One interaction can be simultaneously on a show, on a row, on a genre type. Interaction context like time, duration, language, devices, all these are important things that we want to inject into this foundation model in order for it to be truly, deeply able to understand the product and the user preferences. Here’s an example of how we go about tokenizing this user interaction history. For example, this user came in, played Stranger Things, then added to my list, Wednesday, then played the Gray Man, thumbs up Gray Man, again went back and played Stranger Things, and Stranger Things, and then moved on to Squid Game.
In order to tokenize, if we just use raw history, the context length is going to be enormous. Typically, we do some kind of roll-up. Similarly, in language, we also do some tokenization mechanisms to better represent the data while not blowing up the vocabulary size. Here, we do roll-up where we consider within a given window of time, if you have watched the same title, it has similar information content, so we roll it up. As you see, there is one Stranger Things, but from the series of three Stranger Things later, we take one Stranger Things, similarly for Squid Game. Then this becomes the input to the foundation model that we are training now.
What do we do? We ultimately train a transformer model, a few-stack transformer. This is our member foundation model, and it’s very similar to the architecture of GPT-4, except it is not trained on language, it’s trained on user preference history, which has all the characteristics of a natural language, but it’s on user product engagement. This is the transformer that is trained on user history, specifically as a decoder-only model, that’s why it’s similar to GPT-3, GPT-4, with multiple objectives, the objectives being, what will be the next title that the user will engage with? The second objective is, what is the intent? Does the user want to watch a movie or a game, or do they want to watch horror versus comedy, so genre, and so on? The reason we want to train this multi-objective model is to make sure that we have enough information that it’s able to generalize the user understanding instead of learning about one specific thing, which is recommendation.
A little bit more details about the model itself. It’s a hierarchical multitask learning model, wherein previously, as I showed, the user input sequence goes in, then we create the input vector, and then we have a four-stack transformer, where the four-stack is understanding user intent, like whether you want to watch a movie versus a game, play games, whether you’re interested in horror shows versus comedy shows, and so on. Then the next one is a recommendation task, which is the next item prediction task. This is the foundation model. We were able to train it. This model is able to understand user preference very well, so that individual bespoke models do not need to reinvent the wheel. It’s much cheaper to train one foundation model than each of these individual bespoke models, relearning these user patterns again and again.
In all this, you might be wondering, where is the Harry Potter and what is the magic here? We just talked about ML. I would argue that the transformer as in the foundation model is the Harry Potter, and the personalization that it brings is the magic. Do you remember the UniCoRn model that I described? Nowhere was UniCoRn personalized. What we did is after we trained this foundation model, or the Harry Potter, I was able to bring that foundation model into UniCoRn, and within one go, one click, or one effort, we were able to personalize four different canvases directly by just leveraging the user foundation model. It bought a huge lift offline, both for search and recommendation tasks, 7% and 10% increase, as well as it was a pretty big lift online as well. I’m not allowed to talk about online metrics.
As you see, we saved a lot of effort because otherwise, the alternative is, for each of these different canvases, we would have had to think about how we personalize it, how we learn the tradeoff between context and relevance and personalization. Because this foundation model is so powerful and is capable of understanding the nuances of user preferences, by just combining it or bringing that information into UniCoRn model, we were immediately able to personalize search, personalize pre-query, which is a purely personalized canvas. We were able to personalize title-title recommendation, and so on. That’s where the magic happened. Here’s an example. The previous two rows is the result for query S. This is on one of the test profiles. This test profile, do not watch kids’ title.
As you see, before we had foundation model injected, the first two rows are from pure UniCoRn results. There are some kids’ title. Even though most of the recommendation results are relevant, like Suits, Spy Game, and so on, Seinfeld, there are some kids’ title. After we put the foundation model into UniCoRn, those kids’ title were gone, and all the results became much more relevant. This is just an example, just motivating that we were able to not only bring in personalization with foundation model, but because the foundation model understands the user preference and nuances, we were able to do a good tradeoff between personalization and relevance.
What we covered so far. We developed a Unified Contextual Recommender model, UniCoRn, to power Netflix search and recommendations. Then we developed a large transformer-based foundation model to holistically learn member preferences, grounded in the Netflix content understanding. Then we leveraged the magic of FM, a foundation model in UniCoRn, to personalize all the search and recommendation tasks powered by it. It all looks great, but there were challenges. The first big promise then, before going to the challenges, is to truly leverage this magical power of foundation model, we went even beyond UniCoRn, and we were able to leverage this foundation model into other models that required personalization. We were able to successfully start within very short ramp-up time, personalize any canvas, because this foundation model learns personalization very well.
Challenges
What were some of the challenges? Of course, a large foundation model to be trained, we need a lot of training optimization, so efficient algorithm for training. There is a huge literature on optimizing a large foundation model now, including fast attention, and how we quantize the model, and so on. I’m just highlighting some of the challenges that we had to address, without going into details. Then substantial computing resource needed to be assured, including like thinking about how to best utilize the GPUs that we had, how to shard the data, how to optimize GPU training, and so on.
Then after the model was trained, this is a large model, how do we think about serving? How do we serve? Do we cache, do we not cache? Do we do some batch compute? How do we keep the latency of the downstream application like UniCoRn that is consuming the foundation model, while not blowing up the cost? Cost of productization was a big one, where we had to go with multi-GPU, both for training as well as inference, while figuring out how to optimize for cost, and also robust evaluation, both offline and online. For these challenges, I’m not able to share all the details, but these I think will be common among every company, every team that is trying to think about serving large models in really high production traffic. What are the other challenges? Personalization is good, but not in all cases. Over-personalization is not good.
In the context of search, if you type Paris, and you see a highly personalized result, but none of the titles has P-A-R-I in it, or there is no lexical relevance, then it’s not going to be good results. How do we trade off personalization and relevance? Instead of over-personalizing, how do we make sure we have the right balance? That’s a very important consideration. Then it goes back to the general problems of recommendation systems in general, like concentration effect, filter bubble, how do we think about it? The rewards and objectives are very important here. What are the online metrics? How do we evaluate whether the model that we build is good or not, core metric, secondary metric? Then going back to the offline training world, how do we think about negative sampling, rewards, and so on? I just wanted to highlight some of the different dimensions that we had to consider for making this whole system of UniCoRn and foundation model work.
Key Takeaways
Here are my key takeaways. It is possible to build a large foundation model that can holistically capture member preferences, both long-term and short-term. A single unified model, UniCoRn, aware of diverse contexts can perform and improve both search and recommendation tasks. The magic of personalization can be easily brought within UniCoRn via this foundation model, and even beyond UniCoRn, while optimally trading off relevance and personalization. More importantly, several infrastructural and modeling considerations need to come together for systems like UniCoRn and foundation model to work at the scale of 300 million plus users, as highlighted in some of the work and some of the challenges that I mentioned.
Resources
We have some papers for both the UniCoRn model as well as foundation model. Here are some links.
Luu: Netflix have members across the globe. The foundation model that you guys have, is it for all members?
Moumita Bhattacharya: Yes.
Questions and Answers
Participant 1: In the world of large language models, it is a normal technique to make fine-tuning of the fundamental model for concrete tasks. Why do you use it? Or, if not, why don’t you use it for fine-tuning for recommendation, fine-tuning for search, and fine-tuning for maybe some similar tasks?
Moumita Bhattacharya: In the world of LLMs, it’s very common to leverage the large language model to fine-tune for search and recommendation on any application. Whether we also fine-tune the foundation model for search and recommendation, if not, why not? I didn’t go into details about that integration of foundation model into UniCoRn. I basically said, we have a foundation model, we merge it with Unified Contextual Ranker, and then recommendation is personalized. The fine-tuning comes in play there, how we leverage the foundation model into our application model, so UniCoRn in this case. There are approaches where we fine-tune it, take the TF graph or the PyTorch graph and combine it with the application model and fine-tuning jointly as a part of the training. There are also approaches where depending on the use case, we fine-tune it on a specific data, the foundation model, and then just take the embedding or something. I didn’t really go into the technical details of how we combine it, but yes, very relevant question. Fine-tuning is definitely one of the most leveraged approaches.
Participant 2: One key difference as well between text and titles is that text is more static. There’s a dictionary and we only add so few words a year, but you’re releasing so many titles a day. How often are you having to retrain this model? Obviously, it feels like sometimes TV shows blow up overnight and everyone you know has watched it. Are you forcing recommendations on people?
Moumita Bhattacharya: There is a fundamental difference between text and titles where titles are dropped and released every day, every week, whereas text is a static vocabulary, how do we handle cold starting? How often do we train the foundation model in order to handle those fresh titles? This actually goes back to the first question that was asked, where we don’t retrain or pre-train the foundation model on a daily basis, but we do fine-tune it for any new titles. As you can imagine, there are checkpoints of the model saved, and then with any new incremental title, the weights can be updated to account for that fresh title.
Participant 3: How do you account for things like diversity as well as making sure recommendations are editorially appropriate? I also wanted to ask about your culture to running experiments and A/B tests and how often you might do those.
Moumita Bhattacharya: How does Netflix, or this work account for diversity and editorial appropriateness of recommendations? The third was about culture.
Diversity really comes from the data. There are different ways to build the data that we train the foundation model on. If you train it only on a certain part of the product, then you can have basically closed-loop data, then it can just be over-optimizing on whatever was shown to the user first. What we do is open up and train it anywhere on the product, whatever user has engaged across the product. There are other approaches that we leverage without being able to commit which of these is in production. We also have explore-exploit. Explore, where you don’t show the results from the model, but with some likelihood you explore. That helps in diversifying the data.
Then in objective also, there are ways you can build the reward function and objective to ensure diversity. One common model objective is DPO, which is basically to introduce diversity in the machine learning model itself. Editorial, we have a pretty big setup for editorial reviews, and anything that shows to a Netflix member directly has some checks and balances, but it’s a combination of ML-driven and checks and balances from our content creators, from editorial partners. It’s just impossible at this scale to have every result show up editorially. We do take that very seriously, experts’ opinion about what ends up being shown to the member.
Participant 4: After you have trained and built your foundation model, were there any emergent behaviors that you have explored or found out that were not intended after the training?
Moumita Bhattacharya: After the foundation model was trained, if there was any emergent behavior that was found out that was not expected in the training? Nothing that comes to mind. Even though it’s Netflix’s foundation model, the scale is still pretty small compared to LLMs and stuff. The capability that we’re trying to learn is on just specifically for Netflix members. I can’t comment on it because I don’t think we found anything. There is a blog that details all the details of foundation model that we have built, but there is no mention of the emergent behavior, nothing that I can comment on.
Participant 5: Do you also use the recommendation engine internally now? Because from what I understand, Netflix is a big content producer as well, and you need to make a lot of decisions what content you want to produce.
Moumita Bhattacharya: This is for member recommender system. It’s everything that is facing the customers, when you come to netflix.com, either on mobile or app. On the content side, I don’t work on the content side, so I can’t comment on the content side at all.
Participant 6: You just mentioned web, mobile. Is the model platform agnostic? If so, how do you manage to basically provide the same suggestions for the same queries that the users can access from different platforms? You have Xbox applications, web, mobile.
Moumita Bhattacharya: Is the model platform agnostic? The context that I mentioned in the model, talking about UniCoRn, this context feature does have the device type. It is learning the different behavior from different platforms, if there is any. That’s the whole point. We don’t train a different model for different platforms. It’s all learned within the model based on the user preferences, historical engagement.
Participant 7: Since Netflix has a lot of content, which is primarily visual content, do you extract any attributes like video vectorization or video-based tagging or anything not related to the metadata, not related to the names of the titles, but instead the content themselves?
Moumita Bhattacharya: Because Netflix has a lot of video and image data, do we leverage other modalities beyond text and user engagement history in the foundation model? We try all sorts of things. This part I won’t be able to disclose about the specific features that are being leveraged, but we do go beyond textual modality as well. As you would imagine, not everything that we try gives benefit. We try experiment offline and see what is lifting the offline metrics. We do leverage other modalities beyond just user history.
See more presentations with transcripts
