Transcript
Zi: I was very lucky to choose machine learning as my field 10 years ago. After that, I’ve been staying in the industry as an applied scientist and machine learning engineer, more towards the modeling side. Currently, I am a senior machine learning engineer at Grammarly. Beyond work, I teach deep learning at University of Toronto, certificate program. This year I also co-founded Toronto AI Practitioners Network. It’s a local AI meetup group that we have that we invite hundreds of practitioners to talk about AI together, exchange ideas, so on. Have you worked on any machine learning projects before? Have you worked on machine learning projects that fail to go to production?
Failure Rates of Machine Learning Projects
I reflect on my own journey. I’ve worked on machine learning projects across a couple different domains, social media platforms, FinTech solutions, and productivity tools recently. Some of them succeeded in reaching production. Many of them didn’t. Even though each project taught me something interesting, and I was able to learn some fancy technologies, it’s not the best feeling in the world that the project you pour your heart into didn’t end up generating the impact we hoped for. That motivated me to explore a crucial question like, am I alone in feeling that way? How severe is the problem? It turned out that there are a couple surveys that are being done on this topic, and while older studies reported failure rates as high as 85%, most of the recent data tells a fairly similar story.
The graphics I show here was the study done by Rexer Analytics back in 2023. You can see that they did a study on more than 300 machine learning practitioners, and only 32% said their projects reached production. The exact number may vary from industry, it depends on where you work. A lot of the big tech companies, they’ve been adopting AI for quite a long time, so there’s a lot of experience being learned there. For traditional industries like banking industries or enterprise or startups, their journey is still ongoing, and they’re still navigating their path towards more effective and seamless AI adoptions. Before we dive into concerns, I just want to recognize something that not all the failures are inherently bad. Machine learning projects often carry very high uncertainties because it’s experimental.
Many of the times we can’t draw a conclusion if something can be fixed by machine learning or not, before we really get our hands dirty, exploring the data and try some baseline models. If your team is able to conduct some preliminary studies and quickly identify there’s not enough positive signals for you to invest further into the production, you decided to pivot, or you decided to kill the project, this should actually be considered as success instead of failure, and it should be celebrated. This principle has been adopted everywhere, called fail fast. I think it’s crucial for encouraging good innovation to happen in the company too. For the talk, we’ll be focusing on bad failures.
By bad failures, I mean that project dragged too long to conclude without a clear definition, and the models that never went to production, even though the performance is pretty good. Lastly, the solutions that fail to get adopted even after we apply the model to production. Here comes the gold question we want to answer, what leads to the failures of these machine learning projects?
Outline
We’ll start with a quick overview of what a machine learning project lifecycle looks like, and how it’s different from ordinary software projects. Understanding this process will be helpful for us to see where exactly these failures occur. Then we’ll dive into the five common pitfalls with examples, and some of my thoughts like, where can we reduce the chances of encountering those problems in the future? Lastly, we’ll wrap up with a summary.
Overview of Machine Learning Project Lifecycle
Let’s take a look at the machine learning project lifecycle. On the diagram here, it shows the six steps of the machine learning project lifecycle. This is a high level, simplified overview, so the devil is in the details that we’re not showing here. Usually, we start with identifying what is the business goal that we want to try to optimize, and we need to frame a machine learning project based on this business goal. Once we have outlined machine learning problems, we need to get into the data, exploring it, processing it, so we can use the processed data to train different models.
Once the model is trained, hopefully you’ll find a candidate model that we consider as performing good enough that we want to deploy and use the production line to monitor how the model performs. The feedback we got from the monitoring process will be used to refine the entire system. That’s how the iteration is happening. There are two obvious points that are worth mentioning. One is the lengthy multi-step process. A lot of handovers happen between steps and across teams, and naturally this increases the risk due to this complexity. Second is that machine learning projects are data-centric optimization problems, and these feedback signals, no matter if it’s from the model, from the data, or from the monitoring process, are the essential component we want to make sure we leverage if we want to work on a successful project.
Five Common Pitfalls and How to Improve – Tackling the Wrong Problem
For the five common pitfalls that we want to cover, the first pitfall that I would like to discuss is one of the most critical ones, is working towards optimizing a wrong problem to begin with, and ended up wasting a lot of time and effort. Is that a common thing or not? Let’s go back to the survey we have done here. From the figure, what is showing is that when people were asked, how often are project objectives clearly defined before project initiation? Twenty-nine percent of them answered that they begin with a clear definition, and 26% of them said that rarely this thing happens. This lack of initial clarity is definitely a common battle that machine learning engineers fight before we really committed to working on one project. Iterating over business goals and varying with some ambiguity to start a project is actually pretty common in the past.
For machine learning projects, this has become a more severe problem. In order for us to understand that, we need to explore how business goals are being turned into a machine learning solution. Of course, we don’t want to miss the first step, which is identifying whether this is a machine learning related problem, or not. Many times, you may be able to find a rule-based solution that works pretty well, or you can just incorporate some operational changes to your system, rather than trying to explore a machine learning solution from the beginning to the end.
Once we make sure this is a machine learning problem, this goes to big translation steps. There’s a couple of things happening. For us to solve a machine learning problem, we need to identify what are the specific data that allows us to extract the signals from. That is some heavy lifting from the data engineering team. Second is that we need to train, usually, a whole bunch of different models, including different model architectures and tooling, using different hyperparameters in order for us to find which model works fine. If we’re talking about training more sophisticated models, this usually involves using expensive infrastructures, including GPU.
At the end of the day, what we’re trying to optimize is a mathematically defined objective function. This objective function is highly dependent on what type of business goals we’re trying to solve. As you can see, previously, the pivots or the iteration of business goals may not cause a lot of issues, but right now, if we ever want to iterate on that, the data, the objective function, and even the entire machine learning pipeline might need to be updated because of that. That often results in a significant amount of work being tossed away. That’s why, to begin with a good machine learning problem, we usually want to ask a lot of questions to make sure the business, the team is pretty dedicated into solving that particular problem.
Here’s an example that I want to share with you, in terms of how can we increase our chance of picking a winning project to begin with. This is based on my past experience working in a FinTech company for over five years. The organization I worked with adopted this centralized AI resource model, means that all the team will go to this team to find some AI support. This model is quite common these days, because shortage of AI talent is still a thing for many companies. We can’t afford having an AI team in all the business lines that have an AI use case. This gives us some unique advantages, but also some challenges too, because all the business lines will come to us saying that their project is the most important one we should be looking into, and we should prioritize that, often using some financial jargons that we may or may not understand.
Our team need to develop a special type of skill navigating through all this noise and identify which project is the one that is worth investing, and have a higher chance of success at the end. Being in the centralized AI team gave me a unique perspective, because I can see how the project was selected and executed. Throughout the year, I was able to work on many different types of projects across different business lines. For equity researchers, we worked on a special news recommendations system, and we also worked on capital market stock price prediction problems. From all of that, there’s a clear winner.
The biggest success for the entire company, and also for myself, is this project. We built a self-explainable predictive model for the personal and commercial banking space of the bank. Due to confidentiality, I can’t tell you about the details of how the solution looks like, but if I look back, I can see three very key factors that lead to the success of this project. The first one is that this project is directly related to a team inside P&CB. For the people who are working in bank system, you might know this is one of the largest chapters that earns money for the bank. There’s a strong drive from the company’s perspective to do something different here. The second factor is about mutuality. The model we’re developing is part of an end-to-end system that has been sitting there and serving the bank for more than a decade.
There’s a lot of historical data being used, a lot of monitoring and reporting being there. For us to ship this model to production, we don’t need to build something end-to-end. We need to swap the model, and that’s it. The last one is the ML feasibility part. The original model that the bank is using is a fairly simple model. With the more sophisticated technologies these days, we’re fairly confident that if we’re going to try some new model architectures and some new features, likely we’re going to find a better solution. All these three factors, I think, are the key ones that contributed to the success of this project.
Reflecting on this, I don’t think there is any fancier reasons why this is a successful project for us, and there is not pure luck, for sure. The best machine learning projects are the ones that hits the sweet spot. The three components that I think are important are profitability, desirable to build from the business stakeholders, and feasible to solve from the technical perspective. By finding those opportunities, you’ll be able to start your team with some winning directions. Finding those opportunities are not that straightforward. Often, you need to ask some really good questions to your team to understand what’s the situation.
For example, is the goal clearly defined, and can your potential profit justify the cost you’re going to put into the machine learning project? What assumptions are you making, and are they even realistic? What type of risks your model might expose ourselves to. Don’t be annoyed by your team if they keep asking these questions before they commit to working on the project. The more of these questions you’ll be able to answer, the more likely you are able to identify the right problem for your team to solve. Of course, this doesn’t mean that we shouldn’t take any risks. As you can see from the chart here, for high impact and low-risk projects, these are considered as low hanging fruits for us to start our ML journey with, but the high impact and high-risk projects are also worth pursuing. The key is to be aware of the risk that we’re taking, and build a balanced portfolio on top of that.
Only if we have a balanced portfolio, you’ll be able to allow your machine learning team to take some wins and then continue their iterations on building machine learning solutions. This also allows you to justify the cost that the company is putting into the AI infrastructures and all the AI talents they’re hiring. We also need some high-risk projects to be there, because we want to make sure that there is a chance for you to be able to build some game changing solutions at the end, without any winnings in the short term.
To summarize, starting a machine learning project just because everybody else is doing it, or because technically it’s feasible is not enough. We want to take time to collaborate with different teams to make sure that the project we choose is desirable, profitable, and feasible, and start well and lay the good foundation for our team.
Challenges Arising from Data
The second problem I want to discuss about is the challenge arising from data. This is probably one of the most common pitfalls and the one that your team complain the most about. Where are the data? Can we deal with large amount of data processing? Even though the company already invested a lot in solving those problems, it doesn’t mean that there’s no other hidden challenges that will make your project hurt at the end. There’s a famous saying in the machine learning world, garbage in, garbage out. Machine learning projects depend entirely on recognizing patterns from the data. If the data is flawed, then highly likely the conclusion you find from this study won’t be trusted.
On the right, there is an illustrative example for your intuitive understanding why this data problem can be an issue. The example is, let’s try to find average points to represent your entire dataset. On the top is the one without removing the outlier. As you can see, that the average is being grabbed into the direction of the outlier, that may not be a good representation of general data. After removing the outlier, you’ll be able to get better results. That is just a simple example to show how crucial it is to do some data cleaning before you run any learnings on top of that.
Over the years, the machine learning community has developed some standard structure for data pipelines, starting from data collections, data processing, to feature engineering. There’s also a list of common tasks that people usually do for doing their data preparations, starting from data filtering for duplications, outliers, filling in missing data, to resampling your data to deal with imbalance issues. Every proper machine learning team right now basically adapt standard procedures for them to go over their data. This is merely scratching the surface, and it’s far from sufficient. Why is that? If you’re interested, there’s a GitHub repository called, failed machine learning, and you’ll be able to see a long list of failed machine learning solutions that people have identified, that also are publicly available for everyone.
The majority of these cases fail because of a reason related to data. Even though all the solutions are developed by big tech companies or university researchers, supposed to be machine learning experts, they cannot be immune to mistakes that involve data.
This is one of the examples. This is a study conducted by University of Princeton back in 2022. They did a closer look into 22 papers, peer reviewed, that contains critical pitfalls that compromise the results. This pitfall, the results are being adopted by more than 290 papers across 17 different fields. That means those kinds of problems are very severe for us to take a look before it hurts something really important. One key issue in every failed paper is partially related to data leakage. In the simplest form, data leakage means that for some reason, we might take part of the test data into our training data, and gives model a false sense of success.
In reality, the definition of data leakage is way more complicated. Just like listed in the paper here, the researchers from Princeton University are able to categorize the data leakage into eight different type of categories, starting from mixing your training data and testing data, to sampling biases and even beyond that. This complication of different types of data leakage makes it hard for us to identify it early on and prevent it from the beginning. Another example that we’re seeing on the slides is in the NIH healthcare field. I don’t know if you remember the early days of the pandemic when there was a lot of AI making the news headlines, saying that they’re able to identify or diagnose the diseases. This is proud work for many of the machine learning practitioners.
However, the researchers from the UK did a hard look into the studies that are being published, among 606 models that claim that they can do diagnosis of COVID, only 2 of them have a good enough result that can be trusted and move to test stage. This is even before the clinical usage. All the other models are completely useless. The main issue is, again, data related. Among all the 600 models, more than 400 of them are being trained on inadequate sample of data, and will draw some bias to the conclusions. That’s why we won’t be able to trust the results and conduct the process further. It is another harsh but essential lesson we learned about the importance of the quality of data and how it might hurt your machine learning projects.
Data preparation work usually feels like exploring an iceberg. What you see on the surface is just the beginning. Many problems, especially the one related to your own data, is hidden beneath. However, there’s a couple of key questions you might be able to ask yourself to try to eliminate some of the risks. Are you collecting the data that’s representative of the real-world data where your model is being up to? Have you avoided the risk of leaking future data into your training data? These two points are what we covered from the previous examples we looked into together. The challenge didn’t stop here. In large organizations, data silo is also a big problem. Teams may not be aware of all the features we’re able to use for training a model, this will lead the team to draw a wrong conclusion, that the problem may not be able to be solved. Rather than that, it’s actually there are some data that’s underexplored.
Finally, there are other big topics that we want to cover, which is labeled data. Labeling data plays a critical role in machine learning. For model learning perspective, we need labeled data to extract the signals and optimize model on top of. From the evaluation perspective, we need labeled data to understand how well the model is performing and where the problems are, so we can iterate on fixing those issues. Even though there are third-party vendors right now providing a lot of AI data labeling solutions, trying to make this line of work more streamlined, this still requires a large amount of effort from your own team.
Usually, we need to collect and label a golden dataset for us to use to evaluate the quality of the annotation we got, and providing a really detailed guideline for the annotators to follow in order for them to give ourselves concise labeling results. Even after all this prevention work, at the end of the day, you might be able to find that the labeled data still lacks some consensus and you can’t use it for model training. I’m sure if you have worked on data labeling work before, you understand a lot of the frustration behind the scene. This challenge becomes more interesting with the rise of Model-as-a-Service, or all the available pre-trained model. As the diagram shows on the left, right now, with the Model-as-a-Service, we’re able to actually skip model training entirely by just calling an external API, or use some pre-trained model that works pretty well on their own dataset.
This definitely allows us to deliver a solution faster than before. I remember attending a community discussion, and everybody is working on some GenAI solutions. The question was, how are you evaluating your LLM solutions? The first reaction from the room is that everybody went on silent. At the end of the day, what they said is that, early stages, they’re just doing human eyeballing, or they started to collect a small set of the examples for them to do some testings on the simple statistics so they have a rough understanding of how well it performs. This is ok to begin with, because at the moment, everybody is in the fear of missing out mode. We want to push some GenAI solutions out from the door. It becomes a big problem later on, because without the labeled data for you to understand how well your model performs, you keep adding patches reactively.
There will be some downstream use cases, or your clients complaining about a specific use case, so you add something to fix that problem without knowing what’s the broader impact of that patch on the entire dataset. It also prevents you from collecting a good amount of data, to doing your own model trainings that might be able to solve a lot of headache for you in the future. In this conference, you’ll see a couple of them related to LLM evaluations, and you’ll see why this becomes very painful with generative AI. In my opinion, maybe to begin with, we can skip that step, but at the end of the day, we still need to invest a huge amount of effort into making this evaluation pipeline in place.
The conclusion of the second component, while it’s impossible to fix everything perfectly from the beginning, it still can’t be emphasized enough to take the time to explore your own data, understanding your own data, looking for new features, cleaning up your dataset based on the observation, instead of applying a standard process to it, and investing in collecting high quality labels. In the end, machine learning success depends heavily on the data.
Struggle to Turn a Model to a Product
The third problem I want to highlight is the challenge of turning a machine learning model into production. This transition isn’t just about deploying the code. It requires a lot more context and multiple times to make this solution come true. Of course, a significant amount of this challenge come from MLOps, and many of us reference to Google’s iconic diagrams to show that point. You can see how little the fraction of machine learning code is within this whole system. Majority of the code come from the supporting infrastructure like resource management, serving systems, monitoring tools, logging, and so on. For those of us who have worked on machine learning projects, this probably feels like hit home. Fortunately, the landscape of machine learning ops has grown a lot. There are so many resources available for us to tap into.
Of course, for the first-time adopters, this sounds like heavy lifting, but when you gradually start to build your own foundations and your own pipelines, you are able to support multiple machine learning solutions and deploy them seamlessly than before. To understand the gap between a machine learning model and the machine learning based solutions beyond MLOps effort, let’s use Retrieval Augmented Generation, or RAG as an example. In the era of GenAI, RAG has gained a lot of attention.
Essentially, what it does is that it’s trying to help you, extracting context from your own database and asking your LLM to answer the question with awareness of this context. The best part is that nowadays, this is how much work you need to work on a demo locally running on your machine. We just need to leverage some OpenAI APIs, some LangChain libraries, and a vector database, and you’ll be able to query your own database and ask a question.
When it comes to a fully functioning RAG system that supports production, for example, for customer service use case, is an entirely different story. Here on the left is a GenAI system architecture that one of my favorite bloggers, Chip Huyen, summarized, after reviewing many companies’ GenAI solutions. For a production solution, there’s a long list of aspects that you need to ask yourself and fill in the blanks.
For quality control, what are the ways that you’re doing quality evaluation? Do you need advanced RAG technologies, or agentic RAG to improve the quality, or naive RAG is good enough? For customer trust, do you need to bundle your solution with the explainability model so your customer knows how the result was generated, instead of treating this as a black box entirely? For latency reduction, are you doing any caching? If you’re hosting any large language models locally, do you do any inference optimizations? This list goes on towards data privacy, towards how you’re doing fairness and bias tests to make sure you’re doing the responsible AI. Also, towards security, how you’re handling the hallucination and jailbreak problems that everybody talks about in the GenAI world.
In summary, a winning machine learning project team should be cross-functional, working closely together to form, at the very beginning. They should actively align on requirements and resolve this issue step by step, instead of just working in silos and hoping that all the problems will be fixed later on.
Offline Success, Online Failure
Number four, offline success and online failure. This fourth common pitfall is probably the one that caused most emotional waves with the team. When you have a pretty well working offline model, and then you ship this to production, all of a sudden, it’s not working anymore. Why does this situation happen? To understand this better, I would like to show you a diagram regarding offline training and online serving. If you can take a look at the left side, you can see that this is what happened during offline stage. We’re using historical data to train our own model and evaluate it using the offline evaluation metrics. Once the model is deployed to production, we’re using real-time data and going through the end-to-end solutions and evaluate our performance on the online evaluation metrics. With this context in mind, you’re probably able to see clearly where the three discrepancies might come from.
First of all, the data we’re using is different. One is the historical data that often goes through some samplings and cleanings or augmentations to change the entire structure of the dataset. While on the production side, we’re using real-time data to do the inference. The second discrepancy comes from the solution itself. The team are usually focusing on developing a single model, rather than looking at the entire end-to-end solution that we’re using in reality. The last one is the evaluation we’re doing. For offline training, we’re using offline evaluation metrics that closely work together with machine learning, while at the online stage, we’re using online evaluation metrics that are more aligned with the business metrics.
Let me share an example. This is the first production launch that I had many years ago. A few years ago, we were working on a photo recommender system that aimed to promote photographer’s work to a creative photography community. One of the main worries our business have is that our new users shopping is they come here, register, publishing one or two photos, and they never come back. With that in mind, our data science team started to do some explorations and trying to find out what’s the problem. They find a high correlation between how many likes a new user will get within a short period of time after they register to this new user retention rate. With that insight, a machine learning team is being asked to promote the new user works as soon as possible so they can get the reaction they wanted and come back to our website.
At the time, most companies are still relying on relatively simple recommendation system, often powered by this technology called collaborative filtering. The intuitive understanding is shown in the diagram. For user A and user B, if they liked a lot of the same items before, we considered them having the same taste. Then after that, we’ll try to find the items liked by user A but hasn’t been liked by user B, and give this as a recommendation for user B. This also will explain the reason why our new users don’t get a lot of likes, which is called cold-start problems in the recommendation system. Because the new items didn’t have any visibilities or any interactions before, and the recommendation is purely based on the previous reactions of a photo, so the new photos, it’s really hard for them to get the first bunch of reactions and get the roll started.
We decided to build an additional recommender engine to accommodate the cold-start problem that the collaborative filtering solution brings in, which is called content-based recommender system. What we did is we trained a classification model offline, trying to predict how popular a photo would be purely based on the content of the photo itself. In that case, when a new image comes in, we’re able to identify whether this will be popular or not, right away, instead of waiting for the signals to come in. Once we have a good model performing for this classification task, we integrate it into this pipeline.
For user A, we’re able to extract the historical behaviors of that user and find similar photos that the user liked before. These similar photos the user has never seen before will be filtered down by this popularity identification model that we built. In that case, we’re reducing the chance that the low-quality photos will be recommended to the user, while keeping the content to be relevant to what they liked before.
Offline, what we optimize is the classification model. We’re trying to predict the popularity of the images and see what’s the accuracy we’ll be able to get. When the model is integrated into the production system for A/B testing, we have some mixed signals. While the number of likes new users did receive, which we achieved our tasks in some ways, we also received some warnings from the major dashboard we used to monitor the health of the platform. We saw there’s a drop of the average session length that our user is staying on our platform. It means, for some reason, our recommender system is causing a disturbing experience, and they don’t want to keep scrolling our website anymore. This is a tough situation. Because it’s my first launch, I remember feeling very anxious when there is focus group feedback rolling in, criticizing about what we built. It was a challenging moment.
We went through many rounds of iterations and eventually find a better way of incorporating this popularity prediction signal into the system. That was a long story afterwards. What to add is that the recommender system today has become way more complicated. There are so many models that are being involved for a single recommender system, and there’s multi-steps that we’re going through for making this happen. Besides the difference we talked about between offline evaluation and online evaluation, and also monitoring a set of business metrics instead of just one primary business metric we’re trying to optimize, there’s additional complications.
Once the model is integrated into production, it becomes part of the entire solution, a much larger system. We’re combining output from many different kind of models, so an offline model that performs pretty well after the incorporation, because the model’s output are not orthogonal to each other, the impact of that model may be diminishing after this merging. After we see there are so many ways why the offline success can turn into failure, what we want to emphasize is not to get stuck on overoptimizing incremental succeed in the offline models. What we really want to do is to allow the A/B testing to happen as soon as possible, so we can use this to test out if our optimization target offline is well aligned with the business goal we’re really trying to optimize for.
Unseen Non-Technical Obstacles
The last point, which is one that often gets overlooked, is unseen non-technical obstacles. Are the non-technical obstacles really a big problem? This isn’t just my opinion, it’s also backed by data. Going back to the survey that we looked into before, when people were asked a very straightforward question, what are the main obstacles to deploy models at your organization or your client’s organization? The most common answers people get, the top two are not related to technology at all. The first one is that decisionmakers or stakeholders are unwilling to provide the support that the project needed. The second is they have a general lack of active planning. These are non-technical roadblocks, these are organizational and communication blockers. Why is managing stakeholders tricky? Although there’s a lot of enthusiasm for AI, especially from the leadership perspective nowadays, it doesn’t guarantee that the decisionmakers are able to make the right decision.
The truth is that many stakeholders don’t have AI backgrounds before, so they might be influenced by hearing a lot of the news about how AI is making a big impact and making the big wins for the company, without awareness of the risks and the failures they might cause. Also, they might be biased by their own experience managing non-AI traditional self-engineering projects in the past. This is where AI experts actually play a role. It’s not just about building the model, it’s also about making sure your stakeholder understands what is the right expectation to set for your AI projects.
Some key topics to explain might include how AI learns so they know why there’s a heavy dependency on the data, and why we should invest heavily into the data process pipeline. The second question, why machine learning projects are inherently uncertain so they know what’s the potential risk that the project will fail at the end. Third, the limitation of the models, so we won’t launch a model to production without the business stakeholder means the potential reputation, the uncontrolled output might bring to the company. The realistic cost of building and deploying them, so the cost can be justified by the profit.
Let’s talk about how to manage and plan these projects. There are three principles that really stand out. First, scope out an MVP with a clear and simple optimization goal. This helps the team to get focused and start doing their work, reducing their noises around the space. The truth is that starting with a simple baseline model can actually be a better situation than if you start with a more complicated model. Second, once you have MVP outlined, prioritize building an end-to-end solution so you can allow your team to do A/B testing and get the production feedback in place as soon as possible.
Finally, using these feedbacks to iterate and adapt your projects quickly. This might need you to redefine what’s your project optimization targets, or adding more data into making your model better. One of the most effective strategies I’ve seen in the past is outlined here. There is a separation between a project incubator and the product line. In the project incubators, we can do early stage, high-risk idea testing. We use this as bets to figure out which are the models that are actually worth us investing longer term and form a full stack team to work on in the longer term. This approach allows your team to do innovations while managing the risk carefully. The key takeaway here is that managing machine learning projects is different than managing a traditional software engineering project. We need to adapt to these major challenges to make sure your team can get the support from the non-technical perspective.
Summary
While there’s no way to guarantee we’ll avoid all the mistakes, there are some principles or best practice we might consider in reality. We want to choose a project that is feasible, desirable, and profitable. We want to be data centric. We want to encourage early collaboration and active management of cross-functional teams. We want to build an end-to-end solution soon for testing purposes. We want to adapt your project management plan based on the nature of machine learning projects.
This is just a very incomplete list among a lot of other reasons why a machine learning project may fail, but I hope this can be a good starting point for us to kickstart the discussion. I’d like to leave you with a favorite quote of my own, from Charlie Munger, “Learn everything you possibly can from your own personal experience, minimizing what you learn vicariously from the good and bad experiences of others, living and dead”.
Questions and Answers
Participant 1: You mentioned in your talk those business metrics that are used for evaluations. Could you maybe give some more examples of those business metrics? You mentioned retention or session length. What are some other common ones that are being used in your experience?
Zi: There are two types of evaluation metrics I think we should pay attention to. One is specifically about the thing we’re optimizing for. For example, at Grammarly, we’re doing the rewrite suggestion. One thing we definitely care a lot about is the accept rate. Like, how many of the suggestions we’re giving got accepted by the user at the end? It reflected on the quality of the suggestion we’re giving. The other metrics are more towards the major health of the entire platform. Like, how many people are buying Grammarly memberships, and how many times they’re using Grammarly, for example? This is similar to all the other use cases. You want to make sure whatever you are optimizing for is not hurting the major business that your company is focusing on.
Participant 2: You mentioned the importance of evaluation in the lifecycle. I am just very curious to hear, how are you seeing the roles evolve. In traditional MLOps, evaluation was always like machine learning team, and now the roles are blurring in terms of ML engineering, software engineering, also PMs being more involved in the lifecycle. Who do you think is more responsible for evaluation, more broadly, how are you seeing that evolve?
Zi: I think the one that works best before is the combination. For example, the unit test. Machine learning people don’t do a lot of unit tests, but nowadays it’s actually pretty helpful. If you can collect a bunch of examples that are really valuable to your company, and when you’re doing model iteration, you want to check all those tests are still passing. There’s also machine learning metrics that we can use to evaluate the model. Some of them can be automatic or statistics based. In our team, linguistics will be the one working on those automatic metrics. Some are like learn the evaluation metrics that are more state of the art that were recently published. This requires paper ratings and implementation and experimenting before we push this into our major monitoring metrics. This mostly requires machine learning people to get more involved. At the end of the day, we’ll use a combination of all these tests to make sure that things are going well.
See more presentations with transcripts