Transcript
Alborno: My name is Soledad Alborno. I’m a product manager at Google. I’m here to talk about what to pack for your GenAI adventure. We’re going to talk about what skills can you reuse when building products, GenAI products, and what are the new skills and new tools that we have to learn to build successful products. I’m an information system engineer. I was an engineer for 10 years and then moved to product. I’m an advocate for equality and diversity. I’ve been building GenAI products for the last two years.
In fact, my first product in 2023 was Summarize. Summarize was the first GenAI product or feature for Google Assistant. We had a vision on building something that helped the user get the gist of long-form text. Have you ever received a link from a friend or someone that you know, saying, you have to read this, and you open it and it’s so long and you don’t have time? That’s the moment when you pull up Google Assistant and now Gemini, and you ask, give me a summary of this. It’s going to create a quick summary of the information you have in your screen. In this case, this is how it looks like. You tap the Summarize button and it generates a summary that you can see, in three bullet points. We will talk about, when I started this product and what I learned at the end.
My GenAI Creation Journey
When I started building Summarize, I thought I was like a very seasoned product manager, 10 years of experience. I can do any kind of products. Let’s start. That was good. I had a lot of tools in my backpack. On the way, I learned that I had to interact with this big monster, the non-deterministic monster that is LLM and GenAI, I call it here Moby-Dick. In my journey of building this product, I learned to treat and build datasets. I learned to create ratings and evals. I will share with you some tips on how do we build those evals and datasets so it makes a successful product. I had to learn to deal with hallucinations, very long latencies and how do we make this thing faster. Because you ask for a summary, you don’t want to wait 10 seconds to get the result. Trust and safety, so a lot of things related to GenAI are related to trust and safety, and we had to work with that.
Traditional Product Management Tools are Still Useful
I’m going to start with a few questions. Have you ever used any kind of LLM? Have you used prompt engineering to build a product? Just prompt engineering: you set up some roles for the model, you made it work. Are you right now building a GenAI product in your role? Have you ever fine-tuned a model? Do you design evals or analyze rating results for a product? Do you believe your current tools and skills are useful to create GenAI products? Whatever you know is very useful to create AI products.
My role is to be a product manager. In order to be a product manager, I need to know my users. I’m the voice of the user in my products. I need to know technology because I build technical solutions for my users. I need to know business because whatever product I build needs to be good for my business and be aligned to the strategy. That’s my role.
In my role, the tools and skills that I get as anyone that works in any startup or how to lead any project know, the first one is business and market fit. I need to know, who are my competitors, what’s happening in this market, how to align to the business goal, how do we make money with this product? It’s still very relevant in the GenAI world. No changes there. I need to know, who are my target users? Are these professional users? Are these people students, they are doctors? What is the age I’m targeting? Very important as well for generative AI products. You need to know your users because they will interact with the product. I need to know, what are the users’ pain points? What are the problems? What are the things that will help me understand how to build a solution for them and prove that my solution helped them to solve a real problem and bring real value to them? I need to know, then work with my teams and everyone to build a solution for those problems. To build a solution, I’m going to write the requirements.
These are still very relevant terms for any product, software product. There are little differences there that we will talk about. The last two things are metrics. We need to know how to measure this product to be successful. Go-to market, how are we going to push this product in the market, our marketing strategies, and so on? Everything relevant, all your skills, everything that you have in your backpack so far, everything is useful for GenAI. The difference is in the solution and requirements. There are very small differences there. This is where we need to help our engineering teams to pick the right GenAI stack. They will start asking, what type of model. Do I need a small model, a big model, a model on device? Does it need to be multimodal or text only? What is this model and how do we pick it?
Activity – Selecting the Right GenAI Stack
Next, I’m going to work on a little exercise that will help us to understand, what is the difference between a small model and a big model, and why do we care? The first thing I want to tell you is we care because small models are cheaper to run than bigger models. That’s the first thing why we care. Let’s see what is the quality of them. Who loves traveling here? Any traveler? Who can help me with this question? Lee, your task is super simple. Using only those words, adventure, new, lost, food, luggage, and beach, you have to answer three questions, only those words. Why do you love traveling?
Lee: Got leaner luggage. Food, and beach, and adventure, and new.
Alborno: Describe a problem you had when traveling.
Lee: Luggage, lost, food, adventure.
Alborno: What’s your dream vacation?
Lee: Adventure.
Alborno: As you can see, it’s a little hard to answer these very easy questions with restricted vocabulary. This represents a model with few parameters. It’s fast to run. It’s only a few choices, but it’s cheaper. The response is not that good. It doesn’t feel like human. What happens when we add a little more words to the same exercise? It’s the same questions. Now we have extra words: adventure, new, lost, food, luggage, beach, in, paradise, the, we, our, airport, discover, relax, and sun. Why do you love traveling?
Lee: Adventure, discover, relax, currently paradise.
Alborno: Describe a problem you had when traveling.
Lee: Lost, new, the sun sometimes.
Alborno: What is your dream vacation?
Lee: Not in the airport. Beach, adventure, paradise, sun.
Alborno: We are getting a little better, getting a little longer responses. Still some hallucination in the middle. Hallucination means that the model is inventing stuff that is not in the intent of the response because it doesn’t have more information than what we have here in the vocabulary. The last one is, using all the words in the British dictionary and at least three words, we are introducing some prompt here, why do you love traveling?
Lee: I like the travel to be good. I like the adventure of going to new places and discovering the history and the culture of the people.
Alborno: Describe a problem you had when traveling.
Lee: The airport’s usually a free work zone. I lost money, being lost.
Alborno: What is your dream vacation?
Lee: Going somewhere new, that the people are friendly and inviting.
Alborno: You can see when the model has more information, more parameters, then it’s easier for the answers to be better quality and to represent human in a better way. You can borrow this exercise to present this to anyone, any customer that doesn’t know about how this works. It works very well. I did it a couple of times.
Requirements to Select the Right GenAI Stack
Let’s go back to the requirements. How do we make sure that we help the engineering team to define the right model or the right stat to use in my product? There are many things we need to define. Four of the things that are very important are the input and the output, but not in the general way. We need to think about, what are the modalities? Is my system text only, text, images, voices, video, attachment, document? What is the input of my process and my product? What is the output? Is it text? Is it images? Are we generating images? Are we updating images? What is the output that we need from the model? Accuracy versus creativity. Is this a product to help people create new things? Is it a GenAI product for canvas or to create images for a social network, for TikTok, for Instagram? What is it? If it’s creative, then it’s ok. It’s very easy. LLMs are very creative.
If we need more accuracy, like a GenAI product to help doctors to diagnose illnesses, then we need accuracy. That’s a different topic. It’s a different model, different techniques. We will use RAG, or fine-tune it in a different way. The last one is, how much domain specific knowledge? Is this just a chatbot that talks about whatever the user will ask or is it a customer support agent for a specific business? In which case, you need to upload all the domain specific knowledge from that business. For instance, if it is doing recommendations to buy a product, you need to upload all the information for the product.
For Summarize, I have the following requirements. User input was text. We started thinking, we are going to use the text and the images from the article. Then, the responses were not improving. The model was slower, so we decided to use text to start, and the responses were good enough for our product. In the output, we said, what is the output? Here it was a little complicated because we all know what a summary is. We all have different definitions of a summary. I had to use user experience research to understand, what is a summary for my users? Is it like two paragraphs, one paragraph, two sentences, three bullet points? Is it 50 words or 300 words?
All of these matter when we decide what is this product going to do. We had to go a little into talking with a lot of people and understanding, at the end of the day, what we decided is three bullet points was something that everyone would agree is a good summary. We went with that definition for my output. We wanted no hallucinations. Not in the typical term of, do not add flowers in an article that is talking about the stock. It was no hallucination in the sense, do not add any information that is not in the original article, because LLMs were trained with all of this massive amount of information. They tend to fill in the gaps and the summaries will have information from other places. We needed to make sure the summary accurately represented the article. For that, we had to create metrics and numbers.
Actually, metrics, automated metrics. It’s a number that represents how good the summary represents the original article. We did that to improve our quality of the summaries, too. Last, we didn’t have any domain specific knowledge. We just had the input from the article we were trying to summarize.
The Data-Driven Development Lifecycle
Once we have all the requirements, we say, how do we start developing this thing? The first thing was, create a good dataset. We started. When I work with my product teams, we work with a small, medium, and large dataset. The small dataset is 100 inputs that we can get from different previous studies, from logs of our previous system, or whatever. Sometimes the team makes the inputs. We create 100 prompts and we test. That’s a small dataset. It’s nice because in a small dataset, you can have different prompts, different results from different models for the same prompt. You can compare it with your eyes. You can feel what is better and feel where is the missing point for each of them. We have a medium dataset that usually I use that with the raters and in the evals. That’s about 300 examples.
A large dataset, it’s like 3,000 examples or more, depending on the context, on the product. That big dataset is more for training and validation and fine-tuning. First, we create our datasets. Second, when I had the 100 examples dataset, we went to prompt engineering and just take the foundational model, do some prompting, just generate a summary for this, and get the results. Maybe do another one with generate a summary in three bullet points. Generate a summary that is shorter than 250 words. We try different prompts, so we can evaluate this thing.
Then, in some of the cycles, we also have model customization. Here is when we use RAG or fine-tuning to improve the model, and the evaluation, which is actually using an evaluation criteria. We send these examples to raters, people that will say, this is good or not, or more or less. Then, we can have patterns. We use these patterns to add more data to the dataset, and go again. This is a cycle. We keep going until we feel that the quality is good and the result of the evaluation is good.
What is next is, how do we make sure the evaluation is good? What is a good summary for me, may be a bad summary for Alfredo or for someone else. We have what we call evaluation criteria. For that, we will try it out. What you see on top is the abstract of this talk. It’s in the QCon page. It’s basically describing what I’m going to talk about in this talk. I asked LLMs to generate three summaries, 1, 2, and 3. The first summary was creating a paragraph format. It actually talked about the presentation. We have the essential tools and skills to develop GenAI products. It talked about covering core product manager principles, and considerations for building LLMs and so on. The second one is very not formal. It’s like GenAI stuff, AI products, or whatever. Yes, it’s a summary, but it’s in a very different tone.
The third one is just three bullet points, very short. It doesn’t have a lot of details. Who here thinks that the first one is the best summary for this text? Who here thinks that the second one is the best summary? Who thinks the third one is the best summary here? As you can see, we have different perceptions. This is subjective. We need to create an evaluation criteria. This is an example, not the one that we created.
In this evaluation criteria, we have four different things. We have format quality. Is the response clear? Is it like human language? Does it have the right amount of words or the right format? Completeness, which is, does the summary have all the key points from the original text? Conciseness, which is, does it have only the necessary content without being repetitive or too detailed? Accuracy, which is, does the summary have information only from the original text? What we do with this, once we set up the evaluation criteria, we send this evaluation criteria with a sample of about 300 prompts or text to raters, and ask them, can you evaluate this? Usually, they are tools like this, where they can see two summaries or three summaries next to each other.
They have to rate, from 1 to 5, format quality, which of them has the best format? You can see number 3 is a little more structured, has more points than number 2. Which of them is more complete? Number 1 has more details than number 3. Which of them is more accurate? You can see, for instance, number 3 maybe is not that accurate because it says LLM considerations. I don’t know if that was part of the original text. Conciseness, number 3 was more concise. What happens here is, usually we use raters. Two or three raters will rate each of the data points. We can create average. This is a 300 dataset. That will give me actually an indication. Because for some summaries, some text is going to be tricky, but for others, no. In general, we will have a sense of how good is the summary based on this evaluation criteria.
Risk Assessment for GenAI
The last point is our risk assessment. Once we have all our summaries and the full product, we need to think about three topics, trust and safety, legal, and privacy. This will also be part of the dataset generation later. For instance, can we summarize articles that talk about which mushrooms can we use to make a meal to eat? If true, how do we check that the summary generated is actually safe and it will not do any damage to any of our users? We have seen examples of this, where AI told users to eat poisonous mushrooms. We don’t want that happening. We really need to think about our output. Is it safe, always safe? What are the risk patterns? Like in traditional software, we were always thinking about, what are the attack vectors? What are the risk patterns we have in this product? How do we add data to the dataset so we can prove that the model is reacting well? We have legal concerns.
One of the things I always care about when we talk about legal is, where is the data coming from, the data that we are using in our datasets? Do we have access to that data? Is it generated or not? Also, legal comes into play when we have things like, for instance, disclaimers. What are we going to tell our users about the use of LLMs? How do we tell them that we may have hallucinations? Privacy, which is, what do we do with the personal information we feed this machine? In some cases, there is no personal information, which is great. When we have personal information, how do we create datasets to prove that the personal information is not put at risk?
Conclusion
All of our tools that we learned with product creation are useful for GenAI products. There are a few new skills and things that come into play, but these things will keep changing. New tools will be added. New processes will be created. In GenAI, the only constant is change.
Questions and Answers
Participant 1: You talked about evaluation criteria. Could they have been generated by AI as well?
Alborno: Yes. The evaluation criteria is not only the definition of what are the areas that we want to evaluate, but we need to provide description of that evaluation criteria to our raters, and what does it mean to have one star versus five stars versus three stars? Usually, all of that goes into a description, because when you have a team of 20 raters, we need them to use the same criteria to evaluate.
Participant 2: You mentioned the definition of a summary is different for different people. You gathered some data to look at experience research. Google has those resources, but not every other company does. Are there any product intuition that you can share with us that if you don’t have those resources, how can we approach these problems?
Alborno: You start with product intuition, yes, very important. Then, I think team food and dogfood are very important, and to have the instrumentation to measure that, in team food and dogfood, to measure the success. I think that’s important. Sometimes, when we develop product, and this is not GenAI product, products in general, we leave the metrics to the end. We didn’t have time to implement these metrics, and we release anyway. Sometimes we do that. Then, we risk gathering this information that is so important. Start there with good metrics. Then, try it in team food and dogfood, like small groups of users. Then, test different things until you have something that works better. Then, you go to production.
Participant 3: You mentioned at the beginning that for the same prompt you’re going to get different results. How do you then manage to make your development lifecycle reproducible? Do you have any tips for that?
Alborno: In my experience, building GenAI products in the last two years, the best way I’ve seen to control the diversity of results was with fine-tuning and providing examples to the machine of what do I want it to generate. In the beginning in Summarize, and this was in the beginning of GenAI, it was very different to what it is just a few months later. In the beginning, we would say, generate a summary under 150 words, and the thing would generate a 400-word summary. We were like, “No way, we cannot ship something like that”. We had to use a lot of examples before, so the machine understood what it is, a 150-word summary, to do it. If you use Gemini 1.5 now, for instance, it knows what is 150 words. It’s better to follow instructions. You may not need that fine-tuning.
Participant 3: For the evaluation study, did you use human raters, or did you also use GenAI?
Alborno: We used both, because we have resources. Then, you can hire a company to do the ratings. We used that, because it’s important to have human input. We also generated automatic rating systems. It’s our first version of those systems, now it’s very more complex than that. The first version of those systems. What I’m doing with my friends as well, is like, ask an LLM in a second instance to rate the previous result. You feed the rules and the evaluation criteria, and it’s going to generate good ratings, too.
Participant 4: I’m curious how you protect against, or think about bias in your evaluation criteria. Because even as I was thinking about summarizing, women, on average, use a lot more words than men, and people may. They probably have different ideas of what they want a summary to be. I’m just curious how you approach that in general, and with the evaluation criteria being specific, how do you define what complete is?
Alborno: What we used in that sense, because this summary was a summary of the original text, the original text may have bias, and the summary will have bias. That’s just by design, because what we tested is that the summary reflects what is in the original article. If you are reading an article that you don’t agree with the content, and you know it has flaws, and you generate a summary, you should be able to realize those flaws in the summary. In this case, for a summary, it’s not a problem, because it’s going to be based on the original text. You don’t want to insert any bias, but that you measure by comparing those two things. In other products, you may have bias, and that’s a different thing.
Participant 5: This is a good description of product development, how you launch in the first place. How do you then think of product management? The core tech stack continued to evolve. There’s knowledge cutoff. There’s new capability. When do you go back and revisit?
Alborno: LLMs keep evolving. Every version is better than the previous one. I think what is important in my lifecycle is that the dataset, the evaluation criteria, and the process that we use to make sure that the results are good, do not change. You can change the machine, and you can fine-tune it, make it smaller, or bigger, or whatever. Then, you still have the framework to test and produce the same quality, or actually check if the new model has a better quality or not. It’s very important to have this time in, what are my requirements, what is my evaluation criteria, and what is my dataset that I will use?
Participant 6: How do you deal with different languages potentially having different constraints on that? For example, we were discussing Spanish. You usually need more words to express the same thing. Three hundred words in Spanish, you may be able to convey less content than 300 words in English.
Alborno: First, you want your dataset to be diverse and representative to your user base. If your user base is multiple languages, then you have multiple languages in your dataset and your criteria. Second, we set the number based on the time it would take to read those words. Because assistant is a system that you use, it’s voice-based too, you can ask, give me a summary, and listen to a summary. We measure that the summary should be able to be read in less than two minutes or something like that. You create a metric that’s not based on the language, but based on the amount of attention you need to process the output.
Participant 7: I was reading through your data lifecycle diagram and it kind of presents the system. It’s been a long rating physical system and probably has momentum to it. What happens if one of the components fails? For example, you’ve got a key criteria or, for example, you have raters and it turns out your rating company that you use is not trustworthy [inaudible 00:35:24]. How do you stabilize the system if there’s a sudden crash?
Alborno: Between the evaluation phase and the dataset augmentation phase, you need to do some analysis. If you see that everything is good, you still have your small dataset that you can visualize and check one by one. If there are big errors, you will find them because there are so many raters and so many data in the dataset that you need to be able to highlight those errors. Sometimes what we do is we remove some of these outliers from the dataset, or pay special attention to them and increase the dataset using more examples like those. It’s all about the evaluation of the result of the evaluation phase.
See more presentations with transcripts