Transcript
Galazzo: The goal of this session, based on the title, is to tell you how to create your own large language models. To have the best takeaway, you must understand what is the real goal. My real goal is to provide you tips, mistakes that I did along the path from real life. I don’t want to go deep into technical aspects. Sometimes I will tell you something, but the most important thing here is to have a large vision on these tools. I will share with you a lot of sources, links, articles to read that helped me to save a lot of time. That’s the real goal. We said that we have to create our own large language model, but why should I do that? We already have OpenAI. That needs no presentation. Why should I do it?
The first answer could be because I’m a nerd. We said that we come from real life, and as a CTO, I have to deal with cost, timing. People want your product done very well, quickly, and not expensive. That doesn’t match being a nerd that can bust a lot of time experimenting with everything he wants. To do that, we will use some sources from, of course, very large players like Microsoft, Hugging Face, GitHub. The most important thing is, do not believe that it’s easy. Sometimes it could be really painful, so it must be worth.
Fine-Tuning an LLM
When should I create my own large language model? Basically, we believe that OpenAI is God: never fails, is always perfect. It’s not true. Again, from real life, it happens that there are a lot of mistakes. Mistakes because you cannot know everything about your business. Then, if you need that the answer that comes from OpenAI or other services that you want, like Anthropic with Claude, should be error-free, let’s say, that is impossible, but for a while, let’s say error-free, you can consider training your own model. Starting with suggestions.
My choice, because I am telling you my experience and my feedback, is to use Mistral. Mistral, because based on my test, it was the best model to train. I also trained Llama3. It was great. How easy is it to train Mistral based on my experience as a comparison, today? This is a world that changes every day. Maybe next week, Llama3 will be better than Mistral. I can tell you how things are today, unfortunately. People know that version 3 of Mistral has been released, but based on my test and based on my experience, it is still not so ready and I still prefer the version 2.
How can we train a large language model like Mistral, like Llama3, whatever you want, like Gemma, or Phi from Microsoft? The best starting point is to use a technique that is named LoRA. How does it work? First, we need to understand why we need this technique, because to train a large language model is very expensive. We are talking about millions of dollars, and none of us has millions of dollars, neither the resources, nobody. Just as an example, to train a very small model that has been released just for a researcher, a toy, Microsoft Phi needed two weeks of training with 96 GPUs working all day. Consider that I pay one GPU more than €2,000 per month. You can understand how expensive it could be to train a large language model. How can we save money? Using some techniques like LoRA.
To simplify a lot, we have to imagine a large language model like a big matrix. It is not that, but let’s imagine to simplify that it is like a big matrix. We said that to train the whole matrix is too expensive. Then, we use a technique that is named LoRA. How does it work? It leverages a math property that basically multiplies two vectors, we have a matrix. The basic idea is, instead of training the whole matrix, what happens if I train just one vector, the A and B, and then multiply it, the resulting matrix is added as an overlay over the original matrix or the base model. You can understand that you have reduced a lot the number of parameters that you have to train. It means that I can train a model like Llama3, Mistral, that is very huge, it is comparable to ChatGPT, in two days, three days, if I really want to do the best. It’s saving a lot of money.
To do that, I suggest using libraries from Hugging Face that are named Diffusers, and that’s the configuration that you need. Why am I here? To tell about that slide. As a joke, I said that that slide cost my company, €20,000. Just that. Why? Because I spent a month understanding how to move parameters, how they work. This is just one slide, it’s just one parameter, but to understand the best value, I spent a lot of time and a lot of money. What is the r-value that you must understand when training with the LoRA technique? We said here, r is the dimension of the vector that I want to train. If that dimension is just 1, it means that it’s a vector, of course, but if it’s bigger than 1, it’s a matrix, and multiplying them, you get a matrix already.
The more parameters you train, of course, the better it is. That’s false. It’s at most true, because I realized that when the parameter is between 32, at least 64, you have a good result, but if you increase that value too much, you go against overfitting problems. It’s not always true that the bigger it is, the better it is, in this case. Another very important parameter that I struggled with a lot, is the alpha value. You need just that to do a good job. What is it? To simplify a lot, you have to imagine that it’s a parameter that explains how much emphasis to put when adding what I trained over the base model, because we said that LoRA techniques train the vectors multiplying, and what I get is added over the original model. It’s a kind of damping factor.
If it’s greater than 1, it means that you put much more emphasis on what you trained compared to the base model. If less than 0, I guess that you can understand. It’s easy. Again, do not exaggerate. It’s not true that the bigger it is, the better it is. I use a value that is a factor of 2. In this case, it’s 64 because the r-value is 32, so it’s a factor of 2. Do not exaggerate. No more than a factor of 2. That’s my suggestion.
We said that training a LoRA model, I can specialize a large language model with a specific task. Maybe we say that I’m for lawyer or marketing, whatever you want. If it is just a matrix that I add over the base model, what if I could have the possibility to train more than one LoRA on a specific task and swapping the LoRA in real time? It means that I can increase easily the power of my model, by a lot. You can do that, of course. It’s very complex. Don’t do it from scratch. I can suggest that this library works perfectly. Then, if you like this solution, use that library, in my opinion.
Merge Models – (Franken Models)
LoRA is great, but I don’t like swapping models too much. Another technique that I like so much is merging models. I get better results merging already trained models than LoRA. I’m not saying that LoRA is not good. LoRA is great. I use it a lot. Here is another technique. What is the idea? The idea is to take different models already trained by someone else, you can download on repositories that you can find everywhere, and merge them. I did it with images. The result was impressive. There are two tools. The most famous is mergekit, of course. There is another one that I started to prefer over mergekit, it’s named safetensors-merge-supermario. It works. Here, as you can see, just a line of python command and you can merge different models and you have the magic, no training, nothing.
Of course, what is the difference compared to LoRA? With LoRA, you have to own your own database. You have to spend three days, one week, I don’t know how much of the cost of GPU, but you have the most accurate result for your task. When you merge models from someone else, the cost is zero, but of course the model is trained by someone else, so you cannot complain that it’s not perfect for your task. For images, it’s the best solution. I saved a lot of money in image creation. Maybe for natural language processing, I suggest LoRA as a technique, because it’s much more accurate.
There are different techniques that you can use while merging different models. The easiest is when you have models with the same architecture and the same configuration. That’s easy. Just two minutes of work. You can use these techniques that are a bit different on how they merge the weights between the two models. This is the easiest case. Different is when we have the same architecture, so the same model, but different initialization. Because maybe someone else changed some parameters a little bit. Both models are Llama3 or Stable Diffusion.
It’s the same architecture, but maybe someone else changed something, some layers, and so they do not match perfectly. You need other techniques. In this case, here are the sources of the method that you need to use to merge. The worst case is when we have totally different models, different initializations. Here you have to do so much work, and I do not suggest to use merging when you have totally different models. It’s too much work, and, in my opinion, isn’t worth. If you want to try it, you are free to do that.
Moe (Mixture of Experts)
Another technique is Mixture of Experts. We said that we fine-tune a model because we want that that model has a task specialization, and LoRA works great. Even if with LoRA we said, ok, we can swap if we want, we can change on the fly the task, the LoRA that is specialized on telling stories, and the LoRA that is specialized on math or developing code, let’s say. That’s great, but we lose something. Because, for example, if I want to translate something, and at the same time I want to do a summarization. If I use a LoRA that is specialized in summarization, everything will work, but is not able to leverage the knowledge of the task of the model that is specialized on translation and that is specialized on summarization, because they do not talk to each other.
With this technique, we take different models of, of course, the same architecture, that’s very important, you cannot mix models of different architectures, different initialization, and we create an array of experts with an addition of some layers and some gates, like a switch between layers. We can merge all of them in order to allow the flow of the query to follow the path of the best model or the best route that has the knowledge to solve that problem. Of course, I’m doing a simplification of how it works, but that’s enough.
The good thing is that you do not activate all the weights. You have a model that can be 400 billion parameters, the sum of all models all together, so it’s very expensive to run, but with this technique, you use just a few, a portion, a branch of the model, so saving cost. It works, but the bad side is that all the models stay on the RAM, and so, despite it consuming, in terms of GPU, in terms of time of inference, 10 times less of a model of the same size anyway, the amount of RAM that you need is still huge. You cannot have everything in life.
How to create a Mixture of Experts. Again, it’s really complex, but with this tool, a few lines of code, and you can create your model in a few minutes. What you need is to create carefully that configuration file, where you say, that’s the base model, that’s the specialized model, those are the prompts, they are some examples, that says, when talking about music, please prefer that branch. When talking about code, prefer that model. This tool blends everything together, a few minutes. I tried, it works perfectly.
Multimodal Models
Now, multimodal models. You know that at the beginning, not right now, for example, ChatGPT was able to process just text, but how? What is the concept behind converting text as an input for large language models? Because LLMs is math. We say that isometrics are numbers. How do we convert text to numbers? Again, using vectors. Vectors are everywhere here. The idea is to have a dictionary, and let’s say, to simplify, each word is converted in a vector. It is not always true, because we talk about tokens. Tokens are a small portion of a word, because in this way, combining tokens, we can cover all the words of all languages, instead of having the dictionary for Italian, for English, for Hindi, whatever you want. Combining a small portion of words, you can create all words of all languages.
To simplify, we say that everything is converted into vectors. Then, these vectors are used as input of my large language model. Again, it’s a simplification, so forgive me for people that know very well how it works, but we have to focus not on the detail. Here, the idea, text has been converted into numbers using this technique. What if I would use images, audio, video as input? They are different from text. I don’t have that dictionary, but we use techniques like the convolution for images, for example. Everything is translated into vectors, again, images, audio, video, with the same representation, with the same format of the text. Having the same representation, I can use that input like it was a text. To do that is very expensive. I’m not talking about money, but yes, a little bit, because time is money. Really, you need a lot of knowledge.
Again, here there is a tool that I suggest to you that helps to save a lot of time, and then, again, money. It’s named multi-token. Maybe there are other tools. That’s my tool that I suggest to you. It works in a way that is more or less similar to the use of GPT-4o or other multimodal models. Thanks to this technique, what you have after is a model that is able to have more than one kind of input, just adding an input like that. Images, you pass the rule. The syntax is the same as the one you are accustomed to using with OpenAI, so messages, system, users, agent. Now, you have to just add the tag, images, and running just a few commands, you have a multimodal model. Now we are beginning to have a lot of tools, because we have fine-tuned our model, we can merge them, and after, we can make it multimodal.
TTS – Voice Cloning
This is a slide that is just itself, if you want to play with voice cloning. I don’t want to talk about ethical issues. I pay my mortgage with AI. I’ve dealt with artificial intelligence for more than 20 years, but even before. I know how it works. I don’t believe that it will destroy our life, that machine and so on. There was just one time that I was really scared, is when I heard for the first time a voice cloned of a friend of mine. Pay attention. It’s really powerful, but works far way better in English and with female voices. If you are a man, sorry, you need much more training. It’s a very quick project. I suggest you use it.
Performances and Optimizations
Now let’s say that we created our best model. It’s wonderful, but now we have to talk about performances and optimizations because that’s a pain. It’s a matter of user expectations because people want that your model must transfer in one second. Else, they complain, it’s so slow. That’s a point. On the other side, if you are able to optimize your model, it means less cost. Because, for example, as we will see, if I reduce the size of my model, I can use a different machine on the cloud that costs less. Again, from real life, do not believe that everything comes with magic. You need very expensive hardware, and if you can compress everything, you save a lot of money.
The first technique is pruning. Again, here are a couple of tools that I want to suggest to you. You can download, you can try, that are the ones I want to suggest. I know that there are other tools. Here, they are just mine. How does it work? You have to know that not all weights of our model are activated always. There are a lot of weights of our big metrics that we don’t know why, but are often to zero. The basic idea is, what if I try to figure out which weights are not activated or are less used than others, and simply cut. I can reduce my model, so the smaller it is, the faster it is, and then the less power I need to run it. That is how pruning works. I provided a couple of tools. Also, we have something else that, in my opinion, is much more effective, is named quantization. It allows you to save a lot of memory consumption and speed up your inference a lot. It’s named quantization.
How does it work? The idea, you know that the weights of the metrics are floating points. Floating points are 32 bits. It means 4 bytes in memory. I can say to use, for example, 16 bytes, not 32 bytes. It means half of memory consumption. There are also techniques that convert that number into an integer of 8 bits. The saving in terms of memory consumption and speed, is not linear but quadratic. You can understand that is very effective, but do not exaggerate, because of course, cutting weights, in theory, it means cutting also the accuracy. Based on my test, FP16 is amazing. When you go to integer 8, you can see performances drop down. It’s up to you. I’m not here to say that’s better than other. Everybody has his own needs. I have to say to you, be careful, don’t cut too much, and don’t reduce too much. The best option is FP16, in my opinion.
How does a large language model work? Here we’re talking about generative AI. Basically, it releases one token each time. It is the token with the highest probability based on what there is behind, more or less is how it works. To do that, you need a cache that is named KV cache. You have to imagine like another matrix that has in memory what it has already produced. For example, if you said, tell me the story of Napoleon. Napoleon Hoyden was born, so when you see that the text start to be created. To predict the next token, you have to know what has been already created. Everything is stored in this big cache. Again, we are going to simplify a lot. Here is a very good reading that I suggest to you. There is the link. Why is it important? Here, there is a simple calculation.
Simple, more or less, but you can understand. If you have a model with that amount of layers and with this kind of token that you want to produce, here is an example of how to calculate the amount of memory for KV cache that you need. What does it mean? It means that if you use default values when you run the inference, maybe the size of KV cache is not enough or is too much, so busting money. To speed up your model in terms of inference, you must pay attention on KV cache. You can set some parameters using Diffusers library, of course, but there are some tools that are really specialized on this kind of optimization.
I do not suggest to stay there and try to figure out, what is the best configuration for my model? Tools like llama.cpp, TensorRT, and vLLM do already this kind of configuration. If you want to do by yourself from scratch, don’t worry. They allow you to configure the value that you want. My suggestion is to use these tools. llama.cpp is my favorite, but it’s just an opinion.
Really, the last topic about RAG techniques. How does it work? You know that large language models have a general knowledge. Here there are resources. Here there is an algorithm that I developed. How does it work? The idea is to use a smaller model to calculate the signature of the paragraph to retrieve from the database, using a kind of concatenation. Instead of having a big model for embeddings calculation, using a smaller one, make a concatenation, and that works even better than large language models.
Am I Really Doing AI?
Am I really doing AI? A lot of people say, but this is not really AI. I’m just merging models, training a small dataset. It’s a topic that I don’t know if it makes sense to discuss. You have to solve problems in your company. You have to save money. That’s the most important thing. Not a matter for a nerd. This is a topic to discuss for sure in another stage.
Questions and Answers
Participant 1: I have a question regarding the reduction techniques you mentioned in the beginning. You gave us some ballpark numbers in terms of time reduction, but not in terms of if you actually need less GPUs. Could you give us a ballpark number saying, with this LoRA technique, for most use cases, whatever that means, you can get 90% of the performance of a completely manually tuned model with 10% of the cost, or some number like that.
Galazzo: Yes, unfortunately I cannot answer because it depends on your dataset. It’s not a matter of training itself. Because, for example, we busted a couple months in training after we figured out that there was a mistake on the dataset. Not always, it depends on the technique, but on the data.
Participant 1: Which ballpark is the cost reduction, generally, is it like you can save 95% of the cost or more like you can save 10% of the cost?
Galazzo: No, up to 90%.
Participant 1: Really large numbers.
Galazzo: Yes, it’s really powerful. This is the reason why I have to suggest to be careful because it works so well that you can lose generalization. It learns too much.
Participant 1: That’s the overfitting you mentioned.
Galazzo: It’s not a kind of overfitting. It’s a little bit different than overfitting. The model will still be able to perform, for example, the summarization, but you learn too much on your specific task that sometimes could be a problem. Not always, it depends on your task.
For example, if you trained a LoRA for translation, maybe you don’t see any differences, so you don’t lose so much generalization. For example, in my case, I trained a LoRA to have a specific JSON structure in our sphere because large language models are not able to have consistency. You say, please, I want a JSON because it must be processed by Python code or by other APIs. Never happens. Sometimes reply with a malformed JSON or sometimes with missing keys or different names. You have no idea how bad the bad guy could be. I trained to have much more consistency with the JSON reply. It worked, but I realized that was so good understanding JSON in our sphere that it lost a bit of capabilities in reasoning, for example.
Participant 1: You turned it into a generalized JSON parser that couldn’t do anything else.
Galazzo: In that case, for my company, for the project, that was my goal. It wasn’t a need to have that generalization. For me, it was good. Be careful because it’s too powerful.
Losio: As you mentioned until now, I got the idea that, if you’re OpenAI, you can lose billions a year. If you’re Microsoft, probably you can do the same. We discussed 90%, 80% saving, but what’s the minimum budget if I want to start my own project? How much money am I supposed to lose?
Galazzo: It’s not a sport for poor people. What I can tell you, for example, at home, just for playing, to research, to play, like a playground. At home, I have a computer that now, with all accessories, cost more or less €10,000. I have two GPUs, liquid-cooled, the best, but it’s just for playground. You can do something with a good GPU and with a computer, €3,000, maybe.
Participant 2: What about cloud?
Galazzo: About cloud, I can say that an A100 on Azure, for our company, cost more than €2,000 per month, just one GPU.
Participant 2: If you run it 24/7?
Galazzo: Yes, 24/7. No, not for training. For training, we use another machine that has 4X A100, 80 gigabytes of memory. It cost €10,000 per month. I turn on just when I have to do training, so two days, three days of training, and then soon, turn off, because it’s so expensive. These are GPUs, really, for poor people. Not the T4. The T4 is just for a child. An A100 is a very good GPU, but it’s not the H100. It is even more expensive. For training, you need 10 times even more of the power. Just when training, I turn on the very expensive machine.
For inference, you need a smaller GPU, and so, of course, we can use the A100. You can save money if you reserve GPUs on the cloud a lot. You are up to 70%, but you have to reserve for 3 years. That’s a very big problem. Why is it a big problem? Because that’s a market that changes each day. A big issue as a CTO is what to do. Because if I reserve an A100 for 3 years, maybe in six months, it’s already old and I cannot go back, and so I busted money. It’s not that easy to save money when talking about a GPU. Yes, you see programs for startups, programs that say save up to 80%, but it’s not that easy from real life for that reason. You need money.
See more presentations with transcripts