How To Build Effective LLMs When Both Basic Infrastructure And Model Training Data Are Lacking

Transcript

Olimpiu Pop: Hello, everybody. I’m Olimpiu Pop, an Infoq editor, and I have Jade Abbott to discuss the most critical aspects of LLMS and AI. But without any further ado, Jade, can you please introduce yourself?

Jade Abbott: Yes. Hi, everyone. Thank you so much for having me on this podcast. My name’s Jade Abbott. I am the chief technology officer and co-founder of African AI startup Lelapa AI, where we do language technology for African languages.

Adapting your model for the infrastructure constraints makes it cost-effective, too [01:01]

Olimpiu Pop: Thank you. Your title for the presentation was exciting because I am drawn, like a moth to the flame, to these ecological and ethical aspects, but you had them all together. So, given that you’re an entrepreneur and a co-founder, what utopia are you aiming for? What’s your mission? What are you trying to solve with Lelapa AI?

Jade Abbott: We could take a small look at the problem we care about. There’s also the big, the broader one. I say small, but it’s actually very large. And so, for us, it’s to enable universal communication. Right now, I think, as a society, communication is our biggest tool and I don’t think we’ve even mastered it even when we’re speaking the same language, never mind when we’re speaking different languages. And so to really expect the world to move on and collaborate and build towards a better future, we really need to be able to speak in each other’s languages and be able to communicate with everyone. And so that’s kind of the long-term dream of Lelapa itself is really building the tools and technologies to enable that.

But in order to do that, we have to solve a number of other problems, and those other problems are really helpful, not only for this particular issue that I’ve raised, but they’re helpful for the entire world. And so, here, we’re really looking at how we can utilise AI tooling to make it more efficient, whether it be data-efficient or compute-efficient, in order to still solve those problems. And the reason this is beneficial for everyone, and this is kind of touching on that, as an entrepreneur, as much as we want to be sustainable and ethical entrepreneurs, the other motivation is that it costs less. So, you’re making it more affordable, accessible, and growing your markets by making AI smaller, less dependent on large amounts of data, and more efficient.

Olimpiu Pop: That makes sense. And in the end, given that you had a lot of focus on the African continent, that means 1.5 billion people, if I remember correctly. Additionally, this means they are developing several aspects in multiple stages. For instance, I just saw, a couple of years back, an article that was writing that they used to use drones to just drop-ship medicine, because they didn’t have infrastructure. So that’s creative thinking. However, as you mentioned, there are constraints. What are the constraints that you have to work around that are pushing the way you build things more than what meets the eye? Because we, in Western society, are accustomed to having infinite streaming and internet, but something tells me that you have a whole different set of constraints.

Jade Abbott: Yes. On the African continent, there is a significant deficit in certain infrastructure elements that are commonplace in other parts of the world. When you turn your light on, you don’t even think about the fact that it turns on and that there’s an entire network of electricity that supplies it. And for vast amounts of the African continent, reliable, consistent electricity isn’t even something you can have. And so, when we’re thinking about training AI models, it might take days or weeks to use or serve them to people who may not have devices that can handle high battery loads and similar requirements. That’s just one of the significant issues that you’re seeing.

On the other hand, it’s similar to your discussion about the leapfrogging component, where Africa can leapfrog certain technological advancements. We can move quite quickly with it, because we don’t have legacy systems holding us back. There has been a massive penetration of mobile devices in Africa. So, in some countries, it’s even up to greater than one. So, each person has more than one mobile device. They may not have a laptop, they may not ever have access to a computer, but they do have high mobile penetration. And so you’ve kind of got these interesting contrasts where we don’t necessarily even have a good supply of water, a sustainable supply of internet yet, the internet infrastructure is still being built, but there’s clearly a appetite and need for these devices. And so, with that, you’re working in a very different space. You don’t have reliable internet connectivity.

What does that mean for serving an AI model that’s sitting on the cloud? Does it need to be on the edge? Does it need to be on the phone itself? If we don’t have large amounts of electricity or water availability, how are we building data centres? How are we getting access to GPUs that aren’t on the African continent? What does that mean for the data sovereignty that we have? Europe’s got really great protocols with GDPR, and that’s all good and well, because you’ve got a lot of local data centres, and a lot of those local data centres have accelerated compute, whereas this is only changing now on the African continent.

And I think the last one is this data aspect. 98 or some other large percentage, I think, it might even be higher, percent of the internet is English. And not only is it English, it’s mostly English, Western, white, male. And so if we’re building all these LLMs, and we are using this as the basis for our data, we’re just not going to have enough data to be able to build it for the languages that we have. And so we have this big data constraint where people, historically, have been punished for publishing in their own language.

And this has happened across a number of countries. They were enforced to take on the colonizers’ language, where it be English, French, Portuguese. And when they spoke out of turn at school, they were punished, put in detention, expelled. If someone published in that they were actually imprisoned. And so there’s really not been an incentive to capture the text component of this and have this digitised. And so when we’re building these models, you don’t have all this data available. You need to figure out ways to be clever about the data that you actually collect.

Divide the initial “big” problem into multiple smaller ones to “conquer” it optimally [06:43]

Olimpiu Pop: So, first of all, it’s about the basic infrastructure that we take for granted in the, let’s say, Western world, where you have to be creative in the way how you are doing the basic stuff when you’re discussing about models, meaning that how you train it, how you can have long strides of accessibility to that model. But from my understanding of what you mentioned on the mobile side is being mobile first, it doesn’t necessarily mean mobile internet, but it might be the case that people have phones but don’t have internet, so that means text-to-speech and other perspectives. So this starts to become a very heterogeneous ecosystem, something that … it’s an interesting engineering challenge, but also something that is a headache from the operational point of view.

The other point you mentioned is the scarcity of data at hand for building those models, given the numerous projects. The one that comes to mind now is Project Gutenberg, which has transcribed, or put into digital form, a significant number of books from Western society. But, on the other hand, we are discussing a continent that didn’t have anything like that. A lot of things got lost, or they were just verbally kept, and all these kinds of things. This is what I recall from your presentation as well: that many things are still very antiquated, when everything was passed from one person to another through voice. Congratulations on taking on this challenge; it’s an impressive one, but how do you crack it? What are the lessons learned, so far, on tackling that?

Jade Abbott: I mean, I think the first lesson is you have to be cautious of the hype of bigger is always better. So we’ve seen in a couple of proof points that having a bigger model and having more data doesn’t necessarily mean that you’re going to end up with significantly better results. And second to that, what are we solving for? A lot of AI is solving for improving AI’s sake, seeing how far we can push it, rather than grounding it in a particular problem we’re going to solve for. And what we find is when we ground it in a particular problem and say, “Cool, well we want to solve for XYZ in education or ABCD”, your scope of your needs of your data, of your size of your model becomes more clear. You don’t have to solve every problem. You don’t have to get infinite data because that data isn’t available.

And so that kind of focuses is in tune of type of data you want to gather. On the data component is really focusing on what data do you need? What data will solve your problem? Who are your users? What are they using it for? And less around the extraction of data. We think about data creation. How are we creating data? And, as I mentioned, this is quite expensive task to undertake, both in terms of time and money. However, how are we incentivising the creation of data that meets our requirements to build it? And so, you’re taking this huge, general problem and saying, “Okay, well, we’re not going to start with general AI; let’s start with something that solves the problem”. And that kind of grounding has been massive for us in doing that.

And so focusing on techniques that make things smaller, make things more efficient, that can get the most out of the little data that we do have and can create, but then also making sure the scope of our problems that we’re solving are grounded in real world, and tackling them as such, not as one big gigantic general problem. People often discuss AGI, where the ‘G’ refers to general intelligence. And I sometimes ask, “What is general? General for whom?” And so this is asking that question “general” for whom? Often that does not include majority, the majority world. And so we start by solving the actual problems, and one day that might build up to something that could be more, I’d say, general as in more domain-inclusive, would span multiple languages, et cetera.

Olimpiu Pop: You take the big problem, where generally my feeling is that the general ecosystem tries to solve the problem without knowing what we actually want. So then they just throw up front something. Okay, we have this thing, but it doesn’t actually solve anything. And you just apply divide and conquer, and you break it down in smaller pieces, and then you just optimize up to, not the optimal, not 100%, but that percentile.

Jade Abbott: Whatever adds value. Yes.

Olimpiu Pop: Exactly. So that also means that you’ll have more or less a model zoo where you just have different models and then you use them on different problems, use a different model depending on what you have, and then that means cascading those, I don’t know, requests or something like that.

Jade Abbott: Yes, essentially we build up a lot of smaller models. And even now I look at everyone … I think part of LLM is a branding thing. To say that there’s one model behind that is actually a lie, because I already know that there are multiple models, and so we just are very explicit about having … we’ve got 10 of these models that we can switch between, that we can use a mixture of experts on, et cetera.

Generating high quality data when none is available [11:47]

Olimpiu Pop: Yes, that’s quite interesting, because I think last year in a similar discussion, and where people … there was an attempt to make the difference between the product itself and the model. Behind the scenes you have the model, but between the interface and the model, there is a long way and a lot of parameters and everything like that, and how things are working. So that’s quite interesting. You mentioned making data. What does that mean? Are you using synthetic data? I’m curious about you mentioning the scarcity of the data, the fact that you cannot rely on typical things like, I don’t know, a vocabulary or something like that. Because if it’s verbal, it’s something that it’s very easy to just go left and right, and something that probably can vary from one village to another. How do you tackle that? Did I understand correctly? You’re talking about synthetic data, and if you’re doing that, how do you manage to generate those things?

Jade Abbott: So, once again, that whole focusing on a problem. So we might say we want to focus on doing, I’ll take a boring use case, but it’s a clear one, call center transcription in Johannesburg. And so, with that, we’re like, okay, well what do we understand about the problem? We know that there’s an agent, they’ve got scripts, they have to complete certain things, and we know that there are people who need to receive that. Now that data actually does exist recorded, probably not transcribed in a lot of these organizations, but we can’t use it due to privacy reasons anyway. Even if that data were available, we can’t use it. And this is why our technique’s kind of broadly applicable, beyond just having these resource-scarce, is that sometimes things are resource scarce, because there’s a privacy issue.

And so, in this case, what this means is that we hire teams of people who were call center agents, and we put in a system that mimics a call center, and then we have teams of people who are paid to phone in, and they have some sort of guidelines for a script. Often, the people running the calls are call centre agents. So you’re creating synthetic data, but not through the means of using some AI to do that, because fundamentally we don’t have the distributions to even sample from. And so what we want to do is create this BASE corpus of very high-quality data where we can make a lot of decisions. Because if we know that we need it to work in Johannesburg, we need it to work on … this is the age range of people calling in, we need it to work on all genders equally, it means we can construct our data sets to reflect that.

It allows you to take a lot of the constraints and SLAs that you would have on your end model and actually move that further back into the data to say, “Cool, well now that we are creating data, let’s create it correctly”. Fundamentally, what that means is that each time problems are coming up in the model, we also then analyze what the model might have done with a small piece of client data. Instead of using that data, because, once again, a lot of those data is highly protected, you might say, “Okay, what are the features of this data that we need to ensure now are reflected in the data that we’re creating?” Sometimes it’s rerunning the creation process. Sometimes it adds noise to it, such as some simple adaptations, real-world noises. And so it’s really this focus on curating and building up this data set over time as we get feedback from users, as well.

Olimpiu Pop: Okay, so what you’re doing is you’re getting inspired from real-world data, and then you take the information that’s needed on that as parameters in a whole, in a, let’s say, broad equation. Based on that, you generate your data. Okay? Because initially I was thinking AI, probably you’re doing some kind of anonymisation and all this kind of stuff, but actually you are taking the gist of it, you’re just extracting the proper information that you need, and based on that, you’re just creating what you need.

Jade Abbott: Yes.

Olimpiu Pop: Okay. Even though it’s a boring use case, as you said, most engineers out there are solving boring use cases because the money is coming from above and people in positions of authority.

Jade Abbott: The money’s rather the boring.

How to choose the best base model for your use case [15:49]

Olimpiu Pop: Exactly. So what techniques can people use? As you said, try to use synthetic data and just focus on extracting information. I don’t know, looking at the data set, see the broad majority, understand those, but then based on that, generate data that you can use and then add, in sequences, real-world noise and other stuff like that. What else?

Jade Abbott: I mean some of it is the how do you make things smaller? So the one thing is don’t try the biggest model first, because often it’s not even the best one, particularly if you have less data. And so try the smaller models. And also some of this works back from … and similarly to how you might model like, “Okay, this is the problem, this is the people we needed to work for”, you also would model the environment you need it to run in. You’re like, “Okay, well we need it to run with XYZ latency or on this type of hardware”, et cetera, et cetera, and then you back solve from that saying, “What is the correct model that can actually run on that hardware?” Because you’re working within your constraints.

I think we have a nice tendency as being at the front of AI to just use the biggest thing instead of saying let’s figure out what works within the constraints. Then, with that, there come techniques that can aid us, whether it be model distillation, whether it be quantization, whether it’s putting in some of the really interesting reinforcement learning loops that help improve these models over time, and they allow us to keep our models a little bit smaller, and consistently improving.

Olimpiu Pop: Usually, when you said look for the model, I was just thinking about Hugging Face. And I’ll go there, how should I proceed? I should start examining the use case they have in mind and begin organising things accordingly. What’s your ace up your sleeve? How would you do it?

Jade Abbott: I have many techniques. If you have the time and capacity, what you do is find a set of them, usually on Hugging Face, that somewhat meets your requirements. However, they’ll often be lacking in some way or another. They’ll be lacking in the performance aspect. So, how well is this model performing on your problem? That’s usually your first one. However, you will obviously fine-tune and train it on your own data that you’ve created. Or they’ll be lacking in a size component, so it might be too big, or they’ll be lacking in the latency components. And so, typically, I usually grab a few of them, and you just run all of them. And that works when you have the capacity to do that or you have a small amount of training comparatively, so maybe it’s something that runs in a couple of hours. Fine-tuning can be run in a couple of hours in many cases.

The way that I like to think about it actually now … so that was back in the day when the models were a lot smaller, actually, so you could do that. You could download five and run them. Now I prefer to think of it as this process of looking at the set of them, picking the most likely one that’ll work, running the test, looking at the results, doing an error analysis, which is kind of that qualitative component of understanding what it’s doing wrong, and figuring out if this is something that we can fix with more training, if this is something we can fix with some sort of technique, if this is something we can fix with more data or, fundamentally, if this model isn’t going to work. Then, we need to make a strategic choice about which of these levers to pull, given the resources we have. And then we use that to guide our next decision, whether it be creating more data, whether it be trying to do extra stuff on the model, or whether it be abandoning the model and moving to the next one.

And so we might run a few experiments in parallel, but typically, given that GPU resources are pretty scarce, we’re not usually doing too many. You’re generally picking the few that are most likely effective. And what we found is that there is no good pattern. I suppose the main one is if your distribution is so far out of what the pre-trained model has, fine-tuning it, quite often, isn’t going to give you the best result. And so something trained in English, you can’t just fine-tune to your language. That’s very often not the case. What we have found, however, is that finding something closer to your language will fine-tune better, for example. So we have done studies where we trained more generalised base models, but they were Africa-centric, so they were trained with Swahili as their base language, and had other languages on top of that.

We took a much larger English model and the Swahili model, which is much smaller, and we fine-tuned both of them for specific tasks, and we found the Swahili one performed so much better just because it was trained from scratch, using Africa-centric information. It’s a lot of techniques, but a lot of human intelligence that has to go into deciding and being very strategic about what you’re doing. Those are some of the tricks, flows, and things we do. I think one of the big ones is nailing that evaluation component. Like, is this working?

How to incorporate AI interaction in an organization’s feedback loop [20:38]

Olimpiu Pop: That means it’s a combination of trial and error, flair, and experience that will help you with everything.

Jade Abbott: Yes.

Olimpiu Pop: But what I was thinking, while listening to you, is how to do verification? Because what I was thinking that until not long ago, and even though long ago might mean even two years now, since all hell got loose with everybody trying to fix everything with generative AI and using the most enormous hammer, because until that point I felt that ML engineers, AI guys or ladies, wore white coats, in my opinion. Some very bright people were doing fascinating stuff, but there were a handful of them.

Now it seems that we are somehow making a commodity out of the AI space, and other people that don’t have the training and expertise to do it have to do it. So then what I was preaching in the, let’s say, couple of months, one year, is to create some kind of safety net as you had the continuous integration on more classical ways of building stuff. But in case of machine learning, in case of artificial intelligence, it’s not black or white. It has some shades of gray. How do you recommend tackling that issue? How would you bring the, broadly speaking artificial intelligence into the continuous integration loop, the feedback loop, of a company’s building system?

Jade Abbott: Yes. My past is very much focused in the early days of MLOps, I might actually have claimed to have invented the term ML engineering. I’m still trying to prove that claim 15 years ago, whatever it was. But it was very early and there was something that very interesting happened to me is that a client who we were building, funny enough, a language model for, but 15 years ago, you can only imagine the state that was in, came to me and said, “Well, we have this bug on the model”. And here, a bug, I’m thinking, “Oh, there’s an error that’s happening”, et cetera, et cetera. And he’s like, “No, the bug is that it’s not working on X and Y use case”. What was interesting was that the engineering part of my brain immediately said, “Ah, yes, a bug”. And the data scientist part of my brain was like, “Ah, I can’t test on one sample, I have to get a representative set of these samples”.

And so, for me, it was kind of encapsulating this idea of what a bug is in AI? Because, as engineers, you run tests. Right? You’ve got repeatable test suites, you run them. You run them every time you push something. You run them every time you deploy something. You get a view of it. And so we needed that for these ML-like bugs. And so what became interesting is that what we do is we take this report from the user or the client, and map it out into a small test set, and encapsulate that as the bug. And the bug wouldn’t be solved, like you would have with an engineering bug, you’d have a percentage solved. But, ideally, you build up this database over time. So you’ve got this extensive database of bugs, or you can call them interest points, or you can call them mini test sets, whatever you want to do, which are essentially designed to measure each of these individual problems, and you track them over time.

And so each model, and I’d say every model, probably not, but each candidate model, each model you want to deploy, you make sure you run that test suite against. And everyone goes, “Ah, but we’ve got the test set”. I’m like, “But the test set doesn’t give business a lot of information. It doesn’t actually give anyone a lot of assurance, particularly if you’re approaching it from a risk point of view. You’re talking about a safety net. Very similar kind of concepts here”. We actually need to see it in that light. And so, at the time, went and built a lot of this infrastructure, and now have the regret that I didn’t turn it into a startup, and still no one has probably solved this issue. So we’ll probably release something open source to do it, at least, as a framework, sometime soon. Because that’s the thing is that how do you map … what is an ML, a machine learning bug? And then everything else fits in from that, actually because then you can say whether it’s solved, whether it reopened, et cetera, et cetera. And business can be involved.

Olimpiu Pop: That’s a very nice way of looking at it. Thank you for sharing. I don’t know why, but I just went back to university days when we were discussing about NP-complete problems that they don’t have the ideal solution, not in a feasible time frame, but it has a degree of solving it. And that’s my analogy with what you mentioned that we don’t really need to get to the bottom of the problem, but we need to fix it in the most important aspects that are important for a given customer, and that’s something that you should aim for.

Jade Abbott: And you should monitor it.

Olimpiu Pop: Precisely, and then monitor it. You said that you’re on the technical side, but you are also the co-founder. So you have to think about the mundane things of a company. Most often than not, you need to look at the impact and how to measure it. Because, obviously, when running a company, it’s important to see if you’re better than yesterday, and so on, even if you go back and forth. How do you measure your impact? I mean, you have a comprehensive and distributed problem. How do you measure your impact on the way your models work to see evolutions in your implementation every day?

Jade Abbott: Impact is always such a fun one, because I think if you speak to every development fund, they want to know what impact means. It’s interesting because we measure it in a kind of multi-dimensional way. Because you can measure it in terms of the people that have been impacted by your service. So you have clients, and your clients have users. How many unique conversations, for example, have you added value to?

But then you might also add on, for our open source models, what was the number of people who downloaded them? We don’t have any use for them, but we do know who downloaded them. And, similarly, for our papers, we have some stats on how they’re read. And so you can start to see this multi-dimensional view of like, cool, one thing is very much like what is the conversation we added value to? And the other side is like, cool, how do you measure general world impact? How have we changed the narrative, so to speak? And that’s through our messaging, our publishing, our open source models, et cetera.

Olimpiu Pop: You see, impact was the proper word. Thank you. Thank you, Jade, for just providing a definition of impact. You’re looking at the multi-dimensional aspect of your work, and that’s writing and open-sourcing your paper, so your knowledge. You’re looking at the open source, people are using it, but also, let’s call it, the commercial side of the usage of your … so I’m happy with the impact, I remain with that. Thank you. There is another curiosity back in the initial discussion where we started.

Federated learning and improving models on the fly [27:22]

So you mentioned that, at one point, the models that you’re using are very small and then they might … you work on the mobile itself. But you also mentioned that those mobiles might not be connected to the internet. So that begs the question. And even if they were, I would expect some data scarcity. So even if you have it, you don’t want to go back and forth. And for me, it means that, even if I have a model that, I don’t know, got improved somehow, magically, on the device itself, but it’s not connected to the whole ecosystem as we are used to having the internet, how do you take advantage of that? How do you make sure that those models that are refined, you gather the information and you get it to the source, so you can use it in other places as well?

Jade Abbott: And so I don’t think we’ve reached a point where we’re actually doing it, but the framework is there. And so the framework is federated learning. Is that cool while this phone is not linked to the internet, most of the time or some of the time? When it is linked to the internet, are we able to propagate those updates in a privacy-respecting way to the base model? And so there’s been a lot of movement on federated learning. There hasn’t been enough in the NLP space or the language space or LLM space or whatever we want to call it these days. I can’t keep up, but those approaches of doing it that way are particularly helpful. Often, we could say we’ll capture the data. And so that’s your simplest way. It’s like, cool, well, if you upload your data, you can incentivise people, monetary outcomes and things like that, or we can just build it as a federated model.

But this is still quite aspirational for us at this stage. Right now, we’re just getting the things to work and getting them to work on the right platforms and in the right ways. But the theory of how we would do it’s something else … I’ve got a long list of things, as a CTO, that I wish I could be doing, but instead I’m, I don’t know, a CTO crying over GPUs is what I like to say. And one of those is just sitting there and setting up some prototypes around the federated learning components in the tasks we care about.

Olimpiu Pop: Great. Thank you. Jade. At the point when you just mentioned that … when you were discussing having a universal conversation, you brought me to my childhood days when I was watching Star Trek. I was imagining that, let’s call it, utopian universe. And also what’s essential from what you mentioned and looking at what’s happening, currently, all around the world is that we are all living under the same sun, and it’s nice to see innovation coming from places we wouldn’t have expected it. I think it serves as a good lesson for all of us just to try to strive for the best. Thank you for your time.

Jade Abbott: Cool. Thank you so much.

Mentioned:

.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

How to Build Effective LLMs When Both Basic Infrastructure and Model Training Data Are Lacking

Transcript

Adapting your model for the infrastructure constraints makes it cost-effective, too [01:01]

Divide the initial “big” problem into multiple smaller ones to “conquer” it optimally [06:43]

Generating high quality data when none is available [11:47]

How to choose the best base model for your use case [15:49]

How to incorporate AI interaction in an organization’s feedback loop [20:38]

Federated learning and improving models on the fly [27:22]

Leave a Reply

Transcript

Adapting your model for the infrastructure constraints makes it cost-effective, too [01:01]

Divide the initial “big” problem into multiple smaller ones to “conquer” it optimally [06:43]

Generating high quality data when none is available [11:47]

How to choose the best base model for your use case [15:49]

How to incorporate AI interaction in an organization’s feedback loop [20:38]

Federated learning and improving models on the fly [27:22]

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Leave a Reply