AI Inference And Business Competitive Advantage

Artificial Intelligence is driving IT spending globally. Despite market caution due to economic and geopolitical uncertainty, IT spending is expected to increase globally thanks to investments in AI-related technologies. Europe has not been left behind in this global trend and is expected to increase its AI-related IT spending by 21%, offsetting the slowdown occurring in other areas of the sector, according to Gartner.

Since the launch of ChatGPT in November 2022, companies globally have embarked on a race to adopt AI. In this process, organizations have gone through various stages, from initial fascination to the pragmatic search for a competitive advantage. To achieve this, a first crucial dilemma has been to discern the best technological proposal. Whether to use large language models (LLMs), which are typically developed by tech giants and trained with billions of parameters. Or opt for small language models (SLM), which can be trained with specialized information from the company’s area of interest, and even with data exclusive to the organization itself.

In this first approach, the AI training phase has been taken into account, which refers to the learning of the model. This phase depends on the amount of knowledge that is provided to the AI model, the learning algorithms and the parameters that are established so that it answers the questions asked. Being trained with billions of parameters, LLMs require a large amount of computing resources, as they must process a lot of information to arrive at an answer. This large volume of data makes it possible for AI to answer a wide variety of questions, however, lack of training can lead the model to give incorrect answers, known as hallucinations.

On the other hand, SLMs, when trained with less data, require a smaller amount of computing resources, both for their training and their implementation, which makes it possible to have a much lighter and faster performance. These characteristics reduce the cost of the infrastructure, and make it easier for them to be perfected more quickly.

As these AI models have been adopted in organizations, it has been observed that the true differential value of an AI not only lies in the ability to train it with the data that the organization is interested in, but mainly in having control over the inference phase, which is the process by which the trained model makes predictions when it is asked a query that it was unaware of.

And the life cycle of any AI model is marked by these two phases, training and inference, which will determine the differential value of an AI. To make the difference between these two phases clear, we are going to make an analogy with medical professionals. Doctors have to study the degree, take the MIR and practice the profession would be the equivalent of the AI training phase.

Once trained, doctors are trained to diagnose a patient based on their symptoms, even if they have not seen them before, and to make this diagnosis, the doctor does not have to go back to medical school (training), but rather has the ability to evaluate the countless patterns they have seen throughout their career to make a diagnosis. In AI, this process is known as inference, which is the response based on experience and knowledge. AI inference works in almost exactly the same way, the AI does not think like a human, but computationally finds the best and most likely outcome.

The training phase is an important element, but the true value of AI lies in the inference phase, since it is the analysis engine of AI, that is, the process through which the AI is executed. Depending on its design, efficiency and scalability, companies will be able to stand out from their competitors. Such is its relevance, that according to Gartner¹“by 2028, as the market matures, more than 80% of data center workload accelerators will be deployed specifically for inference, rather than for use in training.”

This inference phase is often limited by the large volumes of data that must be processed, since its use often involves prohibitive costs and increased latency. Gartner has warned of the importance of understanding the complexity of the cost of AI. Lack of knowledge about how AI costs can scale can lead to a 500% to 1,000% error in cost calculations, says Gartner, which places AI spending as one of its main threats to this technology along with hallucinations and security vulnerabilities.

The truth is that the training phase represents a capital cost, which requires a significant investment, but since it is not a recurring activity, it will not be a frequent expense for the company. Instead, the inference phase becomes an operational cost, that is, a consumption expense, which will depend on the use that the company gives to AI.

Therefore, initiatives that are seeking to address this problem are important to bring AI closer to more organizations, making scalable generative AI inference increasingly accessible. The University of California, Berkeley, for example, has promoted the development of the open source vLLM project, an inference engine that has established itself as the de facto standard by providing technologies that help offer efficient, reliable and stable inference capabilities.

The features of vLLM have made it possible to perform inferences with a notable reduction in resources, thus facilitating large-scale production of AI inference. This is achieved through its ability to optimize GPU memory, reduce the memory footprint required to run models, split processing tasks across multiple GPUs, create text faster by using a smaller model that predicts tokens and a larger model that validates them, and improves the efficiency of transformer models.

Adding to the vLLM initiative is the open source llm-d project, which leverages the power of vLLM, which uses Kubernetes orchestration to integrate advanced inference functions directly into enterprise IT infrastructures. This unified platform empowers IT teams, allowing them to scale and orchestrate the diverse demands of critical inference workloads across distributed hardware. Simultaneously, it implements innovative techniques that maximize efficiency and dramatically reduce the total cost of ownership (TCO) of high-performance AI accelerators.

Optimizing inference is not just a technical or cost issue, but a key business strategy. Because ultimately, the future of AI is not defined by models; it is defined by what is done with them through inference.

1″Forecast Analysis: AI Semiconductors Globally,” Alan Priestley, Gartner, August 2, 2024. ID G00818912 GARTNER is a trademark and service mark of Gartner, Inc. or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.