Dec 17 (Reuters) – Alphabet’s Google is working on a new initiative to make its artificial intelligence chips better at running PyTorch, the world’s most widely used AI software framework, with the aim of weakening Nvidia’s long-standing dominance of the AI computing market, according to people familiar with the matter.
The effort is part of Google’s aggressive plan to make its Tensor Processing Units a viable alternative to Nvidia’s leading GPUs. TPU sales have become a crucial growth driver for Google’s cloud revenues as it looks to prove to investors that its AI investments are delivering returns.
But hardware alone is not enough to drive adoption. The new initiative, known internally as ‘TorchTPU’, aims to remove a key barrier that has slowed the adoption of TPU chips by making them fully compatible and developer-friendly for customers who have already built their technical infrastructure using PyTorch software, the sources said. Google is also considering open sourcing parts of its software to speed up adoption among customers, some people said.
Compared to previous efforts to support PyTorch on TPUs, Google has devoted more organizational focus, resources and strategic importance to TorchTPU as demand grows from companies that want to adopt the chips but see the software stack as a bottleneck, the sources said.
PyTorch, an open source project strongly supported by Meta Platforms, is one of the most used tools for developers creating AI models. In Silicon Valley, very few developers write every line of code that chips from Nvidia, Advanced Micro Devices, or Google can actually run.
Instead, these developers rely on tools like PyTorch, a collection of pre-written code libraries and frameworks that automate common tasks in AI software development. Originally released in 2016, PyTorch’s history is closely tied to Nvidia’s development of CUDA, the software that some Wall Street analysts consider the company’s strongest shield against competitors.
Nvidia engineers have spent years ensuring that software developed with PyTorch runs on the chips as quickly and efficiently as possible. Google, on the other hand, has long had its internal armies of software developers use a different code framework called Jax, and its TPU chips use a tool called XLA to keep that code running efficiently. Much of Google’s proprietary AI software stack and performance optimization is built around Jax, widening the gap between how Google uses its chips and how customers want to use them.
A Google Cloud spokesperson did not comment on the details of the project, but confirmed to Reuters that the move would give customers freedom of choice.
“We are seeing tremendous, accelerating demand for both our TPU and GPU infrastructure,” the spokesperson said. “Our focus is on providing the flexibility and scale that developers need, regardless of the hardware they choose to build on.”
TPU FOR CUSTOMERS
Alphabet had long reserved the lion’s share of its own chips, or TPUs, for internal use. That changed in 2022, when Google’s cloud computing unit successfully lobbied for oversight of the group that sells TPUs. This move has dramatically increased Google Cloud’s allocation of TPUs, and as customer interest in AI has grown, Google has sought to capitalize on this by ramping up production and sales of TPUs to third-party customers.
But the discrepancy between the PyTorch frameworks used by most of the world’s AI developers and the Jax frameworks for which Google’s chips are currently most finely tuned means that most developers can’t easily adopt Google’s chips and make them perform as well as Nvidia’s without doing significant additional engineering work. Such work costs time and money in the fast-paced AI race.
If successful, Google’s “TorchTPU” initiative could significantly reduce switching costs for companies wanting alternatives to Nvidia’s GPUs. Nvidia’s dominance has been cemented not only by its hardware, but also by its CUDA software ecosystem, which is deeply embedded in PyTorch and has become the standard method by which companies train and deploy large AI models.
Enterprise customers have told Google that TPUs are more difficult to implement for AI workloads because developers have historically been required to move to Jax, a machine learning framework favored internally at Google, instead of PyTorch, which most AI developers already use, the sources said.
JOINT EFFORTS WITH META
To speed up development, Google is working closely with Meta, the creator and manager of PyTorch, according to the sources. The two tech giants have discussed deals to give Meta access to more TPUs, a move first reported by The Information.
Early offerings for Meta were structured as Google managed services, in which customers like Meta installed Google’s chips designed to run Google software and models, with Google providing operational support. Meta has a strategic interest in working on software that makes it easier to run TPUs, in an effort to lower inference costs and diversify its AI infrastructure away from Nvidia’s GPUs to gain bargaining power, the people said.
Meta declined to comment.
This year, Google started selling TPUs directly into customers’ data centers instead of restricting access to its own cloud. Amin Vahdat, a Google veteran, was named head of AI infrastructure this month, reporting directly to CEO Sundar Pichai.
Google needs that infrastructure both to run its own AI products, including the Gemini chatbot and AI-powered search, and to serve customers of Google Cloud, which sells access to TPUs to companies like Anthropic.
(Reporting by Krystal Hu, Kenrick Cai and Stephen Nellis in San Francisco; Editing by Kenneth Li and Matthew Lewis)