Uber created a unified platform for serving large language models (LLMs) from external vendors and self-hosted ones and opted to mirror OpenAI API to help with internal adoption. GenAI Gateway provides a consistent and efficient interface and serves over 60 distinct LLM use cases across many areas.
The company was one of the early adopters of large language models (LLMs), with several teams working on incorporating AI-driven functionality into various domains, from process automation to customer support and content generation. However, disparate integration efforts resulted in repeated work and inconsistencies in the approach. In response to these challenges, Uber decided to centralize the serving of LLM models in a single service: the GenAI Gateway.
Tse-Chi Wang and Roopansh Bansal, senior software engineers at Uber, explain the rationale for creating the gateway:
The GenAI Gateway is designed to simplify the integration process for teams looking to leverage LLMs in their projects. Its easy onboarding process reduces the effort required by teams, providing a clear and straightforward path to harness the power of LLMs. In addition, a standardized review process, managed by the Engineering Security team, reviews use cases against Uber’s data handling standard before use cases are granted access to the gateway.
The team opted to adopt the OpenAI API for the gateway due to the wide adoption and availability of open-source libraries like LangChain and LlamaIndex. Mirroring a well-known API streamlines the onboarding process and extends the gateway’s reach.
GenAI Gateway is a Go service that incorporates the serving layer, combining external (OpenAI, Vertex AI) and internal LLMs and many generic capabilities, such as authentication and account management, caching, and observability/monitoring.
Architecture of GenAI Gateway (Source: Uber Engineering Blog)
The gateway supports the personal identifiable information (PII) reduction, which is both important and challenging in the context of LLMs. Uber wanted to ensure that PII data was anonymized before forwarding requests to third-party vendors to avoid the risk of exposing sensitive data. On the other hand, reducting PII can lead to problems where requests lose essential context information and prevent LLMs from providing useful responses. Furthermore, data reduction is problematic for LLM caching and retrieval augmented generation (RAG). The team is looking at addressing these challenges by encouraging using Uber-hosted LLMs or considering relying on security assurances from third-party vendors.
The authors included a case study covering the summarization of chats for customer support agents to improve their operational efficiency by reducing the time they spend addressing user queries. Using LLMs for this use case resulted in 97% of generated summaries being considered useful by agents and a six-second reduction in user query handling time. The solution currently generates around 20 million summaries per week, but the team plans to expand to more regions and contact types.
Integration of GenAI Gateway to Support Specific Use Case (Source: Uber Engineering Blog)
The team learned a great deal from developing and operating the GenAI Gateway and is planning to work on further enhancements, including intelligent LLM caching mechanisms, better fallback logic, hallucination detection, and safety and policy guardrails.