Table of Links
Abstract and 1. Introduction
2 Related works
3 Methodology and 3.1 Causal language model as a classification model
3.2 Functional token
3.3 Dataset collection
3.4 Model development and training
4 Experiments and 4.1 Android function calls
4.2 Extension to Vehicle, Yelp, and DoorDash function sets
4.3 Full and partial training datasets and 4.4 Full training and LoRA training
4.5 Parallel and nested function call and 4.6 Weighted loss function for special tokens
5 Discussion and future works and References
Appendix
A.1 Android function examples
A.2 Vehicle function examples
Abstract
Language models have shown effectiveness in a variety of software applications, particularly in tasks related to automatic workflow. These models possess the crucial ability to call functions, which is essential in creating AI agents. Despite the high performance of large-scale language models in cloud environments, they are often associated with concerns over privacy and cost. Current on-device models for function calling face issues with latency and accuracy. Our research presents a new method that empowers an on-device model with 2 billion parameters to surpass the performance of GPT-4 in both accuracy and latency, and decrease the context length by 95%. When compared to Llama-7B with a RAG-based function calling mechanism, our method enhances latency by 35-fold. This method reduces the latency to levels deemed suitable for deployment across a variety of edge devices in production environments, aligning with the performance requisites for real-world applications.
1 Introduction
Large language models have demonstrated impressive capabilities in function calling, significantly contributing to AI agents’ growing presence in the software industry [50, 3, 11, 7, 8]. The advancement in AI agents is rapid, highlighted by the AI assistant tools like MultiOn [9] and Adept AI [25], and AI consumer products like Rabbit R1 [26] and Humane AI Pin [5], which are gaining traction in the consumer sector. Research into AI agents has been robust, witnessing developments in the chain of thought [58, 16] and enhanced prompting techniques [51]. Moreover, the rise of multi-agent systems [53, 43, 39, 29] marks a novel trend in the industry, showcasing the use of language models to develop dependable software that empowers users [54, 38]. These innovations leverage the API calling [56, 12, 49] and reasoning abilities [40, 36] of large, cloud-based language models to convert human natural language instructions into actionable commands. Despite the considerable progress in creating valuable AI agents, reliance on cloud models raises issues concerning privacy, inference costs, and the need for Wi-Fi connectivity [57, 22].
The cost of using large language models like Google’s Gemini family models [45] and OPENAI’s GPT series models [33, 34, 2, 1] can be substantial; for example, an hour-long interaction with an AI bot might cost around 0.24 USD according to the GPT-4 API price. When it comes to function calling, employing RAG-based [17, 27, 19, 15] or context-augmented [35] methods requires processing about 1000 tokens for each call, resulting in costs of approximately 0.01 USD. In practical applications, where hundreds of function calls may be made, the cumulative cost can be much. Additionally, the potential for privacy violations deters many from using GPT-4, amid concerns that sensitive information might be exposed.
To mitigate costs and enhance privacy, there is a trend towards creating smaller models for deployment on edge devices like smartphones, cars, VR headsets, and personal computers [52, 6, 21, 18, 55, 13, 41]. However, edge computing-based models face challenges with slower latency, far from production readiness, and the limited battery life of edge devices further complicates continuous interaction. Research shows that energy consumption reaches 0.1J per token for 1 billion parameter models [23]. Therefore, employing a 7B parameter model for function calls with traditional retrieval-augmented methods would consume 700J per call, roughly 1.4% of a 50kJ iPhone battery, limiting to around 71 function calls.
Smaller models often fall short in reasoning tasks and demand extensive tuning for effective function calling. To address these issues, we developed a method to enhance both accuracy and latency for function calling on 2B parameter models on devices, achieving state-of-the-art (SOTA) results. This approach involves tokenizing the core function’s name and fine-tuning the model with functional tokens. Fine-tuning with these tokens allows the model to understand software application capabilities with additional special tokens, learning to map function descriptions to specific tokens. In the inference phase, the model uses functional tokens to achieve better performance in function calling compared to GPT-4. We present a 2B parameter model fine-tuned from Gemma 2B [10], saving over 95% context length during model inference. For iPhone use, this enables 37 times more function calls with the same battery and reduces latency by approximately 35 times per function call.
Authors:
(1) Wei Chen, Stanford University, with equal contribution and a corresponding author {weichen6}@stanford.edu;
(2) Zhiyuan Li, Stanford University and a corresponding author {zhiyuan8}@stanford.edu.