Databricks introduced Agent Bricks, a new product that changes how enterprises develop domain-specific agents. The platform addresses agent development complexity by allowing teams to focus on defining their agent’s purpose and providing strategic guidance on quality through natural language feedback. “Agent Bricks handles the rest, automatically generating evaluation suites and auto-optimizing the quality,” the company stated. The automated workflow includes generating task-specific evaluations and LLM judges for quality assessment, creating synthetic data that resembles customer data to supplement agent learning, and searching across optimization techniques to refine agent performance.
Source: Databricks Agent Bricks
Agent Bricks operates through a four-step automated workflow that begins when users declare their task by selecting their objective, defining in natural language a high-level description of what they want the agent to accomplish, and connecting their data sources. The platform then initiates automatic evaluation where Agent Bricks automatically creates evaluation benchmarks specific to the task, which may involve synthetically generating new data or building custom LLM judges.
The system proceeds to automatic optimization where Agent Bricks intelligently searches through and combines various optimization techniques, such as prompt engineering, model-finetuning, reward models, or test-adaptive optimization (TAO) to achieve high quality. The final stage addresses cost and quality as Agent Bricks ensures agents are not only highly effective but also cost-effective, allowing users to choose between cost-optimized or quality-optimized models. “In many cases, the end solution is both higher quality and lower cost compared to other DIY approaches,” according to the company.
Agent Bricks incorporates the latest research in agent learning, with Databricks highlighting one key innovation called Agent Learning from Human Feedback (ALHF). The company identified a quality challenge where steering agent behavior from feedback proves difficult because feedback often comes as simple thumbs up or thumbs down signals, making it unclear which components within an agent system require adjustment. Current approaches pack all instructions into one massive LLM prompt, which Databricks describes as brittle and unable to generalize to more complex agent systems. ALHF addresses this through two methods: receiving rich context from natural language guidance, and using algorithms that intelligently translate this guidance into technical optimizations such as refining retrieval algorithms, enhancing prompts, filtering vector databases, or modifying agentic patterns.
Databricks also introduced Test-time Adaptive Optimization (TAO), a new model tuning method that requires only unlabeled usage data, letting enterprises improve quality and cost for AI using existing data. The method leverages test-time compute and reinforcement learning to teach models to perform tasks better based on past input examples alone, scaling with an adjustable tuning compute budget rather than human labeling effort. “Even without labeled data, TAO can achieve better model quality than traditional fine-tuning, and it can bring inexpensive open source models like Llama to within the quality of costly proprietary models like GPT-4o and o3-mini,” the company stated.
Databricks’ Mosaic AI Agent Evaluation helps developers evaluate the quality, cost, and latency of agentic AI applications, including RAG applications and chains. The tool identifies quality issues and determines the root cause of those issues across development, staging, and production phases of the MLOps lifecycle, with all evaluation metrics and data logged to MLflow Runs. Agent Evaluation maintains consistency between development and production environments, enabling teams to quickly iterate, evaluate, deploy, and monitor agentic applications. The main difference between environments lies in the availability of ground-truth labels, which allow Agent Evaluation to compute additional quality metrics during development.
Agent Bricks addresses several customer use cases across key industries through four main agent types. Information Extraction Agent turns documents like emails, PDFs and reports into structured fields like names, dates and product details, allowing retail organizations to pull product details from supplier PDFs regardless of document complexity. Knowledge Assistant Agent provides fast, accurate answers grounded in enterprise data, enabling manufacturing technicians to get instant, cited answers from SOPs and maintenance manuals. Multi-Agent Supervisor enables building systems that coordinate agents across Genie spaces, other LLM agents and tools such as MCP, allowing financial services organizations to orchestrate multiple agents for intent detection, document retrieval, and compliance checks. Custom LLM Agent transforms text for industry-specific tasks, helping marketing teams generate content that respects organizational brand guidelines.
Matei Zaharia, CTO at Databricks and CS professor at UC Berkeley, emphasized the collaborative nature of the development effort.
This is a joint effort across our engineering and Databricks Mosaic Research teams, based on new tuning methods we developed like TAO and ALHF. I think this type of declarative development is the future of AI,
Zaharia said.
The platform represents a shift toward allowing domain experts to contribute directly to system improvement without requiring deep technical expertise in AI infrastructure, potentially changing how enterprises approach agent development workflows.
Readers interested in learning more about Agent Bricks implementation and multi-agent system development can access additional technical details through Databricks’ Data AI Summit session on building multi-agent systems with structured and unstructured data. A video demonstration of the platform’s capabilities is available, providing visual examples of the automated optimization workflow and real-world application scenarios.