Importantly, Microsoft Fabric is a lakehouse platform, but not a data fabric platform as defined by Gartner.
Another disadvantage of direct connections is that the data models of operational systems are usually not optimized for analytical purposes and their use is often expensive. “Direct access to source systems is not efficient,” says Pore.
In addition, established procedures for managing access rights already exist for data lakehouses.
“A lakehouse physically unifies data, maintenance, security and governance in one place,” explains Pore. “This is particularly important for the introduction of AI. As a company-wide ‘single source of truth’, a lakehouse is the modern way to build a central data repository.”
From data lake to lakehouse
The consulting company Lemongrass started with a classic data lake around ten years ago and gradually developed it into a lakehouse around four years ago.
“The lakehouse concept wasn’t nearly as widespread back then,” recalls Kausik Chaudhuri, Chief Innovation Officer at Lemongrass.
The company therefore developed its own lakehouse functions based on its Amazon S3 data lake. Now that the Lakehouse is increasingly supporting AI applications, the next modernization is already underway.
“We are currently working on a solution for incident and change management,” explains Chaudhuri.
The original data is in ServiceNow. If they were extracted directly from the lakehouse to be used in an AI system, the costs would be too high. “That’s why we’re now thinking about setting up an MCP server that specifically queries this data,” he adds.
At the same time, Lemongrass is planning to switch from its self-developed lakehouse extensions to a standard solution.
“When we started, Lemongrass was primarily an advocate for AWS, which is why many of our tools were built on AWS,” explains Chaudhuri. “Now we are thinking about changing this approach because AI opens up significantly more possibilities.”
However, AWS itself now also offers comprehensive lakehouse functionalities. “The data is already there. So we don’t have to reinvent the wheel.”
In addition, AWS provides direct connections to Anthropic Claude and other AI models. Since these models also operate within the AWS infrastructure, there are no egress fees.
Lemongrass plans to begin the modernization in the third quarter of this year, initially with a proof of concept (PoC). However, particular care must be taken to decide which data and to what extent they are transferred from the lakehouse to AI models.
“We do not send customer data to an LLM,” emphasizes Chaudhuri. “And I don’t read 10,000 records and send them to Claude – that would cause token consumption to explode. We realized a few years ago that we could go bankrupt if we didn’t carefully control token consumption.”
For some use cases, the LLM no longer needs to see customer data at all after the solution has been implemented. For example, employees previously manually created status reports on customers for their own use – a time-consuming process. An LLM could take on this task, but would have access to sensitive customer data. Additionally, due to their non-deterministic nature, generative AI models produce slightly different results each time.
Another example is forms that customers are supposed to sign later. Here too, an LLM could generate a new form for each request.
“So instead we asked Claude to write a program that processes these inputs and uses them to generate the report,” explains Chaudhuri. The actual report or form creation is then carried out using classic, deterministic software. This means customer data remains protected while reports can be generated quickly and cost-effectively.
Other companies, on the other hand, are already using AI intensively to make better use of their data stocks.
According to a recent Databricks report based on data from 20,000 companies, the proportion of databases created by AI agents increased from 0.1 percent to 80 percent within two years. Today, AI agents already create 97 percent of all database branches.
Security and governance
One of the biggest challenges organizations face is figuring out how to handle security and other related issues when AI agents access data lakehouses.
In the past, data was primarily routed to dashboards where security and access controls were hard-coded. Or the data went to data analysts who worked within their own access rights. The first AI applications were predominantly based on Retrieval-Augmented Generation (RAG). Classic, deterministic software extracts the required data and inserts it into the prompt of a large language model for a specific use case. The developers were able to define the security rules individually for each workflow.
With the advent of agentic AI and MCP (Model Context Protocol) servers, this model is fundamentally changing: AI agents can now independently decide what data they need and retrieve it independently.
According to Genpact manager Arellano, companies must therefore develop new concepts to manage the identities of AI agents, control access to data, create audit trails and filter prompts and content.
“Agents need their own credentials,” he explains. For example, AI agents may never be allowed access to patient records. “Audit trails are just as important in order to be able to fully understand what the agent has done.”
Arellano says some lakehouse providers, including Databricks, offer this functionality. Additionally, companies can integrate tools from providers such as Okta, Palo Alto or Zscaler.
The new semantic frontier
The next stage of Lakehouse development is the Semantic Layer. Gartner estimates that universal semantic layers will be essential infrastructure by 2030.
“Building a universal semantic layer is now a mandatory task for data and analytics leaders who lead or support AI projects,” explains Gartner. “This is the only way to improve accuracy, control costs, significantly reduce technical AI debt, align multi-agent systems and prevent costly inconsistencies before they spread.”
It is not enough to simply give an AI access to data. She also needs to understand what this data actually means for the company. The semantic level represents exactly this business knowledge, which is usually not formalized in a structured database. For example, the term “customer” or “order” may mean something different in different company systems.
“Historically, the semantic layer was desirable but not essential because data scientists knew which data sources they wanted to query,” explains Amit Kinha, board member of the FinOps Foundation and field CTO at DoiT International, a cloud consulting firm.
However, this no longer applies to AI agents. “Without a semantic layer, an agent may not know where to look for the data they need,” says Kinha. Or even worse: “It executes incorrect joins or triggers processes that cause costs to skyrocket.”
Therefore, the semantic layer will be crucial in the future for effectively using data lakehouses for AI.
Learning system
The semantic layer can also become part of a learning and feedback process. Kevin Martelli, Consulting AI Solution Development Leader at EY Americas, describes the following example: Suppose a company requires payments over $500,000 to be approved by the CFO. An AI agent requests approval from an employee.
However, the employee realizes, “I’m supposed to approve this invoice, but I know that amounts over $500,000 require additional approval from the CFO.”
This information can then be stored permanently. “It can be used within the session and then stored back in the Lakehouse as a process document or event log,” explains Martelli. As a result, agentic systems learn with each use.
“This is exactly why the system becomes more and more valuable over time – because you will never be able to model everything perfectly on the first day.”
However, the semantic layer is still in an early development phase, with different lakehouse providers pursuing different approaches.
“There is currently a lot of discussion in the industry about how data lakehouses and semantic layers are merging and where this layer should actually be located in the future,” explains Matt Arellano, SVP of data and AI at digital transformation consultancy Genpact. Some providers integrate semantic functions directly into their lakehouse platforms or purchase corresponding specialist companies. Other companies rely on specialized third-party providers instead.
“Customers are having a hard time with this,” says Arellano. “They’re all trying to figure out what combination of tools and processes is right for the long term.”
Steven Karan, vice president of AI transformation at Capgemini Australia and New Zealand, sees Lakehouse moving towards a central orchestration layer.
“Companies are now focusing less on traditional analyzes and reports and more on AI-driven applications and agent-based systems,” he explains. “The most effective architectures I see today combine a lakehouse core with specialized server layers.”
These include vector databases for AI, streaming platforms for real-time data, and operational databases for low-latency applications.
The Lakehouse is no longer just for analytics, he adds. It is the basis for company data and AI. “Today, its task is less to replace all other systems and more to connect them together, control them centrally and monitor them. This is to ensure that companies can innovate faster without losing control of their data.” (mb)
This article is based on a post from CIO.com.
