Large language models seem like magic. These artificial intelligence models write poetry, draft legal arguments and debug code with a fluency that suggests true understanding.
But there is no magic, only patterns. An LLM learns language the same way a cryptographer cracks a code: by analyzing a massive volume of text and inferring the rules from the repetition of trillions of words. Grammar, context and even style are just statistical relationships.
When you turn that powerful tool inward on the data within a company, the patterns are lost, replaced by a digital Tower of Babel.
Here, data lives in disconnected spreadsheets, cryptic databases and dashboards built for a single purpose years ago. It’s a landscape of columns with names like CUST_ID_v2_final and metrics defined by unwritten rules known only to a handful of employees. The context that gave the internet’s data meaning is absent; each dataset is a language of one.
For AI to navigate this internal chaos, it needs a guide. It needs metadata. This layer of data that describes data has become the new strategic battleground for enterprise AI.
Why LLMs fail at business data
The structured, predictable patterns an LLM finds in human language don’t exist inside a company. The logic of one sales report tells the model nothing about the structure of a logistics database created by a different team a decade ago. Each system is an island.
The dilemma this creates is clear in a simple business question: “Show me our top 10 customers by profit in the Northeast last quarter.”
An LLM tasked with generating the right database query will fail. It doesn’t know that customer information lives in the active_clients table and not prospects_fy24. It can’t guess that region_id = 4 corresponds to the “Northeast,” or that the business rule for “profit” is to subtract the cost_amt column from the rev_amt column.
Without a map that explicitly defines these terms, relationships and business rules, the LLM can only make a low-confidence guess, and it’s likely to guess wrong. Trust in the system erodes.
The mind trap
Companies have always solved for data chaos by relying on people. The answer to “What does this column mean?” or “Why did this metric spike last year?” was to find the right person and ask.
Efforts to digitize this knowledge have been going on for decades. Semantic layers, documentation wikis and even code comments were efforts to organize the chaos. But people are notoriously bad at maintaining such efforts.
The data changes, the queries are modified and the business logic evolves, while documentation becomes an artifact. This leaves companies reliant on a fragile, unscalable system. Departing workers take expertise with them. The original context for a dataset erodes over time. People become a bottleneck, throttling AI’s potential.
For years, the goal was simple: Collect everything. Companies poured resources into building vast data lakes, believing that value was proportional to volume. We now know that was a mistake. A data lake without a map is just a swamp. It’s a stagnant reservoir of assets that are impossible to navigate and difficult to use.
The true value is never in the raw data itself. It’s in the connective tissue, the metadata that gives the data structure meaning and context.
Shifting left
AI needs knowledge to be systematically captured in a way that resists this decay. This has led to a shift in thinking: treating data as a product.
The idea is to “shift left,” moving the responsibility for metadata definition as close to the data’s source as possible. The team that creates the data also defines its meaning, quality and context. Data and metadata are bundled together as a single, durable product. This prevents the map from ever becoming outdated.
Companies also need an active, intelligent layer that sits between their raw data and AI, a kind of corporate Rosetta Stone.
Its foundation is the data catalog, which acts as a storefront or registry for all of an organization’s data products. The catalog lists each product, where it lives, and who owns it.
Knowledge graphs and ontologies define the relationships. They formally map how a “customer” connects to an “order,” creating a web of machine-readable context.
Finally, the semantic layer acts as the menu for the business. It translates the complex data web into simple terms like “profit” or “region.” Humans and AI agents can both use these terms reliably.
Many companies are now vying for control over this single source of truth. Owning the catalog and the semantic layer creates a powerful competitive moat, making their platforms the indispensable operating system for corporate intelligence.
LLMs are powerful explorers, ready to map this internal world. But they cannot discover new insights without a charter. The companies that win the next decade will be the ones that stop hoarding raw data and start treating it like a core asset: building, governing and shipping well-defined data products, each with its metadata built-in.
That is the true brain of the enterprise. It is the only key that will unlock the full promise of AI.
Sean Falconer is senior director of product, AI strategy, at Confluent Inc. He wrote this article for News. Falconer has been an academic, founder and Googler. His published works cover a wide range of topics from AI to quantum computing. He also hosts the popular engineering podcasts Software Engineering Daily and Software Huddle.
Featured photo: Unsplash
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
- 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
- 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About News Media
Founded by tech visionaries John Furrier and Dave Vellante, News Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.
