Shutterstock/insta_photos
From a purely structural point of view, there is nothing standing in the way of the spread of AI applications in industry. The Huggingface website, which has now become a “GitHub of the AI era,” claims to have more than 2 million AI models available for download for virtually any purpose, a majority of them free and open source. In addition, there are almost a million different data sets, also for every conceivable purpose and language.
Sure, the resources in the IT departments and the associated know-how are still relatively scarce, but thanks to the variety of training offers as well as the developers’ willingness to experiment and their ambition, things are quickly improving. Evidence of this is that almost every company is currently working on AI projects. Nevertheless, the majority of these projects do not make it into productive operation. The two main reasons: an inadequate database and difficulties in integrating AI functionality into existing processes.
The most pressing of the two problems is that of data. “AI-enabled data is an essential prerequisite for AI success because it all starts with the data foundation,” as Gartner puts it. But the current situation leaves a lot to be desired. According to a Gartner report last year, 63 percent of companies did not have the right data management practices for AI. However, without systematic cleaning and structuring of the data, even the most advanced models are prone to errors. According to market researchers, around 60 percent of all AI projects will be canceled this year due to a lack of an “AI-ready” database.
Bridge between computing and cognitive understanding
The absence of an AI-ready data basis is due to incomplete data integration, inadequate data management for AI requirements and, above all, the lack of access to a company’s unstructured data. While structured data in table form forms the backbone of classic analyses, the true potential of modern AI lies in the development of unstructured content. An estimated 80 to 90 percent of all information generated in the company – from emails and PDF contracts to call notes and video recordings – is in this “raw” format.
This content represents an immense wealth of knowledge, especially for Large Language Models (LLMs), because it contains the context, nuances and implicit experiential knowledge of each organization. Its availability in a format that can be processed by AI is the prerequisite for bridging pure data processing and true cognitive understanding. However, this knowledge will never be fully compressed into rigid database lines. Only by analyzing free text, audio data and videos can an AI understand complex customer concerns or identify compliance risks in long contracts.
Shifting security budgets
The facts prove that developing these contexts from unstructured content is a strategic necessity: According to Gartner, companies with successful AI initiatives invest up to four times more in their data foundation or in analytics. However, this seems worth it. Those who tame their flood of data transform “dead capital” into operational intelligence. The market researchers found that companies with the most sophisticated AI-enabled data and analytics capabilities achieve up to 65 percent better business results, particularly in terms of revenue growth and cost optimization.
Once the knowledge from the unstructured content is available, it not only benefits AI applications, but also the employees – and thus the entire company. Only when employees have easy access to information and insights from this content can they optimize business processes and use AI to develop innovative products.
We are therefore already investing heavily in the areas of data access and data security to enable employees to have controlled, context-related access to relevant content. According to Gartner, by the end of this year, 75 percent of companies will shift their budgets from classic data security strategies for structured data to the protection and use of unstructured content. As companies increasingly rely on personalization, this content acts as the critical fuel for systems that can interpret human language and the world of work in all their complexity, says Gartner.
A multi-stage preparation is necessary
To prepare unstructured content for AI applications, the first step is to collect and clean the data – for example by removing duplicates, irrelevant content or format errors. For texts, tools such as speech recognition, tokenization or named entity recognition are used, while images and videos are processed using image processing methods (e.g. noise reduction, normalization).
In the next step, the data is structured and enriched, for example by adding metadata, classifying content or extracting key information. For example, you can not only extract the sender and date from emails, but also identify topics, moods or calls to action. Without this preprocessing, AI models cannot recognize the patterns hidden in the data – and therefore cannot develop their full potential.
In order to ensure the general but compliant availability of this content for the AI and employees, it is recommended to use a uniform platform that enables both technical processing and comprehensive governance. Ideally, this also enables scalable storage, versioning and access control. At the same time, companies must ensure that the data is contextualized – for example by linking it to existing knowledge databases or enriching it with external sources. This is the only way to create machine-readable, high-quality data sets from unstructured content, which serve as the basis for training and operating AI applications.
Seamless data governance is essential
The use of unstructured content for AI applications goes hand in hand with seamless data governance – especially in the EU and German-speaking countries, as strict regulatory requirements such as the GDPR and the EU AI Act apply here. These laws not only require the protection of personal data, but also transparency, traceability and fairness in the development and use of AI systems. Companies must ensure that their database is free of distortions, that the origin and processing of the data is documented and that those affected can exercise their rights – such as information or deletion. Without robust data governance, you risk not only high fines, but also the loss of trust among customers and partners, which jeopardizes the long-term acceptance and success of AI solutions.
In addition, the EU AI Act requires that companies assess and classify the risks of their AI applications, especially for high-risk systems such as those in critical infrastructure, human resources or healthcare. Not only the data itself, but also the AI’s decision-making processes must be fully documented and verifiable. Data governance must ensure that these requirements are met by establishing clear responsibilities, standardized processes and technical controls. In doing so, it creates the basis for operating AI applications not only in a legally secure manner, but also in an ethical and socially responsible manner.
Find out here how an intelligent content platform transforms unstructured content into a dynamic, valuable resource for running AI applications.
