GenAI isn’t stealing your data in one dramatic burst. It leaks fragments—copied into prompts, screenshots, exports, and fine-tuning datasets that move between endpoints, SaaS apps, and cloud storage. Legacy DLP sees some hops. DSPM sees some resting places. Neither sees the whole story.
The only way to reliably track and stop AI-driven data exfiltration is to follow the data’s entire journey—its lineage—across endpoints, SaaS, and the cloud, then apply protection in real time. That’s the mindset behind Cyberhaven’s unified DSPM + DLP platform.
Visit this link to see how this works in a live session and on-demand product launch event.
The New Data Breach Doesn’t Look Like a Breach
When people imagine an “AI incident,” they picture something cinematic: a rogue agent wiring the entire customer database into a model in one shot.
That’s almost never how it happens.
In the environments we see, AI‑related data loss looks more like this:
- A product manager pastes a few rows of roadmap data into a model for help writing a launch brief.
- A developer copies a code snippet with a proprietary algorithm into ChatGPT to debug a race condition.
- A finance analyst exports a slice of a board deck into a CSV to feed an internal LLM.
Each action in isolation seems harmless—“just a few lines,” “just a screenshot,” “just this one table.” But over weeks and months, those fragments accumulate across different tools, identities, and locations.
From an attacker’s point of view, you don’t need the entire truth in one place. Enough fragments, stitched together, are often just as valuable as the original.
Why AI Data Loss Is Almost Invisible to Traditional Tools
Most organizations are still protecting data with a mental model that assumes:
- Data lives in well‑defined systems (databases, file shares, document repositories).
- “Exfiltration” is a discrete event (a big upload, a large export, a massive email).
AI breaks both assumptions.
1. Data is now fragmented by default
We no longer share a file; we share pieces of it. That was already true with SaaS. AI multiplies it:
- A confidential slide becomes: two paragraphs in an email, three bullets in a Jira ticket, and a paragraph pasted into an AI prompt.
- A source code file becomes: a function pasted into a chat, a generated patch in Git, and a screenshot in a Slack thread.
By the time you notice something is wrong, the data has been chopped, transformed, translated, and blended into other content across dozens of systems. Our analysis of customer environments shows data moving continuously between the cloud and endpoints in ways that are impossible to understand if you only look at a single system or moment.
2. Controls are still siloed by location
The security stack mirrors this fragmentation:
- DLP on endpoints and gateways focuses on data in motion.
- DSPM focuses on data at rest in SaaS and cloud.
- New AI security tools focus solely on prompts and responses within specific models.
Each one knows its domain well, but little about what happened before or after the event it observes. So you end up with:
- A DSPM alert that says: “This bucket contains sensitive data,” but not how it got there or who moved it.
- A DLP alert that says: “Someone pasted confidential text into a browser,” but not where the text originated or where it went next.
- An AI usage report that says, “These apps are talking to LLMs,” but doesn’t specify the underlying data they’re exposing.
Individually, these are partial truths. Together, without context, they become noise.
What We Learned by Betting the Company on Data Lineage
Long before “data lineage” became a slide on every security vendor’s pitch deck, we built a company around it.
Cyberhaven’s founding team came out of EPFL and the DARPA Cyber Grand Challenge, where we built technology to track how data flowed through systems at the instruction level, not just the file level. That research evolved into a security platform that could reconstruct the entire history of a sensitive object—where it was born, how it changed, who touched it, and where it tried to leave the organization.
We sometimes joke internally that we were “the original data lineage company” — we were shipping lineage‑based detection and response years before it was fashionable marketing language.
At the time, this approach solved problems like:
- Finding insider threats hidden in millions of “normal” file operations.
- Understanding complex IP leaks where content had been copied, compressed, encrypted, renamed, and moved across multiple systems.
We thought lineage was powerful then.
In the AI era, it’s non‑negotiable. It is like trying to enable full self-driving without having driven round and round San Francisco, gathering the telemetry data.
AI Made Lineage Mandatory, Not Optional
AI has accelerated two trends that were already underway:
- Data never sits still. It continuously moves between endpoints, SaaS, and the cloud.
- Security is moving from point products to platforms. Customers are tired of stitching together DSPM, DLP, insider risk, and a separate AI tool.
If you care about AI‑driven data exfiltration, you can’t afford to look only at:
- Static storage (DSPM alone), or
- Network egress (DLP alone), or
- AI prompts (AI tooling alone).
You need to understand how knowledge moves: how an idea in a design file becomes a bullet in a product document, a paragraph in a Slack thread, and a prompt to an external model.
That’s the whole reason we built Cyberhaven as a unified AI & data security platform that combines DSPM and DLP on top of a single data lineage foundation. It lets security teams see both:
- Where data lives (inventory, posture, misconfigurations), and
- How data moves (copy/paste, exports, uploads, AI prompts, emails, Git pushes, and more).
Once you have that complete picture, AI exfiltration stops being mysterious. It looks like any other sequence of events, just faster and more repetitive.
Principles for Actually Stopping AI-Driven Data Exfiltration
If I were starting a greenfield security program today, with AI in scope from day zero, here are the principles I’d insist on.
1. Unify data at rest and data in motion
You can’t secure what you only see. You can’t secure what you only see part of. Data is sitting in the cloud and SaaS.
- DLP tells you how data is moving, especially at endpoints and egress points.
Together, with lineage, you get the full story: this model training dataset in object storage came from an export from this SaaS app, which originated in this internal HR system, and was enriched by this prompt flow to an external LLM.
That’s the level of context you need to decide whether to block, quarantine, or allow, especially when AI is involved.
2. Treat identity, behavior, and content as a single signal
Whenever I review a serious incident, there are three questions I want answered:
- What exactly was the data? (Regulated data, IP, source code, M&A docs?)
- Who was the human or service account behind the action? (Role, history, typical behavior.)
- How did this sequence of events differ from “normal” for that identity and that data?
Legacy tools usually answer only one of those in isolation:
- Content scanners know what, but not who.
- Identity systems know who, but not what they did with data.
- UEBA systems know anomalies, but not data sensitivity.
Lineage‑driven systems can correlate all three in real time, which is the only way to reliably find the handful of truly risky actions in the noise of millions of “normal” events.
3. Assume policies won’t keep up
Writing perfect AI policies is a losing game.
People will always find new tools, plugins, side channels, and workflows. If your protection depends on static rules that anticipate every vector, you’ll always be behind.
What works better in practice is:
- Broad, simple guardrails (“don’t move data with these characteristics to destinations in these classes”) combined with
- An AI‑assisted detection layer that uses lineage and semantic understanding to surface suspicious patterns you didn’t explicitly write a rule for.
We’re already seeing this with autonomous analysts that investigate lineage graphs and user behavior to propose or enforce controls without requiring a human to anticipate every scenario.
4. Close the loop from insight to action
Seeing the problem isn’t enough. Seeing the problem isn’t enough. One of the biggest complaints we hear about stand-alone DSPM tools is that they generate lots of “insight” but no direct enforcement; teams are left opening tickets and chasing owners by hand. Prioritize where to scan and investigate based on live DLP telemetry (follow where sensitive data is actually moving).
- Offer one‑click remediation paths: revoke access, tighten sharing, quarantine misconfigured stores, or block risky exfiltration attempts in real time.
- Feed every enforcement decision back into the lineage and detection models so the system gets smarter over time.
Without that tight loop, AI-driven leakage becomes another line item on an overcrowded risk register.
Why This Matters Now, Not “Someday”
There’s a reason AI has suddenly made data security a board‑level topic again.
- Employees are using AI tools faster than governance can keep up.
- New regulations and customer expectations are raising the stakes for data misuse.
- Attackers are experimenting with AI‑assisted reconnaissance and exfiltration.
At the same time, security teams are consolidating tools. They don’t want separate products for DLP, DSPM, insider risk, and AI security. They want one platform that can see and control data everywhere—at rest, in motion, and in use—with lineage as the connective tissue.
That’s the platform we’ve been building at Cyberhaven, starting with our early work on data lineage and evolving into a unified AI & data security platform that combines DLP, DSPM, insider risk, and AI security in a single system.
Want to See What This Looks Like in the Real World?
On February 3 at 11:00 AM PT, we’re hosting a live session where we’ll:
- Show the first public demo of our unified AI & data security platform and how it tracks data fragments across endpoints, SaaS, cloud, and AI tools in real time.
- Walk through how security teams get “X‑ray vision” into data usage, so they can isolate the risky handful of actions hidden in millions of normal events — and stop them before they turn into incidents.
- Share candid stories from security leaders on where legacy DLP and stand‑alone DSPM have failed them in the AI era, and how a lineage‑first approach changes the game.
- Talk about where we think DLP, insider risk, AI security, and DSPM are headed next — and why we believe the future belongs to platforms that were built on data lineage from day one, not retrofitted after the fact.
If you’re wrestling with AI adoption, shadow AI tools, or just a growing sense that your current stack is seeing only the surface of what’s happening to your data, I’d love for you to join us and ask hard questions.
Watch live
AI is already exfiltrating your data in fragments. The real question is whether you can see the story those fragments are telling, and whether you can act in time to change the ending.
:::tip
This story was published under HackerNoon’s Business Blogging Program.
:::
