ScreenSafe: A Technical Chronicle Of On-Device AI And Privacy-First Architecture

Cloud-based content moderation is a privacy nightmare. Sending screenshots to a server just to check for safety? That’s a non-starter.

My hypothesis was simple: modern mobile hardware is powerful enough to support a “Guardian AI” that sees the screen and redacts sensitive info (nudity, violence, private text) in milliseconds—strictly on-device using a hybrid inference strategy.

I called it ScreenSafe.

But the journey from concept to reality revealed a chaotic ecosystem of immature tools and hostile operating system constraints.

Here is the architectural breakdown of how I built ScreenSafe, the “integration hell” I survived, and why local AI is still a frontier battleground.

Watch the full technical breakdown on YouTube.

1. The Stack: Why Cactus?

I needed three things: guaranteed privacy, zero latency (in theory), and offline capability.

I chose Cactus (specifically cactus-react-native). Unlike cloud APIs, Cactus acts as a high-performance C++ wrapper around low-level inference graphs. It utilizes the device’s NPU/GPU via a C++ core, exposed to React Native through JNI (Android) and Objective-C++ (iOS).

The goal: Capture the screen buffer @ 60FPS -> Pass to AI -> Draw redaction overlay. Zero network calls.

We implemented a Two-Stage Pipeline to solve hallucination issues:

Stage 1 (Vision): Uses lfm2-vl-450m to generate a descriptive security analysis of the image.
Stage 2 (Logic): Uses qwen3-0.6 to analyze that description and extract structured JSON data regarding PII (Credit Cards, SSNs, Faces).

2. The Build System Quagmire

The first barrier wasn’t algorithmic; it was logistical. Integrating a modern C++ library into React Native exposed the fragility of cross-platform build systems. My git history is a graveyard of “fix build” commits.

The Android NDK Deadlock

React Native relies on older NDK versions (often pinned to r21/r23) for Hermes. Cactus, running modern LLMs with complex tensor operations, requires modern C++ standards (C++20) and a newer NDK (r25+).

This created a Dependency Deadlock:

Choose the old NDK: Cactus fails with syntax errors.
Choose the new NDK: React Native fails with ABI incompatibilities.

I faced constant linker failures, specifically undefined symbols like std::__ndk1::__shared_weak_count. This is a hallmark of libc++ version mismatch. The linker was trying to merge object files compiled against different versions of the C++ standard library.

The Fix: A surgical intervention in local.properties and build.gradle to force specific side-by-side NDK installations, effectively bypassing the package manager’s safety checks. Open PR: github.com/cactus-compute/cactus-react-native/pull/13.

3. The iOS Memory Ceiling (The 120MB Wall)

Once the app built, I hit the laws of physics on iOS. The requirement was simple: Share an image from Photos -> Redact it in ScreenSafe. This requires a Share Extension.

However, iOS enforces a hard memory limit on extensions—often cited as low as 120MB. If you exceed this, the kernel’s Jetsam daemon sends a SIGKILL, and the app vanishes.

The Physics of LLMs

Model Weights (Q4): ~1.2 GB
React Native Overhead: ~40 MB
Available RAM: 120 MB.

You cannot fit a 1.2GB model into a 120MB container.

The “Courier” Pattern

I had to re-architect. The Share Extension could not perform the inference; it could only serve as a courier.

Data Staging: The Extension writes the image to an App Group (a shared file container).
Signal: It flags hasPendingRedaction = true in UserDefaults.
Deep Link: It executes screensafe://process, forcing the OS to switch to the main app.

The main app, running in the foreground, has access to the full 6GB+ of device RAM and handles the heavy lifting.

4. Android IPC: The 1MB Limit

While iOS struggled with static memory, Android struggled with moving data. Android uses Binder for IPC. The Binder transaction buffer is strictly limited to 1MB per process.

A standard screenshot (1080×2400) is roughly 10MB uncompressed. When I tried to pass this bitmap via an Intent, the app crashed instantly with TransactionTooLargeException.

The Solution: Stop passing data; pass references.

Write the bitmap to internal storage.
Pass a content:// URI string (bytes in size) via the Intent.
The receiving Activity streams the data from the URI.

5. The Reality of “Real-Time”

Synchronizing Two Brains (Vision vs. Text)

Multimodal means processing pixels and text. On a server, these run in parallel. On a phone, they fight for the same NPU.

We hit a classic race condition: The vision encoder was fast (detecting an image), but the text decoder was slow.

Scenario: Vision says “Safe.” Text is still thinking.
The Risk: Do we block the screen? If we wait, the UI stutters (latency). If we don’t, we risk showing a harmful caption.

I had to engineer a complex state machine to manage these async streams, ensuring we didn’t lock the JS thread while the C++ backend was crunching numbers:

Dynamic Context Sizing: Implemented checkDeviceMemory to detect available RAM and dynamically set the model context window:

< 3GB RAM → 256 tokens (Safe mode)
3-6GB RAM → 512 tokens (Standard mode)
> 6GB RAM → 1024 tokens (High-performance mode)

Timeout Protection: Added a 15-second timeout to the local text model inference. If it hangs (common on emulators), it gracefully fails instead of crashing the app, showing a “Limited analysis” warning.

Agent Manager verifying context size and memory management

PII Detection Logic

Logic Update: We prioritized the presence of types. If the types array is not empty, hasPII is forced to true, overriding the LLM’s boolean flag.
JSON Repair: The local LLM (qwen3-0.6) was returning conversational <think> blocks and sometimes malformed JSON, causing JSON.parse to fail. We enhanced the JSON cleaning regex to handle more edge cases from the model output (e.g., trailing commas, markdown blocks).
Cloud Fallback: We verified that the 15s timeout correctly triggers the “Enable Cloud Mode” suggestion for users on low-end devices.

Hybrid Cloud Inference Integration

We confirmed that the cloud API (https://dspy-proxy.onrender.com) is functional.

/configure endpoint works.
/register endpoint successfully registered the pii_detection signature.
/predict endpoint returns valid JSON with PII analysis.

Furthermore, we added the logic to catch the timeouts and automatically retry the request (waking the server). If the cloud service remains unavailable, the app gracefully falls back to the local analysis result without crashing.

It Actually Works

We solved the crashes, but we couldn’t solve the latency. Despite the build breaks and the memory wars, we shipped it.

Latency: 30ms – 100ms (Real-time).
Privacy: 100% On-device.
Cost: $0 (Excluding cloud, infinite scalability).

We proved that you can run high-fidelity AI on mobile if you’re willing to fight the memory limits and patch the build tools. However, the wait is an eternity in mobile UX. This is the “frustration” of local AI: the gap between the instant feel of cloud APIs (which hide latency behind network states) and the heavy feel of a device physically heating up as it crunches tensors.

6. The “Antigravity” Companion

Debugging a neural net is like debugging a black box. You can’t step-through the decision making. I relied heavily on the “Antigravity” to iterate on system prompts and fix hallucinations where the model thought a restaurant menu was “toxic text.”

I also used the dspy-proxy to help streamline some of these interactions and test model behaviors before deploying to the constrained mobile environment.

Antigravity from Google being used to generate the mermaid

Conclusion

Building ScreenSafe proved that local, private AI is possible, but it requires you to be a systems architect, a kernel hacker, and a UI designer simultaneously. Until OS vendors treat “Model Inference” as a first-class citizen with dedicated memory pools, we will continue hacking build scripts and passing files through backdoors just to keep data safe.

Resources & Code

If you want to dig into the code or the proxy architecture I used to prototype the logic:

ScreenSafe Repo: github.com/aryaminus/screensafe
DSPy Proxy: github.com/aryaminus/dspy-proxy
Watch the breakdown: YouTube Video

Liked this post? I’m building more things that break (and fixing them). Follow me on Twitter or check out my portfolio.

ScreenSafe: A Technical Chronicle of On-Device AI and Privacy-First Architecture | HackerNoon

1. The Stack: Why Cactus?