Why Gemini 3.0 Is A Great Builder But Still Needs A Human In The Loop

I spent a few weeks building a Neuro-Symbolic Manufacturing Engine. I proved that AI can design drones that obey physics. I also proved that asking AI to pivot that code to robotics is a one-way ticket to a circular drain.

Over the last few weeks, I have been documenting my journey building OpenForge, an AI system capable of translating vague user intent into flight-proven hardware.

The goal was to test the reasoning capabilities of Google’s Gemini 3.0. I wanted to answer a specific question: Can an LLM move beyond writing Python scripts and actually engineer physical systems where tolerance, voltage, and compatibility matter?

The answer, it turns out, is a complicated “Yes, but…”

I am wrapping up this project today. Here is the post-mortem on what worked, what failed, and the critical difference between Generating code and Refactoring systems.

The Win: Drone_4 Works

First, the good news. The drone_4 branch of the repository is a success.

If you clone the repo and ask for a “Long Range Cinema Drone,” the system works from seed to simulation.

It understands intent: It knows that “Cinema” means smooth flight and “Long Range” means GPS and Crossfire protocols.
It obeys physics: The Compatibility Engine successfully rejects motor/battery combinations that would overheat or explode.
It simulates reality: The USD files generated for NVIDIA Isaac Sim actually fly.

I will admit, I had to be pragmatic. In make_fleet.py, I “cheated” a little bit. I relied less on the LLM to dynamically invent the fleet logic and more on hard-coded Python orchestration. I had to remind myself that this was a test of Gemini 3.0’s reasoning, not a contest to see if I could avoid writing a single line of code.

As a proof of concept for Neuro-Symbolic AI—where the LLM handles the creative translation, and Python handles the laws of physics—OpenForge is a win.

The Failure: The Quadruped Pivot

The second half of the challenge was to take this working engine and pivot it. I wanted to turn the Drone Designer into a Robot Dog Designer (the Ranch Dog).

I fed Gemini 3.0 the entire codebase (88k tokens) and asked it to refactor. It confidently spit out new physics, new sourcing agents, and new kinematics solvers.

I am officially shelving the Quadruped branch.

It has become obvious that the way I started this pivot led me down a circular drain rabbit hole of troubleshooting. I found myself in a loop where fixing a torque calculation would break the inventory sourcing, and fixing the sourcing would break the simulation.

The Quad branch is effectively dead. If I want to build the Ranch Dog, I have to step back and build it from scratch, using the Drone engine merely as a reference model, not a base to overwrite.

The Lesson: The Flattening Effect

Why did the Drone engine succeed while the Quadruped refactor failed?

It comes down to a specific behavior I’ve observed in Gemini 3.0 (and other high-context models).

When you build from the ground up, you and the AI build the architecture step-by-step. You lay the foundation, then the framing, then the roof.

However, when you ask an LLM to pivot an existing application, it does not see the history of the code. It doesn’t see the battle scars.

The original Drone code was broken into distinct, linear steps.
There were specific error-handling gates and wait states derived from previous failures.

Gemini 3.0, in an attempt to be efficient, flattened the architecture. It lumped distinct logical steps into singular, monolithic processes. On the surface, the code looked cleaner and more Pythonic. But in reality, it had removed the structural load-bearing walls that kept the application stable.

It glossed over the nuance. It assumed the code was a style guide, not a structural necessity.

The Paradox of Capability: Gemini 2.5 vs. 3.0

This project highlighted a counterintuitive reality: Gemini 2.5 was safer because the code it confidently spit out was truncated pseudo-code.

In previous versions, the outputs were structured to show you how you might go about building. You would then have to build a plan to build the guts inside the program. Sometimes, it could write the entire file. Sometimes, you had to go function by function.

Gemini 2.5 forced me to be the Architect. I had to go program-by-program, mapping out exactly what I wanted. I had to hold the AI’s hand.
Gemini 3.0 has the speed and reasoning to do it all at once. It creates a believable illusion of a One-Shot Pivot.

Gemini 3.0 creates code that looks workable immediately but is structurally rotten inside. It skips the scaffolding phase.

Final Verdict

If you are looking to build a Generative Manufacturing Engine, or any complex system with LLMs, here are my final takeaways from the OpenForge experiment:

Greenfield is Easy, Brownfield is Hard: LLMs excel at building from scratch. They are terrible at renovating complex, existing architectures without massive human hand-holding.
Don’t Refactor with Prompts: If you want to change the purpose of an app, don’t ask the AI to rewrite this for X. Instead, map out the logic flow of the old app, and ask the AI to build a new app using that logic map.
Architecture is Still King: You cannot view a codebase as a fluid document that can be morphed by an LLM. You must respect the scaffolding.

OpenForge proved that we can bridge the gap between vague user intent and physical engineering. We just can’t take the human out of the architecture chair just yet.

That said, Gemini 3.0 is a massive leap from 2.5. Part of what I am exploring here is how to get the best out of a brand-new tool.