AI For Developers: What Works, What Doesn’t, And Why On-Prem Still Matters

In 2025, AI in software engineering has officially moved past the hype cycle. The 2025 Stack Overflow Developer Survey reports that 84% of respondents now use or intend to use AI in their development process, with 51% of professional developers relying on such tools every day.

It’s no longer a novelty, but also not the magical “junior developer who never sleeps”. Instead, AI has become something more realistic and grounded — it helps engineers move faster, think clearer, and focus their time where it matters. But to unlock this value, especially in fintech where data sensitivity is non-negotiable, companies must balance cloud-based experimentation with secure on-prem deployment — and learn how to measure the impact of AI effectively.

Over the last couple of years, engineering teams of one of the largest IT companies in Europe have run dozens of pilots and integrated AI into a wide range of development processes. Some things have worked remarkably well; others have disappointed. This article is based on those firsthand observations.

Where AI Delivers Real Value…

The most tangible productivity gains came from very pragmatic use cases. One of the strongest examples is code generation for simple tasks. AI tools have proved effective at creating service skeletons, CRUD operations, unit tests, SDK boilerplate, and routine migrations. For prototyping, where the bar for architectural perfection is lower, the acceleration is especially noticeable. Engineers no longer need to spend time wiring the same patterns again and again; the model handles the scaffolding, while humans focus on substance.

Another category that consistently delivers value is summarization. Whether it’s pull requests, requests for comments, incident reports, extensive logs, or recordings of team meetings, AI reduces hours of reading into digestible, structured insights. It has quickly become standard practice to skim the AI summary first and dive into details only when necessary.

A further strong win comes from what we call “conversational documentation” — chat-style interfaces layered over internal manuals, architectural guides, and even HR policies. One practical example is an internal AI-HRBP assistant that fields questions about benefits, processes, and corporate rules. It takes a load off human experts who previously handled those requests manually.

In code review, AI has also found its place — not as the final authority, but as a reliable first reviewer before a human takes over. On simple diffs, models can highlight missing checks, style inconsistencies, or forgotten edge cases, making the later manual review faster and more focused. The same applies to bug and vulnerability detection. Memory leaks, null dereferences, race conditions, and schema mismatches still surface quickly in AI-powered scans, giving engineers a clear starting point before they dive deeper.

…And Where It Still Falls Short

As soon as complexity enters the picture, AI’s limitations become clear. Code generation for complex tasks, for example, consistently struggled to produce reliable results. Without very precise prompts, detailed input-output examples, and close human supervision, the generated solutions often included shallow architectural choices and blurred responsibility boundaries.

In practice, engineers frequently ended up spending more time reviewing and correcting this output than they would have spent implementing the feature themselves. What should have been acceleration turned into overhead.

Another limitation emerged with universal general-purpose agents. They were able to perform simple, well-defined tasks — scheduling a meeting in Google Calendar, for instance — but routinely failed in real corporate environments with customized infrastructure. An agent that could operate a public calendar but not an internal one offered little real value. The gap between theoretical capability and practical integration was, and still is, significant.

The Difference Context Makes

A recurring takeaway was that many AI tools become nearly useless when they’re cut off from context. For a model to make intelligent decisions, it must have access to:

the actual codebase,
configuration files,
domain documentation,
internal guidelines.

The real breakthroughs came when teams invested in RAG pipelines, high-quality embeddings, connectors to internal systems, and careful normalization of the documents. Once AI knew “how things work here,” quality rose dramatically.

Another discovery: the narrower the task, the better the result. When an engineer formulates a request with clear acceptance criteria, constraints, and examples, hallucinations drop and reliability increases. The opposite is also true — broad, fuzzy prompts almost guarantee superficial or incorrect output.

And finally, despite all progress, humans remain the final checkpoint. AI can assist, but cannot make production-level decisions independently. Human review and comprehensive automated tests continue to close the last mile.

Measuring the Impact: What Really Counts

Attempts to quantify AI’s impact often end up skewing incentives. Metrics like “lines of code written by AI” or “number of prompts per day” tell you nothing about engineering quality. Instead, we rely on a mix of quantitative indicators and qualitative feedback from teams — a combination that gives a clearer picture of where AI brings real value and where it adds overhead.

On the quantitative side, one of the strongest signals is the average time a pull request spends in review. With AI assisting in first-pass reviews and early issue detection, PRs began moving faster through the pipeline, resulting in tighter feedback loops and smoother workflows. Another helpful indicator is the number of issues opened early in the development cycle. While the increase may seem negative, it actually reflects earlier detection of bugs, inconsistencies, or architectural risks, which reduces surprises later on.

To capture the human side of productivity, we also run regular team satisfaction surveys and ask questions like: “Do you feel more productive with AI tools?”. The responses often provide context that raw numbers can’t — for instance, whether AI saves engineers time or simply shifts work into different parts of the workflow.

The same hybrid approach guides our decisions about long-term adoption. Every new capability starts as a simple pilot that tests the core hypothesis. We track engagement, MAU/DAU, and outcome-related metrics. Only tools that show clear value move forward. This discipline helps distinguish genuinely useful capabilities from those that look impressive only on paper — and naturally leads to the next question: how to deploy them safely, especially when sensitive internal context is involved.

The Case for On-Prem AI

In fintech — the space I work in most closely — the question isn’t just whether AI is useful. It is about deploying it without putting data at risk. And this is where on-prem AI earns its place: it gives companies full control over how information moves inside their systems. When your codebase, internal documentation, or operational logs cannot leave the perimeter, local deployments built on robust open-source models become the only realistic choice.

But here’s the nuance: for large enterprises, security isn’t the hardest part. They already operate under strict regulatory frameworks and mature internal practices. The real challenges show up among smaller engineering teams. Without deep experience in secure development, it’s easy to end up with misconfigured logs, exposed endpoints, or poorly isolated infrastructure — all of which have already caused very real breaches across the industry.

Another visible hurdle is cost. Modern models — from DeepSeek-class architectures to top-tier open-source LLMs — demand serious compute power. High-quality inference often requires clusters of eight A100 or H100 GPUs with 80GB each. That translates into $150,000 to $250,000 upfront, not counting ongoing operations, and it’s a number teams need to take into account from day one.

That said, cloud AI hasn’t disappeared from the equation. It’s still great for quick prototypes, early-stage experiments, and tasks involving public data. It helps teams move fast and save money — right up until sensitive context is involved.

A Glimpse Into the AI-Native Engineering Workflow

The role of AI in engineering teams is only going to expand. Over the next few years, more and more production code will be written by agents. If today they operate at the level of junior developers, soon they will reach mid-level proficiency, capable of handling routine tasks independently.

Teams will increasingly resemble a swarm of intelligent agents supervised by a handful of experienced engineers. The latter will focus on architecture, critical design decisions, and guiding the “hive,” while AI handles integration, refactoring, and maintenance.

Meanwhile, every internal service — wikis, issue trackers, CI/CD systems, CRMs — will gain machine-consumable interfaces (MCP) and become participants in the engineering workflow. This will create something close to a living ecosystem: AI agents that understand project history, collaborate across tools, and reason over long-term context.

But this raises a new challenge: where do we find the senior engineers capable of supervising these ecosystems? The answer will likely come from long, structured internships of a year or more, where juniors grow into specialists who can both write code and orchestrate AI-powered teammates.

What to Do (and Not to Do) When Deploying AI

For teams just beginning their AI journey — especially with on-prem deployments — a few principles consistently prove themselves in practice.

1. Start by identifying the exact processes you want to automate. Clear problem definition is what separates effective AI initiatives from scattered experiments. When teams know where AI fits in the workflow and what outcomes they expect, pilots become far easier to evaluate and scale.

What not to do: Don’t start by buying hardware or locking yourself into a specific model too early. Infrastructure is expensive. Choosing the wrong architecture upfront can turn into a costly mistake.

2. Invest not only in the model, but in the infrastructure around it. AI — especially when deployed on-prem — lives or dies by how quickly teams can iterate. Updating models and evaluating quality rely on solid tooling, automation, and observability. Speed comes from the ecosystem, not the algorithms alone.

What not to do: Don’t assume you can deploy a model once and forget about it. On-prem systems require continuous updates, monitoring, and operational hygiene. Ignoring this leads straight to technical debt.

3. Build with secure-development principles from day one. Think early about compliance, access boundaries, audit trails, and the certifications you may eventually need. Designing for security upfront is far easier than retrofitting it later.

What not to do: Don’t assume that “on-prem” automatically means “secure.” Misconfigured logs, exposed endpoints, and weak access controls remain common sources of breaches — especially for smaller teams. Models can still hallucinate regardless of where they run. On-prem reduces risk only when the surrounding infrastructure is engineered and maintained with discipline.