By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: AI Can Now Do Expert-Level Work (Almost). 5 Surprising Findings from a Landmark ‘GDPval’ Study | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > AI Can Now Do Expert-Level Work (Almost). 5 Surprising Findings from a Landmark ‘GDPval’ Study | HackerNoon
Computing

AI Can Now Do Expert-Level Work (Almost). 5 Surprising Findings from a Landmark ‘GDPval’ Study | HackerNoon

News Room
Last updated: 2025/09/30 at 12:48 PM
News Room Published 30 September 2025
Share
SHARE

Introduction: Moving Beyond the Hype to See What AI Can Really Do

The debate over AI’s impact on the job market is filled with speculation. But trying to measure its real-world effect using historical data like the adoption of electricity gives us only lagging indicators of a shift that’s already underway. What we’ve needed is a leading indicator, a way to see what AI is capable of right now.

A groundbreaking new benchmark from OpenAI, called GDPval, provides exactly that. Unlike typical academic tests, GDPval evaluates AI models on complex, real-world tasks sourced directly from industry professionals with an average of 14 years of experience. The results provide one of the clearest pictures yet of what today’s most advanced AI can, and can’t, do in a professional setting. Here are the five most surprising takeaways.

Takeaway 1: On Complex Professional Tasks, AI Is Approaching Human-Expert Quality

The study’s most significant finding is that the best AI models are beginning to perform at a level comparable to highly experienced industry experts, and this capability is improving roughly linearly over time. The tasks evaluated were not simple queries; they were complex projects requiring an average of 7 hours for a human professional to complete.

Against this high bar, the results were striking. On the GDPval benchmark, deliverables from the top-performing model, Claude Opus 4.1, were judged to be better than or as good as the human expert’s work in 47.6% of cases. When combining the wins and ties for the best models, AI-generated deliverables matched or outperformed the human expert in just over half of the tasks. This suggests that AI’s ability to handle long-horizon, subjective knowledge work is far more advanced than many have assumed.

Takeaway 2: The “Best” AI Depends on the Job: A Battle of Accuracy vs. Aesthetics

The study evaluated several frontier models — including GPT-5, GPT-4o, Claude Opus 4.1, Gemini 2.5 Pro, and Grok 4 — and revealed that there is no single “best” AI for every job. Instead, different models demonstrate distinct strengths, making tool selection a critical factor for professional use. The two top models highlighted this trade-off clearly:

  • Claude Opus 4.1 was the best-performing model overall, with a particular strength in aesthetics. It excelled at tasks involving visual presentation, performing better on file types like .pdf, .xlsx, and .ppt where document formatting and professional slide layouts are key.
  • GPT-5 demonstrated a clear advantage in accuracy. It was superior at carefully following detailed instructions and performing correct calculations, making it a stronger choice for tasks requiring precision in pure text.

This distinction is crucial. It shows that effectively integrating AI into professional workflows isn’t just about using any AI, but about choosing the right tool for the specific demands of the task at hand.

Takeaway 3: AI’s Biggest Flaw Isn’t Hallucination — It’s Following Simple Directions

While much of the public conversation around AI failures focuses on “hallucinations,” the study found a more mundane but critical issue. The single most common reason that experts rejected an AI’s work was its simple failure to fully follow instructions.

This was a primary weakness for models like Claude, Grok, and Gemini. In contrast, GPT-5 had the fewest instruction-following issues, but its deliverables were most often rejected due to formatting errors. This is a surprising and important takeaway, as it shifts the focus from failures in complex reasoning to more fundamental challenges in compliance and attention to detail. Crucially, this finding directly explains why the “AI co-pilot” model requires such careful human oversight, as we’ll see next.

Takeaway 4: The “AI Co-pilot” Is Real, But Savings Require a Human in the Loop

The study’s analysis of speed and cost savings confirms the value of the “AI co-pilot” model, but with a critical caveat: human oversight is non-negotiable. A “naive” comparison can be misleading; for instance, the data for GPT-5 showed it could generate an initial deliverable 90 times faster than a human expert.

However, when researchers modeled a more realistic workflow of “try the AI, review the output, and fix it yourself if it’s wrong,” the gains shrank dramatically. In this scenario, the net speed improvement from using GPT-5 was just 1.12 times. This data, based only on OpenAI’s models, illustrates that realizing time and cost benefits is entirely dependent on having a human expert in the loop to review, validate, and correct the AI’s work.

Interestingly, the researchers note this calculation likely underestimates the true savings, as it over-penalizes the AI by assuming the human has to start from scratch after every failed attempt. Still, it proves AI’s immediate economic value lies in augmenting experts, not replacing them.

Takeaway 5: You Can Make AI Smarter Just by Asking It to Double-Check Its Work

One of the most practical findings was how easily AI performance can be improved through better prompting. Researchers gave GPT-5 a special prompt containing a detailed checklist, essentially asking it to double-check its own work for common errors. The results were significant:

  • It completely eliminated “black-square artifacts” that had previously appeared in over half of its generated PDFs.
  • It cut “egregious formatting errors” in PowerPoint files from 86% down to 64%.
  • Overall, it improved the model’s win rate against human experts by 5 percentage points.

The mechanism behind this improvement wasn’t magic, but engineering. The new prompt caused a sharp increase in the agent using its multi-modal capabilities to visually inspect its own deliverables, jumping from 15% to 97%. This shows that users can dramatically improve AI quality by guiding it to be more thorough and self-critical.

Conclusion: The Dawn of the AI-Augmented Professional

The GDPval benchmark provides clear evidence that AI is rapidly evolving into a capable tool for serious, complex knowledge work. However, its application is nuanced. This study focused on self-contained, precisely-specified tasks, not the interactive, ambiguous challenges that define much of professional life. The findings show we are not on the verge of mass replacement, but rather entering an era of human-AI collaboration. The true potential is unlocked by professionals who know how to choose the right model, provide clear instructions, and maintain rigorous expert oversight.

These models are already this capable; what happens to the world of work when they get just a little bit better?

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article ‘I was denied boarding a flight after my £400 smart ring swelled around my finge
Next Article UK internet provider shutting down service TODAY
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Newsom signs first-in-the-nation AI safety disclosures law
News
Best 75 Hard Notion Templates to Track Your Challenge Progress
Computing
The Real Reasons There Are So Many Foreign Shows On Streaming
News
I tried the new Kindle Scribe Colorsoft and was shocked by one thing
News

You Might also Like

Computing

Best 75 Hard Notion Templates to Track Your Challenge Progress

30 Min Read
Computing

This Ethereum Based Meme coin Can Turn $250 Investment Into $64K By December 31, 2025 | HackerNoon

0 Min Read
Computing

$50 Battering RAM Attack Breaks Intel and AMD Cloud Security Protections

7 Min Read
Computing

FTC sues Zillow and Redfin over rentals deal 

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?