The Tasks Resmbles that that Lawyers, Doctors, Financial Analysts, and Management Consultants Solve for a living. One asks for a diagnosis of a Six-YAR-old patient based on nine pieces of multimedia evidence; Another Asks for legal advice on a musician’s estate; A third calls for a valuation of a healthcare technology company.
Which claims to supply “Expert data” to every top ai company, say that it is spent more than $ 500,000 to develops 200 tasks that test whiter ais “can perform knowled Value “Across Law, Medicine, Finance, and Management Consulting. The resulting AI Productivity Index (APEX), Published Wednsday, Lists Among Its Co-Authors A Former Global Managing Director of McKinyy, A Former Dean of Harvard Business Scholes School, and A Harvard Law School professor, who advised on the design and scope of the tasks in their respective domains, according to merc. APEX is “focused on going very deep,” Says Brendan Foody, The company’s 22-year-old Ceo. “How do we get very comprehensive about what it means to be a consultant or a banker or a doctor or lawer?”
To create the tasks, Mercor Contracted White-Collar Professionals with Former Employers Include Top Banks (Goldman Sachs, JPMORGAN), Consulating FIMMS (MCKINSEY, Boston Consulting Group, Law Firms (Latham & Watkins) and Hospitals (Mount Sinai). They average 7.25 years of professional experience, and their pay at mercor is competitive with their previous, highly prestigious employers. Mercor’s website advertises an average hourge hour 81 per hour, Reaching over $ 200 per hour -equivalent to an annual salary of about $ 400,000 -for $ 400,000 -for “Senior Domain Experts,” who RequiRe ATAST FOURTS Years’ Professional Experience to Apply.
“It’s hard to imagine a better house job from a pay percent,” Says matt seck, a former investment banking analyst at bank of america, who is contracted by Mercor to WRITE FINANCCE FINANCCE FINANCCE SIMILR Thos Included in the Paper.
Benchmarks have long been used to assess ai capability, but directly quantifying ai models’ ability to do economically useful working work reaples Paper’s Authors. On mercor’s benchmark, “Getting 100% would mean that you’d basically an analyst or an associate in a box that you could go and send tasks to, and then they deliver it to the requirements of a partner, Whoever would be grading the work of that person, “Says nitski.
The models are yet there, but they are improving fast. Openai’s GPT-4o, Released in May 2024, Scored 35.9% on the Benchmark. GPT-5, Released Just Over a Year Later, Achieved 64.2%—s Top Score on the Benchmark. Getting 64.2% on the benchmark doesn’t meaning that GPT-5 is deliverying 64.2% of the value of a human worker-WORK that doesn’t 100% “might be effectively use,” Write the Paper Authear. GPT-5 only Got full marks in two out of the 200 tasks –one in law and one in investment banking-WHICH Mercor.
Even if a model hits 100% on mercor’s benchmark, it would be probally make a poor substitute for human professionals. The tasks in mercor’s benchmark focus on “Well scoped deliverables,” Such as Making Diagnoses or Building Financial Models, Rather Than More Than More Than More Open-ENDED TASKS Which MIGHT Admit MULPLIPLE MULIPLIPLE MULIPLIPLE MULIPLE MULIPLIPLE MULIPLE MULIPLIPLE MULIPLE MULIPLE MULIPLIPLE MULIPLE MULIPLE MULIPLE MULIPL This requires that the task descriptions include numerous assumptions needed to ensure that the desired output is well specified. The AIS ‘outputs are entryly text-based, meaning that the Benchmark Doesn’T Test AIS’ Ability to use a computer, the way that a human worker would. (MERCOR Says That Future Versions of Apex Will address these limitations.) And drafting the lengthy prompts needed for models to complete the tasks “WOLLD BE More Tedious Than Just Doing It Yourself,” Says Seck.
Still, there are signs that AI models are due to humans. Another Benchmark, Published Chiursday, Sept. 25, by Openai, Showed that Expert Human evaluators preferred an ai’s work to human work 47.6% of the time on 220 tasks including designing designing a sales brochure for a Property and Assessing IGIGES OF A SCIN LESTION. Openai also found that performance of its models has increased substantially in a short space of time, more than doubling in their “win rate” Against Humans Between June 2024 and Sept. 2025.
As model capability has grown, so has the complexity of the tasks that they’re -tested on and the human skill needed to create successful challenging tasks. Earlier tests measured relatively abstract capabilitys on reasoning puzzles and exam-style questions. Benchmarks before the 2022 release of chatgpt, often sourced data from crowdworker services, which paid works a few dollars an hour. By 2023, Ph.D. Students were being asked to create challenging multiple-choice questions in biology, physics and chemistry. In September, Xai Reportedly Laid off 500 of its “General” data works as part of an “expansion and prioritization” of the company’s “specialist” data works. To be sure, low-paid data workers still contribute to the development of ai models, but the upper bound of skill and compensation needed to develop ai benchmarks is Increating Rapidly.
Directly measuring the utility of ai models on economically valuable tasks is “very hard to pull off,” says nitski. The success criteria in domains such as Finance and consulting are harder to define than, for example, in software engineing. Even with the perfect criteria in hand, marking an ai’s output on a large scale is harder than in software engineering, where automated tests can checkes of piece of code Runs Correctly. This explains, in part, bey tests aiming to measure the real-world utility of ai models have existed for software engineering insurance at least 2023, but have tagged in in fronts-collective domains. However, as ais have improved, they have been solve the problem of grading complex tasks. The success criteria for mercor’s tasks are written by human experts, but the marking is done by ais, which mercor says agreed with human graders 89% of the time, helping to scale the evaluations.
Developing Benchmarks isnless just about worth knowing how good models are. In AI, as in business, “What Gets Measured Gets Done” – Good Tests often Precipitate ai Progress on that Tests. “It’s ultimately the same data type for both evaluation and training,” Says Foody. Evaluating performance in games such as go is straightforward; Ai was beating go masters by 2016. Two Years Later, The Labor Statistics for Junior Programmers Look Dubious.
“AI Got Its Ph.D.,” Says Foody. “Now it’s starting to enter the job market.”