A new study has found that leading artificial intelligence models are still unable to reliably perform complex white collar work, nearly two years after predictions that AI would replace many professional jobs.
The research, conducted by training data company Mercor, evaluated how well top AI systems handle real world tasks drawn from law, consulting and investment banking. The assessment, introduced as the APEX Agents benchmark, shows that even the most advanced models struggle to complete sustained professional tasks accurately.
According to the findings, no AI model tested was able to answer more than a quarter of the questions correctly. In most cases, the systems either produced incorrect responses or failed to provide an answer. Mercor said the results highlight a significant gap between current AI capabilities and the demands of real professional work.
Mercor chief executive Brendan Foody said the main limitation was the models’ difficulty in reasoning across multiple domains at once. He explained that professional work often requires navigating different tools and sources of information, such as internal messaging platforms and document repositories, rather than relying on a single, clearly defined prompt.
The benchmark tasks were created by professionals on Mercor’s expert marketplace, who also determined the criteria for correct answers. Some scenarios required detailed legal and regulatory analysis, including assessments of compliance with European Union privacy laws, reflecting the level of complexity faced by practitioners in the field.
Among the models tested, Google’s Gemini 3 Flash recorded the highest one shot accuracy rate at 24 percent, followed by GPT 5.2 at 23 percent. Other models, including Opus 4.5, Gemini 3 Pro and GPT 5, scored about 18 percent.
The APEX Agents benchmark differs from previous evaluations by focusing on sustained, high value tasks within a narrow set of professions rather than general knowledge across multiple fields. Mercor said this approach provides a clearer picture of whether AI systems are capable of replacing professional roles.
Despite the weak performance, Foody said AI systems have improved significantly over the past year and are likely to continue advancing as benchmarks like APEX Agents become publicly available. He noted that current models perform better than earlier versions but are still far from being able to replace lawyers, bankers or consultants.
The findings suggest that while AI continues to make rapid progress, its ability to automate white collar knowledge work remains limited.
