Technology groups are rushing to redesign the way they test and evaluate artificial intelligence models as rapidly advancing technology outpaces current benchmarks.
OpenAI, Microsoft, Meta, and Anthropic all recently announced plans to build AI agents that can autonomously perform tasks on behalf of humans. To do this effectively, the system must be able to use reasoning and planning to perform increasingly complex actions.
Companies conduct “evaluation” of AI models by teams of staff and external researchers. These are standardized tests, known as benchmarks, that evaluate the capabilities of a model and the performance of different groups of systems or older versions.
However, recent advances in AI technology have enabled many modern models to achieve near 90% accuracy or better on existing tests, highlighting the need for new benchmarks.
“The pace of the industry is very fast. We’re now starting to saturate our ability to measure some of these systems (and as an industry), and it’s becoming increasingly difficult to assess (them). ” said Ahmad Aldar, Generative AI Lead at Meta.
To address this issue, several technology groups, including Meta, OpenAI, and Microsoft, have created their own internal benchmarks and intelligence tests. But this has raised concerns within the industry about whether technologies can be compared in the absence of public trials.
“Many of these benchmarks tell us how far we are from automating tasks and jobs, and unless they are made public, it is difficult for companies and society at large to know that,” says the Center for AI Safety. said Dan Hendricks, executive director and advisor to Elon Musk’s xAI.
Current public benchmarks (Hellaswag and MMLU) use multiple-choice questions to assess common sense and knowledge across a variety of topics. However, researchers argue that this method is now becoming obsolete and that models require more complex problems.
“We’re at a point where many human-written tests are no longer sufficient barometers of a model’s ability,” said Mark Chen, SVP of Research at OpenAI. “That creates new challenges for us as a research community.”
One of the public benchmarks, SWE-bench Verified, was updated in August to better evaluate autonomous systems based on feedback from companies including OpenAI.
This involves using real software problems taken from the developer platform GitHub and providing the code repository and engineering problems to an AI agent and asking them to fix them. Reasoning is required to complete the task.
By this metric, OpenAI’s latest model GPT-4o Preview solved 41.4 percent of the problems, and Anthropic’s Claude 3.5 Sonnet solved 49 percent.
“[With agent systems]it’s much more difficult because you have to connect those systems to a lot of additional tools,” said Jared Kaplan, Anthropic’s chief scientific officer.
“You basically have to create a whole sandbox environment for them to play in. It’s not as simple as just giving them a prompt and seeing what they’ve completed and evaluating it,” he added. I did.
Another important factor when performing more advanced testing is to benchmark the is to ensure that questions are excluded from the public domain.
The ability to reason and plan is critical to unlocking the potential of AI agents that can perform tasks and correct themselves across multiple steps and applications.
“We’re discovering new ways to measure these systems, and of course one of them is inference, which is an important frontier,” said Ece Kamar, vice president of AI frontiers and lab director at Microsoft Research. says.
As a result, Microsoft is working on its own internal benchmarks to assess whether AI models can reason as well as humans, incorporating questions that have not previously appeared in training.
Some, including researchers at Apple, have questioned whether current large-scale language models are “inferring” to the closest similar data seen in training, or are purely “pattern matching.” I have doubts.
“Companies make logical decisions about the narrower areas of interest,” said Ruchir Puri, chief scientist at IBM Research. “[The debate]is around the broader concept of human-level reasoning, which is almost in the context of artificial general intelligence. Are they really thinking logically? Or are you just parroting it?”
OpenAI measures reasoning through assessments that primarily target math, STEM subjects, and coding tasks.
“Inference is a very grand term. Everyone has a different definition and their own interpretation…The line is very blurry, and we try not to get too caught up in the distinction itself. , we look at whether it’s driving utility, performance, or functionality,” said OpenAI’s Chen.
Recommended
The need for new benchmarks has also led to efforts by outside organizations.
In September, startups Scale AI and Hendrycks announced a project called “Humanity’s Last Test.” The project crowdsources complex questions from experts in various fields that require abstract reasoning to complete.
Another example is FrontierMath, a new benchmark created by expert mathematicians released this week. Based on this test, the most advanced model completes less than 2% of the problems.
But without clear agreement on how to measure these capabilities, experts warn that it can be difficult for companies to assess their competitors and for businesses and consumers to understand the market. There is.
“There’s no clear way to say, ‘This model is definitively better than this model,’ because once a certain measure becomes a goal, it’s no longer a good measure.” They are trained to pass, Meta’s Al said. Dahl.
“This is something we are working on as an industry.”
Additional reporting by Hannah Murphy in San Francisco