OpenAI Announces Benchmarks for AI Life Sciences Research. Its Best Model Failed 63.9% of the Test
This week OpenAI announced a 750-task test to measure "whether AI systems can support realistic life science research tasks, not just answer biology questions." But while OpenAI's top-performing GPT-Rosalind model led the rankings, Slashdot reader BrianFagioli notes that "it achieved a pass rate of just 36.1 percent, failing nearly two-thirds of benchmark tasks."
Nerds.xyz points out that means "the best-performing model failed nearly two-thirds of the benchmark's tasks."
Benchmark Performance Breakdown
The benchmark also revealed a familiar weakness. AI systems generally perform better when everything is presented as text. Once they are forced to work with supporting documents, figures, or complex datasets, performance drops noticeably. GPT-Rosalind's pass rate fell from 45.1 percent on text-only tasks to 28.1 percent on tasks involving artifacts or URLs.
Capabilities and Limitations
To be fair, the benchmark is not intended to suggest AI is useless in research. Quite the opposite. OpenAI found that models are becoming increasingly capable of scientific communication, evidence synthesis, and translating research findings into practical explanations. Those are valuable skills, particularly for researchers drowning in information.
But LifeSciBench serves as a useful reminder that today's AI systems are still far from autonomous scientists. They can help. They can assist. They can sometimes provide surprisingly useful insights. What they cannot reliably do, however, is replace the expertise, judgment, and skepticism that real scientific research requires.
Community Reactions
Open source? Safety issues? (Score:1)
Interestingly, the benchmark questions are not available to the public for "safety and licensing" reasons. I don't see how benchmarking questions about answering science questions could be a public risk unless the questions are asking how to build atomic weapons or biological warfare agents, and the scoring matrix contains useful/secret information…
Stupid headline and stupid statistics (Score:5, Insightful)
36.1% pass would be worrying if this was a qualification test of things it needs to be able to do. It's not. This is a benchmark, and it SHOULD have a low pass rate. That's how you know if you're making improvements. We could quite easily create a different benchmark where it passes 99.9%. That wouldn't mean the device being tested is good. It would just mean we have a useless benchmark. I have no opinion on whether AI is good or bad for this use case. I just hate when statistics are used to mislead people.
Re: (Score:3)
I don't understand why they worded the headline like that. Who refers to scoring on an assessment by the percentage you got wrong? If their point is to say the models weren't very good, surely saying they scored 36.1% would have a better audience impact than using the opposite figure.
Re: (Score:2)
Unfortunately (for them) the inverse is also true. If we try to build a system like this to be definitive… Our system will ultimately be inadequate for the breadth of operations researchers will attempt to push through it. Doing last mile research integration is a fairly regular part of my day job… Most researchers don't really care to conform with an application framework, and those who do care to will know the frameworks they're working with well enough to make complex decisions like a developer would.
How does it compare to a human? (Score:5, Interesting)
For example, a new grad with a BS in Biology? Or a mid-career researcher? And with what time limits? Is the amount of work in this benchmark something that would take the human a day? A week? A month? I'd also like to know how quickly a new grad or mid-career researcher can identify which things the AI got right? For example, say it's asked a week's worth of work and gets 36% right = 14 hours. If it takes the human 10 hours to figure that out, it's a win. If it takes the human 20 hours to figure it out, it's not. And how well could the human figure out ahead of time which things it thought the AI would get right? If the human only asks that subset, then the payoff is better. Say the human only asks the AI to do 20% of the tasks (8 hours of work), but now it takes 20% of the time to grade (so instead of 20 hours, it takes 4 hours). Now it's a win again. Without knowing these things, it's like saying, "AI sucks at playing golf!" Without saying whether it's having trouble with 400-yard drives or just getting the ball into the windmill before the ramp goes up.
Re: (Score:2)
...or if it thinks playing golf involves creative use of a golf cart, and correctly infers that just driving up and dropping the ball in the hole will incur a penalty… so it spends all your expensive premium tokens thinking up new ways to avoid detection… etc.
Sounds about right. (Score:2)
It feels like software engineering with AI could be described with similar numbers.
Models failing a new benchmark is good (Score:2)
People may say "63% fail rate is bad!" but new AI benchmarks are designed so current models fail, because when all models are in the range 85-90% of old benchmarks, the signal/noise ratio of the benchmark is low. So when this happens you introduce a benchmark that challenges current models to observe future models becoming better.
Comments
No comments yet. Start the discussion.