Do AI scientists actually do science? New benchmark probes the reasoning behind the results - featuring Dr. Kevin Maik Jablonka, Helmholtz AI Associate

A new preprint by researchers at Friedrich Schiller University Jena / Helmholtz Institute for Polymers in Energy Applications Jena, HIPOLE Jena and the Indian Institute of Technology Delhi challenges a widely held assumption in the AI for science community: that an AI agent capable of producing correct scientific results is, in some meaningful sense, doing science.

The paper, "AI scientists produce results without reasoning scientifically" (arXiv:2604.18805), introduces Corral, a benchmark designed to evaluate not just what LLM-based scientific agents produce, but how they produce it. The work was carried out by Helmholtz AI AssociateKevin Maik Jablonka (Friedrich Schiller University Jena / Helmholtz Institute for Polymers in Energy Applications Jena, HIPOLE Jena), together with co-corresponding author N. M. Anoop Krishnan (IIT Delhi) and co-authors Martiño Ríos-García, Nawaf Alampara, and Ali Asghar Aghajani (all FSU Jena), Chandan Gupta and Sajid Mannan (IIT Delhi, Department of Civil Engineering), and Indrajeet Mandal (IIT Delhi, School of Interdisciplinary Research).

A benchmark grounded in real scientific infrastructure

Corral spans eight experimental domains, ranging from workflow execution to hypothesis-driven inquiry, backed by real scientific infrastructure: a live atomic force microscope, LAMMPS molecular dynamics simulations, wet-lab chemistry and circuit-network simulators, manually curated NMR spectra, and a reaction-rule database for retrosynthesis. Across more than 25,000 agent runs involving three frontier models, the team combined systematic performance analysis with step-by-step epistemological annotation of every agent trace, which can also be explored here.

Two complementary findings emerged. First, on performance: the base language model is the primary determinant of agent success, with reasoning ability accounting for 41.4% of explained variance in a latent factor model, while the agent scaffold (the prompting, tool-routing, and orchestration layer where much community engineering effort is focused) accounts for just 1.5%. The bottleneck is in the model itself, not the wrapper around it.

The process matters, not just the outcome

The more fundamental finding concerns the quality of reasoning. Looking at the epistemological structure of agent behaviour across all configurations, the study finds that evidence gathered during a run goes unused in 68% of traces. Untested claims — hypotheses stated without any experiment designed to test them — appear in 53% of traces overall, rising to 63% in hypothesis-driven domains. Belief revision in response to contradictory evidence occurs in only 26% of runs. Convergent multi-test evidence — where multiple independent lines of inquiry point to the same conclusion — appears in just 7%. Critically, these patterns do not adapt to the demands of the task: agents apply the same reasoning mode whether they are executing a routine computational workflow or conducting open-ended hypothesis-driven inquiry.

The authors find that these breakdowns persist even when agents are given near-complete successful reasoning trajectories as context, and that the resulting unreliability compounds across repeated trials in epistemically demanding domains. Outcome-based evaluation cannot detect these failures, and scaffold engineering cannot repair them. Until scientific reasoning becomes an explicit training target, the knowledge produced by AI agents cannot be fully justified by the process that generated it. 

We asked Dr. Kevin Maik Jablonka, Helmholtz AI Associate, to reflect on the work and its implications.

Corral evaluates the reasoning process behind agent outputs rather than just their correctness. What was the practical challenge of building a benchmark that can do that at scale — across 25,000+ runs?

We spent almost two years on this work with a relatively large team. Much of this work is indeed hidden in the paper: It required quite some software engineering, but also human verification effort and ingenuity in the design of the environments. The software is built in a very modular way; every environment has custom-built tools that run in their own software stack.  Those environments sometimes have many new pieces of code that we developed to faithfully mimic operation in a real scientific setting. For instance,  the main developer of our qualitative inorganic analysis environment spent weeks just getting the color mixing right, such that we can give the agent feedback on what it would observe if it mixed some chemicals. 

The lead authors also spent countless hours going through traces to verify that the different environments are comparable, bug-free, and to also validate the analysis approach. 

Your paper argues that current AI agents "execute scientific workflows but do not exhibit the epistemic patterns that characterize scientific reasoning." What would it take, concretely, to change that — and is it primarily a training problem, an architecture problem, or something else?

I believe that we need to be clear about what kind of scientific reasoning or process we expect from the “AI scientists”. I believe that with the current way we train frontier models, paradigm-shifting (in the Kuhnian sense) discoveries are difficult, as we force the model with every step of training to make the training data more likely, and a paradigm-shifting discovery would require questioning the training data. 

If we think about epistemic rigor, there is hope that we can improve it with better post-training (and not only rewarding the outcome, but also the process). But it is also in the nature of things that with this flexible way of building AI scientists, we won’t have guarantees. In many cases, this might be fine – but in some cases, where we really would want to have process guarantees, we must build different systems that most likely will have some symbolic and formally verifiable components in them. 

As AI agents become more deeply embedded in scientific workflows, understanding how they reason will be essential. Corral offers a framework for asking that question more rigorously and for holding future systems to a higher epistemic standard.

The preprint is available at arxiv.org/abs/2604.18805.