GENIUS: an AI co-pilot for computational materials science

Agentic AI is one of the most talked-about ideas in science right now. The concept is simple but ambitious: AI systems that do not just answer questions, but plan, act, check their own work, and recover from mistakes. A new open-access paper from a Helmholtz AI Project Call-funded team at KIT's Institute of Nanotechnology and Hereon Research Center puts that idea to work in computational materials science.

"GENIUS: an agentic AI framework for autonomous design and execution of simulation protocols" by Mohammad Soleymanibrojeni, Roland Aydin, Diego Guedes-Sobrinho, Alexandre C. Dias, Maurício J. Piotrowski, Wolfgang Wenzel, and Celso Ricardo Caldeira Rêgo is published open access in Communications Materials (Nature Publishing Group, vol. 7, article 115, 2026, https://www.nature.com/articles/s43246-026-01167-0).

A surprisingly stubborn bottleneck

Atomistic simulations, which model how electrons behave inside a material, predict its properties and guide its design, have transformed modern materials science. Tools like Quantum ESPRESSO make it possible to run density functional theory (DFT) calculations on any computer cluster worldwide. The problem is that actually setting one up correctly requires deep specialist knowledge: the right input parameters, the right convergence settings, the ability to diagnose cryptic errors and fix them. It is a skill that takes years to develop, and it creates a real bottleneck for Integrated Computational Materials Engineering (ICME), where domain scientists who are not computation experts often need exactly these tools. This is the problem GENIUS was built to solve.

What GENIUS does

GENIUS is an agentic AI workflow that takes a plain-language description of what you want to simulate and turns it into a working, validated Quantum ESPRESSO input file, without human intervention.

It does this by combining three components:

  • A Quantum ESPRESSO knowledge graph: a structured, machine-readable encoding of expert simulation knowledge, going far beyond what any general-purpose language model has memorized.
  • A tiered hierarchy of large language models that interprets the researcher’s free-form prompt and generates simulation protocols, drawing on the knowledge graph for accuracy.
  • A finite-state automated error-recovery machine that monitors execution, detects failures, diagnoses the cause, and autonomously retries with corrections in a loop.

This is what makes GENIUS agentic: it does not just generate and hand off, it observes the outcome, reasons about what went wrong, and acts again until the protocol is valid or until the defined recovery limits are reached.

How it performed

The team benchmarked GENIUS across 295 diverse prompts, a deliberately varied test set spanning different materials, simulation methods, and levels of prompt complexity. The results:

  • ~80% of prompts produced input files that passed early execution validation directly;
  • 76.3% of initially failing cases were autonomously recovered by the error-handling loop;
  • Higher inference and computational efficiency compared to LLM-only baselines;
  • Hallucinations in protocol generation were virtually eliminated.

The attempt-wise success rate decays exponentially toward a 7% floor, meaning the system knows when to stop, rather than looping indefinitely on intractable cases.

Where it goes next

The current version of GENIUS is built around Quantum ESPRESSO, but that is set to change. The goal is a general AI-assisted infrastructure for computational materials discovery, spanning DFT, molecular dynamics, and kinetic Monte Carlo, and built to work across academia and industry alike.

We spoke to the principal investigator of the project, Celso Ricardo Caldeira Rêgo, corresponding author and researcher at the Institute of Nanotechnology at KIT, about what GENIUS means for the future of computational materials science and where challenges still lie.

The benchmark results show GENIUS recovering 76.3% of failed cases autonomously. What does the remaining ~24% tell you? Are there patterns in the cases the system couldn’t fix, and what do they reveal about the limits of agentic AI in scientific simulation right now?

The first important clarification is that the 76.3% does not mean that 24% of all failed cases remained unresolved. In the GENIUS benchmark, we had 236 successful cases. Among these successful cases, 76.3% were recovered autonomously through the error-handling loop, while the remaining cases did not require the recovery loop and were completed without it.

The truly unsuccessful part corresponds to 59 prompts. These cases are scientifically very informative because they show where agentic AI still reaches its current limits. The failures were not mainly simple syntax errors; those are usually recoverable. They were more likely to be connected to cases requiring deeper domain reasoning, greater model capacity, or more complete structured knowledge. In the published benchmark, we deliberately used small and medium-sized models, and the smart knowledge graph was not yet fully curated. Therefore, the unresolved cases likely reflect limitations in the model's reasoning capacity and knowledge-graph completeness rather than a fundamental limitation of the agentic workflow concept.

The key message is that agentic AI is already powerful when errors are explicit, executable, and linked to a repair strategy. However, scientific simulations also contain implicit expert knowledge: physical assumptions, convergence choices, code-specific conventions, and hidden dependencies among parameters. These are harder to recover automatically unless they are represented in the knowledge graph and tested by reliable validators.

More recently, we improved the automatic generation of the smart knowledge graph and strengthened the error-handling loop. We also started extending the framework to other codes, including VASP for DFT and LAMMPS for molecular dynamics. In the first benchmark of this improved approach, we achieved 100% accuracy, suggesting that many of the original limitations are addressable through improved knowledge structures, stronger validation, and more mature recovery mechanisms.

Scaling GENIUS beyond Quantum ESPRESSO to molecular dynamics and kinetic Monte Carlo sounds straightforward on paper, is that so in practice? What does it actually take to extend an agentic framework to a new simulation code, and what is the hardest part of that transfer?

Conceptually, yes; practically, no. The general GENIUS strategy is transferable: combine a smart knowledge graph, an LLM-based agent, executable validation, and a self-consistent error-handling loop. However, each simulation code has its own input language, assumptions, error messages, convergence behavior, and scientific culture. Extending GENIUS to a new code is therefore not just a matter of changing the executable.

This is exactly what we are now doing for VASP at the DFT level and LAMMPS at the molecular-dynamics level. For each code, we need to build a code-specific layer: documentation must be converted into a structured smart knowledge graph; input syntax and parameter dependencies must be captured; typical runtime errors must be mapped to repair actions; and validators must check not only whether a job runs, but whether the setup is scientifically meaningful.

We are now also beginning to extend this strategy to the 4C (https://www.4c-multiphysics.org/) multiphysics code, which is particularly interesting because it moves GENIUS toward continuum-scale, coupled multiphysics simulations. 4C is a research code for multiphysics simulations, including solid mechanics, fluid mechanics, scalar transport, and chemical reactions.

The hardest part of this transfer is not software integration itself. The hardest part is translating expert, code-specific simulation knowledge into a structured, executable, and self-correcting representation. In other words, the bottleneck is scientific formalization: making implicit expert choices explicit enough for an AI agent to reason over, test, and repair them reliably.

Made possible by the Helmholtz AI Project Call

GENIUS is a direct result of the Helmholtz AI Project Call funding. The Project Call is the core mechanism through which Helmholtz AI supports ambitious, application-driven AI for science research across the Association, enabling interdisciplinary teams to tackle problems at the intersection of AI and scientific domains where the impact is real, and the technical challenges are significant. 

The Helmholtz AI Project Call is held annually. For information on future rounds, visit our project call page here.