Welcome to Helmholtz AI, Leo Schwinn!

We are delighted to welcome Leo Schwinn as a new Principal Investigator at Helmholtz AI. Leo joined Helmholtz Munich in May 2026, where he leads the Schwinn group working on secure machine learning. His research examines how machine learning models fail under adversarial pressure, from subtle input manipulations to safety vulnerabilities in large language models, and what it takes to build systems that are genuinely reliable. We sat down with Leo to learn more about his work, his path to Helmholtz AI, and what drives him forward.

Since when and in what role are you at Helmholtz AI, and what research goals are you pursuing?

I joined Helmholtz Munich in May 2026 as a Principal Investigator at Helmholtz AI, where I lead the Secure Machine Learning group (SEML). We work on the alignment, security, and privacy of machine learning systems, both by studying how these systems fail and by developing methods to make them more reliable. 

Agentic AI is where these questions become most pressing. Models are starting to operate over long horizons with only minor or no human supervision. Here, the failure surface grows in ways that current evaluation methods do not capture well. Understanding how we can make agentic AI more secure and making that knowledge actionable are the directions I want to push the group in.

Beyond fundamental aspects of ML security, there are several collaborations I am excited to develop at Helmholtz Munich. I am interested in studying reliability in scientific ML more broadly. Evaluating protein generation methods, for example, often depends on metrics that cannot capture the full complexity of the domain. Optimizing for these metrics can produce models that score well but are flawed in ways the benchmark does not see. We plan to look into reward hacking in this setting. This concern grows substantially once AI agents and autoresearch enter the picture. If a benchmark can be gamed, an autonomous system that can iterate on it cheaply will eventually find the gap, whether or not it was designed to. The result looks like progress but is not. Part of what we want to understand is how serious this failure mode becomes in practice, and then work on metrics and evaluation scenarios that are harder to hack by design. There is also a related, more classical concern: agents themselves could be misaligned and produce results that look good but are not, which is something I want to study as well.

What brought you to Helmholtz AI?

Two things mainly. The first is the combination of fundamental ML research and concrete scientific applications under one roof. A lot of ML security work is done in isolation from the domains it eventually needs to serve, and Helmholtz Munich is one of the few places where I can sit close to people working on health, biology, and other natural sciences while still doing core methods research. That proximity makes it much easier to ask which security and reliability problems actually matter in practice.

The second is Helmholtz AI itself. The setup gives PIs room to build a group around a research vision, with the compute and infrastructure to follow through, and connects naturally to the broader Munich ML community through TUM and MCML. That mix of independence, scale, and good neighbours is rare.

Your recent papers have been quite critical of how we currently evaluate the robustness and safety of LLMs, arguing that our benchmarks and judges are less reliable than the field assumes. Can you explain the core problem?

The core difficulty is that LLM outputs are open-ended natural language, which makes it hard to automatically decide whether a response is actually harmful, or just looks harmful but is actually not meaningful or even benign. To solve this, the field has converged on using another LLM as a judge to label whether an attack succeeded. Judges are trained and validated on a fixed distribution, typically one target model, one attack type, and one set of harmful behaviors, and they show high agreement with humans in that setting.

In practice, judges are then used in very different settings. Different target models generate in different styles, new attacks distort outputs in unexpected ways, and the harmful behaviors themselves vary in how ambiguous they are. We ran a large-scale human labeling study across these realistic configurations and found that the high human-judge agreement reported in prior work does not hold up. Judge performance often degrades to near random chance. The problem is so severe that some published rankings of which attack is "best" are misleading: the attack is not actually more effective, it is just better at exploiting judge weaknesses.

To give some actionable input to the community on how to improve reliability in these evaluations, we used our human labels to release two datasets. ReliableBench is a curated set of behaviors that were consistently judgeable across configurations, so people can get more reliable robustness estimates. JudgeStressTest is the opposite, the answers where all automated judges failed, with ground-truth human labels, so the community can measure progress on building better judges.

What would you like to achieve in your scientific field? What does success look like in five or ten years?

Adversarial robustness has been a research problem for over a decade, and we still cannot fully solve it even for very simple settings. For LLMs and agentic systems, I do not expect a clean technical solution on the timescale we need one. Capabilities are moving faster than robustness, and we will almost certainly have models deployed in roles where robustness matters before we know how to make them reliably robust. So in the next five to ten years, what I would like to see, and contribute to, directions for dealing with this gap that go beyond pure machine learning, for example systems engineering approaches and other angles that are only starting to get attention. That means safety approaches that do not require solving adversarial robustness from scratch, and a more honest conversation about what we can and cannot reasonably promise.

A connected goal is the gap between AI regulation and the actual trajectory of the field. There is a growing mismatch between what regulators expect in terms of robustness and what the technical work can currently deliver. I would like to help bring these closer together over the next decade, on both sides.

The other direction I want to take the work in is beyond LLMs. A lot of what we learn at the frontier of LLM research, about reward hacking, autoresearch, and recursive self-improvement, transfers naturally to other domains where reliability matters, like AI for science. If we end up with agentic systems running their own experiments, the question of which metrics are hard to game becomes central.

What do you enjoy outside of research?

Most of my time outside of work goes to my two children and my wife, which is the best possible reason to be away from a laptop. When I do get a bit of space for myself, I try to go climbing, mostly indoors, and I play guitar, though honestly less and less often than I would like. I also enjoy reading, which these days has mostly turned into listening to audiobooks in the car or on the bike.

Find out more about Leo and the SEML group here →