Slo Science, Fast Progress: How Sebastian Lobentanzer is making biomedical AI accessible to everyone

© Andreas Heddergott / TUM

Sebastian Lobentanzer, Helmholtz AI Principal Investigator, is a pharmacologist turned research software engineer whose work sits at one of the most consequential intersections in modern science: making artificial intelligence genuinely useful and genuinely trustworthy for biomedical researchers. 

As a Principal Investigator at the Institute of Computational Biology at Helmholtz Munich and Head of Computational Biology at the German Center for Diabetes Research (DZD), he leads the Lobentanzer Lab, whose tools are already reshaping how scientists interact with complex biological data. His open-source frameworks (BioCypher, for automating biomedical knowledge representation, and BioChatter, for connecting large language models to real research workflows) have attracted funding from the DFG and Open Targets and built a growing community.

In an era where AI progress is measured in weeks, Sebastian is making the case for something more deliberate: systems that are explainable, robust, and built to last. What does that take?

The beginning of the story

You started as a pharmacologist, completed your PhD in pharmacology and toxicology, then pivoted toward research software engineering during your postdoc at Heidelberg University Hospital; and ended up building entirely new infrastructure for the field. At what point did you realize the tools you needed simply didn't exist yet, and that you'd have to build them yourself?

This is an excellent question, not just for my own history, but also because this situation (asking whether something suitable exists and, if not, build it) is such a common one in research. In my case, it happened very early in the postdoc. I had built a knowledge graph in my PhD to study microRNA regulation, but it was completely unusable by others. I was not a software engineer! Starting in Heidelberg, we had the plan to migrate an established protein signalling database from my PI’s prior work (Saez Lab) to a graph structure. I had prior experience and it seemed useful to join forces. During the process, we talked to others with similar problems (how to change the representation of biological knowledge easily). We realised that this was a common problem that deserved a general solution; in the spirit of open source, we collaborated on that framework to enable any sort of knowledge representation transfer with relatively simple configuration. That later became BioCypher.

That this simple configuration I mentioned was also quite practical in the early times of LLMs, where they had input windows of around 1000 tokens, was simply very serendipitous. I realised that the configuration was a nice compact way to “explain to the LLM” what the knowledge graph does, which improved retrieval by a lot. That fit well with the sentiment of prompt engineering at the time. Of course, in AI terms, this was ages ago (2023), and agentic solutions can now leverage more powerful and even more generic approaches to the retrieval problem.

Making AI accessible - really!

"Accessible AI" is a phrase that gets used a lot, but your lab takes it seriously as a design constraint. What does it actually mean in practice for a wet-lab biologist who has never written a line of code - and how does that shape the way you build things like BioChatter and BioCypher from the ground up?

There are many ways to answer this question, because accessibility is just such a flexible term. But, as the fan of history of science that I am, I will answer it from an epistemic perspective. What does it mean that a method, experiment, or question design is accessible? Mainly, that the audience can comprehend it, hopefully even interact with it. Simply using the high performance of modern LLMs to let someone use a computational tool or database that they couldn’t use before is not a large technical challenge any more. However, what they do with it—which questions they will ask, and which conclusions do they draw from what is returned—is everything but trivial. Before LLMs, we had only one kind of intelligent (in a common-sense meaning of the word intelligence) participant in any conversation: the human. This meant that we never questioned whether it is a human that asks the questions; implying that the human understands the topic (by having studied the domain). Now, humans can ask questions for which they may not be able to verify the validity of any given answer. The verifiability problem is particularly relevant since LLMs have well-known biases that they learned from their training data, and are prone to hallucinating (telling you what you want to hear). This has profound implications for the scientific process, as we are preparing to hand much of it over to these agents. In that way, accessibility for me means: how can we make these systems so transparent that they don’t impair our ability, as a society, to make the best decisions based on what science tells us?

When NOT to use AI

Your lab explicitly values knowing when not to use a large, complex model. In a field where bigger and more powerful is usually celebrated, that's a countercultural position. Can you give a real example where you deliberately reached for a simpler alternative and walk us through how you made that call?

There are many situations where this is relevant. Currently, we focus on the topic of “agentic AI” a lot; there is much publicity in the approach, so many people, including collaborators, know of it, and approach us about it. We frequently get requests to consider a problem that was solved by humans so far, like automatically determining the cut-off between two groups in a biomarker measurement. Because these workflows are often done manually by humans with domain knowledge, but often not machine learning or statistics background, it is assumed that the task requires “human-level” intelligence, which people associate with frontier LLMs at the moment. However, it often turns out that a very parameter-efficient statistical method (like a naive Bayesian approach or a linear regression) is more appropriate not just because of higher efficiency, but because it performs better, has fewer biases, and is (paradoxically) more robust than high-parameter models when generalising to new data points. How to make the call is simple: if you have domain knowledge, talk to someone with statistics and machine learning experience about your problem. At Helmholtz, we have dedicated consultants for statistics and for AI. Train some models on your data (statistical models are very quick to train) and see how they perform. Only if a simple model does not satisfy your needs, then you scale up to more parameters. Decisions should be made based on data.

This social problem (lack of communication between domain and machine learning experts) is so rampant that we recently published two preprints on the matter. One is a review of agentic applications and how they can overpromise performance, the other a pragmatic proposition of a canvas to help the communication between users and developers (domain and machine learning) take place.

The Causality Problem 

Your work, published in Molecular Systems Biology, examines how current AI models handle (and often mishandle) causal reasoning in the life sciences. The concern is that models trained on correlational data can confidently reproduce associations that aren't actually causal, which in biomedicine can be genuinely dangerous. How serious is this problem in practice, and what does it look like when it goes wrong in a real research context?

It is a very serious scientific issue, with implications also in the real world, but maybe not too much in medicine, which is highly regulated (fortunately). Everybody who has talked to an LLM knows that they are incredibly biased, and the correlations that they take from their large training datasets often lead to hilarious mistakes that no human would ever make. Often, these are entertaining, but if we apply the models to any actually impactful decision, the ethics are very problematic. For instance, in hiring, it is quite obvious why we would not want a potentially biased LLM to make decisions, replicating the biases against social groups that it learned from its training set. In medicine, we don’t have any approved LLM-based medical devices, as opposed to the more than 800 FDA-approved image-based AI systems, for good reason. But also in computer vision systems, biases are present; they reflect the systemic biases in our real world, as they are trained on data from it. For instance, it was shown that a chest X-ray analysis AI algorithm is able to infer social status of the patient from an image that should be used only for diagnosis of disease. Corporate solution for voice analysis are deployed in hiring interviews to warn the employer of potential mental conditions of the applicant, violating privacy ethics. This illustrates the potential for misuse of solutions originally designed for improving diagnosis and health. Misuse as well as honest mistakes are often facilitated by the fact that the models perform correlative associations instead of learning causal or even mechanistic insights.

In the molecular context, which we focus on in the mentioned perspective, the problems are not that immediate. This is mainly due to the lower maturity stage of the algorithms. They are far from translation and approval in most cases, so they do not have the opportunity to negatively impact a patient trajectory yet. However, a large field, including global trillion-dollar-companies, is tirelessly working on solutions, so it is not entirely unrealistic that we might see some of those products enter clinical translation.

As a pharmacologist, I have to bring up the other side: all our medicines, since we created the barrier of clinical studies in response to the Contergan incident, have been based on statistical (that is, correlative) information. For most medications, particularly historically, we did not know what they causally (more specifically, mechanistically) did in the body. For many, we still don’t know (for instance, anything related to psychology and psychiatry). Randomised clinical trials ask for another sort of causality entirely: by randomising participants, they hope to eliminate confounding factors in the experimental setup, isolating the effect of the drug; is the drug more efficient at treating the disease than placebo (or gold standard)? Much more feasible than approving AI algorithms in a medical context is to use AI tools to improve the process of drug discovery and development. Once that process has identified a candidate, it needs to go through the same clinical study process as all other medications. The causality is established after using AI, not inside the AI algorithm. The randomised trial acts as a protective barrier for the patient.

The five-year picture

You embrace "slo science": the idea that in a field moving this fast, being deliberate and building things that last is itself a radical act. But looking forward, if labs like yours succeed over the next five years, what does the daily working life of a biomedical researcher actually look like - and what's the one change you think will surprise people most?

If the philosophy takes hold in the right places and succeeds at multiple levels of society, I expect a transformative change to the way we approach research and technological development. The European Union is currently scrambling to find a strategy to deal with global imbalances in AI development and applications. The classical European framework is not able to deal with the rapid pace of developments and market introductions by global technology leaders, and has also not been fruitful for local developments that are competitive with the current “frontier AI.” Wanting to be deliberate and building things that last is more aligned with European values and the processes and potential we have as a society, but it also requires much more foresight and expertise than competing approaches such as venture capital, where you are lucky if one out of ten companies “makes it.” If Europe manages to map its values and identity onto a slow and deliberate application of our resources and potential instead of trying to replicate outside successes in a very different environment, I think we can make a lot of progress in five years. This will require some courage; hard decisions need to be made.

I have no doubt that the research landscape will remain highly dynamic and responsive to trends; I myself have leveraged serendipity (for instance, in using the suitability of BioCypher configurations for LLMs). Fast ideas, fast implementations, and fast papers are not a problem per se. What is more problematic is the immediate abandonment of many such developments once the main goal (a peer-reviewed publication) has been achieved. This is a systematic problem in academia and will only slowly go away, if at all. Five years will not make a dent in this particular issue. However, I hope that I can convince people to consider the responsibility of their decision of pursuing a particular research goal or solution. If you think a new tool is needed, and go to the lengths of implementing it, writing about it, and advertising it to all your colleagues, it is a huge waste of time to then not maintain it after publication. In the field of AI and software, accessibility is closely related to sustainability. Write maintainable code, use good practice, collaborate and contribute instead of reinventing, document well, apply for sustainability funding, and generally consider your motivation to maintain a software as a factor to consider in the early planning and decision making process (this is often more directed at a PI than a PhD candidate). In my editorial work, for instance in the Journal of Cheminformatics or the open source organisation pyOpenSci, we put strong emphasis on these qualities of submitted software. I think the field will further improve in this direction, but sometimes it is hard to be patient.

I am not sure what will surprise people most, as this is likely going to be very individualistic. But maybe the old saying of “The more things change, the more they stay the same” can apply. We have a whole new world of tools at our disposal, but human intelligence is very adaptive and, after a latency period, I have no doubt that we will be able to leverage these new technologies in the same way that we now are familiar with the internet. For a biomedical researcher, this means they will be able to turn to helpful assistants for many things that currently are very time-consuming or even impossible for a single person to achieve. Ideally, this will respect the social component of research.

Put simply, we don’t work to replace the bioinformatician or statistician; we work to create autonomous systems that (reliably) deal with the boring parts of science so the scientists can focus on the work they like doing, including creativity and the communication with colleagues from other domains.

The Lobentanzer Lab is currently welcoming applications for the position “Agentic AI Research Engineer” - find out more about the listing here.