
Helmholtz AI Conference 2025: a World Café Wrap
The AI World Café, held on June 3, 2025 at the Helmholtz AI Conference brought together a variety of researchers and professionals for a 90-minute session: a solid platform for open idea sharing and collaboration.
Similarly to last year, participants formed small groups of up to 10 and rotated through three 20-minute rounds of vibrant discussions, covering both predefined and spontaneous AI-related topics. We would like to share our insights with you - enjoy our World Café report below!
-
At HAICON25’s AI World Café, I hosted a discussion table on the theme “Beyond Large Language Models (LLMs)” to explore how the next wave of AI systems could move past today’s general-purpose LLMs and address their limitations. My introductory statement focused on two ideas: current LLM-based approaches face technical and sustainability bottlenecks, and richer forms of Agentic AI and integrations between generative and predictive AI could enable more powerful and sustainable future systems. The debate splits into two key branches:
1. Agentic AI: Beyond Passive LLMs
We (participants) discussed the promise of moving from passive, vanilla LLM interactions, where users mostly prompt models, toward Agentic AI systems, where models possess enhanced domain abilities and, crucially, act as interactive partners. In this mode, models actively ask questions, seek missing information from humans, and iterate toward optimal solutions in collaboration with users. There was broad agreement that such approaches can unlock qualitatively new capabilities but require a profound rethinking of interface design, trust calibration, and multi-modal interaction.
2. The Unsustainable Nature of Current LLM Scaling
The other central discussion thread focused on the unsustainability of current LLM scaling trends: increasing energy consumption, dependence on massive hardware infrastructures, scarcity of high-quality data, and diminishing returns in added model value. We (participants) see that today’s AI race in scaling LLMs may soon hit a wall, technically, economically, and ecologically. Some controversy arose here: while all agreed on the current limitations, opinions varied on whether future breakthroughs (e.g., more efficient training algorithms or new architectures) could fundamentally change this trajectory.
3. Future Integrations: Generative + Predictive AI
In one round, a fascinating discussion emerged around integrating generative AI with predictive AI, creating systems that can reason about and act upon physical processes. Participants agreed this would be a game-changer for robotics, automated laboratories, and scientific workflows. Such integrations would foster a new phase of AI where the boundary between basic and applied science becomes fluid, enabling closed-loop scientific discovery and accelerating innovation across domains.
Follow-up Potential
There was a clear interest in a more profound exploration of Agentic AI design patterns and generative + predictive AI integration. A potential follow-up could involve creating cross-disciplinary working groups that bring together experts from physics, chemistry, LLMs, robotics, predictive modeling, and human-computer interaction to co-design next-generation systems. Additionally, addressing sustainability concerns through new benchmarks, shared metrics, and best practices was highlighted as an urgent field-wide need.
-
Introduction
Agentic systems are a topic of great interest, but their scope and capabilities are not well-defined. This world café table had the purpose of evaluating the need for a Helmholtz interest group around these systems. Spoiler: this purpose was not fulfilled.
Key ideas
There are many different use cases where agentic systems could be useful, such as reconstructing an unknown database schema or environment monitoring—curated knowledge graph with chat bot. However, due to the closed-source nature of many top-performing models, we rarely have quantitative metrics of the system's performance.
Controversy
The main controversy, and what ultimately prevented approaching the original question, was an extensive discussion of what constitutes an agent. There were various opinions, including the sense of purpose, the constraints imposed on it, the company leveraging the technology, tool use, and traceability/explainability. Even the conceptual layer at which to define an agent (developer perspective, user perspective, ...?) was not clear. The classical definition of an agent seems to be undergoing a transition in light of current developments.
Conclusions
There is much momentum around agents currently, such that centralising some of the education and tooling around them could be useful for Helmholtz as a society. We will follow up with all interested in a Mattermost channel in the Helmholtz AI organisation. Interested Helmholtz members can find it here: https://mattermost.hzdr.de/helmholtz-ai/channels/agentic-ai-interest-group
-
Introduction and Opening
Each session began by emphasizing the critical importance of reproducibility in scientific research, with particular focus on AI-driven discoveries. Reproducing scientific results ensures that findings and claims are not the result of chance, but can be consistently obtained using the same methods and datasets.
To encourage active discussion, participants were invited to reflect on the following key questions:
- Why is it important to reproduce the results of a publication?
- How can transparency in code and data be ensured?
- What is the difference between Reproducibility and Replicability?
- What are participants' experiences with attempting to reproduce scientific results?
Discussion Highlights
A central takeaway from the discussions was a shared consensus that achieving reproducibility in Artificial Intelligence and Machine Learning research requires adherence to sound software engineering and research practices, including:
- Adherence to “Good Software” versioning principles to reduce randomness.
- Implement code with better standard/conventions. It is also efficient/optimal.
- Set the same SEED, reproduce the same results!
- Use version control systems and reproducibility tools.
- Provide example scripts and documented runs to guide reproduction.
- Store artifacts, including input data, outputs, logs, and model checkpoints.
- Apply a “one step at a time” approach when modifying ML parameters to assess their individual impact. Scientific progress is slow and incremental.
- Researchers also emphasized the importance of reproducing one’s own results.
The distinction between Reproducibility- the ability to obtain the same results using the same method and data- and Replicability, which refers to achieving similar outcomes using different datasets, was clarified. Additional discussion points included:
- In many AI domains, large volumes of synthetic and real data are available, but computational resources remain a significant bottleneck for reproducing experiments.
- Negative results have to be published to help the community understand the limitations of methods and identify in which scenarios the methods work.
- Reproducibility badges certify that a publication meets standards for transparency, open data/code, and verifiable results.
Finally, participants recognized that the reproducibility challenge exists beyond AI and in other scientific disciplines as well. Addressing these challenges in AI requires a commitment to transparency, trustworthiness, uncertainty quantification, and a step-by-step approach.
Communities like the German Reproducibility Network (DERN) foster reproducibility efforts.
-
Motivation
The DFG Code of Conduct 'Guidelines for Safeguarding Good Research Practice' (link) currently lacks specific guidance on the use of (generative) AI tools. However, in September 2023, the DFG introduced supplementary guidelines for working with generative models for text and image creation (link). These guidelines emphasise key principles, including Transparency and Disclosure, Maintaining Responsibility, Authorship, and Review Process Restrictions.
But what does this mean in practice? Where do we draw the line between acceptable and notifiable AI usage? Should researchers disclose the use of seemingly innocuous tools like autocorrection, or more advanced applications like AI-assisted literature searches, text polishing, automatic literature reviews, text generation, or coding co-pilots? These tools are increasingly being used to improve efficiency and precision, but the boundaries of transparency and accountability remain unclear.
Introduction
We started by looking into existing guidelines.
The DFG Guidelines on Generative AI in Research (link), released in September 2023, in a nutshell, advocate for
- Embracing AI with caution, i.e., acknowledging the transformative potential of generative AI in research but emphasising the need for responsible use;
- Transparency, in particular the disclosure of the use of generative models in research work, detailing the tools used, purposes, and extent;
- Authorship integrity, meaning on the one hand that only natural persons should be listed as authors and on the other hand that responsibility for content remains with the researchers, ensuring no infringement or misconduct;
- Permissible use of generative AI in funding proposals, taking into account the aforementioned points, but prohibition in peer reviews to maintain confidentiality.
The European Commission published Living Guidelines on the Responsible Use of Generative AI in Research (link) in March 2024 aligned with the European Code of Conduct for Research Integrity and the EU’s principles for trustworthy AI. They also closely correspond with the DFG’s guidelines, particularly regarding transparency and integrity in peer review processes. In addition, the guidelines emphasise the importance of privacy and data protection when using generative AI tools, urging researchers to ensure full compliance with EU data protection laws.
In September 2024, the Helmholtz Association published Recommendations on the Use of AI in Research and Administration (link), explicitly labelling it as a living document that recognises the rapid evolution of AI, to be updated regularly, reflecting new developments and insights in the field. Building on existing guidelines, these recommendations address crucial considerations such as copyright laws and intellectual property rights, which may be impacted by the use of generative AI due to the composition of training data. Additionally, they emphasise the importance of awareness regarding misinformation and biases in AI training data and outputs.
The Helmholtz Association also outlines Best Practices for the responsible use of AI, including:
- Maintaining transparency in AI system development and deployment
- Engaging in continuous education and training on AI technologies
- Fostering interdisciplinary collaboration to enhance AI applications
- Regularly reviewing and updating AI usage policies to align with technological advancements
Discussion
The discussion roundtable highlighted the blurred lines between generating text from a few keywords and polishing or refining writing, grammar, and wording. This ambiguity makes it challenging to establish clear guidelines, particularly in the absence of tools to detect intransparent use of General Artificial Intelligence (GAI). It is essential to distinguish between using GAI for answering scientific questions, such as through in-house developed tools, and the process of writing a scientific paper, including image generation, literature research, summary, and refinement.
Specific recommendations emerged from the discussion: Firstly, it was suggested that Blablador should be promoted more in the Helmholtz community. It provides an evaluation server for various open-source Large Language Models (LLMs) that does not store user data or chat history. Additionally, mandatory courses on best practices for AI use should be introduced for PhD students and researchers at all career levels.
As AI researchers, we have a responsibility to prioritise the development of models that not only provide diverse perspectives to promote critical thinking, but also models that are explainable and transparent regarding their decision-making processes and the data used for training, thereby mitigating biases and the dissemination of incorrect or inaccurate information.
Funders should incentivise responsible GAI use, for instance, by requiring shorter project proposals to reduce the need for using GAI to generate excessive content. Moreover, participants emphasised that the scientific context should be given more weight than language in peer-reviewed papers and project proposals, rendering GAI-based text polishing unnecessary and keeping the original scientific proposal in focus.
The discussion also touched upon the idea of introducing mandatory watermarks to indicate AI-generated images in papers and grant proposals, as well as appending a list of AI tools used in a paper, including version specifications and a detailed description of the concrete application. Notably, some journals have already implemented such rules and tests for AI usage, indicating an implicit shift towards best practices in transparency for scientific papers. E.g., Elsevier and Springer Nature mandate that authors reveal the usage of GAI in their manuscripts. While Springer Nature does not require disclosure of AI-assisted text enhancement. Both publishers have stringent policies against employing GAI to create or alter images in submitted manuscripts, with exceptions for AI tools that are part of the paper’s scientific work and transparent GAI workflows involving input data that can be attributed, checked, and verified (link). Conferences such as NeurIPS and ICML introduced similar rules and guidelines.
In conclusion, the topic remains open, but there appears to be a consensus favouring transparency over strict rules or prohibitions, which may not be enforceable. Furthermore, as AI developers, it is crucial to work towards creating transparent and explainable AI, enabling the assessment of biases and misinformation. The existing rules and guidelines in journal and conference submission guidelines suggest that a subtle yet significant shift in scientific best practices is already underway, implicitly acknowledging the need for adapted standards in the era of GAI.
This text was improved by Llama3 405 made available through Blablador (link). The author assumes full responsibility for the content 😉
-
Premise
Foundation models (FMs) have been put forward as a possible one-size-fits-all solution also in science and several implementation projects are now underway across disciplines. However, geospatial FMs are arguably disappointing and FMs for climate are more or less non-existent. We want to discuss with you the implementation pathways of large-scale FMs - beyond LLMs - their data situation, and the opportunities and limits of the tasks they might solve. Some of the questions we would like to discuss might include:
- Are scientific applications too diverse, too specific, and often out of training sample?
- Should FMs be seen as working out-of-the-box or will only clever fine-tuning make them shine?
- Can we expect scaling laws from LLMs to transfer to scientific FMs and why?
- Under which circumstances should we instead prefer models trained for a specific task?
Discussion
A considerable part of the discussion focused on the definition of the term FM. In particular, two definitions emerged:
- A FM is an AI model that can solve at least two different tasks,
- A FM learns abstract concepts beyond a single task.
In addition, it was remarked that FMs are often, but not always, multimodal. These different definitions hint at why this topic is currently hotly discussed. Some argued that the term FM is used as a buzzword and researchers therefore unknowingly or intentionally stretch it’s definition. As FMs mature, the definition should become clearer.
Another part of the discussion focused on the question to what degree are FMs preferable over more lightweight task-specific models. This question was hard tackle because these types of benchmarks often don’t exist due to the domain scientists’ lack of time/interest/skill to rigorously benchmark this. Some questioned the paradigm of “bigger is always better”. On the other hand, even if FMs are not substantially better than task-specific models, domain scientists might find a powerful, ready-to-use FMs more appealing than training from scratch.
The third big topic in our discussion addressed design and training of FMs. In particular, it was argued that the performance and generalisability of the FM depends on the design of the pretext tasks, or in other words, how exactly tokens are masked during the initial learning phase. This is especially complex for multi-modal data where extra care must be taken on how different data sources are weighted during training. Another complicated aspect here is foreseeing the downstream tasks or at least producing a FM that is able to adapt to evolving tasks and data.
-
Motivation
As AI is picking up speed, concerns have emerged and many are pondering where our human intelligence is headed. Are we on the path to lose our brain plasticity? Then our brain is like a muscle: if it is not trained, it will get weakened? While AI is becoming a part of our everyday life supporting us in tasks ranging from information retrieval to decision-making, critics argue it may be fostering dependency, diminishing critical thinking skills, and weakening our ability to process complex problems unaided. More and more research points into the direction that human behavior is shifting from doing own research and critical thinking towards prompt engineering and the verification of the information delivered from a Large Language Model (LLM).
The discussion round started with creating a scale based on who agrees and who disagrees with the presented question. There were more participants who tended to agree with the statement and brought their arguments to back it up. We talked about our own experience as well as supporting research articles. One strong example and an active discussion topic concerned our modern reality where young students are abusing and potentially misunderstanding the help, which AI tools can bring to our everyday life. The students might be even unaware of that fact. As a result of overuse, their problem-solving skills might not develop the same way as they could have otherwise.
Whereas more than half of the participants were leaning towards agreeing with the original statement, there was no one who was entirely confident that the rise of AI is leading towards the decay of human intelligence. On the contrary, however, a few very strong no-voters with strongly based arguments brought the voting scale almost to an equilibrium.
With that we delved into the idea of human fear from new and emerging technological development. History shows that every step of the evolution, when a big change was on the rise, our human minds go into fight or flight mode. With that, we agreed that there is a need for more domain-specific AI tools and that each and every one of us carries a responsibility in how we use and integrate AI in our everyday life.
The discussion was rounded up with the initial question: How do we define intelligence, be it human or artificial?
Follow-up Potential
There was a very active round of participants who were interested in sharing their opinion on the matter or curious to hear others out. As philosophical discussions never have a conclusion, a potential follow-up could concentrate on a domain-specific topic. As an example, does using AI tools harm or enhance research? Or simply "The benefits of AI in Research".
-
Introduction
In machine learning, the ability to make reliable predictions is paramount. Yet, standard ML models and pipelines provide only point predictions without accounting for model confidence (or the lack thereof). Uncertainty in model outputs, especially when faced with out-of-distribution (OOD) data, is essential when deploying models in production. This session serves as an introduction to the concepts and techniques for quantifying uncertainty in machine learning models. We will explore the different sources of uncertainty and cover various methods for estimating these uncertainties effectively. By understanding and addressing uncertainty, particularly in the context of OOD data, practitioners can enhance the robustness of their models and foster greater confidence in model predictions.
Discussion
The group discussed current tools and techniques for uncertainty quantification (UQ) in machine learning, with a focus on neural networks. Key topics and takeaways included:
Tools and Packages
Several UQ-related packages were shared and briefly reviewed:
- torch_bayesian: Bayesian neural networks via variational inference, new package by one participant
- lightning-uq-box
- torch-uncertainty
Non-Bayesian Methods
Conformal prediction was introduced by one participant as a non-Bayesian approach to UQ. There was interest in a potential follow-up discussion focused on out-of-distribution detection using these methods.
Types of Uncertainty
The group discussed epistemic and aleatoric uncertainty. In a poll on which type is more relevant to participants’ applications, epistemic uncertainty received more votes (7 to 5).
Use Cases
Participants shared domain-specific applications of UQ, providing context for the practical importance of different uncertainty types and approaches.
LLMs and Semantic Entropy
A short explainer was given on semantic entropy in large language models, referencing a recent article:
Semantic Entropy in LLMs (Nature, 2024)Practical Considerations
The group discussed how to interpret and use uncertainty in practice. One key question was how to threshold uncertainty—deciding when a prediction is too uncertain to trust. A suggested method was to implement abstained prediction, where predictions are withheld if uncertainty exceeds a defined level.
-
Solving PDEs using operator learning has recently emerged as a hot topic in AI research. With a solid mathematical foundation it promises to offer a unified and systematic framework for approximating infinite-dimensional operators by mapping input functions to output functions. Thus, it can be used to model a family of PDEs as opposed to a single instance, which the usual neural networks approximate. However, there is a lot of scepticism around the topic with respect to data requirements, generalizability etc.
In this AI world cafe, first, the paradigm of operator learning was introduced to the participants as many of them wanted to know what it is and only a few had previous experiences or knowledge about it. Our discussions then covered the following main points
- Operator learning could be useful for accelerating the solutions to PDEs which can already be solved using known numerical methods at a moderate cost. But for certain complex PDEs (e.g. multi-physics, high-dimensional etc.) where the PDE solutions using numerical methods are too costly to obtain this may not be very useful due to an insufficient number of training samples.
- A common concern among the participants is the data-hungry nature of these strategies. One example was given by a participant in the case of modelling atmospheric turbulence in CFD using Variable Input DeepONet (VIDON) which is one of the operator learning strategies. In that particular case, the operator learning was not effective compared to the numerical method that they were using.
- Instabilities during inference for long-time integration simulations were mentioned as an issue. Reduced basis methods were suggested as an alternative for operator learning, although it has limitations due to the linear nature of the strategy.
- Many of the participants were coming from an applications background and a common consensus was that the huge benefits as reflected in the AI literature on operator learning does not reflect in practice when the participants try to apply it to their use cases.
As a result, we agreed that operator learning could be useful when one has a lot of training data to start with and even in this case the success heavily relies on the application. For complex (coupled, multi-physics etc.) PDEs, one may not see the huge advantages as commonly reported in the ML/AI literature for relatively simple PDEs.
Finally, the participants were interested in a follow-up discussion on this topic in the future Helmholtz AI conferences and hopefully, the situation improves over time.
-
Introduction
Artificial Intelligence is fundamentally reshaping the landscape of scientific inquiry, bringing about unprecedented opportunities alongside profound challenges to the very foundations of knowledge production. As AI tools scrutinize vast datasets and generate hypotheses at unparalleled scales, we grapple with opaque processes, the overwhelming deluge of research, and increasing hyper-specialization.
Inspired by the accompanying article, this round table discussion adressed the "epistemological upheaval" driven by AI's integration into science with concepts such as:
- The redefined "Ends of Science": Not a cessation, but a transformation of methodology and understanding.
- Epistemic Overhangs & Underhangs: The gaps created when theories outpace verification or empirical findings lack causal explanations, and how AI might accelerate these.
- The Challenge of Opacity: How do we evaluate scientific findings from systems operating beyond human comprehension?
- The Promise of Mechanistic Interpretability: How can new tools help us "open and visualize" AI models to gain understandable explanations and bridge epistemic gaps?
- The Future of the Scientific Method: What new tools, methods, and norms are needed to leverage AI effectively and responsibly for planetary-scale research?
Discussion
There was an intense and constructive conversation that unfolded in many directions. The most controversial point was actually the title of the table, which people interpreted in multiple various ways. At the beginning, we were trying to understand if the science is actually stagnating. That brought up the questions of what is science and scientific process, what is stagnation (and what is this that pushes us to accelerate), how it manifests itself and in which fields, and why people feel this way. Taking the perspective of epistemic overhangs/underhangs, we were trying to find how to close those gaps between empirical and theoretical. Furthermore, the existential question was raised if it is really the end of science, and hence scientists will be left out unemployed in the light of AI rise, which could be potentially more productive than humans.
As for the ideas, a fragmented and incomplete list would be:
- It is important to differentiate applied and fundamental science, where the paces are different and the metrics of "success" are different. The former has a formalised purpose and goal, while the latter is more curiosity-driven and exploratory in its uncertainty.
- Communication, as well as language (between fragmented bubbles/scientists/in general) is the key to bring more ideas floating and remove the hurdles of misunderstanding/lack of interaction.
- Research (especially fundamental) is fundamentally rooted in curiosity and creativity, and that's what makes it very human and therefore unique -- as it's one of our very fundamental inner motivations and forces. Are we really being creative within the current scientific system?
- Speaking of which, we also discussed the existing structural problems of academia.
I definitely see a follow-up. It felt like the conversation was cut short on time, some people were staying at the table for more rounds just to continue the narrative and then we also chatted after the World Cafe for some time. It felt like attendees were curious and eager to talk about the topic, and the table created the space for them to channel their ideas at the very fundamental level. Unfortunately, I feel like there is no space in academia to carry out such deep conversations, and the dynamics of interactions at the table somewhat reflected this belief.
Subscribe to our newsletter
Stay up to date with all things AI in science across the Helmholtz Association.
Subscribe