AquaINFRA
The overarching goal of the EU-funded AquaINFRA project is to design and implement a research data infrastructure to assist scientists from the aquatic realm in restoring healthy oceans, rivers, and lakes. The key components of the infrastructure are the Data Discovery and Access Service (DDAS) as the backend, the AquaINFRA Interaction Platform (AIP) as the central gateway to the project and search interface, and the Virtual Research Environment (VRE). In addition, several researchers in the project work on case studies to demonstrate the usefulness and benefits of FAIR and open data organized in such an infrastructure.
In this blog post, I describe the design rationale for using the Galaxy platform as the central (and currently only) component of the VRE. I will briefly summarize the needs defined in the AquaINFRA proposal, show how Galaxy can address these requirements, and outline challenges and limitations that come with the Galaxy platform.
Needs and Requirements
The AquaINFRA proposal, or Description of Action (DoA) leaves open which tools and software should be used since it is difficult to predict new developments in the dynamic world of research software engineering. However, it promised to build the VRE on top of existing software, rather than develop yet another tool from scratch, which would be resource intensive and result in duplicated effort. Other than that, the proposal listed a number of needs and requirements:
- Users should be able to access AquaINFRA resources, such as data and services.
- The VRE should offer reproducible tools and web services to process and analyze data.
- A canvas should help users to create, run, modify, and share workflows.
- Newly generated outputs (e.g., code, data) should be published in a reproducible format.
- Services and notebooks developed in the case studies should be usable and reproducible in the VRE.
- There should be a connection to the European Open Science Cloud (EOSC).
In addition to these six needs, there are some requirements that seem trivial, but are still essential in the context of an open science project:
- The VRE must be based on, and released as open source software.
- Reused software must at least be maintained and ideally come from active projects.
- The VRE must be flexible regarding programming language and extensible with the analysis scripts and services developed in the case studies.
These needs and requirements sound meaningful and offering such a web application would certainly contribute and enrich the work of many researchers. So far, so romantic. However, such an idealistic software would also be very complex and come with a steep learning curve. It is not realistic to expect users to visit a web application that is easy to understand right away. Furthermore, requirements such as flexibility further increase the complexity. People using the platform will need to acquire some skills to fully benefit from it and those developing tools and workflows will need some technical expertise. All in all: time and effort is needed and this is not only true for our project. For example, a researcher could use a simple point-and-click software to calculate statistics, but this approach is intransparent, unreproducible, and limited to the features offered. Flexible scripting languages such as R and Python offer full control and have become indispensable, but are also complex and require technical skills (and usually takes one year at least to be a proficient user). For information systems like QGIS to be useful, the user must understand the system. Still, these tools are powerful and broadly accepted.
Galaxy
Alright, complex requirements likely result in complex software. Accepted! But is there actually a software that can help here? There is and it’s called Galaxy. Galaxy is an online platform free of charge and released under an open license. It allows users to analyze data, create workflows, and share such data and workflows with other Galaxy users and on Zenodo. Key benefits are the provision of an extensive set of training materials and an active user community. The geoscientific domain has recently made increasing contributions (see, for example, https://climate.usegalaxy.eu/ and https://earth-system.usegalaxy.eu/). Let’s see to what extent Galaxy can address the needs and requirements listed above.
Galaxy allows users to import data from external sources. See, for example, video 2 and 4 in the AIP to understand how data is transferred from the AIP’s search interface to Galaxy. In addition, users can take advantage of plenty of existing tools to process and analyze data reproducibly or contribute to Galaxy’s toolshed with newly developed tools in different programming languages (e.g., R and Python). In a canvas, users can drag and drop these tools and connect them to readily shareable and modifiable workflows. Such reproducible workflows (or any other digital asset in Galaxy) can then be published on Zenodo to receive a DOI. Galaxy is also released under an open license and as part of the EuroScienceGateway, an EOSC project to deliver “a robust, scalable, seamlessly integrated open infrastructure”. Galaxy is extensible, which allows us to develop a tool to integrate services such as OGC API Processes that will be created based on the analysis scripts from the case studies. This tool is currently work-in-progress and will be explained in more detail in a separate article.
The Galaxy landing page (Fig. 1) has three sections. The tools on the left can be used to work with data and the middle section is used to look at the materials or create workflows. The right side shows the so-called history, which is the user’s workspace including imported datasets and scripts. Users can create workflows by connecting the tools on the left (Fig. 2) or using interactive tools such as RStudio, Jupyter notebooks, and QGIS. For more information on how to use Galaxy, see the training materials. Most of the tools in Galaxy follow the input – processing – output logic, which is important to consider when it comes to developing Galaxy tools and implementing OGC API Processes out of the analysis scripts developed in the case studies.
There are three options to integrate Galaxy in a research data infrastructure: 1) It is possible to set up one’s own instance of Galaxy. While this approach gives providers full control over the platform, it requires administration effort and results in a platform that is less connected to other research domains. 2) The second option is to make use of one of the existing, running server instances, for example, in Australia or Europe. While this approach requires no administration effort, users are confronted with a huge number of tools from all research domains. 3) The third option is to create a subdomain in one of the existing server instances, allowing providers to restrict the tools to those that are relevant for the target audience. We decided on the third option to avoid administration overhead and created a subdomain that addresses the users in AquaINFRA. It is possible to switch to one’s own instance at any time if necessary, as all contributions to Galaxy’s toolshed can be imported to any other instance.
Limitations
As with any software, Galaxy also comes with a number of limitations. Galaxy’s user interface is not very intuitive. The Galaxy developers are aware of this and working on it. However, the development of an intuitive user interface for such a complex software is challenging and resource intensive. According to Brian Nosek’s strategy of culture change (Fig. 3), Galaxy successfully developed an infrastructure that makes it possible to create reproducible workflows and is now trying to reach the next of five steps: make it easy. So, for now we need to live with a steep learning curve. There are no comparable alternatives. The Whole Tale would have been a great tool for reproducibility, but it is no longer maintained. Binder is similar to The Whole Tale, is still active, but does not provide a workflow canvas (which is required by the DoA). QGIS has a graphical modeler but typically runs on a user’s local machine, which makes it cumbersome to work on projects collaboratively and connect other software components such as the AIP.
It is also important to note that Galaxy will not be the central entry point for the AquaINFRA platform. This is the task of the AIP, which provides a user interface to search for resources from the aquatic realm. For example, users will find a workflow via the AIP and then get redirected to Galaxy where they can run it with one click.
Although Galaxy is the only component of the VRE for now, it does not mean that there won’t be any additional tools. If tools such as Binder contribute features that are needed in the project but not covered by Galaxy, we should also integrate them as part of the VRE. Adding further software components should be done with care as these also increase the complexity of the entire platform, might demand additional skills from users, and affect the overall architecture.
Outlook
We are currently in the process of integrating the first analysis scripts from the case studies into the Galaxy platform. Our goal is to demonstrate the first workflows during the next partner meeting in Barcelona in November.
References
- Nosek, B. (2019): Strategy of culture change. https://www.cos.io/blog/strategy-for-culture-change
- Konkol, M. (2024): The AquaINFRA platform: A research data infrastructure for marine and inland waters. https://doi.org/10.7490/f1000research.1119778.1
Funding
This project has received funding from the European Commission’s Horizon Europe Research and Innovation programme under grant agreement No 101094434. Project coordinator: Aalborg Universitet (AAU). The information and views of this website lie entirely with the authors. The European Commission is not responsible for any use that may be made of the information it contains.
Leave a Reply