Pioneering the Future of Data-Intensive Science
In the rapidly evolving realm of data-intensive science, computational workflows have become an essential tool for managing, analyzing, and sharing complex data across diverse computing environments. These workflows describe complex multi-step methods used for data collection, preparation, analytics, predictive modeling, and simulation, leading to new data products. As workflows become increasingly vital in generating data, the
FAIR (Findability, Accessibility, Interoperability, and Reusability) principles, initially conceived for data management, have expanded to encompass
computational workflows as digital objects in their own right. FAIR principles for workflows need to address their specific nature in terms of their composition of executable software steps, their provenance, and their development.
Computational workflows inherently contribute to FAIR data principles by processing data according to established metadata, creating metadata during data processing, and tracking and recording data provenance. These properties aid in data quality assessment and contribute to secondary data usage.
The commonly accepted approach to achieving FAIR computational workflows harnesses OCI-compatible containers, which can significantly
improve their FAIRness across various computing environments, from edge devices to high-performance computing (HPC) systems. Singularity, a container platform initially designed for scientific and HPC environments, exemplifies this strategy.
Achieving FAIR Computational Workflows With Containers
FAIR computational workflows should be:
- Findable: Workflows can be effortlessly discovered through container registries by assigning globally unique and persistent identifiers and providing rich metadata for enhanced discoverability.
- Accessible: Containers offer a consistent environment across diverse computing platforms, enabling workflows to be executed with minimal modifications, regardless of the underlying infrastructure.
- Interoperable: Adhering to standards, container technologies ensure interoperability, allowing workflows to be executed on various computing platforms.
- Reusable: Utilizing containerization technology, which packages software and its dependencies in a portable and reproducible manner, facilitates the reuse and sharing of workflows across different projects and research groups.
These principles aim to improve the overall efficiency and effectiveness of computational workflows, making them more valuable for researchers and organizations alike.
The Open Container Initiative and Singularity
The
Open Container Initiative (OCI) is an open-source project that aims to establish common standards for container technologies. Founded by Docker, CoreOS, and other industry leaders in 2015, the OCI provides specifications for container image formats and runtime environments, ensuring that container technologies are interoperable and can be used across different platforms and environments.
Singularity is a container platform that allows users to create and run containers that package software and its dependencies in a portable and reproducible manner. It is known for its unique security model and design features tailored to HPC and scientific workloads that give it advantages in performance-intensive environments.
With the new OCI mode, Singularity fits seamlessly as an HPC-specific component in an overall OCI workflow where containers are built as OCI images elsewhere. And OCI compatibility is only the beginning; going forward Sylabs will enhance OCI mode to include some of the key security features in traditional Singularity containers, such as in-memory encryption and support for authentication and authorization of containerized workloads using cryptography.
Advantages of Using OCI-Compatible Singularity Containers
OCI-compatible Singularity containers offer several advantages for FAIR computational workflows:
- Streamlined container tool and workflow requirements: Singularity Enterprise 2.3 release incorporates a consolidated OCI registry, helping to streamline container tool and workflow requirements.
- Reproducibility: Singularity containers aim to ensure that software runs consistently when launched, although specific aspects of the underlying infrastructure, such as the host kernel version or compatibility requirements for MPI and GPUs, can influence behavior.
- Enhanced performance: In OCI-mode, Singularity converts OCI images to an HPC-native format, resulting in dramatically better performance in large-scale containerized workflows.
- Ease of use: Singularity containers can be easily pulled from resources like Docker Hub or the Singularity Container Services registry, making it simple for users to access and run containerized workflows.
- Integration with Workflow Management Systems: Singularity can be integrated with popular workflow management systems like Nextflow, Snakemake, and CWL.
Singularity Containers: Pioneering Interoperability from Edge to Cloud to HPC
Singularity containers can be employed to enhance the FAIRness of computational workflows across diverse computing environments, including edge devices, cloud platforms, and HPC systems.
At the edge, OCI containers with Sylabs enhancements help simplify the deployment of lightweight, portable applications across devices. Although aspects of the underlying infrastructure can impact portability, they typically run with minimal modifications.
In the cloud, Singularity’s interoperability allows researchers and organizations to run containerized workflows, leveraging the flexibility and cost-effectiveness of cloud-based solutions while capitalizing on scalable resources and services.
In high-performance computing, Singularity containers can execute performance-based applications and libraries in scalable containers. Singularity offers compatibility with high-performance hardware and networking resources, such as support for MPI and GPU applications, as well as Infiniband networks, ensuring that researchers and organizations can exploit the computational power of HPC systems.
Case Study: DataLad-based Framework for Reproducible Data Processing
A
recent study published in Nature highlights the benefits of using Singularity containers in FAIR computational workflows, where researchers introduced a
DataLad-based, domain-agnostic framework suitable for reproducible data processing in compliance with open science mandates. The framework attempts to minimize platform idiosyncrasies and performance-related complexities while capturing machine-actionable computational provenance records that can be used to retrace and verify the origins of research outcomes, as well as be re-executed independent of the original computing infrastructure.
The researchers demonstrated the framework’s performance using two showcases: one highlighting data sharing and transparency (using the
studyforrest.org dataset) and another highlighting scalability (using the largest public brain imaging dataset available: the
UK Biobank dataset).
In both cases, Singularity containers played a crucial role in encapsulating computational environments that could be shared and reused across different systems, improving the FAIRness of computational workflows. It is also worth noting that the researchers stored the UK Biobank (UKB) data in archives in a DataLad RIA store, which was then processed using the Singularity container. This approach helped to mitigate disk space and inode limitations on the two different systems used in the study.
The Impact of OCI-Compatible Singularity Containers on FAIR Workflows
Leveraging OCI-compatible Singularity containers can have a significant impact on the efficiency and effectiveness of FAIR computational workflows. By improving the findability, accessibility, interoperability, and reusability of of these workflows, Singularity containers can help research labs, academic institutions, and enterprises:
- Save time and resources: By streamlining container tools and workflow requirements, researchers can spend less time managing software dependencies and more time focusing on their research.
- Enhance collaboration: By making workflows more accessible and reusable, researchers can more easily share their work with others, fostering collaboration and innovation.
- Increase reproducibility: By encapsulating software and its dependencies, Singularity containers can help ensure that computational workflows produce consistent results across different computing environments, increasing the reproducibility of research findings.
The promise of achieving FAIR principles for computational workflows is within reach through the use of open standards like OCI and container platforms like Singularity. By encapsulating entire software stacks and dependencies into portable and reproducible containers, researchers can significantly improve the findability, accessibility, interoperability, and reuse of their workflows across a diverse range of computing environments.
To fully realize the vision of FAIR computational workflows, the scientific community must continue collaborating to drive the adoption of container technologies and establish best practices. Workflow developers should leverage OCI-compliant containers to package their workflows for seamless sharing and execution. Infrastructure providers need to ensure support for running containers on their systems. Funding agencies should mandate the use of FAIR workflows in grant proposals to drive change.
Together, these concerted efforts can transform how science is conducted in the digital age. By making computational workflows more open, transparent, and reusable, OCI-compatible containers like Singularity will accelerate discovery and allow researchers to focus on pushing the boundaries of human knowledge. The time to embrace FAIR data principles for workflows is now.