AUTHOR: PAOLO DI TOMMASO
Deploy containers at scale with Nextflow
Containers are exceptionally useful in scientific workflows. They allow for the encapsulation of software dependencies – e.g., tools and libraries required by a data-analysis application in one or more self-contained, ready-to-run, immutable container images that can be easily deployed on any platform supporting the container runtime.
For these reasons, containers have been rapidly adopted by the bioinformatics community as the popularity of projects such as BioContainers shows. Because it implements a solution that better fits the security and operational requirements adopted in the context of HPC data centres, Singularity has emerged as the technology of choice when it comes to deploying these kinds of data-analysis applications at scale.
However, while running a Singularity container instance is a relatively straightforward task, a typical genomic workflow may require dozens of different container images – and spawn the execution of thousands of tasks, each of which runs in its own container instance.
Orchestrating the execution of such containerized workloads at scale then, and proactively managing related problems such as resource optimization, per-task data I/O staging, error recovery, etc., is anything but simple!
A workflow developer might be tempted to address these challenges by delegating them to a workload manager such as Slurm. Unfortunately, this results only in a partial solution: when workflow applications are tightly coupled with workload managers, they cannot be easily executed on another system or infrastructure, nor automatically tested with a continuous integration (CI) service. In other words, tight coupling (between workflow applications workload managers) results in a loss of runtime mobility.
Nextflow is a workflow system designed to manage the orchestration and deployment of containerized workloads at scale, across clouds and clusters in a portable and reproducible manner. The main design principle consists of decoupling the application workflow logic from the underlying execution platform. Each task is defined in a self-contained manner and executed in its own containerized environment. Because execution specifics are detailed in a separate configuration file, the nuances of containerization solutions are made transparent to the developer – e.g., they are not required to provide any specific container engine command.
Nextflow provides out-of-the-box support for the most widely used containerization technologies including Singularity, as well as a large range of execution platforms such as Slurm, Grid Engine, LSF, Kubernetes, AWS Batch, etc.
This support enables definition of platform-agnostic data analysis workflows that can easily be deployed in a portable and reproducible manner across heterogeneous execution platforms. For example, a researcher can rapidly prototype a workflow application on their own laptop, isolating the dependencies with a Docker container; then, they could deploy it at scale on the institution’s Slurm cluster using Singularity. Finally, they could share their workflow application with a colleague in a different organisation, using a different batch scheduler or even deploy it in the AWS cloud via the AWS Batch compute service.
Nextflow is routinely used by many pharmaceutical companies and renowned public health institutions such as SciLifeLab. For example, at the Centre for Genomic Regulation (CRG) alone, Nextflow has been used to deploy data-intensive computational workflows since 2014; it has orchestrated the execution of over 12 million jobs, totalling 1.4 million CPU hours – with the majority of these jobs executing within Singularity containers.
Nextflow is a free and open source software solution for application workflows developed by the Centre for Genomic Regulation (CRG). Seqera Labs was recently incorporated as a spin-off from the CRG to provide enterprise-level support and professional services around the Nextflow platform, as well as to explore new, innovative products to power the next generation of big data analysis applications.