Tips and Tricks

Developing on the cloud incurs inherent runtime costs due to compute and storage used to execute workflows. Here are a few tips that can facilitate development.

  1. Leverage the cross-platform nature of these workflow languages. Both CWL and Nextflow can be run locally in addition to on ICA. When possible, testing should be performed locally before attempting to run in the cloud. For Nextflow, configuration files can be utilized to specify settings to be used either locally or on ICA. An example of advanced usage of a config would be applying the scratch directive to a set of process names (or labels) so that they use the higher performance local scratch storage attached to an instance instead of the shared network disk,

    withName: 'process1|process2|process3' { scratch = '/scratch/' }
    withName: 'process3' { stageInMode = 'copy' } // Copy the input files to scratch instead of symlinking to shared network disk
  2. When trying to test on the cloud, it's oftentimes beneficial to create scripts to automate the deployment and launching / monitoring process. This can be performed either using the ICA CLI or by creating your own scripts integrating with the REST API.

  3. For scenarios in which instances are terminated prematurely (for example, while using spot instances) without warning, you can implement scripts like the following to retry the job a certain number of times. Adding the following script to 'nextflow.config' enables five retries for each job, with increasing delays between each try.

    process {
        maxRetries = 4
        errorStrategy = { sleep(task.attempt * 60000 as long); return'retry'} // Retry with increasing delay
    }

    Note: Adding the retry script where it is not needed might introduce additional delays.

  4. When hardening a Nextflow to handle resource shortages (for example exit code 2147483647), an immediate retry will in most circumstances fail because the resources have not yet been made available. It is best practice to use Dynamic retry with backoff which has an increasing backoff delay, allowing the system time to provide the necessary resources.

  5. When publishing your Nextflow pipeline, make sure your have defined a container such as 'public.ecr.aws/lts/ubuntu:22.04' and are not using the default container 'ubuntu:latest'.

  6. To limit potential costs, there is a timeout of 96 hours: if the analysis does not complete within four days, it will go to a 'Failed' state. This time begins to count as soon as the input data is being downloaded. This takes place during the ICA 'Requested' step of the analysis, before going to 'In Progress'. In case parallel tasks are executed, running time is counted once. As an example, let's assume the initial period before being picked up for execution is 10 minutes and consists of the request, queueing and initializing. Then, the data download takes 20 minutes. Next, a task runs on a single node for 25 minutes, followed by 10 minutes of queue time. Finally, three tasks execute simultaneously, each of them taking 25, 28, and 30 minutes, respectively. Upon completion, this is followed by uploading the outputs for one minute. The overall analysis time is then 20 + 25 + 10 + 30 (as the longest task out of three) + 1 = 86 minutes:

Analysis task

request

queued

initializing

input download

single task

queue

parallel tasks

generating outputs

completed

96 hour limit

1m (not counted)

7m (not counted)

2m (not counted)

20m

25m

10m

30m

1m

-

Status in ICA

status requested

status queued

status initializing

status preparing inputs

status in progress

status in progress

status in progress

status generating outputs

status succeeded

If there are no available resources or your project priority is low, the time before download commences will be substantially longer.

Last updated