LogoLogo
Illumina Connected Software
  • Introduction
  • Get Started
    • About the Platform
    • Get Started
  • Home
    • Projects
    • Bundles
    • Event Log
    • Metadata Models
    • Docker Repository
    • Tool Repository
    • Storage
      • Connect AWS S3 Bucket
        • SSE-KMS Encryption
  • Project
    • Data
      • Data Integrity
    • Samples
    • Activity
    • Flow
      • Reference Data
      • Pipelines
        • Nextflow
        • CWL
        • XML Input Form
        • 🆕JSON-Based input forms
          • InputForm.json Syntax
          • JSON Scatter Gather Pipeline
        • Tips and Tricks
      • Analyses
    • Base
      • Tables
        • Data Catalogue
      • Query
      • Schedule
      • Snowflake
    • Bench
      • Workspaces
      • JupyterLab
      • 🆕Bring Your Own Bench Image
      • 🆕Bench Command Line Interface
      • 🆕Pipeline Development in Bench (Experimental)
        • Creating a Pipeline from Scratch
        • nf-core Pipelines
        • Updating an Existing Flow Pipeline
      • 🆕Containers in Bench
      • FUSE Driver
    • Cohorts
      • Create a Cohort
      • Import New Samples
      • Prepare Metadata Sheets
      • Precomputed GWAS and PheWAS
      • Cohort Analysis
      • Compare Cohorts
      • Cohorts Data in ICA Base
      • Oncology Walk-through
      • Rare Genetic Disorders Walk-through
      • Public Data Sets
    • Details
    • Team
    • Connectivity
      • Service Connector
      • Project Connector
    • Notifications
  • Command-Line Interface
    • Installation
    • Authentication
    • Data Transfer
    • Config Settings
    • Output Format
    • Command Index
    • Releases
  • Sequencer Integration
    • Cloud Analysis Auto-launch
  • Tutorials
    • Nextflow Pipeline
      • Nextflow DRAGEN Pipeline
      • Nextflow: Scatter-gather Method
      • Nextflow: Pipeline Lift
        • Nextflow: Pipeline Lift: RNASeq
      • Nextflow CLI Workflow
    • CWL CLI Workflow
      • CWL Graphical Pipeline
      • CWL DRAGEN Pipeline
      • CWL: Scatter-gather Method
    • Base Basics
      • Base: SnowSQL
      • Base: Access Tables via Python
    • Bench ICA Python Library
    • API Beginner Guide
    • Launch Pipelines on CLI
      • Mount projectdata using CLI
    • Data Transfer Options
    • Pipeline Chaining on AWS
    • End-to-End User Flow: DRAGEN Analysis
  • Reference
    • Software Release Notes
      • 2025
      • 2024
      • 2023
      • 2022
      • 2021
    • Document Revision History
      • 2025
      • 2024
      • 2023
      • 2022
    • Known Issues
    • API
    • Pricing
    • Security and Compliance
    • Network Settings
    • ICA Terminology
    • Resources
    • Data Formats
    • FAQ
Powered by GitBook
On this page
  • Creating the tools
  • Pipeline
  • Important remark

Was this helpful?

Export as PDF
  1. Tutorials
  2. CWL CLI Workflow

CWL: Scatter-gather Method

PreviousCWL DRAGEN PipelineNextBase Basics

Last updated 7 months ago

Was this helpful?

In bioinformatics and computational biology, the vast and growing amount of data necessitates methods and tools that can process and analyze data in parallel. This demand gave birth to the scatter-gather approach, an essential pattern in creating pipelines that offers efficient data handling and parallel processing capabilities. In this tutorial, we will demonstrate how to create a CWL pipeline utilizing the scatter-gather approach. To this purpose, we will use two widely known tools: and . Given the functionalities of both fastp and multiqc, their combination in a scatter-gather pipeline is incredibly useful. Individual datasets can be scattered across resources for parallel preprocessing with fastp. Subsequently, the outputs from each of these parallel tasks can be gathered and fed into multiqc, generating a consolidated quality report. This workflow not only accelerates the preprocessing of large datasets but also offers an aggregated perspective on data quality, ensuring that subsequent analyses are built upon a robust foundation.

Creating the tools

First, we create the two tools: fastp and multiqc. For this, we need the corresponding Docker images and CWL tool definitions. Please, look up this of our help sites to learn more how to import a tool into ICA. In a nutshell, once the CWL tool definition is pasted into the editor, the other tabs for editing the tool will be populated. To complete the tool, the user needs to select the corresponding Docker image and to provide a tool version (could be any string).

For this demo, we will use the publicly available Docker images: quay.io/biocontainers/fastp:0.20.0--hdbcaa40_0 for fastp and docker.io/ewels/multiqc:v1.15 for multiqc. In this one can find how to import publicly available Docker images into ICA.

Furthermore, we will use the following CWL tool definitions:

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: CommandLineTool
requirements:
- class: InlineJavascriptRequirement
label: fastp
doc: Modified from https://github.com/nigyta/bact_genome/blob/master/cwl/tool/fastp/fastp.cwl
inputs:
  fastq1:
    type: File
    inputBinding:
      prefix: -i
  fastq2:
    type:
    - File
    - 'null'
    inputBinding:
      prefix: -I
  threads:
    type:
    - int
    - 'null'
    default: 1
    inputBinding:
      prefix: --thread
  qualified_phred_quality:
    type:
    - int
    - 'null'
    default: 20
    inputBinding:
      prefix: --qualified_quality_phred
  unqualified_phred_quality:
    type:
    - int
    - 'null'
    default: 20
    inputBinding:
      prefix: --unqualified_percent_limit
  min_length_required:
    type:
    - int
    - 'null'
    default: 50
    inputBinding:
      prefix: --length_required
  force_polyg_tail_trimming:
    type:
    - boolean
    - 'null'
    inputBinding:
      prefix: --trim_poly_g
  disable_trim_poly_g:
    type:
    - boolean
    - 'null'
    default: true
    inputBinding:
      prefix: --disable_trim_poly_g
  base_correction:
    type:
    - boolean
    - 'null'
    default: true
    inputBinding:
      prefix: --correction
outputs:
  out_fastq1:
    type: File
    outputBinding:
      glob:
      - $(inputs.fastq1.nameroot).fastp.fastq
  out_fastq2:
    type:
    - File
    - 'null'
    outputBinding:
      glob:
      - $(inputs.fastq2.nameroot).fastp.fastq
  html_report:
    type: File
    outputBinding:
      glob:
      - fastp.html
  json_report:
    type: File
    outputBinding:
      glob:
      - fastp.json
arguments:
- prefix: -o
  valueFrom: $(inputs.fastq1.nameroot).fastp.fastq
- |
  ${
    if (inputs.fastq2){
      return '-O';
    } else {
      return '';
    }
  }
- |
  ${
    if (inputs.fastq2){
      return inputs.fastq2.nameroot + ".fastp.fastq";
    } else {
      return '';
    }
  }
baseCommand:
- fastp

and

#!/usr/bin/env cwl-runner

cwlVersion: cwl:v1.0
class: CommandLineTool
label: MultiQC
doc: MultiQC is a tool to create a single report with interactive plots for multiple
  bioinformatics analyses across many samples.
inputs:
  files:
    type:
    - type: array
      items: File
    - 'null'
    doc: Files containing the result of quality analysis.
    inputBinding:
      position: 2
  directories:
    type:
    - type: array
      items: Directory
    - 'null'
    doc: Directories containing the result of quality analysis.
    inputBinding:
      position: 3
  report_name:
    type: string
    doc: Name of output report, without path but with full file name (e.g. report.html).
    default: multiqc_report.html
    inputBinding:
      position: 1
      prefix: -n
outputs:
  report:
    type: File
    outputBinding:
      glob:
      - '*.html'
baseCommand:
- multiqc

Pipeline

Once the tools are created, we will create the pipeline itself using these two tools at Projects > your_project > Flow > Pipelines > CWL > Graphical:

  • On the Definition tab, go to the tool repository and drag and drop the two tools which you just created on the pipeline editor.

  • Connect the JSON output of fastp to multiqc input by hovering over the middle of the round, blue connector of the output until the icon changes to a hand and then drag the connection to the first input of multiqc. You can use the magnification symbols to make it easier to connect these tools.

  • Above the diagram, drag and drop two input FASTQ files and an output HTML file on to the pipeline editor and connect the blue markers to match the diagram below.

Relevant aspects of the pipeline:

  • Both inputs are multivalue (as can be seen on the screenshot)

  • Ensure that the step fastp has scattering configured: it scatters on both inputs using the scatter method 'dotproduct'. This means that as many instances of this step will be executed as there are pairs of FASTQ files. To indicate that this step is executed multiple times, the icons of both inputs have doubled borders.

Important remark

Both input arrays (Read1 and Read2) must be matched. Currently an automatic sorting of input arrays is not supported yet. One has to take care of matching the input arrays. There are two ways to achieve this (besides the manual specification in the GUI):

  • invoke this pipeline in CLI using Bash functionality to sort the arrays

  • add a tool to the pipeline which will intake array of all FASTQ files, spread them on R1 and R2 suffixes, and sort them.

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: CommandLineTool
requirements:
- class: InlineJavascriptRequirement
- class: InitialWorkDirRequirement
  listing:
  - entry: "import argparse\nimport os\nimport json\n\n# Create argument parser\n\
      parser = argparse.ArgumentParser()\nparser.add_argument(\"-i\", \"--inputFiles\"\
      , type=str, required=True, help=\"Input files\")\n\n# Parse the arguments\n\
      args = parser.parse_args()\n\n# Split the inputFiles string into a list of file\
      \ paths\ninput_files = args.inputFiles.split(',')\n\n# Sort the input files\
      \ by the base filename\ninput_files = sorted(input_files, key=lambda x: os.path.basename(x))\n\
      \n\n# Separate the files into left and right arrays, preserving the order\n\
      left_files = [file for file in input_files if '_R1_' in os.path.basename(file)]\n\
      right_files = [file for file in input_files if '_R2_' in os.path.basename(file)]\n\
      \n# Print the left files for debugging\nprint(\"Left files:\", left_files)\n\
      \n# Print the left files for debugging\nprint(\"Right files:\", right_files)\n\
      \n# Ensure left and right files are matched\nassert len(left_files) == len(right_files),\
      \ \"Mismatch in number of left and right files\"\n\n    \n# Write the left files\
      \ to a JSON file\nwith open('left_files.json', 'w') as outfile:\n    left_files_objects\
      \ = [{\"class\": \"File\", \"path\": file} for file in left_files]\n    json.dump(left_files_objects,\
      \ outfile)\n\n# Write the right files to a JSON file\nwith open('right_files.json',\
      \ 'w') as outfile:\n    right_files_objects = [{\"class\": \"File\", \"path\"\
      : file} for file in right_files]\n    json.dump(right_files_objects, outfile)\n\
      \n"
    entryname: spread_script.py
    writable: false
label: spread_items
inputs:
  inputFiles:
    type:
      type: array
      items: File
    inputBinding:
      separate: false
      prefix: -i
      itemSeparator: ','
outputs:
  leftFiles:
    type:
      type: array
      items: File
    outputBinding:
      glob:
      - left_files.json
      loadContents: true
      outputEval: $(JSON.parse(self[0].contents))
  rightFiles:
    type:
      type: array
      items: File
    outputBinding:
      glob:
      - right_files.json
      loadContents: true
      outputEval: $(JSON.parse(self[0].contents))
baseCommand:
- python3
- spread_script.py

Now this tool can added to the pipeline before fastp step.

We will describe the second way in more detail. The tool will be based on public python Docker docker.io/python:3.10 and have the following definition. In this tool we are providing the Python script spread_script.py via Dirent .

fastp
multiqc
part
tutorial
feature
fastp_multiqc