Data Integrity

You can verify the integrity of the data with the MD5 (Message Digest Algorithm 5) checksum. It is a widely used cryptographic hash function that generates a fixed-size, 128-bit hash value from any input data. This hash value is unique to the content of the data, meaning even a slight change in the data will result in a significantly different MD5 checksum.

For files smaller than 16 MB, you can directly retrieve the MD5 checksum using our API endpoints. Make an API GET call to the https://ica.illumina.com/ica/rest/api/projects/{projectId}/data/{dataId} endpoint specifying the data Id you want to check and the corresponding project ID. The response you receive will be in JSON format, containing various file metadata. Within the JSON response, look for the objectETag field. This value is the MD5 checksum for the file you have queried. You can compare this checksum with the one you compute locally ot ensure the file's integrity.

For larger files, the process is different due to computation limitations. In these cases, we recommend using a dedicated pipeline on our platform to explicitly calculate the MD5 checksum. Below you can find both a main.nf file and the corresponding XML for a possible Nextflow pipeline to calculate the MD5 checksum for FASTQ files.

nextflow.enable.dsl = 2


process md5sum {
    
    container "public.ecr.aws/lts/ubuntu:22.04"
    pod annotation: 'scheduler.illumina.com/presetSize', value: 'standard-small'
    
    input:
        file txt

    output:
        stdout emit: result
        path '*', emit: output

    publishDir "out", mode: 'symlink'

    script:
        txt_file_name = txt.getName()
        id = txt_file_name.takeWhile { it != '.'}

        """
        set -ex
        echo "File: $txt_file_name"
        echo "Sample: $id"
        md5sum ${txt} > ${id}_md5.txt
        """
    }

workflow {
    txt_ch = Channel.fromPath(params.in)
    txt_ch.view()
    md5sum(txt_ch).result.view()
}
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<pd:pipeline xmlns:pd="xsd://www.illumina.com/ica/cp/pipelinedefinition">
    <pd:dataInputs>
        <pd:dataInput code="in" format="FASTQ" type="FILE" required="true" multiValue="true">
            <pd:label>Input</pd:label>
            <pd:description>FASTQ files input</pd:description>
        </pd:dataInput>
    </pd:dataInputs>
    <pd:steps/>
</pd:pipeline>

Last updated