Data Integrity

You can verify the integrity of the data by comparing the hash which is usually (with some exceptions) an MD5 (Message Digest Algorithm 5) checksum. This is a common cryptographic hash function that generates a fixed-size, 128-bit hash value from any input data. This hash value is unique to the content of the data, meaning even a slight change in the data will result in a significantly different MD5 checksum. AWS S3 calculates this checksum when data is uploaded and stores it in the ETag (Entity tag).

For files smaller than 16 MB, you can directly retrieve the MD5 checksum using our API endpoints. Make an API GET call to the https://ica.illumina.com/ica/rest/api/projects/{projectId}/data/{dataId} endpoint specifying the data Id you want to check and the corresponding project ID. The response you receive will be in JSON format, containing various file metadata. Within the JSON response, look for the objectETag field. This value is the MD5 checksum for the file you have queried. You can compare this checksum with the one you compute locally to ensure file integrity.

This ETag does not change and can be used as a file integrity check even when that file is archived, unarchived and/or copied to another location. Changes to the metadata have no impact on the ETag

For larger files, the process is different due to computation limitations. In these cases, we recommend using a dedicated pipeline on our platform to explicitly calculate the MD5 checksum. Below you can find both a main.nf file and the corresponding XML for a possible Nextflow pipeline to calculate the MD5 checksum for FASTQ files.

nextflow.enable.dsl = 2


process md5sum {
    
    container "public.ecr.aws/lts/ubuntu:22.04"
    pod annotation: 'scheduler.illumina.com/presetSize', value: 'standard-small'
    
    input:
        file txt

    output:
        stdout emit: result
        path '*', emit: output

    publishDir "out", mode: 'symlink'

    script:
        txt_file_name = txt.getName()
        id = txt_file_name.takeWhile { it != '.'}

        """
        set -ex
        echo "File: $txt_file_name"
        echo "Sample: $id"
        md5sum ${txt} > ${id}_md5.txt
        """
    }

workflow {
    txt_ch = Channel.fromPath(params.in)
    txt_ch.view()
    md5sum(txt_ch).result.view()
}

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<pd:pipeline xmlns:pd="xsd://www.illumina.com/ica/cp/pipelinedefinition">
    <pd:dataInputs>
        <pd:dataInput code="in" format="FASTQ" type="FILE" required="true" multiValue="true">
            <pd:label>Input</pd:label>
            <pd:description>FASTQ files input</pd:description>
        </pd:dataInput>
    </pd:dataInputs>
    <pd:steps/>
</pd:pipeline>

PreviousData NextSamples

Last updated 2 months ago

Was this helpful?