> For the complete documentation index, see [llms.txt](https://help.ica.illumina.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://help.ica.illumina.com/project/p-data/data-integrity.md).

# Data Integrity

You can verify the integrity of the data by comparing the hash which is usually ([with some exceptions](https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html#checking-object-integrity-md5)) an MD5 (Message Digest Algorithm 5) checksum. This is a common cryptographic hash function that generates a fixed-size, 128-bit hash value from any input data. This hash value is unique to the content of the data, meaning even a slight change in the data will result in a significantly different MD5 checksum. AWS S3 calculates this checksum when data is uploaded and stores it in the ETag (Entity tag).

For files smaller than 16 MB, you can directly retrieve the MD5 checksum using our [API](https://ica.illumina.com/ica/api/swagger/index.html) endpoints. Make an API GET call to the `https://ica.illumina.com/ica/rest/api/projects/{projectId}/data/{dataId}` endpoint specifying the data Id you want to check and the corresponding project ID. The response you receive will be in JSON format, containing various file metadata. Within the JSON response, look for the `objectETag` field. This value is the MD5 checksum for the file you have queried. You can compare this checksum with the one you compute locally to ensure file integrity.

This ETag does not change and can be used as a file integrity check even when that file is archived, unarchived and/or copied to another location. Changes to the metadata have no impact on the ETag

For larger files, the process is different due to computation limitations. In these cases, we recommend using a dedicated pipeline on our platform to explicitly calculate the MD5 checksum. Below you can find both a main.nf file and the corresponding XML for a possible Nextflow pipeline to calculate the MD5 checksum for FASTQ files.

<pre><code><strong>nextflow.enable.dsl = 2
</strong>

process md5sum {
    
    container "public.ecr.aws/lts/ubuntu:22.04"
    pod annotation: 'scheduler.illumina.com/presetSize', value: 'standard-small'
    
    input:
        file txt

    output:
        stdout emit: result
        path '*', emit: output

    publishDir "out", mode: 'symlink'

    script:
        txt_file_name = txt.getName()
        id = txt_file_name.takeWhile { it != '.'}

        """
        set -ex
        echo "File: $txt_file_name"
        echo "Sample: $id"
        md5sum ${txt} > ${id}_md5.txt
        """
    }

workflow {
    txt_ch = Channel.fromPath(params.in)
    txt_ch.view()
    md5sum(txt_ch).result.view()
}
</code></pre>

{% code overflow="wrap" fullWidth="false" %}

```
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<pd:pipeline xmlns:pd="xsd://www.illumina.com/ica/cp/pipelinedefinition">
    <pd:dataInputs>
        <pd:dataInput code="in" format="FASTQ" type="FILE" required="true" multiValue="true">
            <pd:label>Input</pd:label>
            <pd:description>FASTQ files input</pd:description>
        </pd:dataInput>
    </pd:dataInputs>
    <pd:steps/>
</pd:pipeline>
```

{% endcode %}


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://help.ica.illumina.com/project/p-data/data-integrity.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
