Data

The Data section shows files and folders stored in the project.

File/Folder Naming

ICA supports UTF-8 characters in file and folder names for data. Following recommended practices, users are encouraged to follow additional guidelines detailed below. (For more information about recommended approaches to file naming that can be applicable across platforms, please refer to this AWS S3 documentation.)

Characters generally considered "safe":

  • Alphanumeric characters

    • 0-9

    • a-z

    • A-Z

  • Special characters

    • Exclamation point !

    • Hyphen -

    • Underscore _

    • Period .

    • Asterisk *

    • Single quote '

    • Open parenthesis (

    • Closed parenthesis )

Length of file name (minus prefixes and delimiters) generally should be limited to 32 characters.

Data Formats

See the list of supported Data Formats

Data Privacy

Data privacy should be carefully considered when adding data in ICA, either through storage configurations (ie, AWS S3) or ICA data upload. Be aware that when adding data from cloud storage providers by creating a storage configuration, ICA will provide access to the data. In general, users should ensure the storage configuration source settings are correct and uploads do not include unintended data in order to avoid unintentional privacy breaches. More guidance can be found in the ICA Security and Compliance section.

Data Integrity

You can verify the integrity of the data with the MD5 (Message Digest Algorithm 5) checksum. It is a widely used cryptographic hash function that generates a fixed-size, 128-bit hash value from any input data. This hash value is unique to the content of the data, meaning even a slight change in the data will result in a significantly different MD5 checksum.

For files smaller than 16 MB, you can directly retrieve the MD5 checksum using our API endpoints. Make an API GET call to the https://ica.illumina.com/ica/rest/api/projects/{projectId}/data/{dataId} endpoint specifying the data Id you want to check and the corresponding project ID. The response you receive will be in JSON format, containing various file metadata. Within the JSON response, look for the objectETag field. This value is the MD5 checksum for the file you have queried. You can compare this checksum with the one you compute locally ot ensure the file's integrity.

For larger files, the process is different due to computation limitations. In these cases, we recommend using a dedicated pipeline on our platform to explicitly calculate the MD5 checksum. Below you can find both a main.nf file and the corresponding XML for a possible Nextflow pipeline to calculate the MD5 checksum for FASTQ files.

nextflow.enable.dsl = 2


process md5sum {
    
    container "public.ecr.aws/lts/ubuntu:22.04"
    pod annotation: 'scheduler.illumina.com/presetSize', value: 'standard-small'
    
    input:
        file txt

    output:
        stdout emit: result
        path '*', emit: output

    publishDir "out", mode: 'symlink'

    script:
        txt_file_name = txt.getName()
        id = txt_file_name.takeWhile { it != '.'}

        """
        set -ex
        echo "File: $txt_file_name"
        echo "Sample: $id"
        md5sum ${txt} > ${id}_md5.txt
        """
    }

workflow {
    txt_ch = Channel.fromPath(params.in)
    txt_ch.view()
    md5sum(txt_ch).result.view()
}
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<pd:pipeline xmlns:pd="xsd://www.illumina.com/ica/cp/pipelinedefinition">
    <pd:dataInputs>
        <pd:dataInput code="in" format="FASTQ" type="FILE" required="true" multiValue="true">
            <pd:label>Input</pd:label>
            <pd:description>FASTQ files input</pd:description>
        </pd:dataInput>
    </pd:dataInputs>
    <pd:steps/>
</pd:pipeline>

View Data

On the Projects > your_project > Data page, you can view information on or preview files.

To view file details:

  1. Select a file to view by clicking on the filename, this will show the file details

To view file contents:

  1. Select a file to view by clicking on the filename, this will show the file details

  2. Select View. This will preview the file.

You can also preview the file content by selecting the checkbox at the begining of the line and then selecting View from the top menu.

Hyperlinking to Data

To hyperlink directly to data, use the following syntax:

https://<ServerURL>/ica/link/project/<ProjectID>/data/<FolderID> and https://<ServerURL>/ica/link/project/<ProjectID>/analysis/<AnalysisID>.

VariableLocation

ServerURL

see browser addres bar

projectID

At YourProject > Details > URN > urn:ilmn:ica:project:ProjectID#MyProject

FolderID

At YourProject > Data > folder > folder details > ID

AnalysisID

At YourProject > Flow > Analyses > YourAnalysis > ID

Normal permission checks still apply with these links. If you try to follow a link to data to which you do not have access, you will be returned to the main project screen or login screen, depending on your permissions.

Upload Data

Uploading data to the platform makes it available for consumption by analysis workflows and tools. There are multiple methods to upload data.

Upload Data via UI

Uploads via the UI are limited to 5TiB

Use the following instructions to upload data manually via the drag-and-drop interface in the platform UI.

  1. Go to Projects > your_project > Data.

  2. To add data, use one of the following methods. Make sure the Illumina Connected Analytics tab is open in the browser while data uploads.

    • Drag a file from your system into Choose a file or drag it here box.

    • Select the Choose a file or drag it here box, and then choose a file. Select Open to upload the file.

Your file or files are added to the Data page when upload completes.

Upload Data via CLI

For instructions on uploading/downloading data via CLI, see CLI Data Transfer.

Copy Data

You can copy data from the same project to a different folder or from another project to which you have access.

  1. Go to the destination project for your data copy and proceed to Projects > your_project > Data > Manage > Copy Data From.

  2. Optionally, use the filters (Type, Name, Status, Format or additional filters) to filter out the data or search with the search box.

  3. Select the data (individual files or folders with data) you want to copy.

  4. Select any meta data which you want to keep with the copied data (user tags, technical system tags or instrument information).

  5. Select which action to take if the data already exists (overwrite exsiting data, don't copy or keep both the original and the new copy by appending a number to the copied data).

  6. Select Copy Data to copy the data to your project. You can see the progress in Projects > your_project > Activity > Batch Jobs.

The outcome can be

  • INITIALIZED

  • WAITING_FOR_RESOURCES

  • RUNNING

  • STOPPED - When choosing to stop the batch job.

  • SUCCEEDED - All files and folders are copied).

  • PARTIALLY_SUCCEEDED - Some files and folders could be copied, but not all. Partially succeeded will typically occur when files were being modified or unavailable while the copy process was running.

  • FAILED - None of the files and folders could be copied.

copy typeresult

Replace

Overwrites the existing data. Folders will copy their data in an existing folder with existing files. Existing files will be replaced when a file with the same name is copied and new files will be added. The remaining files in the target folder will remain unchanged.

Don't copy

The original files are kept. If you selected a folder, files that do not yet exist in the destination folder are added to it. Files that already exist at the destination are not copied over and the originals are kept.

Keep both

Files have a number appended to them if they already exist. If you copy folders, the folders are merged, with new files added to the destination folder and original files kept. New files with the same name get copied over into the folder with a number appended.

There is a difference in copy type behavior between copying files and folders. The behavior is designed for files and it is best practice to not copy folders if there already is a folder with the same name in the destination location.

Notes on copying data:

  • Copying data comes with an additional storage cost as it will create a copy of the data.

  • You can copy over the same data multiple times.

  • Copying data from your own S3 storage requires additional configuration. See Connect AWS S3 Bucket and SSE-KMS Encryption..

  • On the command-line interface, the command to copy data is icav2 projectdata copy.

  • You can not copy data into a linked folder.

Download Data

Some small files can be downloaded directly from within the UI. Things like .txt and .csv files can be viewed by clicking on the filename when in the project Data section. On the View tab, the file can be viewed directly (larger files may take some time to load) and the Download button will allow you to download the file directly from the UI.

Schedule for Download

You can trigger an asynchronous download via service connector using the Schedule for Download button with one or more files selected.

  1. Select a file or files to download.

  2. Select Schedule for Download.

  3. Select a connector, and then select Schedule for Download.

You can view the progress of the download or stop the download on the Activity page for the project.

Export Project Data Information

The data records contained in a project can be exported in CSV, JSON, and excel format.

  1. Select one or more files to export.

  2. Select Export.

  3. Select the following export options:

    • To export only the selected file, select the Selected rows as the Rows to export option. To export all files on the page, select Current page.

    • To export only the columns present for the file, select the Visible columns as the Columns to export option.

  4. Select the export format.

Data Lifecycle Management

Uploaded files are automatically added with the standard storage tier. You can use files in the standard tier in your analysis.

To manually archive or delete files, do as follows:

  1. Select the checkbox next to the file or files to delete or archive.

  2. Select Manage, and then select one of the following options:

    • Archive — Move the file or files to long-term storage.

    • Unarchive — Return the file or files from long-term storage. Unarchiving can take up to 48 hours, regardless of file size. Unarchived files can be used in analysis.

    • Delete — Remove the file completely.

When attempting concurrent archiving or unarchiving of the same file, a message will inform you to wait for the currently running (un)archiving to finish first.

To archive or delete files programmatically, one can proceed as follows using ICA's API endpoints:

  1. GET the file's information.

  2. Modify the dates of the file to be deleted/archived.

  3. PUT the updated information back in ICA.

The Python snippet below exemplifies the approach: it sets (or updates if set already) the time to be archived for a specific file:

import requests
import json

from config import PROJECT_ID, DATA_ID, API_KEY

url_get="https://ica.illumina.com/ica/rest/api/projects/" + PROJECT_ID + "/data/" + DATA_ID

# set the API get headers
headers = {
            'X-API-Key': API_KEY,
            'accept': 'application/vnd.illumina.v3+json'
            }

# set the API put headers
headers_put = {
            'X-API-Key': API_KEY,
            'accept': 'application/vnd.illumina.v3+json',
            'Content-Type': 'application/vnd.illumina.v3+json'
            }

# Helper function to insert willBeArchivedAt after field named 'region'
def insert_after_region(details_dict, timestamp):
    new_dict = {}
    for k, v in details_dict.items():
        new_dict[k] = v
        if k == 'region':
            new_dict['willBeArchivedAt'] = timestamp
    if 'willBeArchivedAt' in details_dict:
        new_dict['willBeArchivedAt'] = timestamp
    return new_dict

# 1. Make the GET request
response = requests.get(url_get, headers=headers)
response_data = response.json()

# 2. Modify the JSON data
timestamp = "2024-01-26T12:00:04Z"  # Replace with the provided timestamp
response_data['data']['details'] = insert_after_region(response_data['data']['details'], timestamp)

# 3. Make the PUT request
put_response = requests.put(url_get, data=json.dumps(response_data), headers=headers_put)
print(put_response.status_code)

To delete a file at specific timepoint, the key 'willBeDeletedAt' should be added or changed using the API call. If running in the terminal, a successful run will finish with the message ‘200’. In the ICA UI, you can check the details of the file to see the updated values for ‘Time To Be Archived’ (willBeArchivedAt) or ‘Time To Be Deleted’ (willBeDeletedAt), as shown in the screenshot.

Secondary Data

Linking a folder links only the data within the folder at the time the link is created. Data added within the source folder after the link is created will not be automatically linked to the destination project.

You can perform analysis on data from other projects by linking data from that project.

  1. Select Projects > your_project > Data > Manage, and then select Link Data.

  2. To view data by project, select Add filter, and then select Owning Project. If you only know to which project the data is linked, you can choose linked project to filter on.

  3. Select the checkbox next to the file or files to add.

  4. Select Select Data.

Tip: if you have selected multiple owning projects, you can add the owning project column to see which project owns the data.

  1. Hover over to column names to reveal the cogwheel.

  2. Select Add new column.

  3. Choose Owning Project (or Linked Projects)

Your files are added to the Data page. To view the linked data file, select Add filter, and then select Links.

Linking Folders

To unlink the data, select the folder containing the files or the individual files themselves (limited to 100 at a time) and select Manage > Unlink Data. As during linking a folder, when unlinking, the progress can be monitored at Projects > your_project > activity > Batch Jobs.

Last updated