Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Bundles are curated data sets which combine assets such as pipelines, tools, and Base query templates. This is where you will find packaged assets such as Illumina-provided pipelines and sample data. You can create, share and use bundles in projects of your own tenant as well as projects in other tenants.
There is a combined limit of 30 000 projects and bundles per tenant.
The following ICA assets can be included in bundles:
Data (link / unlink)
Samples (link / unlink)
Reference Data (add / delete)
Pipelines (link/unlink)
Tools and Tool images (link/unlink)
Base tables (read-only) (link/unlink)
The main Bundles screen has two tabs: My Bundles and Entitled Bundles. The My Bundles tab shows all the bundles that you are a member of. This tab is where most of your interactions with bundles occur. The Entitled Bundles tab shows the bundles that have been specially created by Illumina or other organizations and shared with you to use in your projects. See Access and Use an Entitled Bundle.
Some bundles come with additional restrictions such as disabling bench access or internet access when running pipelines to protect the data contained in them. When you link these bundles, the restrictions will be enforced on your project. Unlinking the bundle will not remove the restrictions.
You can not link bundles which come with additional restrictions to externally managed projects.
As of ICA v.2.29, the content in bundles is linked in such a way that any updates to a bundle are automatically propagated to the projects which have that bundle linked.
If you have created bundle links in ICA versions prior to ICA v2.29 and want to switch them over to links with dynamic updates, you need to unlink and relink them.
From the main navigation page, select Projects > your_project > Project Settings > Details.
Click the Edit button at the top of the Details page.
Click the + button, under Linked bundles.
Click on the desired bundle, then click the +Link Bundles button.
Click Save.
The assets included in the bundle will now be available in the respective pages within the Project (e.g. Data and Pipelines pages). Any updates to the assets will be automatically available in the destination project.
To unlink a bundle from a project,
Select Projects > your_project > Project Settings > Details.
Click the Edit button at the top of the Details page.
Click the (-) button, next to the linked bundle you wish to remove.
Bundles and projects have to be in the same region in order to be linked. Otherwise, the error The bundle is in a different region than the project so it's not eligible for linking will be displayed.
You can only link bundles to a project if that project belongs to a tenant who has access to the bundle. You do not carry your access to a bundle over if you are invited to projects of other tenants.
You can not unlink bundles which were linked by external applications
To create a new bundle and configure its settings, do as follows.
From the main navigation, select Projects > your_project > Bundles.
Select + Create .
Enter a unique name for the bundle.
From the Region drop-down list, select where the assets for this bundle should be stored.
[Optional] Configure the following settings.
Categories—Select an existing category or enter a new one.
Status—Set the status of the bundle. When the status of a bundle changes, it cannot be reverted to a draft or released state.
Draft—The bundle can be edited.
Released—The bundle is released. Technically, you can still edit bundle information and add assets to the bundle, but should refrain from doing so.
Deprecated—The bundle is no longer intended for use. By default, deprecated bundles are hidden on the main Bundles screen (unless non-deprecated versions of the bundle exist). Select "Show deprecated bundles" to show all deprecated bundles. Bundles can not be recovered from deprecated status.
Short Description—Enter a description for the bundle.
Metadata Model—Select a metadata model to apply to the bundle.
Enter a release version for the bundle and optionally enter a description for the version.
[Optional] Links can be added with a display name (max 100 chars) and URL (max 2048 chars).
Homepage
License
Links
Publications
[Optional] Enter any information you would like to distribute with the bundle in the Documentation section.
Select Save.
There is no option to delete bundles, they must be deprecated instead.
To make changes to a bundle:
From the main navigation, select Bundles.
Select a bundle.
Select Edit.
Modify the bundle information and documentation as needed.
Select Save.
When the changes are saved, they also become available in all projects that have this bundle linked.
To make changes to a bundle:
Select a bundle.
On the left-hand side, select the type of asset under Flow (such as pipeline or tool) you want to add to the bundle.
Depending on the asset type, select add or link to bundle.
Select the assets and confirm.
Assets must meet the following requirements before they can be added to a bundle:
For Samples and Data, the project the asset belongs to must have data sharing enabled.
The region of the project containing the asset must match the region of the bundle.
You must have permission to access the project containing the asset.
Pipelines and tools need to be in released status.
Samples must be available in a complete
state.
When you link folders to a bundle, a warning is displayed indicating that, depending on the size of the folder, linking may take considerable time. The linking process will run in the background and the progress can be monitored on the Bundles > your_bundle > activity > Batch Jobs screen. To see more details and the progress, double-click the batch job and then double-click the individual item. This will show how many individual files are already linked.
You can not add the same asset twice to a bundle. Once added, the asset will no longer appear in the selection list.
Which batch jobs are visible as activity depends on the user role.
When creating a new bundle version, you can only add assets to the bundle. You cannot remove existing assets from a bundle when creating a new version. If you need to remove assets from a bundle, it is recommended that you create a new bundle. All users wich currently have access to a bundle will automatically have access to the new version as well.
From the main navigation, select Bundles.
Select a bundle.
Select + Create new Version.
Make updates as needed and update the version number.
Select Save.
When you create a new version of a bundle, it will replace the old version in your list. To see the old version, open your new bundle and look at Bundles > your_bundle > Details > Versioning. There you can open the previous version which is contained in your new version.
Assets such as data which were added in a previous version of your bundle will be marked in green, while new content will be black.
From the main navigation, Select Bundles > your_bundle > Bundle Settings > Legal.
To add Terms of Use to a Bundle, do as follows:
Select + Create New Version.
Use the WYSIWYG editor to define Terms of Use for the selected bundle.
Click Save.
[Optional] Require acceptance by clicking the checkbox next to Acceptance required.
Acceptance required will prompt a user to accept the Terms of Use before being able to use a bundle or add the bundle to a project.
To edit the Terms of Use, repeat Steps 1-3 and use a unique version name. If you select acceptance required, you can choose to keep the acceptance status as is or require users to reaccept the terms of use. When reacceptance is required, users need to reaccept the terms in order continue using this bundle in their pipelines. This is indicated when they want to enter projects which use this bundle.
If you want to collaborate with other people on creating a bundle and managing the assets in the bundle, you can add users to your bundle and set their permissions. You use this to create a bundle together, not to use the bundle in your projects.
From the main navigation, select Bundles > your_bundle > Bundle Settings > Team.
To invite a user to collaborate on the bundle, do as follows.
To add a user from your tenant, select Someone of your tenant and select a user from the drop-down list.
To add a user by their email address, select By email and enter their email address.
To add all the users of an entire workgroup, select Add workgroup and select a workgroup from the drop-down list.
Select the Bundle Role drop-down list and choose a role for the user or workgroup. This role defines the ability of the user or workgroup to view or edit bundle settings.
Viewer: view content without editing rights.
Contributor: view bundle content and link/unlink assets.
Administrator: full edit rights of content and configuration.
Repeat as needed to add more users.
Users are not officially added to the bundle until they accept the invitation.
To change the permissions role for a user, select the Bundle Role drop-down list for the user and select a new role.
To revoke bundle permissions from a user, select the trash icon for the user.
Select Save Changes.
Once you have finalized your bundle and added all assets and legal requirements, you can share your bundle with other tenants to use it in their projects.
Your bundle must be in released status to prevent it from being updated while it is shared.
Go to Bundles > your_bundle > Edit > Details > Bundle status and set it to Released.
Save the change.
Once the bundle is released, you can share it. Invitations are sent to an individual email address, however access is granted and extended to all users and all workgroups inside that tenant.
Go to Bundles > your_bundle > Bundle Settings > Share.
Click Invite and enter the email address of the person you want to share the bundle with. They will receive an email from which they can accept or reject the invitation to use the bundle. The invitation will show the bundle name, description and owner. The link in the invite can only be used once.
Do not to create duplicate entries. You can only use one user/tenant combination per bundle.
You can follow up on the status of the invitation on the Bundles > your_bundle > Bundle Settings > Share page.
If they reject the bundle, the rejection date will be shown. To re-invite that person again later on, select their email address in the list and choose Remove. You can then create a new invitation. If you do not remove the old entry before sending a new invitation, they will be unable to accept and get an error message stating that the user and bundle combination must be unique. They can also not re-use an invitation once it has been accepted or declined.
If they accept the bundle, the acceptance date will be shown. They will in turn see the bundle under Bundles > Entitled bundles. To remove access, select their email address in the list and choose Remove.
Entitled bundles are bundles created by Illumina or third parties for you to use in your projects. Entitled bundles can already be part of your tenant when it is part of your subscription. You can see your entitled bundles at Bundles > Entitled Bundles.
To use your shared entitled bundle, add the bundle to your project via Project Linking. Content shared via entitled bundles is read-only, so you cannot add or modify the contents of an entitled bundle. If you lose access to an entitled bundle previously shared with you, the bundle is unlinked and you will no longer be able to access its contents.
Learn how to .
The and documentation pages match navigation within ICA. We also offer supporting documentation for popular topics like , , and .
For more content on topics like , , , and other resources, view the section.
New users may reference the Illumina Connected Software Registration Guide for detailed guidance on setting up an account and registering a subscription.
The platform requires a provisioned tenant in the Illumina account management system with access to the Illumina Connected Analytics (ICA) application. Once a tenant has been provisioned, a tenant administrator will be assigned. The tenant administrator has permission to manage account access including add users, create workgroups, and add additional tenant administrators.
Each tenant is assigned a domain name used to login to the platform. The domain name is used in the login URL to navigate to the appropriate login page in a web browser. The login URL is https://<domain>.login.illumina.com
, where <domain>
is substituted with the domain name assigned to the tenant.
New user accounts can be created for a tenant by navigating to the domain login URL and following the links on the page to setup a new account with a valid email address. Once the account has been added to the domain, the tenant administrator may assign registered users to workgroups with permission to use the ICA application. Registered users may also be made workgroup administrators by tenant administrators or existing workgroup administrators.
For more details on identity and access management, please see the Illumina Connected Software help site.
For security reasons, it is best practice to not use accounts with administrator level access to generate API keys and instead create a specific CLI user with basic permission. This will minimize the possible impact of compromised keys.
For long-lived credentials to the API, an API Key can be generated from the account console and used with the API and command-line interface. Each user is limited to 10 API Keys. API Keys are managed through the product dashboard after logging in through the domain login URL by navigating to the profile drop down and selecting "Manage API Keys".
Click the button to generate a new API Key. Provide a name for the API Key. Then choose to either include all workgroups or select the workgroups to be included. Selected workgroups will be accessible with the API Key.
Click to generate the API Key. The API Key is then presented (hidden) with a button to show the key to be copied and a link to download to a file to be stored securely for future reference. Once the window is closed, the key contents will not be accessible through the domain login page, so be sure to store it securely for future reference if needed.
After generating an API key, save the key somewhere secure to be referenced when using the command-line interface or APIs.
The web application provides a visual user interface (UI) for navigating resources in the platform, managing projects, and extended features beyond the API. To access the web application, navigate to the Illumina Connected Analytics portal.
On the left, you have the navigation bar which will auto-collapse on smaller screens. When collapsed, use the ≡ symbol to expand it.
The central part of the display is the item on which you are performing your actions.
At the top right, you have icons to refresh the screen for information, status updates, and access to the online help.
The command-line interface offers a developer-oriented experience for interacting with the APIs to manage resources and launch analysis workflows. Find instructions for using the command-line interface including download links for your operating system in the CLI documentation.
The HTTP-based application programming interfaces (APIs) are listed in the API Reference section of the documentation. The reference documentation provides the ability to call APIs from the browser page and shows detailed information about the API schemas. HTTP client tooling such as Postman or cURL can be used to make direct calls to the API outside of the browser.
When accessing the API using the API Reference page or through REST client tools, the
Authorization
header must be provided with the value set toBearer <token>
where<token>
is replaced with a valid JSON Web Token (JWT). For generating a JWT, see JSON Web Token (JWT).
The object data models for resources that are created in the platform include a unique id
field for identifying the resource. These fixed machine-readable IDs are used for accessing and modifying the resource through the API or CLI, even if the resource name changes.
Accessing the platform APIs requires authorizing calls using JSON Web Tokens (JWT). A JWT is a standardized trusted claim containing authentication context. This is a primary security mechanism to protect against unauthorized cross-account data access.
A JWT is generated by providing user credentials (API Key or username/password) to the token creation endpoint. Token creation can be performed using the API directly or the CLI.
When looking at the main ICA navigation, you will see the following structure:
Projects are your primary work locations which contain your data and tools to execute your analyses. Projects can be considered as a binder for your work and information. You can have data contained within a project, or you can choose to make it shareable between projects.
Reference Data are reference genome sets which you use to help look for deviations and to compare your data against.
Bundles are packages of assets such as sample data, pipelines, tools and templates which you can use as a curated data set. Bundles can be provided both by Illumina and other providers, and you can even create your own bundles. You will find the Illumina-provided pipelines in bundles.
Audit/Event Logs are used for audit purposes and issue resolving.
System Settings contain general information susch as the location of storage space, docker images and tool repositories.
Projects are the main dividers in ICA. They provide an access-controlled boundary for organizing and sharing resources created in the platform. The Projects view is used to manage projects within the current tenant.
Note that there is a combined limit of 30,000 projects and bundles per tenant.
To create a new project, click the Projects > + Create Project button.
Required fields include:
Name
1-255 characters
Must begin with a letter
Characters are limited to alphanumerics, hyphens, underscores, and spaces
Analysis Priority (Low/Medium(default)/High) This is balanced per tenant with high priority analyses started first and the system progressing to the next lower priority once all higher priority analyses are running. Balance your priorities so that lower priority projects do not remain waiting for resources indefinitely.
Project Owner Owner (and usually contact person) of the project. The project owner has the same rights as a project administrator, but can not be removed from a project without first assigning another project owner. This can be done by the current project owner, the tenant administrator or a project administrator of the current project. Reassignment is done at Projects > your_project > Project Settings > Team > Edit.
Project Location Select your project location. Options available are based on Entitlement(s) associated with purchased subscription.
Storage Bundle (auto-selected based on user selection of Project Location)
Click the Save button to finish creating the project. The project will be visible from the Projects view.
During project creation, select the I want to manage my own storage checkbox to use a Storage Configuration as the data provider for the project.
With a storage configuration set, a project will have a 2-way sync with the external cloud storage provider: any data added directly to the external storage will be sync'ed into the ICA project data, and any data added to the project will be sync'ed into the external cloud storage.
Several tools are available to assist you with keeping an overview of your projects. These filters work in both list and tile view and persist across sessions.
Searching is a case-insensitive wildcard filter. Any project which contains the characters will be shown. Use * as wildcard in searches. Be aware that operators without search words are blocked and will result in Unexpected error occurred when searching for projects. You can use the brackets, AND, OR and NOT operators, provided that you do not start the search with them (Monkey AND Banana is allowed, AND Aardvark by itself is invalid syntax)
Filter by Workgroup : Projects in ICA can be accessible for different workgroups. This drop-down list allows you to filter projects for specific workgroups. To reset the filter so it displays projects from all your workgroups, use the x on the right which appears when a workgroup is selected.
Hidden projects : You can hide projects (Projects > your_project > Details > Hide) which you no longer use. Hiding will delete data in base and bench and will thus be irreversible.
You can still see hidden projects if you select this option and delete the data they contain at Projects > your_project > Data to save on storage costs.
If you are using your own S3 bucket, your S3 storage will be unlinked from the project, but the data will remain in your S3 storage. Your S3 storage can then be used for other projects.
Favorites : By clicking on the star next to the project name in the tile view, you set a project as favourite. You can have multiple favourites and use the Favourites checkbox to only show those favourites. This prevents having too many projects visible.
Tile view shows a grid of projects. This view is best suited if you only have a few projects or have filtered them out by creating favourites. A single click will open the project.
List view shows a list of projects. This view allows you to add additional filters on name, description, location, user role, tenant, size and analyses. A double-click is required to open the project.
Illumina software applications which do their own data management on ICA (such as BSSH) store their resources and data in a project much in the same was as manually created projects work in ICA. For ICA, these projects are considered to be externally-managed projects and from ICA, there are a number of restrictions on which actions are allowed on externally-managed projects. For example, you can not delete or move externally-managed data. This is to prevent inconsistencies when these applications want to access their own project data.
When you create a folder with a name which already exists as externally-managed folder, your project will have that folder twice. Once ICA-managed and once externally-managed as S3 does not require unique folder names.
You can keep track of which files are externally controlled and which are ICA-managed by means of the “managed by” column, visible in the data list view of externally-managed projects at Projects > your_project > Data.
Projects are indicated as externally-managed in the projects overview screen by a project card with a light grey accent and a lock symbol followed by "managed by app".
To access the APIs using the command-line interface (CLI), an API Key may be provided as credentials when logging in. API Keys operate similar to a user name and password and should be kept secure and rotated on a regular basis (preferably yearly). When keys are compromised or no longer in use, they must be revoked. This is done through the domain login URL by navigating to the profile drop down and selecting "Manage API Keys", followed by selecting the key and using the trash icon next to it.
On the project creation screen, add information to create a project. See page for information about each field.
Refer to the documentation for details on creating a storage configuration.
Hiding projects is not possible for projects.
If you are missing projects, especially those been created by other users, the workgroup filter might still be active. Clear the filter with the x to the right. You can verify the list of projects to which you have access with the icav2 projects list
.
What you can do is add and data such as to externally managed projects. Separation of data is ensured by only allowing additional files at the root level or in dedicated subfolders which you can create in your projects. Data which you have added can be moved and deleted again.
You can add to externally managed projects, provided those bundles do not come with additional restrictions for the project.
You can start workspaces in externally-managed projects. The resulting data will be stored in the externally-managed project.
Tertiary modules such as are not supported for externally-managed projects.
Externally-managed projects protect their notification subscriptions to ensure no user can delete them. It is possible to add your own subscriptions to externally-managed projects, see for more information.
For a better understanding of how all components of ICA work, try the .
The event log shows an overview of system events with options to search and filter. For every entry, it lists the following:
Event date and time
Category (error, warn or info)
Code
Description
Tenant
Up to 200,000 results will be be returned. If your desired records are outside the range of the returned records, please refine the filters or use the search function at the top right.
Export is restricted to the amount of entries shown per page. You can use the selector at the bottom to set this to up to 1000 entries per page.
You can use your own S3 bucket with Illumina Connected Analytics (ICA) for data storage. This section describes how to configure your AWS account to allow ICA to connect to an S3 bucket.
These instructions utilize the AWS CLI. Follow the AWS CLI documentation for instructions to download and install.
The AWS S3 bucket must exist in the same AWS region as the ICA project. Refer to the table below for a mapping of ICA project regions to AWS regions:
Australia
ap-southeast-2
Canada
ca-central-1
Germany
eu-central-1
India
ap-south-1
Indonesia
ap-southeast-3
Israel
il-central-1
Japan
ap-northeast-1
Singapore
ap-southeast-1
South Korea*
ap-northeast-2
UK
eu-west-2
United Arab Emirates
me-central-1
United States
us-east-1
(*) BSSH is not currently deployed on the South Korea instance, resulting in limited functionality in this region with regard to sequencer integration.
You can use unversioned, versioned and suspended buckets as own S3 storage. If you connect buckets with object versioning, the data in ICA will be automatically synced with the data in objectstore. When an object is deleted without specifying a particular version, a Delete marker is created on the objectstore to indicate that the object has been deleted. ICA will reflect the object state by deleting the record from the database. No further action on your side is needed to sync.
You can enable SSE using an Amazon S3-managed key (SSE-S3). Instructions for using KMS-managed (SSE-KMS) keys are found here.
Because of how Amazon S3 handles folders and does not send events for S3 folders, the following restrictions must be taken into account for ICA project data stored in S3.
When creating an empty folder in S3, it will not be visible in ICA.
When moving folders in S3, the original, but empty, folder will remain visible in ICA and must be manually deleted there.
When deleting a folder and its contents in S3, the empty folder will remain visible in ICA and must be manually deleted there.
Projects cannot be created with ./ as prefix since S3 does not allow uploading files with this key prefix.
When configuring a new project in ICA to use a preconfigured S3 bucket, create a folder on your S3 bucket in the AWS console. This folder will be connected to ICA as a prefix.
Failure to create a folder will result in the root folder of your S3 bucket being assigned which will block your S3 bucket from being used for other ICA projects with the error "Conflict while updating file/folder. Please try again later."
For Bring Your Own Storage buckets, all unversioned, versioned and suspended buckets are supported. If you connect buckets with object versioning, the data in ICA will be automatically synced with the data in objectstore.
For Bring Your Own Storage buckets with versioning enabled, when an object is deleted without specifying a particular version, a "Delete marker" is created on the objectstore to indicate that the object has been deleted. ICA will reflect the object state by deleting the record from the database. No further action on your side is needed to sync.
ICA requires cross-origin resource sharing (CORS) permissions to write to the S3 bucket for uploads via the browser. Refer to the Configuring cross-origin resource sharing (CORS) (expand the "Using the S3 console" section) documentation for instructions on enabling CORS via the AWS Management Console. Use the following configuration during the process:
In the cross-origin resource sharing (CORS) section, enter the following content.
ICA requires specific permissions to access data in an AWS S3 bucket. These permissions are contained in an AWS IAM Policy.
Refer to the Creating policies on the JSON tab documentation for instructions on creating an AWS IAM Policy via the AWS Management Console. Use the following configuration during the process:
On Unversioned buckets, paste the JSON policy document below. Note the example below provides access to all objects prefixes in the bucket.
Replace YOUR_BUCKET_NAME with the name of the S3 bucket you created for ICA. Replace YOUR_FOLDER_NAME with the name of the folder in your S3 bucket.
On Versioned OR Suspended buckets, paste the JSON policy document below. Note the example below provides access to all objects prefixes in the bucket.
Replace YOUR_BUCKET_NAME with the name of the S3 bucket you created for ICA. Replace YOUR_FOLDER_NAME with the name of the folder in your S3 bucket.
(Optional) Set policy name to "illumina-ica-admin-policy"
To create the IAM Policy via the AWS CLI, create a local file named illumina-ica-admin-policy.json
containing the policy content above and run the following command. Be sure the path to the policy document (--policy-document
) leads to the path where you saved the file:
An AWS IAM User is needed to create an Access Key for ICA to connect to the AWS S3 Bucket. The policy will be attached to the IAM user to grant the user the necessary permissions.
Refer to the Creating IAM users (console) documentation for instructions on creating an AWS IAM User via the AWS Management Console. Use the following configuration during the process:
(optional) Set user name to "illumina_ica_admin"
Select the Programmatic access option for the type of access
Select Attach existing policies directly when setting the permissions, and choose the policy created in Create AWS IAM Policy
(Optional) Retrieve the Access Key ID and Secret Access Key by choosing to Download .csv
To create the IAM user and attach the policy via the AWS CLI, enter the following command (AWS IAM users are global resources and do not require a region to be specified). This command creates an IAM user illumina_ica_admin
, retrieves your AWS account number, and then attaches the policy to the user.
If the Access Key information was retrieved during the IAM user creation, skip this step.
Refer to the Managing access keys (console) AWS documentation for instructions on creating an AWS Access Key via the AWS Console. See the "To create, modify, or delete another IAM user's access keys (console)" sub-section.
Use the command below to create the Access Key for the illumina_ica_admin IAM user. Note the SecretAccessKey
is sensitive and should be stored securely. The access key is only displayed when this command is executed and cannot be recovered. A new access key must be created if it is lost.
The AccessKeyId
and SecretAccessKey
values will be provided to ICA in the next step.
Connecting your S3 bucket to ICA does not require any additional bucket policies.
However, if a bucket policy is required for use cases beyond ICA, you need to ensure that the bucket policy supports the essential permissions needed by ICA without inadvertently restricting its functionality.
Here is one such example:
Be sure to replace the following fields:
YOUR_BUCKET_NAME: Replace this field with the name of the S3 bucket you created for ICA.
YOUR_ACCOUNT_ID: Replace this field with your account ID number.
YOUR_IAM_USER: Replace this field with the name of your IAM user created for ICA.
In this example, we have a restriction enabled on the bucket policy to disallow any kind of access to the bucket. However, there is an exception rule added for the IAM user that ICA is using to connect to the S3 bucket. The exception rule is allowing ICA to perform the above S3 action permissions necessary for ICA functionalities.
Additionally, the exception rule is applied to the STS federated user session principal associated with ICA. Since ICA leverages the AWS STS to provide temporary credentials that allow users to perform actions on the S3 bucket, it is crucial to include these STS federated user session principals in your policy's whitelist. Failing to do so could result in 403 Forbidden errors when users attempt to interact with the bucket's objects using the provided temporary credentials.
To connect your S3 account to ICA, you need to add a storage credential in ICA containing the Access Key ID and Access Key created in the previous step. From the ICA home screen, navigate to System Settings > Credentials and click the Create button to create a new storage credential.
Provide a name for the storage credentials, ensure the type is set to "AWS user" and provide the Access Key ID and Secret Access Key.
With the secret credentials created, a storage configuration can be created using the secret credential. Refer to the instructions to Create a Storage Configuration for details.
ICA uses AssumeRole to copy and move objects from a bucket in an AWS account to another bucket in another AWS account. To allow cross account access to a bucket, the following policy statements must be added in the bucket policy:
Be sure to replace the following fields:
ASSUME_ROLE_ARN: Replace this field with the ARN of the cross account role you want to give permission to. Refer to the table below to determine which region-specific Role ARN should be used.
YOUR_BUCKET_NAME: Replace this field with the name of the S3 bucket you created for ICA.
The ARN of the cross account role you want to give permission to is specified in the Principal. Refer to the table below to determine which region-specific Role ARN should be used.
Australia (AU)
arn:aws:iam::079623148045:role/ica_aps2_crossacct
Canada (CA)
arn:aws:iam::079623148045:role/ica_cac1_crossacct
Germany (EU)
arn:aws:iam::079623148045:role/ica_euc1_crossacct
India (IN)
arn:aws:iam::079623148045:role/ica_aps3_crossacct
Indonesia (ID)
arn:aws:iam::079623148045:role/ica_aps4_crossacct
Israel (IL)
arn:aws:iam::079623148045:role/ica_ilc1_crossacct
Japan (JP)
arn:aws:iam::079623148045:role/ica_apn1_crossacct
Singapore (SG)
arn:aws:iam::079623148045:role/ica_aps1_crossacct
South Korea (KR)
arn:aws:iam::079623148045:role/ica_apn2_crossacct
UK (GB)
arn:aws:iam::079623148045:role/ica_euw2_crossacct
United Arab Emirates (AE)
arn:aws:iam::079623148045:role/ica_mec1_crossacct
United States (US)
arn:aws:iam::079623148045:role/ica_use1_crossacct
The following are common issues encountered when connecting an AWS S3 bucket through a storage configuration
Access Forbidden
Access forbidden: {message}
Mostly occurs because of lack of permission. Fix: Review IAM policy, Bucket policy, ACLs for required permissions
Conflict
System topic is not in a valid state
Conflict
Found conflicting storage container notifications with overlapping prefixes
Conflict
Found conflicting storage container notifications for {prefix}{eventTypeMsg}
Conflict
Found conflicting storage container notifications with overlapping prefixes{prefixMsg}{eventTypeMsg}
Customer Container Notification Exists
Volume Configuration cannot be provisioned: storage container is already set up for customer's own notification
Invalid Access Key ID
Failed to update bucket policy: The AWS Access Key Id you provided does not exist in our records.
Check the status of the AWS Access Key ID in the console. If not active, activate it. If missing, create it.
Invalid Paramater
Missing credentials for storage container
Invalid Parameter
Missing bucket name for storage container
Invalid Parameter
The storage container name has invalid characters
Invalid Parameter
Storage Container '{storageContainer}' does not exist
Invalid Parameter
Invalid parameters for volume configuration: {message}
Invalid Storage Container Location
Storage container must be located in the {region} region
Invalid Storage Container Location
Storage container must be located in one of the following regions: {regions}
Missing Configuration
Missing queue name for storage container notification
Missing Configuration
Missing system topic name for storage container notification
Missing Configuration
Missing lambda ARN for storage container notification
Missing Configuration
Missing subscription name for storage container notification
Missing Storage Account Settings
The storage account '{storageAccountName}' needs HNS (Hierarchical Namespace) enabled.
Missing Storage Container Settings
Missing settings for storage container
This error occurs when an existing bucket notification's event information overlap with the notifications ICA is trying to add. Amazon S3 event notification only allows overlapping events with non-overlapping prefix. Depending on the conflicts on the notifications, the error can be presented in any of the following:
Volume Configuration cannot be provisioned: storage container is already set up for customer's own notification
Invalid parameters for volume configuration: found conflicting storage container notifications with overlapping prefixes
Failed to update bucket policy: Configurations overlap. Configurations on the same bucket cannot share a common event type
To fix the issue:
In the Amazon S3 Console, review your current S3 bucket's notification configuration and look for prefixes that overlaps with your Storage Configuration's key prefix
Delete the existing notification that overlaps with your Storage Configuration's key prefix
ICA will perform a series of steps in the background to re-verify the connection to your bucket.
This error can occur when recreating a recently deleted storage configuration. To fix the issue, you have to delete the bucket notifications:
In the Amazon S3 Console select the bucket for which you need to delete the notifications from the list.
Choose properties
Navigate to the Event Notifications section and choose the check box for the event notifications with name gds:objectcreated, gds:objectremoved and gds:objectrestore and click Delete.
Wait 15 minutes for the storage to become available in ICA
If you do not want to wait 15 minutes, you can delete the current storage configuration, delete the bucket notifications in the bucket and create a new storage configuration.
Illumina Connected Analytics allows you to create and assign metadata to capture additional information about samples.
Each tenant has one root metadata model that is accessible to all projects in the tenant. This allows an organization to collect the same piece of information for every sample in every project in the tenant, such as an ID number. Within this root model, you can configure multiple metadata submodels, even at different levels.
Illumina recommends that you limit the amount of fields or field groups you add to the root model. If there are any misconfigured items in the root model, it will carry over into all other metadata models in the tenant. Once a root model is published, the fields and groups that are defined within it cannot be deleted. You should first consider creating submodels before adding anything to the root model. When configuring a project, you have the option to assign one published metadata model for all samples in the project. This metadata model can be the root model, a submodel of the root model, or a submodel of a submodel. It can be any published metadata model in the tenant. When a metadata model is selected for a project, all fields configured for the metadata model, and all fields in any parent models are applied to the samples in the project.
❗️ Illumina recommends that you limit the amount of fields or field groups you add to the root model. You should first consider creating submodels before adding anything to the root model.
The following terminology is used within this page:
Metadata fields = Metadata fields will be linked to a sample in the context of a project. They can be of various types and could contain single or multiple values.
Metadata groups = You can identify that a few fields belong together (for example, they all are related to quality metrics). That would be the moment to create a group so that the user knows these fields belong together
Root model = Model that is linked to the tenant. Every metadata model that you link to a project will also contain the fields and groups specified in this model as this is a parent model for all other models. This is a subcategory of a project metadata model
Child/Sub model = Any metadata model that is not the root model. Child models will inherit all fields and groups from their parent models. This is a subcategory of a project metadata model
Pipeline model = Model that is linked to a specific pipeline and not a project
Metadata in the context of ICA will always give information about a sample. It can be provided by the user, the pipeline and via the API. There are 2 general categories of metadata models: Project Metadata Model and Pipeline Metadata Model. Both models are built from metadata fields and groups. The project metadata model is specific per tenant, while the pipeline metadata model is linked to a pipeline and can be shared across tenants. These models are defined by users.
Each sample can have multiple metadata models. Whenever you link a project metadata model to your project, you will see its groups and fields present on each sample. The root model from that tenant will also be present as every metadata model inherits the groups and fields specified in the parent metadata model(s). When a pipeline is executed with sample and the pipeline contained a metadata model, the groups and fields will be present as well for each analysis that comes out of a pipeline execution.
The following field types are used within ICA:
Text: Free text
Keyword: Automatically complete value based on already used values
Numeric: Only numbers
Boolean: True or false, cannot be multiple value
Date: e.g. 23/02/2022
Date time: e.g. 23/02/2022 11:43:53, saved in UTC
Enumeration: select value out of drop-down list
The following properties can be selected for groups & fields:
Required: Pipeline can’t be started with this sample until the required group/field is filled in
Sensitive: Values of this group/field are only visible to project users of the own tenant. When a sample is shared across tenants, these fields won't be visible
Filled by pipeline: Fields that need to be filled by pipeline should be part of the same group. This group will automatically be multiple value and values will be available after pipeline execution. This property is only available for groups
Multiple value: This group/field can consist out of multiple (grouped) values
❗️ Fields cannot be both required and filled by pipeline
Project metadata model has metadata linked to a specific project. Values are known upront, general information is required for each sample of a specific project, and it may include general mandatory company information.
Pipeline metadata model has metadata linked to a specific pipeline. Values are populated during the pipeline execution and it requires an output file with the name 'metadata.response.json'.
❗️ Field groups should be used when configuring metadata fields that are filled by a pipeline. These fields should be part of the same field group and be configured with the Multiple Value setting enabled
Newly created and updated metadata models are not available for use within the tenant until the metadata model is published. When a metadata model is published, fields and field groups cannot be deleted, but the names and descriptions for fields and field groups can be edited. A model can be published after verifying all parent models are published first.
If a published metadata model is no longer needed, you can retire the model (except the root model).
First, check if the model contains any submodels. A model cannot be retired if it contains any published submodels.
When you are certain you want to retire a model and all submodels are retired, click on the three dots in the top right of the model window, and then select Retire Metadata Model.
To add metadata to your samples, you first need to assign a metadata model to your project.
Go to Projects > your_project > Project Settings > Details.
Select Edit.
From the Metadata Model drop-down list, select the metadata model you want to use for the project.
Select Save. All fields configured for the metadata model, and all fields in any parent models are applied to the samples in the project.
To manually add metadata to samples in your project, do as follows.
Precondition is that you have a metadata model assigned to your project
Go to Projects > your_project > Samples > your_sample.
Double-click your sample to open the sample details.
Enter all metadata information as it applies to the selected sample. All required metadata fields must be populated or the pipeline cannot start.
Select Save
To fill metadata by pipeline executions, a pipeline model must be created.
In the Illumina Connected Analytics main navigation, go to Projects > your_project > Flow > Pipelines > your_pipeline.
Double-click on your pipeline to open the pipeline details.
Create/Edit your model under Metadata Model tab. Field groups should be used when configuring metadata fields that are filled by a pipeline. These fields should be part of the same field group and be configured with the Multiple Value setting enabled.
In order for your pipeline to fill the metadata model, an output file with the name metadata.response.json
must be generated. After adding your group fields to the pipeline model, click on Generate example JSON
to view the required format for your pipeline.
❗️ The field names cannot have
.
in them, e.g. for the metric nameQ30 bases (excl. dup & clipped bases)
the.
afterexcl
must be removed.
Populating metadata models of samples allows having a sample-centric view of all the metadata. It is also possible to synchronize that data into your project's Base warehouse.
In the Illumina Connected Analytics main navigation, select Projects.
In your project menu select Schedule.
Select 'Add new', and then click on the Metadata Schedule option.
Type a name for your schedule, optionally add description, and select whether you would like the metadata source would be the current project or the entire tenant. It is also possible to select whether ICA references would be anonymized and if sensitive metadata fields would be included. As a reminder, values of sensitive metadata fields would not be visible to other users outside of the project.
Select Save.
Navigate to Tables under BASE menu in your project.
Two new table schemas should be added with your current metadata models.
In order to create a Tool or Bench image, a Docker image is required to run the application in a containerized environment. Illumina Connected Analytics supports both public Docker images and private Docker images uploaded to ICA.
Navigate to System Settings > Docker Repository.
Click Create > External image to add a new external image.
Add your full image URL in the Url field, e.g. docker.io/alpine:latest
or registry.hub.docker.com/library/alpine:latest
. Docker Name and Version will auto-populate. (Tip: do not add http:// or https:// in your URL)
Note: Do not use :latest when the repository has rate limiting enabled as this interferes with caching and incurs additional data transfer.
(Optional) Complete the Description field.
Click Save.
The newly added image will appear in your Docker Repository list.
Verification of the URL is performed during execution of a pipeline which depends on the Docker image, not during configuration.
External images are accessed from the external source whenever required and not stored in ICA. Therefore, it is important not to move or delete the external source. There is no status displayed on external Docker repositories in the overview as ICA cannot guarantee their availability. The use of :stable instead of :latest is recommended.
In order to use private images in your tool, you must first upload them as a TAR file.
Navigate to Projects > your_project .
Select your uploaded TAR file and click in the top menu on Manage > Change Format .
Navigate to System Settings > Docker Repository (outside of your project).
Click on Create > Image.
Click on the magnifying glass to find your uploaded TAR image file.
Select the appropriate region and if needed, filter on project from the drop-down menus to find your file.
Select that file.
The newly added image should appear in your Docker Repository list. Verify it is marked as Available under the Status column to ensure it is ready to be used in your tool or pipeline.
Navigate to System Settings > Docker Repository.
Either
Select the required image(s) and go to Manage > Add Region.
OR double-click on a required image, check the box matching the region you want to add, and select update.
In both cases, allow a few minutes for the image to become available in the new region (the status becomes available in table view).
To remove regions, go to Manage > Remove Region or unselect the regions from the Docker image detail view.
You can download your created Docker images at System Settings > Docker Images > your_Docker_image > Manage > Download.
In order to be able to download Docker images, the following requirements must be met:
The Docker image can not be from an entitled bundle.
Only self-created Docker images can be downloaded.
The Docker image must be an internal image and in status Available.
You can only select a single Docker image at a time for download.
Docker image size should be kept as small as practically possible. To this end, it is best practice to compress the image. After compressing and uploading the image, select your uploaded file and click Manage > Change Format in the top menu to change it to Docker format so ICA can recognize the file.
The Activity view shows the status and history of long-running activities including Data Transfers, Base Jobs, Base Activity, Bench Activity and Batch Jobs.
The Data Transfers tab shows the status of data uploads and downloads. You can sort, search and filter on various criteria and export the information. Show ongoing transfers (top right) allows you to filter out the completed and failed transfers to focus on current activity.
The Base Jobs tab gives an overview of all the actions related to a table or a query that have run or are running (e.g., Copy table, export table, Select * from table, etc.)
The jobs are shown with their:
Creation time: When did the job start
Description: The query or the performed action with some extra information
Type: Which action was taken
Status: Failed or succeeded
Duration: How long the job took
Billed bytes: The used bytes that need to be paid for
Failed jobs provide information on why the job failed. Details are accessed by double-clicking the failed job. Jobs in progress can be aborted here.
The Base Activity tab gives an overview of previous results (e.g., Executed query, Succeeded Exporting table, Created table, etc.) Collecting this information can take considerable time. For performance reasons, only the activity of the last month (rolling window) with a limit of 1000 records is shown and available for download as Excel or JSON. To get the data for the last year without limit on the number of records, use the export as file function. No activity data is retained for more than one year.
The activities are shown with:
Start Time: The moment the action was started
Query: The SQL expression.
Status: Failed or succeeded
Duration: How long the job took
User: The user that requested the action
Size: For SELECT queries, the size of the query results is shown. Queries resulting in less than 100Kb of data will be shown with a size of <100K
The Bench Activity tab shows the actions taken on Bench Workspaces in the project.
The activities are shown with:
Workspace: Workspace where the activity took place
Date: Date and time of the activity
User: User who performed the activity
Action: Which activity was performed
The Batch Jobs tab allows users to monitor progress of Batch Jobs in the project. It lists Data Downloads, Sample Creation (double-click entries for details) and Data Linking (double-click entries for details). The (ongoing) Batch Job details are updated each time they are (re)opened, or when the refresh button is selected at the bottom of the details screen. Batch jobs which have a final state such as Failed or Succeeded are removed from the activity list after 7 days.
Which batch jobs are visible depends on the user role.
A storage configuration provides ICA with information to connect to an external cloud storage provider, such as AWS S3. The storage configuration validates that the information provided is correct, and then continuously monitors the integration.
Refer to the following pages for instructions to setup supported external cloud storage providers:
The storage configuration requires credentials to connect to your storage. AWS uses the security credentials to authenticate and authorize your requests. On the System Settings > Credentials > Create, you can enter these credentials. Long-term access keys consist of a combination of the access key ID and secret access key as a set.
Fill out the following fields:
Type—The type of access credentials. This will usually be AWS user.
Name—Provide a name to easily identify your access key.
Access key ID—The access key you created.
Secret access key—Your related secret access key.
In the ICA main navigation, select System Settings > Storage > Create.
Configure the following settings for the storage configuration.
Type—Use the default value, eg, AWS_S3. Do not change.
Region—Select the region where the bucket is located.
Configuration name—You will use this name when creating volumes that reside in the bucket. The name length must be in between 3 and 63 characters.
Description—Here you can provide a description for yourself or other users to identify this storage configuration.
Bucket name—Enter the name of your S3 bucket.
Key prefix [Optional]—You can provide a key prefix to allow only files inside the prefix to be accessible. The key prefix must end with "/".
If a key prefix is specified, your projects will only have access to that folder and subfolders. For example, using the key prefix folder-1/ ensures that only the data from the folder-1 directory in your S3 bucket is synced with your ICA project. Using prefixes and distinct folders for each ICA project is the recommended configuration as it allows you to use the same S3 bucket for different projects.
Using no key prefix results in syncing all data in your S3 bucket (starting from root level) with your ICA project. Your project will have access to your entire S3 bucket, which prevents that S3 bucket from being used for other ICA projects. Although possible, this configuration is not recommended.
Secret—Select the credentials to associate with this storage configuration. These were created on the Credentials tab.
Server Side Encryption [Optional]—If needed, you can enter the algorithm and key name for server-side encryption processes.
Select Save.
With the action Set as default for region, you select which storage will be used as default storage in a region for new projects of your tenant. Only one storage can be default at a time for a region, so selecting a new storage as default will unselect the previous default. If you do not want to have a default, you can select the default storage and the action will become Unset as default for region.
The System Settings > Credentials > Share action is used to make the storage available to everyone in your tenant. By default, storage is private per user so that you have complete control over the contents. Once you decide you want to share the storage, simply select it and use the Share action. Do take into account that once shared, you can not unshare the storage. Once your storage is used in a project, it can also no longer be deleted.
Filenames beginning with / are not allowed, so be careful when entering full path names. Otherwise the file will end up on S3 but not be visible in ICA. If this happens, access your S3 storage directly and copy the data to where it was intended. If you are using an Illumina-managed S3 storage, submit a support request to delete the erroneous data.
Every 4 hours, ICA will verify the storage configuration and credentials to ensure availability. When an error is detected, ICA will attempt to reconnect once every 15 minutes. After 200 consecutively failed connection attempts (50 hours), ICA will stop trying to connect.
When you update your credentials, the storage configuration is automatically validated. In addition, you can manually trigger revalidation when ICA has stopped trying to connect by selecting the storage and then clicking Validate on the System Settings > Storage > Manage.
Illumina® Connected Analytics is a cloud-based software platform intended to be used to manage, analyze, and interpret large volumes of multi-omics data in a secure, scalable, and flexible environment. The versatility of the system allows the platform to be used for a broad range of applications. When using the applications provided on the platform for diagnostic purposes, it is the responsibility of the user to determine regulatory requirements and to validate for intended use, as appropriate.
The platform is hosted in regions listed below.
The platform hosts a suite of RESTful HTTP-based application programming interfaces (APIs) to perform operations on data and analysis resources. A web application user-interface is hosted alongside the API to deliver an interactive visualization of the resources and enables additional functionality beyond automated analysis and data transfer. Storage and compute costs are presented via usage information in the account console, and a variety of compute resource options are specifiable for applications to fine tune efficiency.
Use the search bar on the top right to navigate through the help docs and find specific topics of interest.
If you have any questions, contact Illumina Technical Support by phone or email:
Illumina Technical Support | techsupport@illumina.com | 1-800-809-4566
For customers outside the United States, Illumina regional Technical Support contact information can be found at www.illumina.com/company/contact-us.html.
To see the current ICA version you are logged in to, click your username found on the top right of the screen and then select About.
To view a list of the products to which you have access, select the 9 dots symbol at the top right of ICA. This will list your products. If you have multiple regional applications for the same product, the region of each is shown between brackets.
The More Tools category presents the following options
My Illumina Dashboard to monitor instruments, streamline purchases and keep track of upcoming activities.
Link to the Support Center for additional information and help.
Link to the order management from where you can keep track of your current and past orders.
See
See
See
See
Upload your private image as a TAR file, either by dragging and dropping the file in the Data tab, using the CLI or a Connector. For more information please refer to the project .
Select DOCKER from the drop-down menu and Save.
Select the appropriate region, fill in the Docker Name and Version and if it is a tool or a bench image, and click Save. \
You need a with a download rule to download the Docker image.
Transfers with a yellow background indicate that rules have been modified in ways that prevent planned files from being uploaded. Please verify your service connectors to resolve this.
For more information, refer to the documentation.
ICA performs a series of steps in the background to verify the connection to your bucket. This can take several minutes. You may need to manually refresh the list to verify that the bucket was successfully configured. Once the storage configuration setup is complete, the configuration can be used while .
Refer to this for the troubleshooting guide.
ICA supports the following storage classes. Please see the for more information on each:
If you are using , which allows S3 to automatically move files into different cost-effective storage tiers, please do NOT include the Archive and Deep Archive Access tiers, as these are not supported by ICA yet. Instead, you can use lifecycle rules to automatically move files to Archive after 90 days and Deep Archive after 180 days. Lifecycle rules are supported for user-managed buckets.
The user documentation provides material for learning the basics of interacting with the platform including examples and tutorials. Start with the documentation to learn more.
In the section of the documentation, posts are made for new versions of deployments of the core platform components.
Project Creator
Project Collaborator same tenant
Project Collaborator different tenant
All batch jobs
All batch jobs
Only batch jobs of own tenant
S3 Standard
Available
S3 Intelligent-Tiering
Available
S3 Express One Zone
Available
S3 Standard-IA
Available
S3 One Zone-IA
Available
S3 Glacier Instant Retrieval
Available
S3 Glacier Flexible Retrieval
Archived
S3 Glacier Deep Archive
Archived
Reduced redundancy (not recommended)
Available
Australia
AU
Canada
CA
Germany
EU
India
IN
Indonesia
ID
Japan
JP
Singapore
SG
South Korea
KR
United Kingdom
GB
United Arab Emirates
AE
United States
US
You can use samples to group information related to a sample, including input files, output files, and analyses.
You can search for samples (excluding their metadata) with the Search button at the top right.
To add a new sample, do as follows.
Select Projects > your_project > Samples.
To add a new sample, select + Create, and then enter a unique name and description for the sample.
To include files related to the sample, select + Add data to sample.
Your sample is added to the Samples page. To view information on the sample, select the sample, and then select Open Details.
You can add additional files to a sample after creating the sample. Any files that are not currently included in a sample are listed on the Unlinked Files tab.
To add an unlinked file to a sample, do as follows.
Go to Projects > your_project > Samples > Unlinked files tab.
Select a file or files, and then select one of the following options:
Create sample — Create a new sample that includes the selected files.
Link to sample — Select an existing sample in the project to link the file to.
Alternatively, you can add unlinked files from the sample details.
Going to Projects > your_project > Samples > your_sample.
Select your sample to open the details.
The last section of the details is files, where you select + Add data to sample.
If the data is not in your project, select Choose a file, which will upload the data to your project. This does not automatically add it to your sample, you will still have to select that newly uploaded data and then select add data to sample.
Data can only be linked to a single sample, so once you have linked data to a sample, it will no longer appear in the list of data to choose form.
To remove files from samples,
Go to Projects > your_project > Samples > your_sample > Details.
Go to the files section and open the file details of the file you want to remove.
Select Remove data from sample.
Save your changes.
A Sample can be linked to a project from a separate project to make it available in read-only capacity.
Navigate to the Samples view in the Project
Click the Link button
Select the Sample(s) to link to the project
Click the Link button
Data linked to Samples is not automatically linked to the project. The data must be linked separately from the Data view. Samples also must be available in a complete
state in order to be linked.
If you want to remove a sample, select it and use the delete option from the top navigation row. You will be presented a choice of how to handle the data in the sample.
Unlink all data without deleting it.
Delete input data and unlink other data.
Delete all data.
You can verify the integrity of the data with the MD5 (Message Digest Algorithm 5) checksum. It is a widely used cryptographic hash function that generates a fixed-size, 128-bit hash value from any input data. This hash value is unique to the content of the data, meaning even a slight change in the data will result in a significantly different MD5 checksum.
For files smaller than 16 MB, you can directly retrieve the MD5 checksum using our API endpoints. Make an API GET call to the https://ica.illumina.com/ica/rest/api/projects/{projectId}/data/{dataId} endpoint specifying the data Id you want to check and the corresponding project ID. The response you receive will be in JSON format, containing various file metadata. Within the JSON response, look for the objectETag field. This value is the MD5 checksum for the file you have queried. You can compare this checksum with the one you compute locally ot ensure the file's integrity.
For larger files, the process is different due to computation limitations. In these cases, we recommend using a dedicated pipeline on our platform to explicitly calculate the MD5 checksum. Below you can find both a main.nf file and the corresponding XML for a possible Nextflow pipeline to calculate the MD5 checksum for FASTQ files.
A Tool is the definition of a containerized application with defined inputs, outputs, and execution environment details including compute resources required, environment variables, command line arguments, and more.
Tools define the inputs, parameters, and outputs for the analysis. Tools are available for use in graphical CWL pipelines by any project in the account.
Select System Settings > Tool Repository > + Create.
Configure tool settings in the tool properties tabs. See Tool Properties.
Select Save.
The following sections describe the tool properties that can be configured in each tab.
Refer to the CWL CommandLineTool Specification for further explanation about many of the properties described below. Not all features described in the specification are supported.
Name
The name of the tool.
Categories
One or more tags to categorize the tool. Select from existing tags or type a new tag name in the field.
Icon
The icon for the tool.
Description
Free text description for information purposes.
Status
The release status of the tool.
Docker image
The registered Docker image for the tool.
Regions
The regions supported by linked Docker image.
Tool version
The version of the tool specified by the end user. Could be any string.
Release version
The version number of the tool.
Family
A group of tools or tool versions.
Version comment
A description of changes in the updated version.
Links
External reference links.
Tool Status
The release status of the tool. can be one of "Draft", "Release Candidate", "Released" or "Deprecated".
Draft
Fully editable draft.
Release Candidate
The tool is ready for release. Editing is locked but the tool can be cloned to create a new version.
Released
The tool is released. Tools in this state cannot be edited. Editing is locked but the tool can be cloned to create a new version.
Deprecated
The tool is no longer intended for use in pipelines. but there are no restrictions placed on the tool. That is, it can still be added to new pipelines and will continue to work in existing pipelines. It is merely an indication to the user that the tool should no longer be used.
The Documentation tab provides options for configuring the HTML description for the tool. The description appears in the Tool Repository but is excluded from exported CWL definitions.
The General Tool tab provides options to configure the basic command line.
ID
CWL identifier field
CWL version
The CWL version in use. This field cannot be changed.
Base command
Components of the command. Each argument must be added in a separate line.
Standard in
The name of the file that captures Standard In (STDIN) stream information.
Standard out
The name of the file that captures Standard Out (STDOUT) stream information.
Standard error
The name of the file that captures Standard Error (STDERR) stream information.
Requirements
The requirements for triggering an error message.
Hints
The requirements for triggering a warning message.
The Hints/Requirements include CWL features to indicate capabilities expected in the Tool's execution environment.
Inline Javascript
The Tool contains a property with a JavaScript expression to resolve it's value.
Initial workdir
The workdir can be any of the following types:
String or Expression — A string or JavaScript expression, eg, $(inputs.InputFASTA)
File or Dir — A map of one or more files or directories, in the following format: {type: array, items: [File, Directory]}
Dirent — A script in the working directory. The Entry name field specifies the file name.
Scatter feature — Indicates that the workflow platform must support the scatter
and scatterMethod
fields.
The Tool Arguments tab provides options to configure base command parameters that do not require user input.
Tool arguments may be one of two types:
String or Expression — A literal string or JavaScript expression, eg --format=bam.
Binding — An argument constructed from the binding of an input parameter.
The following table describes the argument input fields.
Value
The literal string to be added to the base command.
String or expression
Position
The position of the argument in the final command line. If the position is not specified, the default value is set to 0 and the arguments appear in the order they were added.
Binding
Prefix
The string prefix.
Binding
Item separator
The separator that is used between array values.
Binding
Value from
The source string or JavaScript expression.
Binding
Separate
The setting to require the Prefix and Value from fields to be added as separate or combined arguments. Tru indicates the fields must be added as separate arguments. False indicates the fields must be added as a single concatenated argument.
Binding
Shell quote
The setting to quote the Value from field on the command line. True indicates the value field appears in the command line. False indicates the value field is entered manually.
Binding
Example
Prefix
--output-filename
Value from
$(inputs.inputSAM.nameroot).bam
Input file
/tmp/storage/SRR45678_sorted.sam
Output file
SRR45678_sorted.bam
The Tool Inputs tab provides options to define the input files and directories for the tool. The following table describes the input and binding fields. Selecting multi value enables type binding options for adding prefixes to the input.
ID
The file ID.
Label
A short description of the input.
Description
A long description of the input.
Type
The input type, which can be either a file or a directory.
Input options
Checkboxes to add the following options. Optional indicates the input is optional. Multi value indicates there is more than one input file or directory. Streamable indicates the file is read or written sequentially without seeking.
Secondary files
The required secondary files or directories.
Format
The input file format.
Position
The position of the argument in the final command line. If the position is not specified, the default value is set to 0 and the arguments appear in the order they were added.
Prefix
The string prefix.
Item separator
The separator that is used between array values.
Value from
The source string or JavaScript expression.
Load contents
The setting to require the Prefix and Value from fields to be added as separate or combined arguments. True indicates the fields must be added as separate arguments. False indicates the fields must be added as a single concatenated argument.
Separate
The setting to require the Prefix and Value from fields to be added as separate or combined arguments. True indicates the fields must be added as separate arguments. False indicates the fields must be added as a single concatenated argument.
Shell quote
The setting to quote the Value from field on the command line. True indicates the value field appears in the command line. False indicates the value field is entered manually.
The Tool Settings tab provides options to define parameters that can be set at the time of execution. The following table describes the input and binding fields. Selecting multi value enables type binding options for adding prefixes to the input.
ID
The file ID.
Label
A short description of the input.
Description
A long description of the input.
Default Value
The default value to use if the tool setting is not available.
Type
The input type, which can be Boolean, Int, Long, Float, Double or String.
Input options
Checkboxes to add the following options. Optional indicates the input is optional. Multi value indicates there can be more than one value for the input.
Position
The position of the argument in the final command line. If the position is not specified, the default value is set to 0 and the arguments appear in the order they were added.
Prefix
The string prefix.
Item separator
The separator that is used between array values.
Value from
The source string or JavaScript expression.
Separate
The setting to require the Prefix and Value from fields to be added as separate or combined arguments. True indicates the fields must be added as separate arguments. False indicates the fields must be added as a single concatenated argument.
Shell quote
The setting to quote the Value from field on the command line. True indicates the value field appears in the command line. False indicates the value field is entered manually.
The Tool Outputs tab provides options to define the parameters of output files.
The following table describes the input and binding fields. Selecting multi value enables type binding options for adding prefixes to the input.
ID
The file ID.
Label
A short description of the input.
Description
A long description of the input.
Type
The input type, which can be either a file or a directory.
Output options
Checkboxes to add the following options. Optional indicates the input is optional. Multi value indicates here is more than one input file or directory. Streamable indicates the file is read or written sequentially without seeking.
Secondary files
The required secondary files or directories.
Format
The input file format.
Globs
The pattern for searching file names.
Load contents
Automatically loads some contents. The system extracts up to the first 64 KiB of text from the file. Populates the contents field with the first 64 KiB of text from the file.
Output eval
Evaluate an expression to generate the output value.
The Tool CWL tab displays the complete CWL code constructed from the values entered in the other tabs. the CWL code automatically updates when changes are made in the tool definition tabs, and any changes to the CWL code are reflected in the tool definition tabs.
❗️ Modifying data within the CWL editor can result in invalid code.
From the System Settings > Tool Repository page, select a tool.
Select Edit.
From the System Settings > Tool Repository page, select a tool.
Select the Information tab.
From the Status drop-down menu, select a status.
Select Save.
In addition to the interactive Tool builder, the platform GUI also supports working directly with the raw definition when developing a new Tool. This provides the ability to write the Tool definition manually or bring an existing Tool's definition to the platform.
A simple example CWL Tool definition is provided below.
When creating a new Tool, navigate to System Settings > Tool Repository > your_tool > Tool CWL tab to show the raw CWL definition. Here a CWL CommandLineTool definition may be pasted into the editor. After pasting into the editor, the definition is parsed and the other tabs for visually editing the Tool will populate according to the definition contents.
General Tool - includes your base command and various optional configurations.
The base command is required for your tool to run, e.g. python /path/to/script.py
such that python
and /path/to/script.py
are added in separate lines.
Inline Javascript requirement - must be enabled if you are using Javascript anywhere in your tool definition.
Initial workdir requirement - Dirent Type
Your tool must point to a script that executes your analysis. That script can either be provided in your Docker image or using a Dirent. Defining a script via Dirent allows you to dynamically modify your script without updating your Docker image. In order to define your Dirent script define your script name under Entry name
(e.g. runner.sh
) and the script content under Entry
. Then, point your base command to that custom script, e.g. bash runner.sh
.
❗ What's the difference between Settings and Arguments?
Settings are exposed at the pipeline level with the ability to get modified at launch, while Arguments are intended to be immutable and hidden from users launching the pipeline.
How to reference your tool inputs and settings throughout the tool definition?
You can either reference your inputs using their position or ID.
Settings can be referenced using their defined IDs, e.g. $(inputs.InputSetting)
File/Directory inputs can be referenced using their defined IDs, followed by the desired field, e.g. $(inputs.InputFile.path)
. For additional information please refer to the File CWL documentation.
All inputs can also be referenced using their position, e.g. bash script.sh $1 $2
This section describes how to connect an AWS S3 Bucket with SSE-KMS Encryption enabled. General instructions for configuring your AWS account to allow ICA to connect to an S3 bucket are found on this page.
Follow the AWS instructions for how to create S3 bucket with SSE-KMS key.
S3-SSE-KMS must be in the same region as your ICA v2.0 project. See the ICA S3 bucket documentation for more information.
In the "Default encryption" section, enable Server-side encryption and choose AWS Key Management Service key (SSE-KMS)
. Then select Choose your AWS KMS key
.
If you do not have an existing customer managed key, click Create a KMS key
and follow these steps from AWS.
Once the bucket is set, create a folder with encryption enabled in the bucket that will be linked in the ICA storage configuration. This folder will be connected to ICA as a prefix. Although it is technically possible to use the root folder, this is not recommended as it will cause the S3 bucket to no longer be available for other projects.
Follow the general instructions for connecting an S3 bucket to ICA.
In the step "Create AWS IAM policy":
Add permission to use KMS key by adding kms:Decrypt
, kms:Encrypt
, and kms:GenerateDataKey
Add the ARN KMS key arn:aws:kms:xxx
on the first "Resource"
On Unversioned buckets, the permssions will match the following:
On Versioned OR Suspended buckets, the permssions will match the following:
At the end of the policy setting, there should be 3 permissions listed in the "Summary".
Follow the general instructions for how to create a storage configuration in ICA.
On step 3 in process above, continue with the [Optional] Server Side Encryption
to enter the algorithm and key name for server-side encryption processes.
On "Algorithm", input aws:kms
On "Key Name", input the ARN KMS key: arn:aws:kms:xxx
Although "Key prefix" is optional, it is highly recommended to use this and not use the root folder of your S3 bucket. "Key prefix" refers to the folder name in the bucket which you created.
In addition to following the instructions to Enable Cross Account Copy, the KMS policy must include the following statement for AWS S3 Bucket with SSE-KMS Encyption (refer to the Role ARN table from the linked page for the ASSUME_ROLE_ARN
value):
The Data section gives you access to the files and folders stored in the project as well as those linked to the project. Here, you can perform searches and data management operations such as moving, copying, deleting and (un)archiving.
The length of the file name (minus prefixes and delimiters) is ideally limited to 32 characters.
To prevent cost issues, you can not perform actions such as copying and moving data which would write data to the workspace when the project billing mode is set to tenant and the owning tenant of the folder is not the current user's tenant.
On the Projects > your_project > Data page, you can view file information and preview files.
To view file details click on the filename to see the file details.
Run input tags identify the last 100 pipelines which used this file as input.
Connector tags indicate if the file was added via browser upload or connector.
To view file contents, select the checkbox at the begining of the line and then select View from the top menu. Alternatively, you can first click on the filename to see the details and then click view to preview the file.
To see the ongoing actions (copying from, copying to, moving from, moving to) on data in the data overview (Projects > your_project > Data), add the ongoing actions column from the column list. This contains a list of ongoing actions sorted by when they were created. You can also consult the data detail view for ongoing actions by clicking on the data in the overview. When clicking on an ongoing action itself, the data job details of the most recent created data job are shown.
For folders, the list of ongoing actions is displayed on top left of the folder details. When clicking the list, the data job details are shown of the most recent created data job of all actions.
When Secondary Data is added to a data record, those secondary data records are mounted in the same parent folder path as the primary data file when the primary data file is provided as an input to a pipeline. Secondary data is intended to work with the CWL secondaryFiles feature. This is commonly used with genomic data such as BAM files with companion BAM index files (refer to https://www.ncbi.nlm.nih.gov/tools/gbench/tutorial6/ for an example).
To hyperlink to data, use the following syntax:
Normal permission checks still apply with these links. If you try to follow a link to data to which you do not have access, you will be returned to the main project screen or login screen, depending on your permissions.
Uploading data to the platform makes it available for consumption by analysis workflows and tools.
To upload data manually via the drag-and-drop interface in the platform UI, go to Projects > your_project > Data and either
Drag a file from your system into the Choose a file or drag it here box.
Select the Choose a file or drag it here box, and then choose a file. Select Open to upload the file.
Your files are added to the Data page with status partial during upload and become available when upload completes.
Do not close the ICA tab in your browser while data uploads.
You can copy data from the same project to a different folder or from another project to which you have access.
In order to copy data, the following rights must be assigned to the person copying the data:
The following restrictions apply when copying data:
Data in the "Partial" or "Archived" state will be skipped during a copy job.
To use data copy:
Go to the destination project for your data copy and proceed to Projects > your_project > Data > Manage > Copy From.
Optionally, use the filters (Type, Name, Status, Format or additional filters) to filter out the data or search with the search box.
Select the data (individual files or folders with data) you want to copy.
Select any meta data which you want to keep with the copied data (user tags, technical system tags or instrument information).
Select which action to take if the data already exists (overwrite exsiting data, don't copy or keep both the original and the new copy by appending a number to the copied data).
Select Copy Data to copy the data to your project. You can see the progress in Projects > your_project > Activity > Batch Jobs and if your browser permits it, a pop-up message will be displayed whan the copy process completes.
The outcome can be
INITIALIZED
WAITING_FOR_RESOURCES
RUNNING
STOPPED - When choosing to stop the batch job.
SUCCEEDED - All files and folders are copied.
PARTIALLY_SUCCEEDED - Some files and folders could be copied, but not all. Partially succeeded will typically occur when files were being modified or unavailable while the copy process was running.
FAILED - None of the files and folders could be copied.
To see the ongoing actions on data in the data overview (Projects > your_project > Data), you can add the ongoing actions column from the column list with the three column symbol at the top right, next to the filter funnel. You can also consult the data detail view for ongoing actions by clicking on the data in the overview.
There is a difference in copy type behavior between copying files and folders. The behavior is designed for files and it is best practice to not copy folders if there already is a folder with the same name in the destination location.
Notes on copying data
Copying data comes with an additional storage cost as it will create a copy of the data.
You can copy over the same data multiple times.
On the command-line interface, the command to copy data is icav2 projectdata copy
.
You can move data both within a project and between different projects to which you have access. If you allow notifications from your browser, a pop-up will appear when the move is completed.
Move From is used when you are in the destination location.
Move To is used when you are in the source location. Before moving the data, pre-checks are performed to verify that the data can be moved and no currently running operations are being performed on the folder. Conflicting jobs and missing permissions will be reported. Once the move has started, no other operation should be performed on the data being moved to avoid potential data loss or duplication. Adding or (un)archiving files during the move may result in duplicate folders and files with different identifiers. If this happens, you will need to manually delete the duplicate files and move the files which were skipped during the initial move.
When you move data from one location to another, you should not change the source data while the Move job is in progress. This will result in jobs getting aborted. Please expand the "Troubleshooting" section below for information on how to fix this if it occurs.
There are a number of rights and restrictions related to data move as this will delete the data in the source location.
Move jobs will fail if any data being moved is in the "Partial" or "Archived" state.
Move Data From is used when you are in the destination location.
Navigate to Projects > your_project > Data > your_destination_location > Manage > Move From.
Select the files and folders which you want to move.
Select the Move button. Moving large amounts of data can take considerable time. You can monitor the progress at Projects > your_project > Activity > Batch Jobs.
Move Data To is used when you are in the source location. You will need to select the data you want to move from to current location and the destination to move it to.
Navigate to Projects > your_project > Data > your_source_location.
Select the files and folders which you want to move.
Select to Projects > your_project > Data > your_source_location > Manage > Move To.
Select your target project and location.
Select the Move button. Moving large amounts of data can take considerable time. You can monitor the progress at Projects > your_project > Activity > Batch Jobs.
INITIALIZED
WAITING_FOR_RESOURCES
RUNNING
STOPPED - When choosing to stop the batch job.
SUCCEEDED - All files and folders are moved.
PARTIALLY_SUCCEEDED - Some files and folders could be moved, but not all. Partially succeeded will typically occur when files were being modified or unavailable while the move process was running.
FAILED - None of the files and folders could be moved.
To see the ongoing actions on data in the data overview (Projects > your_project > Data), you can add the ongoing actions column from the column list with the three column symbol at the top right, next to the filter funnel. You can also consult the data detail view for ongoing actions by clicking on the data in the overview.
Restrictions:
A total maximum of 1000 items can be moved in one operation. An item can be either a file or a folder. Folders with subfolders and subfiles still count as one item.
You can not move files and folders to a destination where one or more files or folders with the same name already exists.
You can not move data and folders to linked data.
You can not move a folder to itself.
You can not move data which is in the process of being moved.
You can not move data across regions.
You can not move data from externally-managed projects.
You can not move linked data.
You can not move data between regions.
You can not move externally managed data.
You can only move data when it has status available.
To move data across projects, it must be owned by the user's tenant.
If you do not select a target folder for Move Data To, the root folder of the target project is used.
If you are only able to select your source project as the target data project, this may indicate that data sharing (Projects > your_project > Project Settings > Details > Data Sharing) is not enabled for your project or that you do not have have upload rights in other projects.
Single files can be downloaded directly from within the UI.
Select the checkbox next to the file which you want to download, followed by Download > Download file.
Files for which ICA can display the contents can be viewed by clicking on the filename, followed by the View tab. Select the download action on the view tab to download the file. Note that larger files may take some time to load.
You can trigger an asynchronous download via service connector using the Schedule for Download button with one or more files selected.
Select a file or files to download.
Select Download > Download files or folders using a service connector. This will display a list of all available connectors.
Select a connector, and then select Schedule for Download. If you do not find the connector you need or you do not have a connector, you can click the Don't have a connector yet?
option to create a new connector. You must then install this new connector and return to the file selection in step 1 to use it.
You can view the progress of the download or stop the download on the Activity page for the project.
The data records contained in a project can be exported in CSV, JSON, and excel format.
Select one or more files to export.
Select Export.
Select the following export options:
To export only the selected file, select the Selected rows as the Rows to export option. To export all files on the page, select Current page.
To export only the columns present for the file, select the Visible columns as the Columns to export option.
Select the export format.
To manually archive or delete files, do as follows:
Select the checkbox next to the file or files to delete or archive.
Select Manage, and then select one of the following options:
Archive — Move the file or files to long-term storage (event code ICA_DATA_110).
Unarchive — Return the file or files from long-term storage. Unarchiving can take up to 48 hours, regardless of file size. Unarchived files can be used in analysis (event code ICA_DATA_114).
Delete — Remove the file completely (event code ICA_DATA_106).
When attempting concurrent archiving or unarchiving of the same file, a message will inform you to wait for the currently running (un)archiving to finish first.
To archive or delete files programmatically, you can use ICA's API endpoints:
Modify the dates of the file to be deleted/archived.
Linking a folder creates a dynamic read-only view of the source data. You can use this to get access to data without running the risk of modifying the source material and to share data between projects. In addition, linking ensures changes to the source data are immediately visible and no additional storage is required.
You can recognise linked data by the green color and see the owning project as part of the details.
Since this is read-only access, you cannot perform actions on linked data that need to write access. Actions like (un)archiving, linking, creating, deleting, adding or moving data and folders, and copying data into the linked data are not possible.
Linking data is only possible from the root folder of your destination project. The action is disabled in project subfolders.
Linking a parent folder after linking a file or subfolder will unlink the file or subfolder and link the parent folder. So root\linked_subfolder will become root\linked_parentfolder\linked_subfolder.
Initial linking can take considerable time when there is a large amount of source data. However, once the initial link is made, updates to the source data will be instantaneous.
You can perform analysis on data from other projects by linking data from that project.
Select Projects > your_project > Data > Manage, and then select Link.
To view data by project, select the funnel symbol, and then select Owning Project. If you only know which project the data is linked to, you can choose to filter on linked projects.
Select the checkbox next to the file or files to add.
Select Select Data.
Your files are added to the Data page. To view the linked data file, select Add filter, and then select Links.
If you link a folder instead of individual files, a warning is displayed indicating that, depending on the size of the folder, linking may take considerable time. The linking process will run in the background and the progress can be monitored on the Projects > your_project > activity > Batch Jobs screen.
To see more details, double-click the batch job.
To see how many individual files are already linked, double-click the item.
To unlink the data, go to the root level of your project and select the linked folder or if you have linked individual files separately, then you can select those linked files (limited to 100 at a time) and select Manage > Unlink. As during linking a folder, when unlinking, the progress can be monitored at Projects > your_project > activity > Batch Jobs.
The GUI considers non-indexed folders as a single object. You can access the contents from a non-indexed folder
as Analysis input/output
in Bench
via the API
To use a reference set from within a project, you have first to add it. From the project's page select Flow > Reference Data > Manage > +Add to project. Then select a reference set to add to your project. You can select the entire reference set, or click the arrow next to it to expand it. After expanding, scroll to the right, to see the individual reference files in the set. You can select individual reference files to add to your project, by checking the boxes next to them.
Note: Reference sets are only supported in Graphical CWL pipelines.
Navigate to Reference Data (outside of Project context).
Select the data set(s) you wish to add to another region and select Actions > Copy to another project.
Select a project located in the region where you want to add your reference data.
You can check in which region(s) Reference data is present by double-clicking on individual files in the Reference set and viewing Copy Details on the Data details tab.
Allow a few minutes for new copies to become available before use.
Note: You only need one copy of each reference data set per region. Adding Reference Data sets to additional projects set in the same region does not result in extra copies, but creates links instead. This is done from inside the project at Projects > <your_project> > Flow > Reference Data > Manage > Add to project.
To create a pipeline with a reference data use the CWL - graphical mode (important restriction: as of now you cannot use reference data for pipelines created in advanced mode). Use the reference data icon instead of regular input icon. On the right hand side use the Reference files submenu to specify the name, the format, and the filters. You can specify the options for an end-user to choose from and a default selection. You can select more than 1 file, but you can only select 1 at a time (so, repeat process to select multiple reference files). If you only select 1 reference file, that file will be the only one users can use with your pipeline. In the screenshot a reference data with two options is presented.
If your pipeline was built to give users the option of choosing among multiple input reference files, they will see the option to select among the reference files you configured, under Settings.
After clicking the magnifying glass icon the user can select from provided options.
Pipelines defined using the "Code" mode require either an XML-based or JSON-based input form to define the fields shown on the launch view in the user interface (UI). The XML-based input form is defined in the "XML Configuration" tab of the pipeline editing view.
The input form XML must adhere to the input form schema.
During the creation of a Nextflow pipeline the user is given an empty form to fill out.
The input files are specified within a single DataInputs node. An individual input is then specified in a separate DataInput node. A DataInput node contains following attributes:
code: an unique id. Required.
format: specifying the format of the input: FASTA, TXT, JSON, UNKNOWN, etc. Multiple entries are possible: example below. Required.
type: is it a FILE or a DIRECTORY? Multiple entries are not allowed. Required.
required: is this input required for the execution of a pipeline? Required.
multiValue: are multiple files as an input allowed? Required.
dataFilter: TBD. Optional.
Additionally, DataInput has two elements: label for labelling the input and description for a free text description of the input.
An example of a single file input which can be in a TXT, CSV, or FASTA format.
To use a folder as an input the following form is required:
For multiple files, set the attribute multiValue to true. This will make it so the variable is considered to be of type list [], so adapt your pipeline when changing from single value to multiValue.
Settings (as opposed to files) are specified within the steps node. Settings represent any non-file input to the workflow, including but not limited to, strings, booleans, integers, etc. The following hierarchy of nodes must be followed: steps > step > tool > parameter. The parameter node must contain following attributes:
code: unique id. This is the parameter name that is passed to the workflow
minValues: how many values (at least) should be specified for this setting. If this setting is required, minValues
should be set to 1.
maxValues: how many values (at most) should be specified for this setting
classification: is this setting specified by the user?
In the code below a string setting with the identifier inp1 is specified.
Examples of the following types of settings are shown in the subsequent sections. Within each type, the value
tag can be used to denote a default value in the UI, or can be left blank to have no default. Note that setting a default value has no impact on analyses launched via the API.
For an integer setting the following schema with an element integerType is to be used. To define an allowed range use the attributes minimumValue and maximumValue.
Options types can be used to designate options from a drop-down list in the UI. The selected option will be passed to the workflow as a string. This currently has no impact when launching from the API, however.
Option types can also be used to specify a boolean, for example
For a string setting the following schema with an element stringType
is to be used.
For a boolean setting, booleanType
can be used.
One known limitation of the schema presented above is the inability to specify a parameter that can be multiple type, e.g. File or String. One way to implement this requirement would be to define two optional parameters: one for File input and the second for String input. At the moment ICA UI doesn't validate whether at least one of these parameters is populated - this check can be done within the pipeline itself.
Below one can find both a main.nf and XML configuration of a generic pipeline with two optional inputs. One can use it as a template to address similar issues. If the file parameter is set, it will be used. If the str parameter is set but file is not, the str parameter will be used. If neither of both is used, the pipeline aborts with an informative error message.
In order to run Nextflow pipelines, the following process-level attributes within the Nextflow definition must be considered.
(*) Pipelines will still run when 20.10.0 will be deprecated, but you will no longer be able to choose it when creating new pipelines.
You can select the Nextflow version while building a pipeline as follows:
For each compute type, you can choose between the scheduler.illumina.com/lifecycle: standard
(default - AWS on-demand) or scheduler.illumina.com/lifecycle: economy
(AWS spot instance) tiers.
Syntax highlighting is determined by the file type, but you can select alternative syntax highlighting with the drop-down selection list.
If no Docker image is specified, Ubuntu will be used as default.
The following configuration settings will be ignored if provided as they are overridden by the system:
To specify a compute type for a CWL CommandLineTool, use the ResourceRequirement
with a custom namespace.
For example, take the following ResourceRequirements
:
This would result in a best fit of standard-large
ICA Compute Type request for the tool.
If the specified requirements can not be met by any of the presets, the task will be rejected and failed.
FPGA requirements can not be set by means of CWL ResourceRequirements.
The Machine Profile Resource in the graphical editor will override whatever is set for requirements in the ResourceRequirement.
If no Docker image is specified, Ubuntu will be used as default. Both : and / can be used as separator.
In ICA you can provide the "override" recipes as a part of the input JSON. The following example uses CWL overrides to change the environment variable requirement at load time.
Pay close attention to uppercase and lowercase characters when creating pipelines.
Select Projects > your_project > Flow > Pipelines. From the Pipelines view, click the +Create > Nextflow > JSON based button to start creating a Nextflow pipeline.
In the Details tab, add values for the required Code (unique pipeline name) and Description fields. Nextflow Version and Storage size defaults to preassigned values.
First, we present the individual processes. Select Nextflow files > + Create and label the file split.nf. Copy and paste the following definition.
Next, select +Create and name the file sort.nf. Copy and paste the following definition.
Select +Create again and label the file merge.nf. Copy and paste the following definition.
Edit the main.nf file by navigating to the Nextflow files > main.nf tab and copying and pasting the following definition.
Here, the operators flatten and collect are used to transform the emitting channels. The Flatten operator transforms a channel in such a way that every item of type Collection or Array is flattened so that each single entry is emitted separately by the resulting channel. The collect operator collects all the items emitted by a channel to a List and return the resulting object as a sole emission.
On the Inputform files tab, edit the inputForm.json to allow selection of a file.
Click the Simulate button (at the bottom of the text editor) to preview the launch form fields.
The onSubmit.js and onRender.js can remain with their default scripts and are just shown here for reference.
Click the Save
button to save the changes.
ICA supports UTF-8 characters in file and folder names for data. Please follow the guidelines detailed below. (For more information about recommended approaches to file naming that can be applicable across platforms, please refer to the .)
Folders cannot be renamed after they have been created. To rename a folder, you will need to create a new folder with the desired name, move the contents from the original folder into the new one, and then delete the original folder. Please see section for more information.
See the list of supported
Data privacy should be carefully considered when adding data in ICA, either through storage configurations (ie, AWS S3) or ICA data upload. Be aware that when adding data from cloud storage providers by creating a storage configuration, ICA will provide access to the data. Ensure the storage configuration source settings are correct and ensure uploads do not include unintended data in order to avoid unintentional privacy breaches. More guidance can be found in the .
See
Uploads via the UI are limited to 5TB and no more than 100 concurrent files at a time, but for practical and performance reasons, it is recommended to use the CLI or when uploading large amounts of data.
For instructions on uploading/downloading data via CLI, see .
Copying data from your own S3 storage requires additional configuration. See and ..
This partial move may cause data at the destination to become unsynchronized between the object store (S3) and ICA. To resolve this, users can create a folder session on the parent folder of the destination directory by following the steps in the API: and then . Ensure the Move job is already aborted before submitting the folder session create and complete requests. Wait for the session status t
Note: You can create a new folder to move data to by filling in the "New folder name (optional)" field. This does NOT rename an existing folder. To rename an existing folder, please see .
the file's information.
the updated information back in ICA.
Non-indexed folders () are designed for optimal performance in situations where no file actions are needed. They serve as fast storage in situations like temporary analysis file storage where you don't need access or searches via the GUI to individual files or subfolders within the folder. Think of a non-indexed folder as a data container. You can access the container which contains all the data, but you can not access the individual data files within the container from the GUI. As non-indexed folders contain data, they count towards your total project storage.
ICA supports running pipelines defined using . See for an example.
To specify a compute type for a Nextflow process, use the within each process. Set the annotation
to scheduler.illumina.com/presetSize
and the value
to the desired compute type. A list of available compute types can be found . The default compute type, when this directive is not specified, is standard-small
(2 CPUs and 8 GB of memory).
Often, there is a need to select the compute size for a process dynamically based on user input and other factors. The Kubernetes executor used on ICA does not use the cpu
and memory
directives, so instead, you can dynamically set the pod
directive, as mentioned . e.g.
Additionally, it can also be specified in the . Example configuration file:
Inputs are specified via the or JSON-based input form. The specified code
in the XML will correspond to the field in the params
object that is available in the workflow. Refer to the for an example.
Outputs for Nextflow pipelines are uploaded from the out
directory in the attached shared filesystem. The can be used to symlink (recommended), copy or move data to the correct folder. Data will be uploaded to the ICA project after the pipeline execution completes.
Use "" instead of "copy" in the publishDir
directive. Symlinking creates a link to the original file rather than copying it, which doesn’t consume additional disk space. This can prevent the issue of silent file upload failures due to disk space limitations.
Use Nextflow 22.04.0 or later and enable the "" publishDir
option. This option ensures that the workflow will fail and provide an error message if there's an issue with publishing files, rather than completing silently without all expected outputs.
During execution, the Nextflow pipeline runner determines the environment settings based on values passed via the command-line or via a configuration file (see ). When creating a Nextflow pipeline, use the nextflow.config tab in the UI (or API) to specify a nextflow configuration file to be used when launching the pipeline.
ICA supports running pipelines defined using .
Reference for available compute types and sizes.
The ICA Compute Type will be determined automatically based on coresMin/coresMax (CPU) and ramMin/ramMax (Memory) values using a "best fit" strategy to meet the minimum specified requirements (refer to the table).
ICA supports overriding workflow requirements at load time using Command Line Interface (CLI) with JSON input. Please refer to for more details on the CWL overrides feature.
Let's create the with a JSON input form.
To add filters, select the funnel/filter symbol at the top right, next to the search field.
To change which columns are displayed, select the three columns symbol and select which columns should be shown.
You can keep track of which files are externally controlled and which are ICA-managed by means of the “managed by” column.
ServerURL
see browser addres bar
projectID
At YourProject > Details > URN > urn:ilmn:ica:project:ProjectID#MyProject
FolderID
At YourProject > Data > folder > folder details > ID
AnalysisID
At YourProject > Flow > Analyses > YourAnalysis > ID
Within a project
Contributor rights
Upload and Download rights
Contributor rights
Upload and Download rights
Between different projects
Download rights
Viewer rights
Upload rights
Contributor rights
Within a project
No linked data
No partial data
No archived data
No Linked data
Between different projects
Data sharing enabled
No partial data
No archived data
Within the same region
No linked data
Within the same region
Replace
Overwrites the existing data. Folders will copy their data in an existing folder with existing files. Existing files will be replaced when a file with the same name is copied and new files will be added. The remaining files in the target folder will remain unchanged.
Don't copy
The original files are kept. If you selected a folder, files that do not yet exist in the destination folder are added to it. Files that already exist at the destination are not copied over and the originals are kept.
Keep both
Files have a number appended to them if they already exist. If you copy folders, the folders are merged, with new files added to the destination folder and original files kept. New files with the same name get copied over into the folder with a number appended.
Within a project
Contributor rights
Contributor rights
Between different projects
Download rights
Contributor rights
Upload rights
Viewer rights
Within a project
No linked data
No partial data
No archived data
No Linked data
Between different projects
Data sharing enabled
Data owned by user's tenant
No linked data
No partial data
No archived data
No externally managed projects
Within the same region
No linked data
Within same region
Creation
Yes
You can create non-indexed folders at Projects > your_project > Data > Manage > Create non-indexed folder. or with the /api/projects/{projectId}/data:createNonIndexedFolder
endpoint
Deletion
Yes
You can delete non-indexed folders by selecting them at Projects > your_project > Data > select the folder > Manage > Delete.
or with the /api/projects/{projectId}/data/{dataId}:delete
endpoint
Uploading Data
API Bench Analysis
Use non-indexed folders as normal folders for Analysis runs and bench. Different methods are available with the API such as creating temporary credentials to upload data to S3 or using /api/projects/{projectId}/data:createFileWithUploadUrl
Downloading Data
Yes
Use non-indexed folders as normal folders for Analysis runs and bench. Use temporary credentials to list and download data with the API.
Analysis Input/Output
Yes
Non-indexed files can be used as input for an analysis and the non-indexed folder can be used as output location. You will not be able to view the contents of the input and output in the analysis details screen.
Bench
Yes
Non-indexed folders can be used in Bench and the output from Bench can be written to non-indexed folders. Non-indexed folders are accessible across Bench workspaces within a project.
Viewing
No
The folder is a single object, you can not view the contents.
Linking
No
You cannot see non-indexed folder contents.
Copying
No
Prohibited to prevent storage issues.
Moving
No
Prohibited to prevent storage issues.
Managing tags
No
You cannot see non-indexed folder contents.
Managing format
No
You cannot see non-indexed folder contents.
Use as Reference Data
No
You cannot see non-indexed folder contents.
Nextflow version
20.10.0 (deprecated *), 22.04.3, 24.10.2 (Experimental)
Executor
Kubernetes
GUI
Select the Nextflow version at Projects > your_project > flow > pipelines > your_pipeline > Details tab.
API
Select the Nextflow version by setting it in the optional field "pipelineLanguageVersionId
".
When not set, a default Nextflow version will be used for the pipeline.
Pipelines defined using the "Code" mode require an XML or JSON-based input form to define the fields shown on the launch view in the user interface (UI).
To create a JSON-based Nextflow (or CWL) pipeline, go to Projects > your_project > Flow > Pipelines > +Create > Nextflow (or CWL) > JSON-based.
Three files, located on the inputform files tab, work together for evaluating and presenting JSON-based input.
inputForm.json contains the actual input form which is rendered when starting the pipeline run.
onRender.js is triggered when a value is changed.
onSubmit.js is triggered when starting a pipeline via the GUI or API.
Use + Create to add additional files and Simulate to test your inputForms.
Scripting execution supports crossfield validation of the values, hiding fields, making them required, .... based on value changes.
The JSON schema allowing you to define the input parameters. See the inputForm.json page for syntax details.
textbox
Corresponds to stringType in xml.
checkbox
A checkbox that supports the option of being required, so can serve as an active consent feature. (corresponds to the booleanType in xml).
radio
A radio button group to select one from a list of choices. The values to choose from must be unique.
select
A dropdown selection to select one from a list of choices. This can be used for both single-level lists and tree-based lists.
number
The value is of Number type in javascript and Double type in java. (corresponds to doubleType in xml).
integer
Corresponds to java Integer.
data
Data such as files.
section
For splitting up fields, to give structure. Rendered as subtitles. No values are to be assigned to these fields.
text
To display informational messages. No values are to be assigned to these fields.
fieldgroup
Can contain parameters or other groups. Allows to have repeating sets of parameters, for instance when a father|mother|child choice needs to be linked to each file input. So if you want to have the same elements multiple times in your form, combine them into a fieldgroup.
These attributes can be used to configure all parameter types.
label
The display label for this parameter. Optional but recommended, id will be used if missing.
minValues
The minimal amount of values that needs to be present. Default when not set is 0. Set to >=1 to make the field required.
maxValues
The maximal amount of values that need to be present. Default when not set is 1.
minMaxValuesMessage
The error message displayed when minValues or maxValues is not adhered to. When not set, a default message is generated.
helpText
A helper text about the parameter. Will be displayed in smaller font with the parameter.
placeHolderText
An optional short hint ( a word or short phrase) to aid the user when the field has no value.
value
The value of the parameter. Can be considered default value.
minLength
Only applied on type="textbox". Value is a positive integer.
maxLength
Only applied on type="textbox". Value is a positive integer.
min
Minimal allowed value for 'integer' and 'number' type.
for 'integer' type fields the minimal and maximal values are -100000000000000000 and 100000000000000000.
for 'number' type fields the max precision is 15 significant digits and the exponent needs to be between -300 and +300.
max
Maximal allowed value for 'integer' and 'number' type.
for 'integer' type fields the minimal and maximal values are -100000000000000000 and 100000000000000000.
for 'number' type fields the max precision is 15 significant digits and the exponent needs to be between -300 and +300.
choices
A list of choices with for each a "value", "text" (is label), "selected" (only 1 true supported), "disabled". "parent" can be used to build hierarchical choicetrees. "availableWhen" can be used for conditional presence of the choice based on values of other fields. Parent and value must be unique, you can not use the same value for both.
fields
The list of sub fields for type fieldgroup.
dataFilter
For defining the filtering when type is 'data'. nameFilter, dataFormat and dataType are additional properties.
regex
The regex pattern the value must adhere to. Only applied on type="textbox".
regexErrorMessage
The optional error message when the value does not adhere to the "regex". A default message will be used if this parameter is not present. It is highly recommended to set this as the default message will show the regex which is typically very technical.
hidden
Makes this parameter hidden. Can be made visible later in onRender.js or can be used to set hardcoded values of which the user should be aware.
disabled
Shows the parameter but makes editing it impossible. The value can still be altered by onRender.js.
emptyValuesAllowed
When maxValues is 1 or not set and emptyValuesAllowed is true, the values may contain null entries. Default is false.
updateRenderOnChange
When true, the onRender javascript function is triggered each time the user changes the value of this field. Default is false.
Streamable inputs
Adding "streamable":true
to an input field of type "data" makes it a streamable input.
The onSubmit.js javascript function receives an input object which holds information about the chosen values of the input form and the pipeline and pipeline execution request parameters. This javascript function is not only triggered when submitting a new pipeline execution request in the user interface, but also when submitting one through the rest API..
settings
The value of the setting fields. Corresponds to settingValues
in the onRender.js. This is a map with field id as key and an array of field values as value. For convenience, values of single-value fields are present as the individual value and not as an array of length 1. In case of fieldGroups, the value can be multiple levels of arrays.
settingValues
To maximize the opportunity for reusing code between onRender and onSubmit, the 'settings' are also exposed as settingValues
like in the onRender input.
pipeline
Info about the pipeline: code, tenant, version, and description are all available in the pipeline object as string.
analysis
Info about this run: userReference, userName, and userTenant are all available in the analysis object as string.
storageSize
The storage size as chosen by the user. This will initially be null. StorageSize is an object containing an 'id' and 'name' property.
storageSizeOptions
The list of storage sizes available to the user when creating an analysis. Is a list of StorageSize objects containing an 'id' and 'name' property.
settings
The value of the setting fields. This allows modifying the values or applying defaults and such. Or taking info of the pipeline or analysis input object. When settings are not present in the onSubmit return value object, they are assumed to be not modified.
validationErrors
A list of AnalysisError essages representing validation errors. Submitting a pipeline execution request is not possible while there are still validation errors.
fieldId / FieldId
The field which has an erroneous value. When not present, a general error/warning is displayed. To display an error on the storage size, use the storageSize
Fieldid.
index / Index
The 0-starting index of the value which is incorrect. Use this when a particular value of a multivalue field is not correct. When not present, the entire field is marked as erroneous. The value can also be an array of indexes for use with fieldgroups. For instance, when the 3rd field of the 2nd instance of a fieldgroup is erroneous, a value of [ 1 , 2 ] is used.
message / Message
The error/warning message to display.
Receives an input object which contains information about the current state of the input form, the chosen values and the field value change that triggered the onrender call. It also contains pipeline information. Changed objects are present in the onRender return value object. Any object not present is considered to be unmodified. Changing the storage size in the start analysis screen triggers an onRender execution with storageSize as changed field.
context
"Initial"/"FieldChanged"/"Edited".
Initial is the value when first displaying the form when a user opens the start run screen.
The value is FieldChanged when a field with 'updateRenderOnChange'=true
is changed by the user.
Edited (Not yet supported in ICA) is used when a form is displayed later again, this is intended for draft runs or when editing the form during reruns.
changedFieldId
The id of the field that changed and which triggered this onRender call. context will be FieldChanged
. When the storage size is changed, the fieldId will be storageSize
.
analysisSettings
The input form json as saved in the pipeline. This is the original json, without changes.
currentAnalysisSettings
The current input form json as rendered to the user. This can contain already applied changes form earlier onRender passes. Null in the first call, when context is Initial
.
settingValues
The current value of all settings fields. This is a map with field id as key and an array of field values as value for multivalue fields. For convenience, values of single-value fields are present as the individual value and not as an array of length 1. In case of fieldGroups, the value can be multiple levels of arrays.
pipeline
Information about the pipeline: code, tenant, version, and description are all available in the pipeline object as string.
analysis
Information about this run: userReference, userName, and userTenant are all available in the analysis object as string.
storageSize
The storage size as chosen by the user. This will initially be null. StorageSize is an object containing an 'id' and 'name' property.
storageSizeOptions
The list of storage sizes available to the user when creating an analysis. Is a list of StorageSize objects containing an 'id' and 'name' property.
analysisSettings
The input form json with potential applied changes. The discovered changes will be applied in the UI.
settingValues
The current, potentially altered map of all setting values. These will be updated in the UI.
validationErrors
A list of RenderMessages representing validation errors. Submitting a pipeline execution request is not possible while there are still validation errors.
validationWarnings
A list of RenderMessages representing validation warnings. A user may choose to ignore these validation warnings and start the pipeline execution request.
storageSize
The suitable value for storageSize. Must be one of the options of input.storageSizeOptions. When absent or null, it is ignored.
validation errors and validation warnings can use 'storageSize' as fieldId to let an error appear on the storage size field. 'storageSize' is the value of the changedFieldId when the user alters the chosen storage size.
This is the object used for representing validation errors and warnings. The attributes can be used with first letter lowercase (consistent with the input object attributes) or uppercase.
fieldId / FieldId
The field which has an erroneous value. When not present, a general error/warning is displayed. To display an error on the storage size, use the storageSize
Fieldid.
index / Index
The 0-starting index of the value which is incorrect. Use this when a particular value of a multivalue field is not correct. When not present, the entire field is marked as erroneous. The value can also be an array of indexes for use with fieldgroups. For instance, when the 3rd field of the 2nd instance of a fieldgroup is erroneous, a value of [ 1 , 2 ] is used.
message / Message
The error/warning message to display.
An Analysis is the execution of a pipeline.
You can start an analysis from both the dedicated analysis screen or from the actual pipeline.
Navigate to Projects > Your_Project > Flow > Analyses.
Select Start.
Select a single Pipeline.
Configure the analysis settings.
Select Start Analysis.
Refresh to see the analysis status. See lifecycle for more information on statuses.
If for some reason, you want to end the analysis before it can complete, select Projects > Your_Project > Flow > Analyses > Manage > Abort. Refresh to see the status update.
Navigate to Projects > <Your_Project> > Flow > Pipelines
Select the pipeline you want to run or open the pipeline details of the pipeline which you want to run.
Select Start Analysis.
Configure analysis settings.
Select Start Analysis.
View the analysis status on the Analyses page. See lifecycle for more information on statuses.
If for some reason, you want to end the analysis before it can complete, select Manage > Abort on the Analyses page.
You can abort a running analysis from either the analysis overview (Projects > your_project > Flow > Analyses > your_analysis > Manage > Abort) or from the analysis details (Projects > your_project > Flow > Analyses > your_analysis > Details tab > Abort).
Once an analysis has been executed, you can rerun it with the same settings or choose to modify the parameters when rerunning. Modifying the parameters is possible on a per-analysis basis. When selecting multiple analyses at once, they will be executed with the original parameters. Draft pipelines are subject to updates and thus can result in a different outcome when rerunning. ICA will display a warning message to inform you of this when you try to rerun an analysis based on a draft pipeline.
When rerunning an analysis, the user reference will be the original user reference (up to 231 characters), followed by _rerun_yyyy-MM-dd_HHmmss.
When there is an XML configuration change on a a pipline for which you want to rerun an analysis, ICA will display a warning and not fill out the parameters as it cannot guarantee their validity for the new XML.
Some restrictions apply when trying to rerun an analysis.
Analyses using external data
Allowed
-
Analyses using mount paths on input data
Allowed
-
Analyses using user-provided input json
Allowed
-
Analyses using advanced output mappings
-
-
Analyses with draft pipeline
Warn
Warn
Analyses with XML configuration change
Warn
Warn
To rerun one or more analyses with te same settings:
Navigate to Projects > Your_Project > Flow > Analyses.
In the overview screen, select one or more analyses.
Select Manage > Rerun. The analyses will now be executed with the same parameters as their original run.
To rerun a single analysis with modified parameters:
Navigate to Projects > Your_Project > Flow > Analyses.
In the overview screen, open the details of the analysis you want to rerun by clicking on the analysis user reference.
Select Rerun. (at the top right)
Update the parameters you want to change.
Select Start Analysis The analysis will now be executed with the updated parameters.
Requested
The request to start the Analysis is being processed
No
Queued
Analysis has been queued
No
Initializing
Initializing environment and performing validations for Analysis
No
Preparing Inputs
Downloading inputs for Analysis
No
In Progress
Analysis execution is in progress
No
Generating outputs
Transferring the Analysis results
No
Aborting
Analysis has been requested to be aborted
No
Aborted
Analysis has been aborted
Yes
Failed
Analysis has finished with error
Yes
Succeeded
Analysis has finished with success
Yes
When an analysis is started, the availability of resources may impact the start time of the pipeline or specific steps after execution has started. Analyses are subject to delay when the system is under high load and the availability of resources is limited.
During analysis start, ICA runs a verification on the input files to see if they are available. When it encounters files that have not completed their upload or transfer, it will report "Data found for parameter [parameter_name], but status is Partial instead of Available". Wait for the file to be available and restart the analysis.
During the execution of an analysis, logs are produced for each process involved in the analysis lifecyle. In the analysis details view, the Steps tab is used to view the steps in near real time as they're produced in the running processes. A grid layout is used for analyses with more than 50 steps, a tiled view for analyses with 50 steps or less, though you can choose to also use the grid layout for those by means of the tile/grid button on the top right of the analysis log tab.
There are system processes involved in the lifeycle for all analyses (ie. downloading inputs, uploading outputs, etc.) and there are processes which are pipeline-specific, such as processes which execute the pipeline steps. The below table describes the system processes. You can choose to display or hide these system processes with the Show technical steps
Setup Environment
Validate analysis execution environment is prepared
Run Monitor
Monitor resource usage for billing and reporting
Prepare Input Data
Download and mount input data to the shared file system
Pipeline Runner
Parent process to execute the pipeline definition
Finalize Output Data
Upload Output Data
Additional log entries will show for the processes which execute the steps defined in the pipeline.
Each process shows as a distinct entry in the steps view with a Queue Date, Start Date, and End Date.
Queue Date
The time when the process is submitted to the processes scheduler for execution
Start Date
The time when the process has started exection
End Date
The time when the process has stopped execution
The time between the Start Date and the End Date is used to calculate the duration. The time of the duration is used to calculate the usage-based cost for the analysis. Because this is an active calculation, sorting on this field is not supported.
Each log entry in the Steps view contains a checkbox to view the stdout and stderr log files for the process. Clicking a checkbox adds the log as a tab to the log viewer where the log text is displayed and made available for download.
To see the price of an analysis in iCredits, look at Projects > your_project > Flow > Analyses > your_analysis > Details tab. The pricing section will show you the entitlement bundle, storage detail and price in iCredits.
In the analysis output folder, the ica_logs subfolder will contain the stdout and stderr files.
If you delete these files, no log information will be available on the analysis details > Steps tab.
Logs can also be streamed using websocket client tooling. The API to retrieve analysis step details returns websocket URLs for each step to stream the logs from stdout/stderr during the step's execution. Upon completion, the websocket URL is no longer available.
Currently, this feature is only availabe when launching analyses via API.
Currently, only FOLDER type output mappings are supported
By default, analysis outputs are directed to a new folder within the project where the analysis is launched. Analysis output mappings may be specified to redirect outputs to user-specified locations consisting of project and path. An output mapping consists of:
the source path on the local disk of the analysis execution environment, relative to the working directory
the data type, either FILE or FOLDER
the target project ID to direct outputs to; analysis launcher must have contributor access to the project
the target path relative to the root of the project data to write the outputs
If the output directory already exists, any existing contents with the same filenames as those output from the pipeline will be overwritten by the new analysis
You can jump from the Analysis Details output section to the individual files and folders by opening the detail view (projects > your_project > Flow > Analyses > your_analysis > Details tab > Output files section > your_output_file) and selecting open in data.
You can add and remove tags from your analyses.
Navigate to Projects > Your_Project > Flow > Analyses.
Select the analyses whose tags you want to change.
Select Manage > Manage tags.
Edit the user tags, reference data tags (if applicable) and technical tags.
Select Save to confirm the changes.
Both system tags and customs tags exist. User tags are custom tags which you set to help identify and process information while technical tags are set by the system for processing. Both run-in and run-out tags are set on data to identify which analyses use the data. Connector tags determine data entry methods and reference data tags identify where data is used as reference data.
If you want to share a link to an analysis, you can copy and paste the URL from your browser when you have the analysis open. The syntax of the analysis link will be <hostURL>/ica/link/project/<projectUUID>/analysis/<analysisUUID>
. Likewise, workflow sessions will use the syntax <hostURL>/ica/link/project/<projectUUID>/workflowSession/<workflowsessionUUID>
. To prevent third parties from accessing data via the link when it is shared or forwarded, ICA will verify the access rights of every user when they open the link.
Input for analysis is limited to a total of 50,000 files (including multiple copies of the same file). You can have up to 50 concurrent analyses running per tenant. Additional analyses will be queued and scheduled when currently running analyses complete and free up positions.
When your analysis fails, open the analysis details view (Projects > your_project> Flow > Analyses > your_analysis) and select display failed steps. This will give you the steps view filtered on those steps that had non-0 exit codes. If there is only one failed step which has logfiles, the stderr of that step will be displayed.
Exit code 55 indicates analysis failure due to an external event such as spot termination or node draining. Retry the analysis.
Data Catalogues provide views on data from Illumina hardware and processes (Instruments, Cloud software, Informatics software and Assays) so that this data can be distributed to different applications. This data consists of read-only tables to prevent updates by the applications accessing it. Access to data catalogues is included with professional and enterprise subscriptions.
Project-level views
ICA_PIPELINE_ANALYSES_VIEW (Lists project-specific ICA pipeline analysis data)
ICA_DRAGEN_QC_METRIC_ANALYSES_VIEW (project-specific quality control metrics)
Tenant-level views
ICA_PIPELINE_ANALYSES_VIEW (Lists ICA pipeline analysis data)
CLARITY_SEQUENCINGRUN_VIEW_tenant (sequencing run data coming from the lab workflow software)
CLARITY_SAMPLE_VIEW_tenant (sample data coming from the lab workflow software)
CLARITY_LIBRARY_VIEW_tenant (library data coming from the lab workflow software)
CLARITY_EVENT_VIEW_tenant (event data coming from the lab workflow software)
ICA_DRAGEN_QC_METRIC_ANALYSES_VIEW (quality control metrics)
DRAGEN metrics will only have content when DRAGEN pipelines have been executed.
Analysis views will only have content when analyses have been executed.
Views containing Clarity data will only have content if you have a Clarity LIMS instance with minimum version 6.0 and the Product Analytics service installed and configured. Please see the Clarity LIMS documentation for more information.
Members of a project, who have both base contributor and project contributor or administrator rights and who belong to the same tenant as the project can add views from a Catalogue. Members of a project with the same rights who do not belong to the same tenant can remove the catalogue views from a project. Therefore, if you are invited to collaborate on a project, but belong to a different tenant, you can remove catalogue views, but cannot add them again.
To add Catalogue data,
Go to Projects > your_project > Base > Tables.
Select Add table > Import from Catalogue.
A list of available views will be displayed. (Note that views which are already part of your project are not listed)
Select the table you want to add and choose +Select
Catalogue data will have View as type, the same as tables which are linked from other projects.
To delete Catalogue data,
go to Projects > your_project > Base > Tables.
Select the table you want to delete and choose Delete.
A warning will be presented to confirm your choice. Once deleted, you can add the Catalogue data again if needed.
View: The name of the Catalogue table.
Description: An explanation of which data is contained in the view.
Category: The identification of the source system which provided the data.
Tenant/project. Appended to the view name as _tenant or _project. Determines if the data is visible for all projects within the same tenant or only within the project. Only the tenant administrator can see the non-project views.
In the Projects > your_project > Base > Tables view, double-click the Catalogue table to see the details. For an overview of the available actions and details, see Tables.
In this section, we provide examples of querying selected views from the Base UI, starting with ICA_PIPELINE_ANALYSES_VIEW (project view). This table includes the following columns: TENANT_UUID, TENANT_ID, TENANT_NAME, PROJECT_UUID, PROJECT_ID, PROJECT_NAME, USER_UUID, USER_NAME, and PIPELINE_ANALYSIS_DATA. While the first eight columns contain straightforward data types (each holding a single value), the PIPELINE_ANALYSIS_DATA column is of type VARIANT, which can store multiple values in a nested structure. In SQL queries, this column returns data as a JSON object. To filter specific entries within this complex data structure, a combination of JSON functions and conditional logic in SQL queries is essential.
Since Snowflake offers robust JSON processing capabilities, the FLATTEN function can be utilized to expand JSON arrays within the PIPELINE_ANALYSIS_DATA column, allowing for the filtering of entries based on specific criteria. It's important to note that each entry in the JSON array becomes a separate row once flattened. Snowflake aligns fields outside of this FLATTEN operation accordingly, i.e. the record USER_ID in the SQL query below is "recycled".
The following query extracts
USER_NAME directly from the ICA_PIPELINE_ANALYSES_VIEW_project table.
PIPELINE_ANALYSIS_DATA:reference and PIPELINE_ANALYSIS_DATA:price. These are direct accesses into the JSON object stored in the PIPELINE_ANALYSIS_DATA column. They extract specific values from the JSON object.
Entries from the array 'steps' in the JSON object. The query uses LATERAL FLATTEN(input => PIPELINE_ANALYSIS_DATA:steps) to expand the steps array within the PIPELINE_ANALYSIS_DATA JSON object into individual rows. For each of these rows, it selects various elements (like bpeResourceLifeCycle, bpeResourcePresetSize, etc.) from the JSON.
Furthermore, the query filters the rows based on the status being 'FAILED' and the stepId not containing the word 'Workflow': it allows the user to find steps which failed.
Now let's have a look at DRAGEN_METRICS_VIEW_project view. Each DRAGEN pipeline on ICA creates multiple metrics files, e.g. SAMPLE.mapping_metrics.csv, SAMPLE.wgs_coverage_metrics.csv, etc for DRAGEN WGS Germline pipeline. Each of these files is represented by a row in DRAGEN_METRICS_VIEW_project table with columns ANALYSIS_ID, ANALYSIS_UUID, PIPELINE_ID, PIPELINE_UUID, PIPELINE_NAME, TENANT_ID, TENANT_UUID, TENANT_NAME, PROJECT_ID, PROJECT_UUID, PROJECT_NAME, FOLDER, FILE_NAME, METADATA, and ANALYSIS_DATA. ANALYSIS_DATA column contains the content of the file FILE_NAME as an array of JSON objects. Similarly to the previous query we will use FLATTEN command. The following query extracts
Sample name from the file names.
Two metrics 'Aligned bases in genome' and 'Aligned bases' for each sample and the corresponding values.
The query looks for files SAMPLE.wgs_coverage_metrics.csv only and sorts based on the sample name:
Lastly, you can combine these views (or rather intermediate results derived from these views) using the WITH and JOIN commands. The SQL snippet below demonstrates how to join two intermediate results referred to as 'flattened_dragen_scrna' and 'pipeline_table'. The query:
Selects two metrics ('Invalid barcode read' and 'Passing cells') associated with single-cell RNA analysis from records where the FILE_NAME ends with 'scRNA.metrics.csv', and then stores these metrics in a temporary table named 'flattened_dragen_scrna'.
Retrieves metadata related to all scRNA analyses by filtering on the pipeline ID from the 'ICA_PIPELINE_ANALYSES_VIEW_project' view and stores this information in another temporary table named 'pipeline_table'.
Joins the two temporary tables using the JOIN operator, specifying the join condition with the ON operator.
You can use ICA_PIPELINE_ANALYSES_VIEW to obtained the costs of individual steps of an analysis. Using the following SQL snippet you can retrieve the costs of individual steps for every analyses run in the past week.
Data Catalogue views cannot be shared as part of a Bundle.
Data size is not shown for views because views are a subset of data.
By removing Base from a project, the Data Catalogue will also be removed from that project.
As tenant-level Catalogue views can contain sensitive data, it is best to save this (filtered) data to a new table and share that table instead of sharing the entire view as part of a project. To do so, add your view to a separate project and run a query on the data at Projects > your_project > Base > Query > New Query. When the query completes, you can export the result as a new table. This ensures no new data will be added on consequent runs.
Developing on the cloud incurs inherent runtime costs due to compute and storage used to execute workflows. Here are a few tips that can facilitate development.
Leverage the cross-platform nature of these workflow languages. Both CWL and Nextflow can be run locally in addition to on ICA. When possible, testing should be performed locally before attempting to run in the cloud. For Nextflow, configuration files can be utilized to specify settings to be used either locally or on ICA. An example of advanced usage of a config would be applying the scratch directive to a set of process names (or labels) so that they use the higher performance local scratch storage attached to an instance instead of the shared network disk,
When trying to test on the cloud, it's oftentimes beneficial to create scripts to automate the deployment and launching / monitoring process. This can be performed either using the ICA CLI or by creating your own scripts integrating with the REST API.
For scenarios in which instances are terminated prematurely (for example, while using spot instances) without warning, you can implement scripts like the following to retry the job a certain number of times. Adding the following script to 'nextflow.config' enables five retries for each job, with increasing delays between each try.
Note: Adding the retry script where it is not needed might introduce additional delays.
When hardening a Nextflow to handle resource shortages (for example exit code 2147483647), an immediate retry will in most circumstances fail because the resources have not yet been made available. It is best practice to use Dynamic retry with backoff which has an increasing backoff delay, allowing the system time to provide the necessary resources.
When publishing your Nextflow pipeline, make sure your have defined a container such as 'public.ecr.aws/lts/ubuntu:22.04' and are not using the default container 'ubuntu:latest'.
To limit potential costs, there is a timeout of 96 hours: if the analysis does not complete within four days, it will go to a 'Failed' state. This time begins to count as soon as the input data is being downloaded. This takes place during the ICA 'Requested' step of the analysis, before going to 'In Progress'. In case parallel tasks are executed, running time is counted once. As an example, let's assume the initial period before being picked up for execution is 10 minutes and consists of the request, queueing and initializing. Then, the data download takes 20 minutes. Next, a task runs on a single node for 25 minutes, followed by 10 minutes of queue time. Finally, three tasks execute simultaneously, each of them taking 25, 28, and 30 minutes, respectively. Upon completion, this is followed by uploading the outputs for one minute. The overall analysis time is then 20 + 25 + 10 + 30 (as the longest task out of three) + 1 = 86 minutes:
Analysis task
request
queued
initializing
input download
single task
queue
parallel tasks
generating outputs
completed
96 hour limit
1m (not counted)
7m (not counted)
2m (not counted)
20m
25m
10m
30m
1m
-
Status in ICA
status requested
status queued
status initializing
status preparing inputs
status in progress
status in progress
status in progress
status generating outputs
status succeeded
If there are no available resources or your project priority is low, the time before download commences will be substantially longer.
By default, Nextflow will not generate the trace report. If you want to enable generating the report, add the section below to your userNextflow.config file.
All tables created within Base are gathered on the Projects > your_project > Base > Tables page. New tables can be created and existing tables can be updated or deleted here.
To create a new table, click Add table > New table on the Tables page. Tables can be created from scratch or from a template that was previously saved.
If you make a mistake in the order of columns when creating your table, then as long as you have not saved your table, you can switch to Edit as text to change the column order. The text editor can swap or move columns whereas the built-in editor can only delete columns or add columns to the end of the sequence. When editing in text mode, it is best practice to copy the content of the text editor to a notepad before you make changes because a corrupted syntax will result in the text being wiped or reverted when switching between text and non-text mode.
Once a table is saved it is no longer possible to edit the schema, only new fields can be added. The workaround is switching to text mode, copying the schema of the table to which you want to make modifications and paste it into a new empty table where the necessary changes can be made before saving.
Once created, do not try to modify your table column layout via the Query module as even though you can execute ALTER TABLE commands, the definitions and syntax of the table will go out of sync resulting in processing issues.
Be careful when naming tables when you want to use them in bundles. Table names have to be unique per bundle, so no two tables with the same name can be part of the same bundle.
To create a table from scratch, complete the fields listed below and click the Save button. Once saved, a job will be created to create the table. To view table creation progress, navigate to the Activity page.
The table name is a required field and must be unique. The first character of the table must be a letter followed by letters, numbers or underscores. The description is optional.
Including or excluding references can be done by checking or un-checking the Include reference checkbox. These reference fields are not shown on the table creation page, but are added to the schema definition, which is visible after creating the table (Projects > your_project > Base > Tables > your_table > Schema definition). By including references, additional columns will be added to the schema (see next paragraph) which can contain references to the data on the platform:
data_reference: reference to the data element in the Illumina platform from which the record originates
data_name: original name of the data element in the Illumina platform from which the record originates
sample_reference: reference to the sample in the Illumina platform from which the record originates
sample_name: name of the sample in the Illumina platform from which the record originates
pipeline_reference: reference to the pipeline in the Illumina platform from which the record originates
pipeline_name: name of the pipeline in the Illumina platform from which the record originates
execution_reference: reference to the pipeline execution in the Illumina platform from which the record originates
account_reference: reference to the account in the Illumina platform from which the record originates
account_name: name of the account in the Illumina platform from which the record originates
In an empty table, you can create a schema by adding a field for each column of the table and defining it. The + Add field button is located to the right of the schema. At any time during the creation process, it is possible to switch to the edit as text mode and back. The text mode shows the JSON code, whereas the original view shows the fields in a table.
Each field requires:
a name – this has to be unique (*1)
a type
String – collection of characters
Bytes – raw binary data
Integer – whole numbers
Float – fractional numbers (*2)
Numeric – any number (*3)
Boolean – only options are “true” or “false”
Timestamp - Stores number of (milli)seconds passed since the Unix epoch
Date - Stores date in the format YYYY-MM-DD
Time - Stores time in the format HH:MI:SS
Datetime - Stores date and time information in the format YYYY-MM-DD HH:MI:SS
Record – has a child field
Variant - can store a value of any other type, including OBJECT and ARRAY
a mode
Required - Mandatory field
Nullable - Field is allowed to have no value
Repeated - Multiple values are allowed in this field (will be recognized as array in Snowflake)
(*1) Do not use reserved Snowflake keywords such as left, right, sample, select, table,... (https://docs.snowflake.com/en/sql-reference/reserved-keywords) for your schema name as this will lead to SQL compilation errors.
(*2) Float values will be exported differently depending on the output format. For example JSON will use scientific notation so verify that your consecutive processing methods support this.
(*3) Defining the precision when creating tables with SQL is not supported as this will result in rounding issues.
Users can create their own template by making a table which is turned into a template at Projects > your_project > Base > Tables > your_table > Save as template.
If a template is created and available/active, it is possible to create a new table based on this template. The table information and references follow the rules of the empty table but in this case the schema will be pre-filled. It is possible to still edit the schema that is based on the template.
The status of a table can be found at Projects > your_project > Base > Tables. The possible statuses are:
Available: Ready to be used, both with or without data
Pending: The system is still processing the table, there is probably a process running to fill the table with data
Deleted: The table is deleted functionally; it still exists and can be shown in the list again by clicking the “Show deleted tables” button
Additional Considerations
Tables created from empty data or from a template are the fastest available.
When copying a table with data, it can remain in a Pending for longer periods of time.
Clicking on the page's refresh button will update the list.
For any available table, the following details are shown:
Table information: Name, description, number of records and data size
Schema definition: An overview of the table schema, also available in text. Fields can be added to the schema but not deleted. For deleting fields: copy the schema as text and paste in a new empty table where the schema is still editable.
Preview: A preview of the table for the 50 first rows (when data is uploaded into the table)
Source Data: the files that are currently uploaded into the table. You can see the Load Status of the files which can be Prepare Started, Prepare Succeeded or Prepare Failed and finally Load Succeeded or Load Failed. You can change the order and visible columns by hovering over the column headers and clicking on the cog symbol.
From within the details of a table it is possible to perform the following actions related to the table:
Copy: Create a copy from this table in the same or a different project. In order to copy to another project, data sharing of the original project should be enabled in the details of this project. The user also has to have access to both original and target project.
Export as file: Export this table as a CSV or JSON file. The exported file can be found in a project where the user has the access to download it.
Save as template: Save the schema or an edited form of it as a template.
Add data: Load additional data into the table manually. This can be done by selecting data files previously uploaded to the project, or by dragging and dropping files directly into the popup window for adding data to the table. It’s also possible to load data into a table manually or automatically via a pre-configured job. This can be done on the Schedule page.
Delete: Delete the table.
To manually add data to your table, go to Projects > your_project > Base > Tables > your_table > +Add Data
The data selection screen will show options to define the structure and location of your source data:
Write preference: Define if data can be written to the table only when the table is empty, if the data should be appended to the table or if the table should be overwrtitten.
Data format (required): Select the format of the data which you want to import. CSV(comma-separated), TSV (tab-separated) or JSON (JavaScript Object Notation).
Delimiter: Which delimiter is used in the delimiter separated file. If the required delimiter is not comma, tab or pipe, select custom and define the custom delimiter.
Custom delimiter: If a custom delimiter is used in the source data, it must be defined here.
Header rows to skip: The number of consecutive header rows (at the top of the table) to skip.
References: Choose which references must be added to the table.
Most of the advanced options are legacy functions and should not be used. The only exceptions are
Encoding: Select if the encoding is UTF-8 (any Unicode character) or ISO-8859-1 (first 256 Unicode characters).
Ignore unknown values: This applies to CSV-formatted files. You can use this function to handle optional fields without separators, provided that the missing fields are located at the end of the row. Otherwise, the parser can not detect the missing separator and will shift fields to the left, resulting in errors.
If headers are used: The columns that have matching fields are loaded, those that have no matching fields are loaded with NULL and remaining fields are discarded.
If no headers are used: The fields are loaded in order of occurrence and trailing missing fields are loaded with NULL, trailing additional fields are discarded.
At the bottom of the select data screen, you can select the data you manually want to upload. You can select local files, drop files via the browser or choose files from your project.
To see the status of your data import, go to Projects > your_project > Activity > Base Jobs where you will see a job of type Prepare Data which will have succeeded or failed. If it has failed, you can see the error message and details by double-clicking the base job. You can then take corrective actions if the input mismatched with the table design and try to run the import again (with a new copy of the file as each input file can only be used once)
If you need to cancel the import, you can do so while it is scheduled by navigating to the Base Jobs inventory and selecting the job followed by Abort.
To see which data has been used to populate your table go to Projects > your_project > Base > Tables > your_table > Source Data. This will list all the source data files, even those that failed to be imported. You can not use these files anymore to import again to prevent double entries.
Base Table schema definitions do not include an array type, but arrays can be ingested using either the Repeated
mode for arrays containing a single type (ie, String), or the Variant
type.
Every Base user has 1 snowflake username: ICA_U_<id>
For each user/project-bundle combination a role is created: ICA_UR_<id>_<name project/bundle>__<id>
This role receives the viewer or contributor role of the project/bundle, depending on their permissions in ICA.
Every project or bundle has a dedicated Snowflake database.
For each database, 2 roles are created:
<project/bundle name>_<id>_VIEWER
<project/bundle name>_<id>_CONTRIBUTOR
This role receives
REFERENCE and SELECT rights on the tables/views within the project's PUBLIC schema.
Grants on the viewer roles of the bundles linked to the project.
This role receives the following rights on current an future objects in the project's/bundle database in the PUBLIC schema:
ownership
select, insert, update, delete, truncate and references on tables/views/materialized views
usage on sequences/functions/procedures/file formats
write, read and usage on stages
select on streams
monitor and operate on tasks
It also receives grant on the viewer role of the project.
For each project (not bundle!) 2 warehouses are created, whose size can be changed ICA at projects > your_project > project settings > details.
<projectname>_<id>_QUERY
<projectname>_<id>_LOAD
A Pipeline is a series of Tools with connected inputs and outputs configured to execute in a specific order.
Pipelines are created and stored within projects.
Navigate to Projects > your_project > Flow > Pipelines > +Create.
Select CWL Graphical, CWL code (XML / JSON) or Nextflow (XML / JSON) to create a new Pipeline.
Configure pipeline settings in the pipeline property tabs.
When creating a graphical CWL pipeline, drag connectors to link tools to input and output files in the canvas. Required tool inputs are indicated by a yellow connector.
Select Save.
Pipelines use the latest tool definition when the pipeline was last saved. Tool changes do not automatically propagate to the pipeline. In order to update the pipeline with the latest tool changes, edit the pipeline definition by removing the tool and re-adding it back to the pipeline.
Individual Pipeline files are limited to 20 Megabytes. If you need to add more than this, split your content over multiple files.
You can edit pipelines while they are in Draft or Release Candidate status. Once released, pipelines can no longer be edited.
The following sections describe the tool properties that can be configured in each tab of the pipeline editor.
Depending on how you design the pipeline, the displayed tabs differ between the graphical and code definitions. For CWL you have a choice on how to define the pipeline, Nextflow is always defined in code mode.
Any additional source files related to your pipeline will be displayed here in alphabetical order.
See the following pages for language-specific details for defining pipelines:
The details tab provides options for configuring basic information about the pipeline.
Code
The name of the pipeline.
Nextflow Version
User selectable Nextflow version available only for Nextflow pipelines
Categories
One or more tags to categorize the pipeline. Select from existing tags or type a new tag name in the field.
Description
A short description of the pipeline.
Proprietary
Hide the pipeline scripts and details from users who do not belong to the tenant who owns the pipeline. This also prevents cloning the pipeline.
Status
The release status of the pipeline.
Storage size
User selectable storage size for running the pipeline. This must be large enough to run the pipeline, but setting it too large incurs unnecessary costs.
Family
A group of pipeline versions. To specify a family, select Change, and then select a pipeline or pipeline family. To change the order of the pipeline, select Up or Down. The first pipeline listed is the default and the remainder of the pipelines are listed as Other versions. The current pipeline appears in the list as this pipeline.
Version comment
A description of changes in the updated version.
Links
External reference links. (max 100 chars as name and 2048 chars as link)
The following information becomes visible when viewing the pipeline details.
ID
Unique Identifier of the pipeline.
URN
Identification of the pipeline in Uniform Resource Name
The clone action will be shown in the pipeline details at the top-right. Cloning a pipeline allows to create modifications without impacting the original pipeline. When cloning a pipeline, you become the owner of the cloned pipeline.
When you clone a Nextflow pipeline, a verification of the configured Nextflow version is done to ensure no deprecated versions are used.
The Documentation tab provides options for configuring the HTML description for the tool. The description appears in the tool repository but is excluded from exported CWL definitions. If no documentation has been provided, this tab will be empty.
When using graphical mode for the pipeline definition, the Definition tab provides options for configuring the pipeline using a visualization panel and a list of component menus.
Machine profiles
Compute types available to use with Tools in the pipeline.
Shared settings
Settings for pipelines used in more than one tool.
Reference files
Descriptions of reference files used in the pipeline.
Input files
Descriptions of input files used in the pipeline.
Output files
Descriptions of output files used in the pipeline.
Tool
Details about the tool selected in the visualization panel.
Tool repository
A list of tools available to be used in the pipeline.
In graphical mode, you can drag and drop inputs into the visualization panel to connect them to the tools. Make sure to connect the input icons to the tool before editing the input details in the component menu. Required tool inputs are indicated by a yellow connector.
This page is used to specify all relevant information about the pipeline parameters.
The Analysis Report tab provides options for configuring pipeline execution reports. The report is composed of widgets added to the tab.
The pipeline analysis report appears in the pipeline execution results. The report is configured from widgets added to the Analysis Report tab in the pipeline editor.
[Optional] Import widgets from another pipeline.
Select Import from other pipeline.
Select the pipeline that contains the report you want to copy.
Select an import option: Replace current report or Append to current report.
Select Import.
From the Analysis Report tab, select Add widget, and then select a widget type.
Configure widget details.
Title
Add and format title text.
Analysis details
Add heading text and select the analysis metadata details to display.
Free text
Add formatted free text. The widget includes options for placeholder variables that display the corresponding project values.
Inline viewer
Add options to view the content of an analysis output file.
Analysis comments
Add comments that can be edited after an analysis has been performed.
Input details
Add heading text and select the input details to display. The widget includes an option to group details by input name.
Project details
Add heading text and select the project details to display.
Page break
Add a page break widget where page breaks should appear between report sections.
Select Save.
[[BB_PROJECT_NAME]]
The project name.
[[BB_PROJECT_OWNER]]
The project owner.
[[BB_PROJECT_DESCRIPTION]]
The project short description.
[[BB_PROJECT_INFORMATION]]
The project information.
[[BB_PROJECT_LOCATION]]
The project location.
[[BB_PROJECT_BILLING_MODE]]
The project billing mode.
[[BB_PROJECT_DATA_SHARING]]
The project data sharing settings.
[[BB_REFERENCE]]
The analysis reference.
[[BB_USERREFERENCE]]
The user analysis reference.
[[BB_PIPELINE]]
The name of the pipeline.
[[BB_USER_OPTIONS]]
The analysis user options.
[[BB_TECH_OPTIONS]]
The analysis technical options. Technical options include the TECH suffix and are not visible to end users.
[[BB_ALL_OPTIONS]]
All analysis options. Technical options include the TECH suffix and are not visible to end users.
[[BB_SAMPLE]]
The sample.
[[BB_REQUEST_DATE]]
The analysis request date.
[[BB_START_DATE]]
The analysis start date.
[[BB_DURATION]]
The analysis duration.
[[BB_REQUESTOR]]
The user requesting analysis execution.
[[BB_RUNSTATUS]]
The status of the analysis.
[[BB_ENTITLEMENTDETAIL]]
The used entitlement detail.
[[BB_METADATA:path]]
The value or list of values of a metadata field or multi-value fields.
See Metadata Models
The Nextflow project main script.
The Nextflow configuration settings.
The Common Workflow Language main script.
Multiple files can be added to make pipelines more modular and manageable.
Syntax highlighting is determined by the file type, but you can select alternative syntax highlighting with the drop-down selection list. The following formats are supported:
DIFF (.diff)
GROOVY (.groovy .nf)
JAVASCRIPT (.js .javascript)
JSON (.json)
SH (.sh)
SQL (.sql)
TXT (.txt)
XML (.xml)
YAML (.yaml .cwl)
For each process defined by the workflow, ICA will launch a compute node to execute the process.
For each compute type, the standard
(default - AWS on-demand) or economy
(AWS spot instance) tiers can be selected.
When selecting an fpga instance type for running analyses on ICA, it is recommended to use the medium size. While the large size offers slight performance benefits, these do not proportionately justify the associated cost increase for most use cases.
When no type is specified, the default type of compute node is standard-small
.
By default, compute nodes have no scratch space. This is an advanced setting and should only be used when absolutely necessary as it will incur additional costs and may offer only limited performance benefits because it is not local to the compute node.
For simplicity and better integration, consider using shared storage available at /ces
. It is what is provided in the Small/Medium/Large+ compute types. This shared storage is used when writing files with relative paths.
Daemon sets and system processes consume approximately 1 CPU and 2 GB Memory from the base values shown in the table. Consumption will vary based on the activity of the pod.
Compute Type
CPUs
Mem (GB)
Nextflow (pod.value
)
CWL (type, size
)
standard-small
2
8
standard-small
standard, small
standard-medium
4
16
standard-medium
standard, medium
standard-large
8
32
standard-large
standard, large
standard-xlarge
16
64
standard-xlarge
standard, xlarge
standard-2xlarge
32
128
standard-2xlarge
standard, 2xlarge
hicpu-small
16
32
hicpu-small
hicpu, small
hicpu-medium
36
72
hicpu-medium
hicpu, medium
hicpu-large
72
144
hicpu-large
hicpu, large
himem-small
8
64
himem-small
himem, small
himem-medium
16
128
himem-medium
himem, medium
himem-large
48
384
himem-large
himem, large
himem-xlarge (*1)
96
768
himem-xlarge
himem, xlarge
hiio-small
2
16
hiio-small
hiio, small
hiio-medium
4
32
hiio-medium
hiio, medium
fpga-small (*2)
8
122
fpga-small
fpga, small
fpga-medium
16
244
fpga-medium
fpga, medium
fpga-large (*3)
64
976
fpga-large
fpga, large
transfer-small (*4)
4
10
transfer-small
transfer, small
transfer-medium (*4)
8
15
transfer-medium
transfer, medium
transfer-large (*4)
16
30
transfer-large
transfer, large
(*1) The compute type himem-xlarge has low availability.
(*2) The compute type fpga-small is not available. Use 'fpga-medium' instead.
(*3) The compute type fpga-large is only available in the US (use1) region. This compute type is not recommended as it suffers from low availability and offers little performance benefit over fpga-medium at significant additional cost.
(*4) The transfer size selected is based on the selected storage size for compute type and used during upload and download system tasks.
Use the following instructions to start a new analysis for a single pipeline.
Select a project.
From the project menu, select Flow > Pipelines.
Select the pipeline or pipeline details of the pipeline you want to run.
Select Start Analysis.
Configure analysis settings. See Analysis Properties.
Select Start Analysis.
View the analysis status on the Analyses page.
Requested—The analysis is scheduled to begin.
In Progress—The analysis is in progress.
Succeeded—The analysis is complete.
Failed and Failed Final—The analysis has failed or was aborted.
To end an analysis, select Abort.
To perform a completed analysis again, select Re-run.
The following sections describe the analysis properties that can be configured in each tab.
The Analysis tab provides options for configuring basic information about the analysis.
User Reference
The unique analysis name.
User tags
One or more tags used to filter the analysis list. Select from existing tags or type a new tag name in the field.
Entitlement Bundle
Select a subscription to charge the analysis to.
Input Files
Select the input files to use in the analysis. (max. 50,000)
Settings
Provide input settings.
You can abort a running analysis from either the analysis overview (Projects > your_project > Flow > Analyses > your_analysis > Manage > Abort) or from the analysis details (Projects > your_project > Flow > Analyses > your_analysis > Details tab > Abort).
You can view analysis results on the Analyses page or in the output_folder on the Data page.
Select a project, and then select the Flow > Analyses page.
Select an analysis.
On the Details tab, select the square symbol right of the output files.
From the output files view, expand the list and select an output file.
If you want to add or remove any user or technical tags, you can do so from the data details view.
If you want to download the file, select Schedule download.
To preview the file, select the View tab.
Return to Flow > Analyses > your_analysis.
View additional analysis result information on the following tabs:
Details - View information on the pipeline configuration.
Steps - stderr and stdout information
Timeline Report - Nextflow process execution timeline.
Execution Report - Nextflow analysis report. Showing the run times, commands, resource usage and tasks for Nextflow analyses.
Queries can be used for data mining. On the Projects > your_project > Base > Query page:
New queries can be created and executed
Already executed queries can be found in the query history
Saved queries and query templates are listed under the saved queries tab.
All available tables and their details are listed on the New Query tab.
Note that Metadata tables are created by syncing with the Base module. This synchronization is configured on the Details page within the project.
Queries are executed using SQL (for example Select * From table_name
). When there is a syntax issue with the query, the error will be displayed on the query screen when trying to run it. The query can be immediately executed or saved for future use.
Do not use queries such as ALTER TABLE to modify your table structure as it will go out of sync with the table definition and will result in processing errors.
When you have duplicate column names in your query, put the columns explicitly in the select clause and use column aliases for columns with the same name.
Case sensitive column names (such as the VARIANTS table) must be surrounded by double quotes. For example, select * from MY_TABLE where "PROJECT_NAME" = 'MyProject'
.
The syntax for ICA case-sensitive subfields is without quotes, for example select * from MY_TABLE where ica:Tenant = 'MyTenant'
As these are case sensitive, the upper and lowercasing must be respected.
For more information on queries, please also see the snowflake documentation: https://docs.snowflake.com/en/user-guide/
Some tables contain columns with an array of values instead of a single value.
As of ICA version 2.27, there is a change in the use of capitals for ICA array fields. In previous versions, the data name within the array would start with a capital letter. As of 2.27, lowercase is used. For example ICA:Data_reference
has become ICA:data_reference
.
You can use the GET_IGNORE_CASE option to adapt existing queries when you have both data in the old syntax and new data in the lowercase syntax. The syntax is GET_IGNORE_CASE(Table_Name.Column_Name,'Array_field')
For example:
select ICA:Data_reference as MY_DATA_REFERENCE from TestTable
becomes:
select GET_IGNORE_CASE(TESTTABLE.ICA,'Data_reference') as MY_DATA_REFERENCE from TestTable
You can also modify the data to have consistent capital usage by executing the query update YOUR_TABLE_NAME set ica = object_delete(object_insert(ica, 'data_name', ica:Data_name), 'Data_name')
and repeating this process for all field names (Data_name, Data_reference, Execution_reference, Pipeline_name, Pipeline_reference, Sample_name, Sample_reference, Tenant_name and Tenant_reference).
Suppose you have a table called YOUR_TABLE_NAME consisting of three fields. The first is a name, the second is a code and the third field is an array of data called ArrayField:
Name A
Code A
{ “userEmail”: “email_A@server.com”, "bundleName": null, “boolean”: false }
Name B
Code B
{ “userEmail”: “email_B@server.com”, "bundleName": "thisbundle", “boolean”: true }
You can use the name field and code field to do queries by running
Select * from YOUR_TABLE_NAME where NameField = "Name A"
.
If you want to show specific data like the email and bundle name from the array, this becomes
Select ArrayField:userEmail as User_Email, ArrayField:bundleName as Bundle_Name from YOUR_TABLE_NAME where NameField = "Name A"
.
If you want to use data in the array as your selection criteria, the expression becomes
Select ArrayField:userEmail as User_Email, ArrayField:bundleName as Bundle_Name from YOUR_TABLE_NAME where ArrayField:boolean = true
.
If your criteria is text in the array, use the '
to delimit the text. For example:
Select ArrayField:userEmail as User_Email, ArrayField:bundleName as Bundle_Name from YOUR_TABLE_NAME where ArrayField:userEmail = 'email_A@server.com'
.
You can also use the LIKE operator with the % wildcard if you do not know the exact content.
Select ArrayField:userEmail as User_Email, ArrayField:bundleName as Bundle_Name from YOUR_TABLE_NAME where ArrayField:userEmail LIKE '%A@server%'
If the query is valid for execution, the result will be shown as a table underneath the input box. From within the result page of the query, it is possible to save the result in two ways:
Download: As Excel or JSON file to the computer.
Export: As a new table, as a view or as file to the project in CSV (Tab, Pipe or a custom delimeter is also allowed.) or JSON format. When exporting in JSON format, the result will be saved in a text file that contains a JSON object for each entry, similar to when exporting a table. The exported file can be located in the Data page under the folder named base_export_<user_supplied_name>_<auto generated unique id>.
Navigate to Projects > your_project > Base > Query.
Enter the query to execute using SQL.
Select »Run Query.
Optionally, select Save Query to add the query to your saved queries list.
If the query takes more than 30 seconds without returning a result, a message will be displayed to inform you the query is still in progress and the status can be consulted on Projects > your_project > Activity > Base Jobs. Once this Query is successfully completed, the results can be found in Projects > your_project > Base > Query > Query History tab.
The query history lists all queries that were executed. Historical queries are shown with their date, executing user, returned rows and duration of the run.
Navigate to Projects > your_project > Base > Query.
Select the Query History tab.
Select a query.
Perform one of the following actions:
Open Query—Open the query in the New Query tab. You can then select Run Query to execute the query again.
Save Query—Save the query to the saved queries list.
View Results—Download the results from a query or export results to a new table, view, or file in the project. Results are available for 24 hours after the query is executed. To view results after 24 hours, you need to execute the query again.
All queries saved within the project are listed under the Saved Queries tab together with the query templates.
The saved queries can be:
Opened: This will open the query in the “New query” tab.
Saved as template: The saved query becomes a query template.
Deleted: The query is removed from the list and cannot be opened again.
The query templates can be:
Opened: This will open the query again in the “New query” tab.
Deleted: The query is removed from the list and cannot be opened again.
It is possible to edit the saved queries and templates by double-clicking on each query or template. Specifically for Query Templates, the data classification can be edited to be:
Account: The query template will be available for everyone within the account
User: The query template will be available for the user who created it
If you have saved a query, you can run the query again by selecting it from the list of saved queries.
Navigate to Projects > your_project > Base > Query.
Select the Saved Queries tab.
Select a query.
Select Open Query to open the query in the New Query tab from where it can be edited if needed and run by selecting Run Query.
Shared databases are displayed under the list of Tables as Shared Database for project <project name>.
For ICA Cohorts Customers, shared databases are available in a project Base instance. For more information on specific Cohorts shared database tables that are viewable, See Cohorts Base.
The JupyterLab docker image contains the following environment variables:
ICA_URL
set to the ICA server URL https://ica.illumina.com/ica
ICA_PROJECT
(OBSOLETE) set to the current ICA project ID
ICA_PROJECT_UUID
set to the current ICA project UUID
ICA_SNOWFLAKE_ACCOUNT
set to the ICA Snowflake (Base) Account ID
ICA_SNOWFLAKE_DATABASE
set to the ICA Snowflake (Base) Database ID
ICA_PROJECT_TENANT_NAME
set to the tenant name of the owning tenant of the project where the workspace is created
ICA_STARTING_USER_TENANT_NAME
set to the tenant name of the tenant of the user which last started the workspace
ICA_COHORTS_URL
set to the URL of the Cohorts web application used to support the Cohorts view
Note: To export data from your workspace to your local machine, it is best practice to move the data in your workspace to the /data/project/ folder so that it becomes available in your project under projects > your_project > Data.
The ICA Python library API documentation can be found in folder /etc/ica/data/ica_v2_api_docs
within the JupyterLab docker image.
The following steps are needed to get your bench image running in ICA.
You need to have Docker installed in order to build your images.
For your Docker bench image to work in ICA, they must run on Linux X86 architecture, have the correct user id and initialization script in the Docker file.
Bench-console provides an example to build a minimal image compatible with ICA Bench to run a SSH Daemon.
Bench-web provides an example to build a minimal image compatible with ICA Bench to run a Web Daemon.
Bench-rstudio provides an example to build a minimal image compatible with ICA Bench to run a rStudio Open Source.
These examples come with information on the available parameters.
This script copies the ica_start.sh
file which takes care of the Initialization and termination of your workspace to the location in your project from where it can be started by ICA when you request to start your workspace.
The user settings must be set up so that bench runs with UID 1000.
To do a clean shutdown, you can capture the sigterm which is transmitted 30 seconds before the workspace is terminated.
Once you have Docker installed and completed the configuration of your Docker files, you can build your bench image.
Open the command prompt on your machine.
Navigate to the root folder of your Docker files.
Once the image has been built, save it as docker tar file with the command docker save mybenchimage:0.0.1 | bzip2 > ../mybenchimage-0.0.1.tar.bz2
The resulting tar file will appear next to the root folder of your docker files.
If you want to build on a mac with Apple Silicon, then the build command is docker buildx build --platform linux/amd64 -f Dockerfile -t mybenchimage:0.0.1 .
Open ICA and log in.
Go to Projects > your_project > Data.
Select the uploaded image file and perform Manage > Change Format.
From the format list, select DOCKER and save the change.
Go to System Settings > Docker Repository > Create > Image.
Select the uploaded docker image and fill out the other details.
Name: The name by which your docker image will be seen in the list
Version: A version number to keep track of which version you have uploaded. In our example this was 0.0.1
Description: Provide a description explaining what your docker images does or is suited for.
Type: The type of this image is Bench. The Tool type is reserved for tool images.
Cluster compatible: [For Future Use, not currently supported] Indicates if this docker images is suited for cluster computing
Access: This setting must match the available access options of your Docker image. You can choose web access (HTTP), console access (SSH) or both. What is selected here becomes available on the + New Workspace screen. Enabling an option here which your Docker image does not support, will result in access denied errors when trying to run the workspace.
Regions: If your tenant has access to multiple regions, you can select to which regions to replicate the docker image.
Once the settings are entered, select Save. The creation of the Docker image typically takes between 5 and 30 minutes. The status of your docker image will be partial during creation and available once completed.
Navigate to Projects > your_project > Bench > Workspaces.
Create a new workspace with + Create Workspace or edit an existing workspace.
Save your changes.
Select Start Workspace
Wait for the workspace to be started and you can access it either via console or the GUI.
Once your bench image has been started, you can access it via console, web or both, depending on your configuration.
Web access (HTTP) is done from either Projects > your_project > Bench > Workspaces > your_Workspace > Access tab or from the link provided at provided in your running workspace at Projects > your_project > Bench > Workspaces > your_Workspace > Details tab > Access section.
Console access (SSH) is performed from your command prompt by going to the path provided in your running workspace at Projects > your_project > Bench > Workspaces > your_Workspace > Details tab > Access section.
The bench image will be instantiated as a container which will be forcedly started as user with UID 1000 and GID 100.
You cannot elevate your permissions in a running workspace.
Do not run containers as root as this is bad security practice.
Only the following folders are writeable:
/data
/tmp
All other folders are mounted as read-only.
For inbound access, the following ports on the container are publicly exposed, depending on the selection made at startup.
Web: TCP/8888
Console: TCP/2222
For outbound access, a workspace can be started in two modes:
Public: Access to public IP’s is allowed using TCP protocol.
Restricted: Access to list of URLs are allowed.
At runtime, the following Bench-specific environment variables are made available to the workspace instantiated from the Bench image.
Following files and folders will be provided to the workspace and made accessible for reading at runtime.
At runtime, ICA-related software will automatically be made available at /data/.software in read-only mode.
New versions of ICA software will be made available after a restart of your workspace.
When a bench workspace is instantiated from your selected bench image, the following script is invoked: /usr/local/bin/ica_start.sh
This script needs to be available and executable otherwise your workspace will not boot.
This script can be used to invoke other scripts.
If you get the error "docker buildx build" requires exactly 1 argument when trying to build your docker image, then a possible cause is missing the last .
of the command.
When you stop the workspace when users are still actively using it, they will receive a message showing a Server Connection Error.
ICA provides a tool called Bench for interactive data analysis. This is a sandboxed workspace which runs a docker image with access to the data and pipelines within a project. This workspace runs on the Amazon S3 system and comes with associated processing and provisioning costs. It is therefore best practice to not keep your Bench instances running indefinitely, but stopping them when not in use.
Having access to Bench depends on the following conditions:
Bench needs to be included in your ICA subscription.
The project owner needs to enable Bench for their project.
Individual users of that project need to be given access to Bench.
After creating a project, go to Projects > your_project > Bench > Workspaces page and click the Enable button. If you do not see this option, then either your tenant subscription does not include Bench or you belong to a tenant different from the one where the project was created. Users from other tenants cannot enable the Bench module, but can create workspaces. Once enabled, every user who has the correct permissions has access to the Bench module in that project.
Once Bench has been enabled for your project, the combination of roles and teams settings determines if a user can access Bench.
Tenant administrators and project owners are always able to access Bench and perform all actions.
The teams settings page at Projects > your_project > Project Settings > Team determines the role for the user/workgroup.
No Access means you have no access to the Bench workspace for that project.
Contributor gives you the right to start and stop the Bench workspace and to access the workspace contents, but not to create or edit the workspace.
Administrator gives you the right to create, edit, delete, start and stop the Bench workspace, and to access the actual workspace contents. In addition, the administrator can also build new derived Bench images and tools.
Finally, a verification is done of your user rights against the required workspace permissions. You will only have access when your user rights meet or exceed the required workspace permissions. The possible required Workspace permissions include:
Upload / Download rights (Download rights are mandatory for technical reasons)
Project Level (No Access / Data Provider / Viewer / Contributor)
Flow (No Access / Viewer / Contributor)
Base (No Access / Viewer / Contributor)
On the Schedule page at Projects > your_project > Base > Schedule, it’s possible to create a job for importing different types of data you have access to into an existing table.
When creating or editing a schedule, Automatic import is performed when the Active box is checked. The job will run at 10 minute intervals. In addition, for both active and inactive schedules, a manual import is performed when selecting the schedule and clicking the »run button.
There are different types of schedules that can be set up:
Files
Metadata
Administrative data.
This type will load the content of specific files from this project into a table. When adding or editing this schedule you can define the following parameters:
Name (required): The name of the scheduled job
Description: Extra information about the schedule
File name pattern (required): Define in this field a part or the full name of the file name or of the tag that the files you want to upload contain. For example, if you want to import files named sample1_reads.txt, sample2_reads.txt, … you can fill in _reads.txt in this field to have all files that contain _reads.txt imported to the table.
Generated by Pipelines: Only files generated by these selected pipelines are taken into account. When left clear, files from all pipelines are used.
Target Base Table (required): The table to which the information needs to be added. A drop-down list with all created tables is shown. This means the table needs to be created before the schedule can be created.
Write preference (required): Define data handling; whether it can overwrite the data
Data format (required): Select the data format of the files (CSV, TSV, JSON)
Delimiter (required): to indicate which delimiter is used in the delimiter separated file. If the delimiter is not present in list, it can be indicated as custom.
Active: The job will run automatically if checked
Custom delimiter: the custom delimiter that is used in the file. You can only enter a delimiter here if custom delimiter is selected.
Header rows to skip: The number of consecutive header rows (at the top of the table) to skip.
References: Choose which references must be added to the table
Advanced Options
Encoding (required): Select the encoding of the file.
Null Marker: Specifies a string that represents a null value in a CSV/TSV file.
Quote: The value (single character) that is used to quote data sections in a CSV/TSV file. When this character is encountered at the beginning and end of a field, it will be removed. For example, entering " as quote will remove the quotes from "bunny" and only store the word bunny itself.
Ignore unknown values: This applies to CSV-formatted files. You can use this function to handle optional fields without separators, provided that the missing fields are located at the end of the row. Otherwise, the parser can not detect the missing separator and will shift fields to the left, resulting in errors.
If headers are used: The columns that have matching fields are loaded, those that have no matching fields are loaded with NULL and remaining fields are discarded.
If no headers are used: The fields are loaded in order of occurrence and trailing missing fields are loaded with NULL, trailing additional fields are discarded.
This type will create two new tables: BB_PROJECT_PIPELINE_EXECUTIONS_DETAIL and ICA_PROJECT_SAMPLE_META_DATA. The job will load metadata (added to the samples) into ICA_PROJECT_SAMPLE_META_DATA. The process gathers the metadata from the samples via the data linked to the project and the metadata from the analyses in this project. Furthermore, the schedular will add provenance data to BB_PROJECT_PIPELINE_EXECUTIONS_DETAIL. This process gathers the execution details of all the analyses in the project: the pipeline name and status, the user reference, the input files (with identifiers), and the settings selected at runtime. This enables you to track the lineage of your data and to identify any potential sources of errors or biases. So, for example, the following query will count how many times each of the pipelines was executed and sort it accordingly:
To obtained the similar table for the failed runs, you can execute the following SQL query:
When adding or editing this schedule you can define the following parameters:
Name (required): the name of this scheduled job
Description: Extra information about the schedule
Anonymize references: when selected, the references will not be added
Include sensitive meta data fields: in the meta data fields configuration, fields can be set to sensitive. When checked, those fields will also be added.
Active: the job will run automatically if ticked
Source (Tenant Administrators Only):
Project (default): All administrative data from this project will be added
Account: All administrative data from every project in the account will be added. When a tenant admin creates the tenant-wide table with administrative data in a project and invites other users to this project, these users will see this table as well.
This type will automatically create a table and load administrative data into this table. A usage overview of all executions is considered administrative data.
When adding or editing this schedule the following parameters can be defined:
Name (required): The name of this scheduled job
Description: Extra information about the schedule
Anonymize references: When checked, any platform references will not be added
Include sensitive metadata fields: In the metadata fields configuration, fields can be set to sensitive. When checked, those fields will also be added.
Active: The job will run automatically if checked
Source (Tenant Administrators Only):
Project (default): All administrative data from this project will be added
Account: All administrative data from every project in the account will be added. When a tenant admin creates the tenant-wide table with administrative data in a project and invites other users to this project, these users will see this table as well.
Schedules can be deleted. Once deleted, they will no longer run, and they will not be shown in the list of schedules.
When clicking the Run button, or Save & Run when editing, the schedule will start the job of importing the configured data in the correct tables. This way the schedule can be run manually. The result of the job can be seen in the tables.
The main concept in Bench is the Workspace. A workspace is an instance of a Docker image that runs the framework which is defined in the image (for example JupyterLab, R Studio). In this workspace, you can write and run code and graphically represent data. You can use API calls to access data, analyses, Base tables and queries in the platform. Via the command line, R-packages, tools, libraries, IGV browsers, widgets, etc. can be installed.
You can create multiple workspaces within a project and each workspace runs on an individual node and is available in different resource sizes. Each node has local storage capacity, where files and results can be temporarily stored and exported from to be permanently stored in a Project. The size of the storage capacity can range from 1GB – 16TB.
For each workspace, you can see the status by the colour red is stopped, orange is started and green is running.
If this is the first time you are using a workspace in a Project, click Enable
to create new Bench Workspaces. In order to use Bench, you first need to have a workspace. This workspace determines which docker image will be used with which node and storage size.
Click Projects > Your_Project > Bench > Workspaces > + Create Workspace
Complete the following fields:
Name: (required) must be a unique name.
Docker image: (required) The list of docker images includes base images from ICA and images uploaded to the docker repository for that domain.
Storage size (GB): (required) Represents the size of the storage available on the workspace. A storage from 10GB to 64TB can be provided.
Description: A place to provide additional information about the workspace.
Web allows to interact with the workspace via a browser.
Console provides a terminal to interact with the workspace.
Internet Access: (required) Type of access to the internet which should be provided for this workspace
Open: Internet access is allowed
Restricted: Creates a workspace with no internet access. Access to the ICA Project Data is still available in this mode.
Whitelisted URLs: Specify URLs and paths that are allowed in a restricted workspace. Separate URLS with a new line. Only domains and subdomains in the specified URL will be allowed.
URLs must comply with the following:
URLs can be between 1 and 263 characters including dot (.
).
URLs can begin with a leading dot (.
).
Domain and Sub-domains:
Can include alphanumeric characters (Letters A-Z and digits 0-9). Case insensitive.
Can contain hyphens (-
) and underscores (_
), but not as a first or last character.
Length between 1 and 63 characters.
Dot (.
) must be placed after a domain or sub-domain.
Note that if you use a trailing slash like in the path ftp.example.net/folder/ then you will not be able to access the path ftp.example.net/folder without the trailing slash included.
Regex for URL : [(http(s)?):\/\/(www\.)?a-zA-Z0-9@:%._\+~#=-]\{2,256}\.[a-z]\{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&\/\/=]*)
Accepted Example URLs:
example.com www.example.com https://www.example.com subdomain.example.com subdomain.example.com/folder subdomain.example.com/folder/subfolder sub-domain.example.com sub_domain.example.com example.co.uk subdomain.example.co.uk sub-domain.example.co.uk\
Example data science specific whitelist compatible with restricted Bench workspaces. Note there are two required URLs to allow for Python pip installs:\
pypi.org files.pythonhosted.org repo.anaconda.com conda.anaconda.org github.com cran.r-project.org bioconductor.org www.npmjs.com mvnrepository.com\
Access limited to workspace owner. When this field is selected, only the workspace owner can access the workspace. Everything created in that workspace will belong to the workspace owner.
Download/Upload allowed
Project/Flow/Base access
Click “Save”
The workspace can be edited afterwards when it is stopped, on the Details tab within the workspace. The changes will be applied when the workspace is restarted.
Access limited to workspace owner. When this field is selected, only the workspace owner can access the workspace. Everything created in that workspace will belong to the workspace owner.
Bench administrators are able to create, edit and delete workspaces and start and stop workspaces. If their permissions match or exceed those of the workspace, they can also access the workspace contents.
Contributors are able to start and stop workspaces and if their permissions match or exceed those of the workspace, they can also access the workspace contents.
For security reasons, the Tenant administrator and Project owner can always access the workspace.
If one of your permissions is not high enough as bench contributor, you will see the following message "You are not allowed to use this workspace as your user permissions are not sufficient compared to the permissions of this workspace".
The permissions that a Bench workspace can receive are the following:
Upload rights
Download rights (required)
Project (No Access - Dataprovider - Viewer - Contributor)
Flow (No Access - Viewer - Contributor)
Base (No Access - Viewer - Contributor)
Based on these permissions, you will be able to upload or download data to your ICA project (upload and download rights) and will be allowed to take actions in the Project, Flow and Base modules related to the granted permission.
If you encounter issues when uploading/downloading data in a workspace, the security settings for that workspace may be set to not allow uploads and downloads. This can result in RequestError: send request failed and read: connection reset by peer. This is by design in restricted workspaces and thus limits data access to your project via /data/project to prevent the extraction of large amounts of (proprietary) data.
Workspaces which were created before this functionality existed can be upgraded by enabling these workspace permissions. If the workspaces are not upgraded, they will continue working as before.
To delete a workspace, go to Projects > your_project > Bench > Workspaces > your_workspace and click “Delete”. Note that the delete option is only available when the workspace is stopped.
The workspace will not be accessible anymore, nor will it be shown in the list of workspaces. The content of it will be deleted so if there is any information that should be kept, you can either put it in a docker image which you can use to start from next time, or export it using the API.
The workspace is not always accessible. It needs to be started before it can be used. From the moment a workspace is Running, a node with a specific capacity is assigned to this workspace. From that moment on, you can start working in your workspace.
As long as the workspace is running, the resources provided for this workspace will be charged.
To start the workspace, follow the next steps:
Go to Projects > your_project > Bench > Workspaces > your_workspace > Details
Click on Start Workspace button
On the top of the details tab, the status changes to “Starting”. When you click on the >_Access tab, the message “The workspace is starting” appears.
Wait until the status is “Running” and the “Access” tab can be opened. This can take some time because the necessary resources have to be provisioned.
You can refresh the workspace status by selecting the round refresh symbol at the top right.
If you want to open a running workspace in a new tab, then select the link at Projects > your_project > Bench > Workspaces > Details tab > Access. You can also copy the link with the copy symbol in front of the link.
When you exit a workspace, you can choose to stop the workspace or keep it running. Keeping the workspace running means that it will continue to use resources and incur associated costs. To stop the workspace, select stop in the displayed dialog. You can also stop a workspace by opening it and selecting stop at the top right. If you choose to keep it running, the workspace will be stopped if it is not accessed for more than 7 days to avoid unnecessary costs.
Stopping the workspace will stop the notebook, but will not delete local data. Content will no longer be accessible and no actions can be performed until it is restarted. Any work that has been saved will stay stored.
Storage will continue to be charged until the workspace is deleted. Administrators have a delete option for the workspace in the exit screen.
The project/tenant administrator can enter and stop workspaces for their project/tenant even if they did not start those workspaces at Projects > your_project > Bench > Workspaces > your_workspace > Details. Be careful not to stop workspaces that are processing data. For security reasons, a log entry is added when a project/tenant administrator enters and exits a workspace.
You can see who is using a workspace in the workspace list view.
Once the Workspace is running, the default applications are loaded. These are defined by the start script of the docker image.
The docker images provided by Illumina will load JupyterLab by default. It also contains Tutorial notebooks that can help you get started. Opening a new terminal can be done via the Launcher, + button above the folder structure.
To ensure that packages (and other objects, including data) are permanently installed on a Bench image, a new Bench image needs to be created, using the BUILD option in Bench. A new image can only be derived from an existing one. The build process uses the DOCKERFILE method, where an existing image is the starting point for the new Docker Image (The FROM directive), and any new or updated packages are additive (they are added as new layers to the existing Docker file).
NOTE: The Dockerfile commands are all run as ROOT, so it is possible to delete or interfere with an image in such a way that the image is no longer running correctly. The image does not have access to any underlying parts of the platform so will not be able to harm the platform, but inoperable Bench images will have to be deleted or corrected.
In order to create a derived image, open up the image that you would like to use as the basis and select the Build tab.
Name: By default, this is the same name as the original image and it is recommended to change the name.
Version: Required field which can by any value.
Description: The description for your docker image (for example, indicating which apps it contains).
Code: The Docker file commands must be provided in this section.
The first 4 lines of the Docker file must NOT be edited. It is not possible to start a docker file with a different FROM directive. The main docker file commands are RUN and COPY. More information on them is available in the official Docker documentation.
Once all information is present, click the Build button. Note that the build process can take a while. Once building has completed, the docker image will be available on the Data page within the Project. If the build has failed, the log will be displayed here and the log file will be in the Data list.
From within the workspace it is possible to create a docker image and a tool from it at the same time.
Click the Manage > Create CWL Tool button in the top right corner of the workspace.
Give the tool a name.
Replace the description of the tool to describe what it does.
Add a version number for the tool.
Click the Image tab.
Here the image that accompanies the tool will be created.
Change the name for the image.
Change the version.
Replace the description to describe what the image does.
Below the line where it says “#Add your commands below.” write the code necessary for running this docker image.
Click the Save button in the upper, right-hand corner to start the build process.
The building can take a while. When it has completed, the tool will be available in the Tool Repository.
To export data from your workspace to your local machine, it is best practice to move the data in your workspace to the /data/project/ folder so that it becomes available in your project under projects > your_project > Data. Although this storage is slow, it offers read and write access and access to the content from within ICA.
Every workspace you start has a read-only /data/.software/ folder which contains the icav2 command-line interface (and readme file).
The last tab of the workspace is the activity tab. On this tab all actions performed in the workspace are shown. For example, the creation of the workspace, starting or stopping of the workspace,etc. The activities are shown with their date, the user that performed the action and the description of the action. This page can be used to check how long the workspace has run.
In the general Activity page of the project, there is also a Bench activity tab. This shows all activities performed in all workspaces within the project, even when the workspace has been deleted. The Activity tab in the workspace only shows the action performed in that workspace. The information shown is the same as per workspace, except that here the workspace in which the action is performed is listed as well.
If you want to query data from a table shared from another tenant (indicated in green), select the table to see the unique name. In the example below, the query will be select * from demo_alpha_8298.public.TestFiles
Bench workspaces require setting a docker image to use as the image for the workspace. Illumina Connected Analytics (ICA) provides a default docker image with installed.
JupyterLab supports (.ipynb). Notebook documents consist of a sequence of cells which may contain executable code, markdown, headers, and raw text.
Included in the default JupyterLab docker image is a python library with APIs to perform actions in ICA, such as add data, launch pipelines, and operate on Base tables. The python library is generated from the using .
See the for examples on using the ICA Python library.
Bench images are Docker containers tailored to run in ICA with the necessary permissions, configuration and resources. For more information of Docker images, please refer to
For easy reference, you can find examples of preconfigured Bench images on the which you can copy to your local machine and edit to suit your needs.
The following scripts must be part of your Docker bench image. Please refer to the examples from the for more details.
Execute docker build -f Dockerfile -t mybenchimage:0.0.1 .
with mybenchimage being the name you want to give to your image and 0.0.1 replaced with the version number which you want your bench image to be. For more information on this command, see
For small Docker images, upload the docker image file which you generated in the previous step. For large Docker images use the to better performance and reliability to import the Docker image.
Fill in the bench workspace details according to .
To execute , your workspace needs a way to run them such as the inclusion of an SSH daemon, be it integrated into your web access image or into your console access. There is no need to download the workspace command-line interface, you can run it from within the workspace.
This script is the main process in your running workspace and cannot run to completion as it will stop the workspace and instantiate a restart (see ).
When you stop a workspace, a TERM signal is sent to the main process in your bench workspace. You can trap this signal to handle the stop gracefully (see and shut down child processes of the main process. The workspace will be forcedly shut down after 30 seconds if your main process hasn’t stopped within the given period.
Resource model: (required) Size of the machine on which the workspace will run and whether or not the machine should contain a Graphics Processing Unit (GPU). See for available sizes.
Access: The options here are determined by the . The options you select will become available on the details tab of the Workspace when it is running.
Workspace Permissions: Your workspace will operate with these . For security reasons, users will need to have the permissions matching what you set at the following permissions to run the workspace, regardless of their role.
The determines if someone is an administrator or contributor, while the dedicated indicate what the workspace itself can and cannot do within your project. For this reason, the users need to meet or exceed the required permissions to enter this workspace and use it.
Click the General Tool tab. This tab and all next tabs will look familiar from Flow. Enter the information required for the tool in each of the tabs. For more detailed instruction check out the section in the Flow documentation.
For fast read-only access, link folders with the workspace-ctl data create-mount --mode read-only.
For fast read/write access, link which are visible, but whose contents are not accessible from ICA. Use the workspace-ctl data create-mount --mode read-write to do so. You can not have fast read-write access to indexed folders as the indexing mechanism on those would deteriorate the performance.
Draft
Fully editable draft.
Release Candidate
The pipeline is ready for release. Editing is locked but the pipeline can be cloned (top right in the details view) to create a new version.
Released
The pipeline is released. To release a pipeline, all tools of that pipeline must also be in released status. Editing a released pipeline is not possible, but the pipeline can be cloned (top right in the details view) to create a new editable version.
CWL Graphical
Details
Documentation
Definition
Analysis Report
Metadata Model
CWL Code
Details
Documentation
Inpurform files (JSON) or XML Configuration (XML)
CWL Files
Metadata Model
Nextflow Code
Details
Documentation
Inputform Files (JSON) or XML Configuration (XML)
Nextflow files
Metadata Model
ICA_WORKSPACE
The unique identifier related to the started workspace. This value is bound to a workspace and will never change.
32781195
ICA_CONSOLE_ENABLED
Whether Console access is enabled for this running workspace.
true, false
ICA_WEB_ENABLED
Whether Web access is enabled for this running workspace.
true, false
ICA_SERVICE_ACCOUNT_USER_API_KEY
An API key that allows interaction with ICA using the ICA CLI and is bound to the permissions defined at startup of the workspace.
ICA_BENCH_URL
The host part of the public URL which provides access to the running workspace.
use1-bench.platform.illumina.com
ICA_PROJECT_UUID
The unique identifier related to the ICA project in which the workspace was started.
ICA_URL
The ICA Endpoint URL.
HTTP_PROXY
HTTPS_PROXY
The proxy endpoint in case the workspace was started in restricted mode.
HOME
The home folder.
/data
/etc/workspace-auth
Contains the SSH rsa public/private keypair which is required to be used to run the workspace SSHD.
/data
This folder contains all data specific to your workspace.
Data in this folder is not persisted in your project and will be removed at deletion of the workspace.
/data/project
This folder contains all your project data.
/data/.software
This folder contains ICA-related software.
Contributor
-
-
X
when permissions match those of the workspace
Administrator
X
X
X
when permissions match those of the workspace
The following is a list of available bench CLI commands and thier options.
Please refer to the examples from the Illumina website for more details.
workspace-ctl compute get-cluster-details
workspace-ctl compute get-logs
workspace-ctl compute get-pools
workspace-ctl compute scale-pool
workspace-ctl data create-mount
workspace-ctl data delete-mount
workspace-ctl data get-mounts
workspace-ctl help completion
workspace-ctl help compute
workspace-ctl help compute get-cluster-details
workspace-ctl help compute get-logs
workspace-ctl help compute get-pools
workspace-ctl help compute scale-pool
workspace-ctl help data
workspace-ctl help data create-mount
workspace-ctl help data delete-mount
workspace-ctl help data get-mounts
workspace-ctl help help
workspace-ctl help software
workspace-ctl help software get-server-metadata
workspace-ctl help software get-software-settings
workspace-ctl help workspace
workspace-ctl help workspace get-cluster-settings
workspace-ctl help workspace get-connection-details
workspace-ctl help workspace get-workspace-settings
workspace-ctl software get-server-metadata
workspace-ctl software get-software-settings
workspace-ctl workspace get-cluster-settings
workspace-ctl workspace get-connection-details
workspace-ctl workspace get-workspace-settings
Bench has the ability to handle containers inside a running workspace. This allows you to install and package software more easily as a container image and provides capabilities to pull and run containers inside a workspace.
Bench offers a container runtime as a service in your running workspace. This allows you to do standardized container operations such as pulling in images from public and private registries, build containers at runtime from a Dockerfile, run containers and eventually publish your container to a registry of choice to be used in different ICA products such as ICA Flow.
The Container Service is accessible from your Bench workspace environment by default.
The container service uses the workspace disk to store any container images you pulled in or created.
To interact with the Container Service, a container remote client CLI is exposed automatically in the /data/.local/bin
folder. The Bench workspace environment is preconfigured to automatically detect where the Container Service is made available using environment variables. These environment variables are automatically injected into your environment and are not determined by the Bench Workspace Image.
Use either docker or podman cli to interact with the Container Service. Both are interchangeable and support all the standardized operations commonly known.
To run a container, the first step is to either build a container from a source container or pull in a container from a registry
A public image registry does not require any form of authentication to pull the container layers.
The following command line example shows how to pull in a commonly known image.
The Container Service uses Dockerhub by default to pull images from if no registry hostname is defined in the container image URI.
To pull images from a private registry, the Container Service needs to authenticate to the Private Registry.
The following command line example shows how to instruct the Container Service to login into the Private registry.hub.docker.com
registry
Depending on your authorisations in the private registry you will be able to pull and push images. These authorisations are managed outside of the scope of ICA.
Depending on the Registry setup you can publish Container Images with or without authentication. If Authentication is required, follow the login procedure described in Private Registry
The following command line example shows how to publish a locally available Container Image to a private registry in Dockerhub.
The following example shows how to save a locally available Container Image as a compressed tar archive.
This lets you upload the container image into the Private ICA Docker Registry.
The following example shows how to list all locally available Container Images
Container Images require storage capacity on the Bench Workspace disk. The capacity is shown when listing the locally available container images. The container Images are persisted on disk and remain available whenever a workspace stops and restarts.
The following example shows how to clean up a locally available Container Image
When a Container Image has multiple tags, all the tags need to be removed individually to free up disk capacity.
A Container Image can be instantiated in a Container running inside a Bench Workspace.
By default the workspace disk (/data
) will be made available inside the running Container. This lets you to access data from the workspace environment.
When running a Container, the default user defined in the Container Image manifest will be used and mapped to the uid and the gid of the user in the running Bench Workspace (uid:1000, gid: 100). This will ensure files created inside the running container on the workspace disk will have the same file ownership permissions.
The following command line example shows how to run an instance a locally available Container Image as a normal user
Running a Container as root user maps the uid and gid inside the running Container to the running non-root user in the Bench Workspace. This lets you act as user with uid 0 and gid 0 inside the context of the container.
By enabling this functionality, you can install system level packages inside the context of the Container. This can be leveraged to run tools that require additional system level packages at runtime.
The following command line example shows how to run an instance of a locally available Container as root user and install system level packages
When no specific mapping is defined using the --userns
flag, the user in the running Container user will be mapped to an undefined uid and gid based on an offset of id 100000. Files created in your workspace disk from the running Container will also use this uid and gid to define the ownership of the file.
Building a Container
To build a Container Image, you need to describe the instructions in a Dockerfile.
This next example builds a local Container Image and tags it as myimage:1.0 The Dockerfile used in this example is
The following command line example will build the actual Container Image
When defining the build context location, keep in mind that using the HOME folder (/data) will index all files available in /data, which can be a lot and will slow down the process of building. Hence the reason to use a minimal build context whenever possible.
The GWAS
and PheWAS
tabs in ICA Cohorts allow you to visualize precomputed analysis results for phenotypes/diseases and genes, respectively. Note that these do not reflect the subjects that are part of the cohort that you created.
ICA Cohorts currently hosts GWAS and PheWAS analysis results for approximately 150 quantitative phenotypes (such as "LDL direct" and "sitting height") and about 700 diseases.
Navigate to the GWAS
tab and start looking for phenotypes and diseases in the search box. Cohorts will suggest the best matches against any partial input ("cancer") you provide. After selecting a phenotype/disease, Cohorts will render a Manhattan plot, by default collapsed to gene level and organized by their respective position in each chromosome.
Circles in the Manhattan plot indicate binary traits, potential associations between genes and diseases. Triangles indicate quantitative phenotypes with regression Beta different from zero, and point up or down to depict positive or negative correlation, respectively.
Hovering over a circle/triangle will display the following information:
gene symbol
variant group (see below)
P-value, both raw and FDR-corrected
number of carriers of variants of the given type
number of carriers of variants of any type
regression Beta
For gene-level results, Cohorts distinguishes five different classes of variants: protein truncating; deleterious; missense; missense with a high ILMN PrimateAI score (indicating likely damaging variants); and synonymous variants. You can limit results to either of these five classes, or select All
to display all results together.
Deleterious variants (del
): the union of all protein-truncating variants (PTVs, defined below), pathogenic missense variants with a PrimateAI score greater than a gene-specific threshold, and variants with a SpliceAI score greater than 0.2.
Protein-truncating variants (ptv
): variant consequences matching any of stop_gained
, stop_lost
, frameshift_variant
, splice_donor_variant
, splice_acceptor_variant
, start_lost
, transcript_ablation
, transcript_truncation
, exon_loss_variant
, gene_fusion
, or bidirectional_gene_fusion
.
missense_all
: all missense variants regardless of their pathogenicity.
missense, PrimateAI optimized (missense_pAI_optimized
): only pathogenic missense variants with primateAI score greater than a gene-specific threshold.
missenses and PTVs (missenses_and_ptvs_all
): the union of all PTVs, SpliceAI > 0.2 variants and all missense variants regardless of their pathogenicity scores.
all synonymous variants (syn
).
To zoom in to a particular chromosome, click the chromosome name underneath the plot, or select the chromosome from the drop down box, which defaults to Whole genome
.
To browse PheWAS analysis results by gene, navigate to the PheWAS
tab and enter a gene of interest into the search box. The resulting Manhattan plot will show phenotypes and diseases organized into a number of categories, such as "Diseases of the nervous system" and "Neoplasms". Click on the name of a category, shown underneath the plot, to display only those phenotypes/diseases, or select a category from the drop down, which defaults to All
.
A future release of ICA Cohorts will allow you to run your own customized GWAS analysis inside ICA Bench and then upload variant- or gene-level results for visualization in the ICA Cohorts graphical user interface.
ICA Cohorts lets you create a research cohort of subjects and associated samples based on the following criteria:
Project:
Include subjects that are part of any ICA Project that you own or that is shared with you.
Sample:
Sample type such as FFPE.
Tissue type.
Sequencing technology: Whole genome DNA-sequencing, RNAseq, single-cell RNAseq, etc.
Subject:
Demographics such as age, sex, ancestry.
Biometrics such as body height, body mass index.
Family and patient medical history.
Sample:
Sample type such as FFPE.
Tissue type.
Sequencing technology: Whole genome DNA-sequencing, RNAseq, single-cell RNAseq, etc.
Disease:
Phenotypes and diseases from standardized ontologies.
Drug:
Drugs from standardized ontologies along with specific typing, stop reasons, drug administration routes, and time points.
Molecular attributes:
Samples with a somatic mutation in one or multiple, specified genes.
Samples with a germline variant of a specific type in one or multiple, specified genes.
Samples over- or under-expressed in one or multiple, specified genes.
Samples with a copy number gain or loss involving one or multiple, specified genes.
ICA Cohorts currently uses six standard medical ontologies to 1) annotate each subject during ingestion and then to 2) search for subjects: HPO for phenotypes, MeSH, SNOMED-CT, ICD9-CM, ICD10-CM, and OMIM for diseases. By default, any 'type-ahead' search will find matches from all six; and you can limit the search to only the one(s) you prefer. When searching for subjects using names or codes from one of these ontologies, ICA Cohorts will automatically match your query against all the other ontologies, therefore returning subjects that have been ingested using a corresponding entry from another ontology.
In the 'Disease' tab, you can search for subjects diagnosed with one or multiple diseases, as well as phenotypes, in two ways:
Start typing the English name of a disease/phenotype and pick from the suggested matches. Continue typing if your disease/phenotype of interest is not listed initially.
Use the mouse to select the term or navigate to the term in the dropdown using the arrow buttons.
If applicable, the concept hierarchy is shown, with ancestors and immediate children visible.
For diagnostic hierarchies, concept children count and descendant count for each disease name is displayed.
Descendant Count: Displays next to each disease name in the tree hierarchy (e.g., "Disease (10)").
Leaf Nodes: No children count shown for leaf nodes.
Missing Counts: Children count is hidden if unavailable.
Show Term Count: A new checkbox below "Age of Onset" that is always checked. Unchecking it hides the descendant count.
Select a checkbox to include the diagnostic term along with all of its children and decedents.
Expand the categories and select or deselect specific disease concepts.
Paste one or multiple diagnostic codes separated by a pipe (‘|’).
In the 'Drug' tab, you can search for subjects who have a specific medication record:
Start typing the concept name for the drug and pick from suggested matches. Continue typing if the drug is not listed initially.
Paste one or multiple drug concept codes. ICA Cohorts currently use RXNorm as a standard ontology during ingestion. If multiple concepts are in your instance of ICA Cohorts, they will be listed under 'Concept Ontology.'
'Drug Type' is a static list of qualifiers that denote the specific administration of the drug. For example, where the drug was dispensed.
'Stop Reason' is a static list of attributes describing a reason why a drug was stopped if available in the data ingestion.
'Drug Route' is a static list of attributes that describe the physical route of administration of the drug. For example, Intravenous Route (IV).
In the ‘Measurements’ tab, you can search for vital signs and laboratory test data leveraging LOINC concept codes. ·
Start typing the English name of the LOINC term, for example, ‘Body height’. A dropdown will appear with matching terms. Use the mouse or down arrows to select the term.
Upon selecting a term, the term will be available for use in a query.
Terms can be added to your query criteria.
For each term, you can set a value `Greater than or equal`, `Equals`, `Less than or equal`, `In range`, or `Any value`.
`Any value` will find any record where there is an entry for the measurement independent of an available value.
Click `Apply` to add your criteria to the query.
Click `Update Now` to update the running count of the Cohort.Include/Exclude
As attributes are added to the 'Selected Condition' on the right-navigation panel, you can choose to include or exclude the criteria selected.
Select a criterion from 'Subject', 'Disease', and/or 'Molecular' attributes by filling in the appropriate checkbox on the respective attribute selection pages.
When selected, the attribute will appear in the right-navigation panel.
You can use the 'Include' / 'Exclude' dropdown next to the selected attribute to decide if you want to include or exclude subjects and samples matching the attribute.
Note: the semantics of 'Include' work in such a way that a subject needs to match only one or multiple of the 'included' attributes in any given category to be included in the cohort. (Category refers to disease, sex, body height, etc.) For example, if you specify multiple diseases as inclusion criteria, subjects will only need to be diagnosed with one of them. Using 'Exclude', you can exclude any subject who matches one or multiple exclusion criteria; subjects do not have to match all exclusion criteria in the same category to be excluded from the cohort.
Note: This feature is not available on the 'Project' level selections as there is no overlap between subjects in datasets.
Note: Using exclusion criteria does not account for NULL values. For example, if the Super-population 'Europeans' is excluded, subjects will be in your cohort even if they do not contain this data point.
Once you selected Create Cohort
, the above data are organized in tabs such as Project, Subject, Disease, and Molecular. Each tab then contains the aforementioned sections, among others, to help you identify cases and/or controls for further analysis. Navigate through these tabs, or search for an attribute by name to directly jump to that tab and section, and select attributes and values that are relevant to describe your subjects and samples of interest. Assign a new name to the cohort you created, and click Apply
to save the cohort.
After creating a Cohort, select the Duplicate
icon.
A copy of the Cohort definition will be created and tagged with "_copy
".
Deleting a Cohort Definition can be accomplished by clicking the Delete Cohort
icon.
This action cannot be undone.
After creating a Cohort, users can set a Cohort bookmark as Shared. By sharing a Cohort, the Cohort will be available to be applied across the project by other users with access to the Project. Cohorts created in a Project are only accessible at scope of the user. Other users in the project cannot see the cohort created unless they use this sharing functionality.
Create a Cohort using the directions above.
To make the Cohort available to other users in your Project, click the Share
icon.
The Share
icon will be filled in black and the Shared Status will be turned from Private
to Shared
.
Other users with access to Cohorts in the Project can now apply the Cohort bookmark to their data in the project.
To unshare the Cohort, click the Share
icon.
The icon will turn from black to white, and other users within the project will no longer have access to this cohort definition.
A Shared Cohort can be Archived.
Select a Shared Cohort with a black Shared Cohort
icon.
Click the Archive Cohort
icon.
You will be asked to confirm this selection.
Upon archiving the Cohort definition, the Cohort will no longer be seen by other users in the Project.
The archived Cohort definition can be unarchived by clicking the Unarchive Cohort
icon.
When the Cohort definition is unarchived, it will be visible to all users in the Project.
You can link cohorts data sets to a bundle as follows:
Create or edit a bundle at Bundles from the main navigation.
Navigate to Bundles > your_bundle > Cohorts > Data Sets.
Select Link Data Set to Bundle.
Select the data set which you want to link and +Select.
After a brief time, the cohorts data set will be linked to your bundle and ICA_BASE_100 will be logged.
If you can not find the cohorts data sets which you want to link, verify if
Your data set is part of a project (Projects > your_project > Cohorts > Data Sets)
This project is set to Data Sharing (Projects > your_project > Project Settings > Details)
You can unlink cohorts data sets from bundles as follows:
Edit the desired bundle at Bundles from the main navigation.
Navigate to Bundles > your_bundle > Cohorts > Data Sets.
Select the cohorts data set which you want to unlink.
Select Unlink Data Set from Bundle.
After a brief time, the cohorts data set will be unlinked from your bundle and ICA_BASE_101 will be logged.
Bench Workspaces use a FUSE driver to mount project data directly into a workspace file system. There are both read and write capabilities with some limitations on write capabilities that are enforced by the underlying AWS S3 storage.
As a user, you are allowed to do the following actions from Bench (when having the correct user permissions compared to the workspace permissions) or through the CLI:
Copy project data
Delete project data
Mount project data (CLI only)
Unmount project data (CLI only)
When you have a running workspace, you will find a file system in Bench under the project folder along with the basic and advanced tutorials. When opening that folder, you will see all the data that resides in your project.
WARNING: This is a fully mounted version of the project data. Changes in the workspace to project data cannot be undone.
The FUSE driver allows the user to easily copy data from /data/project to the local workspace and vice versa. There is a file size limit of 500 GB per file for the FUSE driver.
The FUSE driver also allows you to delete data from your project. This is different from the use of Bench before where you took a local copy and still kept the original file in your project.
WARNING: Deleting project data through Bench workspace through the FUSE driver will permanently delete the data in the Project. This action cannot be undone.
Using the FUSE driver through the CLI is not supported for Windows users. Linux users will be able to use the CLI without any further actions, Mac users will need to install the kernel extension from macFuse.
MacOS uses hidden metadata files beginning with ._ ,which are copied over and exposed during CLI copy to your project data. These can be safely deleted from your project.
Mount and unmount of data needs to be done through the CLI. In Bench this happens automatically and is not needed anymore.
WARNING Do NOT use the CP -f command to copy or move data to a mounted location. This will result in data loss as data on the destination location will be deleted.
❗️ Once a file is written, it cannot be changed! You will not be able to update it in the project location because of the restrictions mentioned above.
Trying to update files or saving you notebook in the project folder will typically result in File Save Error for fusedrivererror.ipynb Invalid response: 500 Internal Server Error.
Some examples of other actions or commands that will not work because of the above mentioned limitations:
Save a jupyter notebook or R script on the /project location
Add/remove a file from an existing zip file
Redirect with append to an existing file e.g. echo "This will not work" >> myTextFile.txt
Rename a file due to the existing association between ICA and AWS
Move files or folders.
Using vi or another editor
A file can be written only sequentially. This is a restriction that comes from the library the FUSE driver uses to store data in AWS. That library supports only sequential writing, random writes are currently not supported. The FUSE driver will detect random writes and the write will fail with an IO error return code. Zip will not work since zip writes a table of contents at the end of the file. Please use gzip.
Listing data (ls -l) reads data from the platform. The actual data comes from AWS and there can be a short delay between the writing of the data and the listing being up to date. As a result, a file that is written may appear temporarily as a zero length file, a file that is deleted may appear in the file list. This is a tradeoff, the FUSE driver caches some information for a limited time and during that time the information may seem wrong. Note that besides the FUSE driver, the library used by the FUSE driver to implement the raw FUSE protocol and the OS kernel itself may also do caching.
To use a specific file in a jupyter notebook, you will need to use '/data/project/filename'.
This functionality won't work for old workspaces unless you enable the permissions for that old workspace.
This tutorial shows you how to
monitor the execution
Start Bench workspace
For this tutorial, the instance size depends on the flow you import, and whether you use a Bench cluster:
If using a cluster, choose standard-small or standard-medium for the workspace master node
Otherwise, choose at least standard-large as nf-core pipelines often need more than 4 cores to run.
Select the single user workspace permissions (aka "Access limited to workspace owner "), which allows us to deploy pipelines
Specify at least 100GB of disk space
Optional: After choosing the image, enable a cluster with at least this one standard-largeinstance type
Start the workspace, then (if applicable) start the cluster
If conda and/or nextflow are not installed, pipeline-dev will offer to install them.
The Nextflow files are pulled into the nextflow-src
subdirectory.
A larger example that still runs quickly is nf-core/sarek
All nf-core pipelines conveniently define a "test" profile that specifies a set of validation inputs for the pipeline.
The following command runs this test profile. If a Bench cluster is active, it runs on your Bench cluster, otherwise it runs on the main workspace instance.
The pipeline-dev tool is using "nextflow run ..." to run the pipeline. The full nextflow command is printed on stdout and can be copy-pasted+adjusted if you need additional options.
When a pipeline is running locally (i.e. not on a Bench cluster), you can monitor the task execution from another terminal with docker ps
When a pipeline is running on your Bench cluster, a few commands help to monitor the tasks and cluster. In another terminal, you can use:
qstat
to see the tasks being pending or running
tail /data/logs/sge-scaler.log.
<latest available workspace reboot time>
to check if the cluster is scaling up or down (it currently takes 3 to 5 minutes to get a new node)
The output of the pipeline is in the outdir
directory
Nextflow work files are under the work
directory
Log files are .nextflow.log*
and output.log
After generating a few ICA-specific files (JSON input specs for Flow launch UI + list of inputs for next step's validation launch), the tool identifies which previous versions of the same pipeline have already been deployed (in ICA Flow, pipeline versioning is done by including the version number in the pipeline name, so that's what is checked here). It then asks if you want to update the latest version or create a new one.
Choose "3" and enter a name of your choice to avoid conflicts with other users following this same tutorial.
At the end, the URL of the pipeline is displayed. If you are using a terminal that supports it, Ctrl+click or middle-click can open this URL in your browser.
This launches an analysis in ICA Flow, using the same inputs as the nf-core pipeline's "test" profile.
Some of the input files will have been copied to your ICA project to allow the launch to take place. They are stored in the folder bench-pipeline-dev/temp-data.
This tutorial shows you how to
import an existing ICA Flow pipeline with a supporting validation analysis
monitor the execution
Iterative development: modify pipeline code and validate in Bench
Modify nextflow code
Modify Docker image contents (Dockerfile or Interactive method)
Make sure you have access in ICA Flow to:
the pipeline you want to work with
an analysis exercising this pipeline, preferably with a short execution time, to use as validation test
For this tutorial, the instance size depends on the flow you import, and whether you use a Bench cluster:
When using a cluster, choose standard-small or standard-medium for the workspace master node
Otherwise, choose at least standard-large if you re-import a pipeline that originally came from nf-core, as they typically need 4 or more CPUs to run.
Select the "single user workspace" permissions (aka "Access limited to workspace owner "), which allows us to deploy pipelines
Specify at least 100GB of disk space
Optional: After choosing the image, enable a cluster with at least one standard-large instance type.
Start the workspace, then (if applicable) also start the cluster
The starting point is the analysis id that is used as pipeline validation test (the pipeline id is obtained from the analysis metadata).
If no --analysis-id is provided, the tool lists all the successfull analyses in the current project and lets the developer pick one.
If conda and/or nextflow are not installed, pipeline-dev will offer to install them.
The Nextflow files are pulled into the nextflow-src
subdirectory.
The analysis inputs are converted into a "test" profile for nextflow stored - among other items - in nextflow_bench.conf
The following command runs this test profile. If a Bench cluster is active, it runs on your Bench cluster, otherwise it runs on the main workspace instance:
The pipeline-dev tool is using "nextflow run ..." to run the pipeline. The full nextflow command is printed on stdout and can be copy-pasted+adjusted if you need additional options.
When a pipeline is running on your Bench cluster, a few commands help to monitor the tasks and cluster. In another terminal, you can use:
qstat
to see the tasks being pending or running
tail /data/logs/sge-scaler.log.
<latest available workspace reboot time>
to check if the cluster is scaling up or down (it currently takes 3 to 5 minutes to get a new node)
The output of the pipeline is in the outdir
directory
Nextflow work files are under the work
directory
Log files are .nextflow.log*
and output.log
Nextflow files (located in the nextflow-src
directory) are easy to modify.
Depending on your environment (ssh access / docker image with JupyterLab or VNC, with and without Visual Studio code), various source code editors can be used.
After modifying the source code, you can run a validation iteration with the same command as before:
Modifying the Docker image is the next step.
Nextflow (and ICA) allow the Docker images to be specified at different places:
in config files such as nextflow-src/nextflow.config
in nextflow code files:
grep container
may help locate the correct files:
Use case: Update some of the software (mimalloc) by compiling a new version
With the appropriate permissions, you can then "docker login" and "docker push" the new image.
With the appropriate permissions, you can then "docker login" and "docker push" the new image.
Fun fact: VScode with the "Dev Containers" extension lets you edit the files inside your running container:
Beware that this extension creates a lot of temp files in /tmp and in $HOME/.vscode-server. Don't include them in your image...
Update the nextflow code and/or configs to use the new image
Validate your changes in Bench:
After generating a few ICA-specific files (JSON input specs for Flow launch UI + list of inputs for next step's validation launch), the tool identifies which previous versions of the same pipeline have already been deployed (in ICA Flow, pipeline versioning is done by including the version number in the pipeline name, so that's what is checked here).
It then asks if we want to update the latest version or create a new one.
At the end, the URL of the pipeline is displayed. If you are using a terminal that supports it, Ctrl+click or middle-click can open this URL in your browser.
This launches an analysis in ICA Flow, using the same inputs as the pipeline's "test" profile.
Some of the input files will have been copied to your ICA project to allow the launch to take place. They are stored in the folder /data/project/bench-pipeline-dev/temp-data
.
The Pipeline Development Kit in Bench makes it easy to create Nextflow pipelines for ICA Flow. This kit consists of a number of development tools which are installed in /data/.software
(regardless of which Bench image is selected) and provides the following features:
Import to Bench
From public nf-core pipelines
From existing ICA Flow Nextflow pipelines
Run in Bench
Modify and re-run in Bench, providing fast development iterations
Deploy to Flow
Launch validation in Flow
Recommended workspace size: Nf-core Nextflow pipelines typically require 4 or more cores to run.
The pipeline development tools require
Conda which is automatically installed by “pipeline-dev” if conda-miniconda.installer.ica-userspace.sh
is present in the image.
Nextflow (version 24.10.2 is automatically installed using conda, or you can use other versions)
jq, curl (which should be made available in the image)
JupyterLab version 1.2.2 (or higher)
Pipeline development tools work best when the following items are defined:
Nextflow profiles:
test profile, specifying inputs appropriate for a validation run
docker profile, instructing NextFlow to use Docker
nextflow_schema.json, as described here. This is useful for the launch UI generation. The nf-core CLI tool (installable via pip install nf-core
) offers extensive help to create and maintain this schema.
ICA Flow adds one additional constraint. The output directory out
is the only one automatically copied to the Project data when an ICA Flow Analysis completes. The -outdir
parameter recommended by nf-core should therefore be set to--outdir=out
when running as a Flow pipeline.
These are installed in /data/.software
(which should be in your $PATH
), the pipeline-dev
script is the front-end to the other pipeline-dev-*
tools.
Pipeline-dev fulfils a number of roles:
Checks that the environment contains the required tools (conda, nextflow, etc) and offers to install them if needed.
Checks that the fast data mounts are present (/data/mounts/project etc.) – it is useful to check regularly, as they get unmounted when a workspace is stopped and restarted.
Redirects stdout and stderr to .pipeline-dev.log
, with the history of log files kept as .pipeline-dev.log.<log date>
.
Launches the appropriate sub-tool.
Prints out errors with backtrace, to help report issues.
A pipeline-dev project relies on the following directory structure, which is auto-generated when using the pipeline-dev import*
tools.
If you start a project manually, you must follow the same directory structure.
Project base directory
nextflow-src: Platform-agnostic Nextflow code, for example the github contents of an nf-core pipeline, or your usual nextflow source code.
main.nf
nextflow.config
nextflow_schema.json
pipeline-dev.project-info: contains project name, description, etc.
nextflow-bench.config (automatically generated when needed): contains definitions for bench.
ica-flow-config: Directory of files used when deploying pipeline to Flow.
inputForm.json (if not present, gets generated from nextflow-src/nextflow_schema.json): input form as defined in ICA Flow.
onrender.js (generated at the same time as inputForm.json): javascript code to go with the input form.
launchPayload_inputFormValues.json (if not present, gets generated from the test profile): used by “pipeline-dev launch-validation-in-flow”.
The above-mentioned project structure must be generated manually. The nf-core CLI tools can assist to generate the nextflow_schema.json
. Tutorial Pipeline from Scratch goes into more details about this use case.
A directory with the same name as the nextflow/nf-core pipeline is created, and the Nextflow files are pulled into the nextflow-src
subdirectory.
Tutorial Nf Core Pipelines goes into more details about this use case.
A directory called imported-flow-analysis
is created.
Pipeline assets are downloaded into the nextflow-src
sub-directory.
Analysis input specs are downloaded as ica_....json
file.
They are converted into a Nextflow test profile, stored in nextflow-bench.config
.
Tutorial Updating an Existing Flow Pipeline goes into more details about this use case.
Currently only pipelines with publicly available Docker images are supported. Pipelines with ICA-stored images are not yet supported.
Optional parameters --local / --sge
can be added to force the execution on the local workspace node, or on the workspace cluster (when available). Otherwise, the presence of a cluster is automatically detected and used.
The script then launches nextflow. The full nextflow command line is printed and launched.
In case of errors, full logs are saved as .pipeline-dev.log
Currently, not all corner cases are covered by command line options. Please start from the nextflow command printed by the tool and extend it based on your specific needs.
Nextflow can run processes with and without Docker images. In the context of pipeline development, the pipeline-dev tools assume Docker images are used, in particular during execution with the nextflow --profile docker
.
In NextFlow, Docker images can be specified at the process level
This is done with the container "<image_name:version>"
directive, which can be specified
in nextflow config files (preferred method when following the nf-core best practices)
or at the start of each process definition.
Each process can use a different docker image
It is highly recommended to always specify an image. If no Docker image is specified, Nextflow will report this. In ICA, a basic image will be used but with no guarantee that the necessary tools are available.
Resources such as #cpu and memory can be specified as described here See containers or our tutorials for details about Nextflow-Docker syntax.
Bench can push/pull/create/modify Docker images, as described in Containers.
This command does the following:
Generate the JSON file describing the ICA Flow user interface.
If ica-flow-config/inputForm.json
doesn’t exist: generate it from nextflow-src/nextflow_swagger.json
.
Generate the JSON file containing the validation launch inputs.
If ica-flow-config/launchPayload_inputFormValues.json
doesn’t exist: generate it from nextflow --profile test
inputs.
If local files are used as validation inputs or as default input values:
copy them to /data/project/pipeline-dev-files/temp
.
get their ICA file ids.
use these file ids in the launch specifications.
If remote files are used as validation inputs or as default input values of an input of type “file” (and not “string”): do the same as above.
Identify the pipeline name to use for this new pipeline deployment:
If a deployment has already occurred in this project, or if the project was imported from an existing Flow pipeline, start from this pipeline name. Otherwise start from the project name.
Identify which already-deployed pipelines have the same base name, with or without suffixes that could be some versioning (_v<number>, _<number>, _<date>) .
Ask the user if they prefer to update the current version of the pipeline, create a new version, or enter a new name of their choice – or use the --create/--update
parameters when specified, for scripting without user interactions.
New ICA Flow pipeline gets created (except in case of pipeline update) .
The current Nextflow version in Bench is used to select the best Nextflow version to be used in Flow
nextflow-src
directory is uploaded file by file as pipeline assets.
Output Example:
The pipeline name, id and URL are printed out, and if your environment allows, Ctrl+Click/Option+Click/Right click can open the URL in a browser.
Opening the URL of the pipeline and clicking on Start Analysis shows the generated user interface:
The ica_... file generated in the previous step is submitted to ICA Flow to start an analysis with the same validation inputs as “nextflow --profile test”.
Output Example:
The analysis name, id and URL are printed out, and if your environment allows, Ctrl+Click/Option+Click/Right click can open the URL in a browser.
This tutorial shows you how to start a new pipeline from scratch
Start Bench workspace
For this tutorial, any instance size will work, even the smallest standard-small.
Select the single user workspace permissions (aka "Access limited to workspace owner "), which allows us to deploy pipelines.
A small amount of disk space (10GB) will be enough.
We are going to wrap the "gzip" linux compression tool with inputs:
1 file
compression level: integer between 1 and 9
We intentionally do not include sanity checks, to keep this scenario simple.
Here is an example of NextFlow code that wraps the bzip2 command and publishes the final output in the “out” directory:
Save this file as nextflow-src/main.nf, and check that it works:
We now need to:
Use Docker
Follow some nf-core best practices to make our source+test compatible with the pipeline-dev tools
In NextFlow, Docker images can be specified at the process level
Each process may use a different docker image
It is highly recommended to always specify an image. If no Docker image is specified, Nextflow will report this. In ICA, a basic image will be used but with no guarantee that the necessary tools are available.
Specifying the Docker image is done with the container '<image_name:version>'
directive, which can be specified
at the start of each process definition
or in nextflow config files (preferred when following nf-core guidelines)
For example, create nextflow-src/nextflow.config:
We can now run with nextflow's -with-docker
option:
Following some nf-core best practices to make our source+test compatible with the pipeline-dev tools:
Here is an example of “test” profile that can be added to nextflow-src/nextflow.config
to define some input values appropriate for a validation run:
With this profile defined, we can now run the same test as before with this command:
A “docker” profile is also present in all nf-core pipelines. Our pipeline-dev tools will make use of it, so let’s define it:
We can now run the same test as before with this command:
We also have enough structure in place to start using the pipeline-dev command:
In order to deploy our pipeline to ICA, we need to generate the user interface input form.
This is done by using nf-core's recommended nextflow_schema.json.
For our simple example, we generate a minimal one by hand (done by using one of the nf-core pipelines as example):
In the next step, this gets converted to the ica-flow-config/inputForm.json
file.
Note: For large pipelines, as described on the nf-core website
Manually building JSONSchema documents is not trivial and can be very error prone. Instead, the nf-core pipelines schema build command collects your pipeline parameters and gives interactive prompts about any missing or unexpected params. If no existing schema is found it will create one for you.
We recommend looking into "nf-core pipelines schema build -d nextflow-src/", which comes with a web builder to add descriptions etc.
We just need to create a final file, which we had skipped until now: Our project description file, which can be created via the command pipeline-dev project-info --init
:
We can now run:
After generating the ICA-Flow-specific files in the ica-flow-config
directory (JSON input specs for Flow launch UI + list of inputs for next step's validation launch), the tool identifies which previous versions of the same pipeline have already been deployed (in ICA Flow, pipeline versioning is done by including the version number in the pipeline name).
It then asks if we want to update the latest version or create a new one.
Choose "3" and enter a name of your choice to avoid conflicts with all the others users following this same tutorial.
At the end, the URL of the pipeline is displayed. If you are using a terminal that supports it, Ctrl+click or middle-click can open this URL in your browser.
This launches an analysis in ICA Flow, using the same inputs as the pipeline's "test" profile.
Some of the input files will have been copied to your ICA project in order for the analysis launch to work. They are stored in the folder /data/project/bench-pipeline-dev/temp-data
.
ICA Cohorts can pull any molecular data available in an ICA Project, as well as additional sample- and subject-level metadata information such as demographics, biometrics, sequencing technology, phenotypes, and diseases.
To import a new data set, select Import Jobs
from the left navigation tab underneath Cohorts
, and click the Import Files
button. The Import Files
button is also available under the Data Sets
left navigation item.
The
Data Set
menu item is used to view imported data sets and information. TheImport Jobs
menu item is used to check the status of data set imports.
Confirm that the project shown is the ICA Project that contains the molecular data you would like to add to ICA Cohorts.
Choose a data type among
Germline variants
Somatic mutations
RNAseq
GWAS
Choose a new study name by selecting the radio button: Create new study
and entering a Study Name
.
To add new data to an existing Study, select the radio button: Select from list of studies
and select an existing Study Name
from the dropdown.
To add data to existing records or add new records, select Job Type
, Append
.
Append
does not wipe out any data ingested previously and can be used to ingest the molecular data in an incremental manner.
To replace data, select Job Type
, Replace
. If you are ingesting data again, use the Replace job type.
Enter an optional Study description
.
Select the metadata model (default: Cohorts; alternatively, select OMOP version 5.4 if your data is formatted that way.)
Select the genome build your molecular data is aligned to (default: GRCh38/hg38)
For RNAseq, specify whether you want to run differential expression (see below) or only upload raw TPM.
Click Next
.
Navigate to VCFs located in the Project Data.
Select each single-sample VCF or multi-sample VCF to ingest. For GWAS, select CSV files produced by Regenie.
As an alernative to selecting individual files, you can also opt to select a folder instead. Toggle the radio button on Step 2 from "Select files" to "Select folder".
This option is currently only available for germline variant ingestion: any combination of small variants, structural variation, and/or copy number variants.
ICA Cohorts will scan the selected folder and all sub-folders for any VCF files or JSON files and try to match them against the Sample ID column in the metadata TSV file (Step 3).
Files not matching sample IDs will be ignored; allowed file extensions for VCF files after the sample ID are: *.vcf.gz, *.hard-filtered.vcf.gz, *.cnv.vcf.gz, and *.sv.vcf.gz .
Files not matching sample IDs will be ignored; allowed file extensions for JSON files after the sample ID are: .json,.json.gz, *.json.bgz, *.json.gzip.
Click Next
.
Navigate to the metadata (phenotype) data tsv in the project Data.
Select the TSV file or files for ingestion.
Click Finish
.
Search Spinner behavior in input jobs table
Search a term and press ** Enter.
The search spinner will appear while the results are loading.
Once the results are displayed in the table, the spinner will disappear immediately
All VCF types, specifically from DRAGEN, can be ingested using the Germline variants selection. Cohorts will distinguish the variant types that it is ingesting. If Cohorts cannot determine the variant file type, it will default to ingest small variants.
Alternatively to VCFs, you can select Nirvana JSON files for DNA variants: small variants, structural variants, and copy number variation.
The maximum amount of files that can be part of a single manual ingestion batch is capped at 1000
Alternatively, users can choose a single folder and ICA Cohorts will identify all ingestible files within that folder and its sub-folders. In this scenario, cohorts will select molecular data files matching the samples listed in the metadata sheet which is the next step in the import process.
Users have the option to ingest either VCF files or Nirvana JSON files for any given batch, regardless of the chosen ingestion method.
The sample identifiers used in the VCF columns need to match the sample identifiers used in subject/sample metadata files; accordingly, if you are starting from JSON files containing variant- and gene-level annotations provided by ILMN Nirvana, the
samples
listed in the header need to match the metadata files.
ICA Cohorts supports VCF files formatted according to VCF v4.2 and v4.3 specifications. VCF files require at least one of the following header rows to identify the genome build:
##reference=file://... --- needs to contain a reference to hg38/GRCh38 in the file path or name (numerical value is sufficient)
##contig=<ID=chr1,length=248956422> --- for hg38/GRCh38
##DRAGENCommandLine= ... --ht-reference
ICA Cohorts accepts VCFs aligned to hg38/GRCh38 and hg19/GRCh37. If your data uses hg19/GRCh37 coordinates, Cohorts will convert these to hg38/GRCh38 during the ingestion process [see Reference 1]. Harmonizing data to one genome build facilitates searches across different private, shared, and public projects when building and analyzing a cohort. If your data contains a mixture of samples mapped to hg38 and hg19, please ingest these in separate batches, as each import job into Cohorts is limited to one genome build.
ICA Cohorts can process gene- and transcript-level quantification files produced by the Illumina DRAGEN RNA pipeline. The file naming convention needs to match .quant.genes.sf for genes; and .quant.sf for transcript-level TPM (transcripts per million.)
Note: If annotating large sets of samples with molecular data, expect the annotation process to take over 20 minutes per whole genome batch of samples. You will receive two e-mail notifications: once your ingestion starts and once completed successfully or failed.
PERSON (mandatory),
CONCEPT (mandatory if any of the following is provided),
CONDITION_OCCURRENCE (optional),
DRUG_EXPOSURE (optional), and
PROCEDURE_OCCURRENCE (optional.)
Additional files such as measurement and observation will be supported in a subsequent release of Cohorts.
Note that Cohorts requires that all such files do not deviate from the OMOP CDM 5.4 standard. Depending on your implementation, you may have to adjust file formatting to be OMOP CDM 5.4-compatible.
[1] VcfMapper: https://stratus-documentation-us-east-1-public.s3.amazonaws.com/downloads/cohorts/main_vcfmapper.py
[2] crossMap: https://crossmap.sourceforge.net/
[3] liftOver: https://genome.ucsc.edu/cgi-bin/hgLiftOver
In ICA Cohorts, metadata describe any subjects and samples imported into the system in terms of attributes, including:
subject:
demographics such as age, sex, ancestry;
phenotypes and diseases;
biometrics such as body height, body mass index, etc.;
pathological classification, tumor stages, etc.;
family and patient medical history;
sample:
sample type such as FFPE,
tissue type,
sequencing technology: whole genome DNA-sequencing, RNAseq, single-cell RNAseq, among others.
A metadata sheet will need to contain at least these four columns per row:
Subject ID - identifier referring to individuals; use the column header "SubjectID".
Sample ID - identifier for a sample. Sample IDs need to match the corresponding column header in VCF/GVCFs; each subject can have multiple samples, these need to be specified in individual rows for the same SubjectID; use the column header "SampleID".
Biological sex - can be "Female (XX)", "Female"; "Male (XY)", "Male"; "X (Turner's)"; "XXY (Klinefelter)"; "XYY"; "XXXY" or "Not provided". Use the column header "DM_Sex" (demographics).
Sequencing technology - can be "Whole genome sequencing", "Whole exome sequencing", "Targeted sequencing panels", or "RNA-seq"; use the column header "TC" (technology).
ICA Cohorts is a cohort analysis tool integrated with Illumina Connected Analytics (ICA). ICA Cohorts combines subject- and sample-level metadata, such as phenotypes, diseases, demographics, and biometrics, with molecular data stored in ICA to perform tertiary analyses on selected subsets of individuals.
Intuitive UI for selecting subjects and samples to analyze and compare: deep phenotypical and clinical metadata, molecular features including germline, somatic, gene expression.
Comprehensive, harmonized data model exposed to ICA Base and ICA Bench users for custom analyses.
Run analyses in ICA Base and ICA Bench and upload final results back into Cohorts for visualization.
Out-of-the-box statistical analyses including genetic burden tests, GWAS/PheWAS.
Rich public data sets covering key disease areas to enrich private data analysis.
Easy-to-use visualizations for gene prioritization and genetic variation inspection.
Alternative to VCFs, ICA Cohorts accepts the JSON output of for hg38/GRCh38-aligned data for small germline variants and somatic mutations, copy number variations other structural variants.
Please also see the online documentation for the for more information on output file formats.
ICA Cohorts currently support upload of SNV-level GWAS results produced by and saved as CSV files.
As an alternative to ICA Cohorts' metadata file format, you can provide files formatted according to the . Cohorts currently ingests data for these OMOP 5.4 tables, formatted as tab-delimited files:
[4] Chain files:
You can use these attributes while to define the cases and/or controls that you want to include.
During , you will be asked to upload a metadata sheet as a tab-delimited (TSV) file. An example sheet is available for download on the Import files page in the ICA Cohorts UI.
A description of all attributes and data types currently supported by ICA Cohorts can be found here:
You can download an example of a metadata sheet, which contains some samples from The Cancer Genome Atlas () and their publicly available clincal attributes, here:
A list of concepts and diagnoses that cover all public data subjects to easily navigate the new concept code browser for diagnosis can be found here:
This video is an overview of Illumina Connnected Analytics. It walks through a Multi-Omics Cancer workflow that can be found here:
ICA Cohorts contains a variety of freely available data sets covering different disease areas and sequencing technologies. For a list of currently available data, .
Field
Description
Project name
The ICA project for your cohort analysis (cannot be changed.)
Study name
Create or select a study. Each study represents a subset of data within the project.
Description
Short description of the data set (optional).
Job type
Append: Appends values to any existing values. If a field supports only a single value, the value is replaced.
Replace: Overwrites existing values with the values in the uploaded file.
Subject metadata files
Subject metadata file(s) in tab-delimited format. For Append and Replace job types, the following fields are required and cannot be changed: - Sample identifier - Sample display name - Subject identifier - Subject display name - Sex
From the Cohorts menu in the left hand navigation, select a cohort created in Create Cohort
to begin a cohort analysis.
The query details can be accessed by clicking the triangle next to Show Query Details
. The query details displays the selections used to create a cohort. The selections can be edited by clicking the pencil
icon in the top right.
Charts
will be open by default. If not, click Show Charts
.
Use the gear icon in the top-right to change viewable chart settings.
There are four charts available to view summary counts of attributes within a cohort as histogram plots.
Click Hide Charts
to hide the histograms.
Display time-stamped events and observations for a single subject on a timeline.The timeline view is visible to only those subjects which have time-series data.
Below attributes are displayed in timeline view: • Diagnosed and Self-Reported Diseases: • Start and end dates • Progression vs. remission • Medication and Other Treatments: • Prescribed and self-medicated • Start date, end date, and dosage at every time point
The timeline utilizes age (at diagnosis, at event, at measurement) as the x-axis and attribute name as the y-axis. If the birthdate is not recorded for a subject, the user can now switch to Date to visualize data.
In the default view, the timeline shows the first five disease data and the first five drug/medication data in the plot. Users can choose different attributes or change the order of existing attributes by clicking on the “select attribute” button.
The x-axis shows the person’s age in years, with data points initially displayed between ages 0 to 100. Users can zoom in by selecting the desired range to visualize data points within the selected age range.
Each event is represented by a dot in the corresponding track. Events in the same track can be connected by lines to indicate the start and end period of an event.
Measurement Section: A summary of measurements (without values) is displayed under the section titled "Measurements and Laboratory Values Available." Users can click a link to access the Timeline View for detailed results.
Drug Section: The "Drug Name" section lists drug names without repeating the header "Drug Name" for each entry.
By Default, the Subjects
tab is displayed.
The Subjects
tab with a list of all subjects matching your criteria is displayed below Charts
with a link to each Subject by ID and other high-level information. By clicking a subject ID, you will be brought to the data collected at the Subject level.
Search for a specific subject by typing the Subject ID into the Search Subjects
text box.
Get all details available on a subject by clicking the hyperlinked Subject ID in the Subject list.
To Exclude specific subjects from subsequent analysis, such as marker frequencies or gene-level aggregated views, you can uncheck the box at the beginning of each row in the subject list. You will then be prompted to save any exclusion(s).
You can Export the list of subjects either to your ICA Project's data folder or to your local disk as a TSV file for subsequent use. Any export will omit subjects that you excluded after you saved those changes. For more information, see at the bottom of this page.
Specific subjects can be removed from a Cohort.
Select the Subjects
tab.
Subjects in the Cohort, by default are checked.
To remove a specific subject from a Cohort, uncheck the checkbox next to subjects to remove from a Cohort.
Check box selections are maintained while browsing through the pages of the subject list.
Click Save Cohort
to save the subjects you would like to exclude.
The specific subjects will no longer be counted in all analysis visualizations.
The specific excluded subjects will be saved for the Cohort.
To add the subjects back to the Cohort, select the checkboxes to checked and click Save Cohort
.
For each individual cohort, display a table of all observed SVs that overlap with a given gene.
Click the Marker Frequency
tab, then click the Gene Expression
tab.
Down-regulated genes are displayed in blue and Up-regulated genes are displayed in red.
A frequency in the Cohort is displayed and the Matching number/Total is also displayed in the chart.
Genes can be searched by using the Search Genes
text box.
You are brought to the Gene
tab under the Gene Summary
sub-tab.
Select a Gene by typing the gene name into the Search Genes
text box.
A Gene Summary
will be displayed that lists information and links to public resources about the selected gene.
A cytogenic map will be displayed based on the selected gene and a vertical orange bar represents gene location in the chromosome.
Click the Variants
tab and Show legend and filters
if it does not open by default.
Below the interactive legend, you see a set of analysis tracks: Needle Plot, Primate AI, Pathogenic variants, and Exons.
The Needle Plot allows toggling the plot by gnomAD frequency
and Sample Count
. Select Sample Count
in the Plot by
legend above the plot. You can also filter the plot to only show variants above/below a certain cut-off for gnomAD frequency (in percent) or absolute sample count.
The Needle Plot allows filtering by PrimateAI
Score.
Set a lower (>=) or upper (<=) threshold for the PrimateAI Score to filter variants.
Enter the threshold value in the text box located below the gnomadFreq/SampleCount input box.
If no threshold value is entered, no filter will be applied.
The filter affects both the plot and the table when the “Display only variants shown in the plot above” toggle is enabled.
Filter preferences persist across gene views for a seamless experience.
Click on a variant's needle pin to view details about the variant from public resources and counts of variants in the selected cohort by disease category. If you want to view all subjects that carry the given variant, click on the sample count link, which will take you to the list of subjects (see above).
Use the Exon zoom bar from each end of the Amino Acid sequence to zoom in on the gene domain to better separate observations.
The Pathogenic Variant
Track shows pop up details with pathogenicity calls, phenotypes, submitter and a link to the ClinVar entry is seen by hovering over the purple triangles.
Below the needle plot is a full listing of variants displayed in the needle plot visualization
Display only variants shown in the plot above. toggle (enabled by default) syncs the table with the Needle Plot. When the toggle is on, the table will display only the variants shown in the Needle Plot, applying all active filters (e.g., variant type, somatic/germline, sample count). When the toggle is off, all reported variants are displayed in the table and table-based filters can be used.
Export to CSV: When the views are synchronized (toggle on), the filtered list of variants can be exported to a CSV file for further analysis.The Phenotypes tab
shows a stacked horizontal bar chart which displays molecular breakdown (disease type vs Gene) and subject count for the selected gene.
Note on "Stop Lost" Consequence Variants:
The
stop_lost
consequence is mapped asFrameshift, Stop lost
in the tooltip.The l
Stop gained|lost
value includes both stop gain and stop loss variants.When the Stop gained filter is applied, Stop lost variants will not appear in the plot or table if the "Display only variants shown in the plot above" toggle is enabled
The Gene Expression
tab shows known gene expression data from tissue types in GTEx.
The Genetic Burden Test
will only be available for de novo
variants only.
For every correlation, subjects contained in each count can be viewed by selecting the count on the bubble or the count on the X-axis and Y-axis.
Click the Correlation
Tab.
In X-axis category
, select Clinical
.
In X-axis Attribute
, select a clinical attribute.
In Y-axis category
, select Clinical
.
In Y-Axis Attribute
, select another clinical attribute.
You will be shown a bubble plot comparing the first clinical attribute on the x-axis to second attributes on the y-axis.
The size of the bubbles correspond to the number of subjects falling into those categories.
To see a breakdown of Somatic Mutations vs. RNA Expression levels perform the following steps:
Note this comparison is for a Cancer case.
Click the Correlation
Tab.
In X-axis category
, select Somatic
.
In X-axis Attribute
, select a gene.
In Y-axis category
, select RNA expression
.
In Y-Axis Attribute
, type a gene and leave Reference Type
, NORMAL
.
Click Continuous
to see violin plots of compared variables.
Note this comparison is for a Cancer case.
Click the Correlation
Tab.
In X-axis category
, select Somatic
.
In X-axis Attribute
, type a gene name.
In Y-axis category
, select Clinical
.
In Y-Axis Attribute
, select a clinical attribute.
Click the Molecular Breakdown
Tab.
In Enter a clinical Attribute
, and select a clinical attribute.
In Enter a gene
, select a gene by typing a gene name.
You are shown a stacked bar-chart by the clinical attribute selected values on the Y-axis.
For each attribute value the bar represents the % of Subjects with RNA Expression
, Somatic Mutation
, and Multiple Alterations
.
Note: for each of the aforementioned bubble plots, you can view the list of subjects by following the link under each subject count associated with an individual bubble or axis label. This will take you to the list of subjects view, see above.
If there is Copy Number Variant data in the cohort:
Click the CNV
tab.
A graph will show CNV a Sample Percentage on the Y-axis and Chromosomes on the X-axis.
Any value above Zero is a copy number gain, and any value below Zero is a copy number loss.
Click Chromosome:
to select a specific chromosome position.
ICA allows for integrated analysis in a computation workspace. You can export your cohorts definitions and, in combination with molecular data in your ICA Project Data, perform, for example, a GWAS analysis.
Confirm the VCF data for your analysis is in ICA Project Data.
From within your ICA Project, Start a Bench Workspace -- See Bench for more details.
Navigate back to ICA Cohorts.
Create a Cohort of subjects of interest using Create a Cohort.
From the Subjects
Tab click the Export subjects...
from the top-right of the subject list. The file can be downloaded to the Browser or ICA Project Data.
We suggest using export ...to Data Folder
for immediate access to this data in Bench or other areas of ICA.
Create another cohort if needed for your Research and complete the last 3 steps.
Navigate to the Bench workspace created in the second step.
After the workspace has started up, click Access
.
Find the /Project/
folder in the Workspace file navigation.
This folder will contain your cohort files created along with any pipeline output data needed for your workspace analysis.
This walk-through is meant to represent a typical workflow when building and studying a cohort of rare genetic disorder cases.
Create a new Project to track your study:
Login to the ICA
Navigate to Projects
Create a new project using the New Project
button.
Give your project a name and click Save
.
Navigate to the ICA Cohorts module by clicking COHORTS
in the left navigation panel then choose Cohorts
.
Navigate to the ICA Cohorts module by clicking Cohorts
in the left navigation panel.
Click Create Cohort
button.
Enter a name for your cohort, like Rare Disease + 1kGP
at top, left of pencil icon.
From the Public Data Sets
list select:
DRAGEN-1kGP
All Rare genetic disease cohorts
Notice that a cohort can also be created based on Technology
, Disease Type
and Tissue
.
Under Selected Conditions
in right panel, click on Apply
A new page opens with your cohort in a top-level tab.
Expand Query Details
to see the study makeup of your cohort.
A set of 4 Charts
will be open by default. If they are not, click Show Charts
.
Use the gear icon in the top-right of the Charts pane to change chart settings.
The bottom section is demarcated by 8 tabs (Subjects, Marker Frequency, Genes, GWAS, PheWAS, Correlation, Molecular Breakdown, CNV).
The Subjects
tab displays a list of exportable Subject IDs and attributes.
Clicking on a Subject ID
link pops up a Subject details page.
A recent GWAS publication identified 10 risk genes for intellectual disability (ID) and autism. Our task is to evaluate them in ICA Cohorts: TTN, PKHD1, ANKRD11, ARID1B, ASXL3, SCN2A, FHL1, KMT2A, DDX3X, SYNGAP1.
First let’s Hide charts
for more visual space.
Click the Genes
tab where you need to query a gene to see and interact with results.
Type SCN2A
into the Gene search field and select it from autocomplete dropdown options.
The Gene Summary
tab now lists information and links to public resources about SCN2A.
Click on the Variants
tab to see an interactive Legend and analysis tracks.
The Needle Plot displays gnomAD Allele Frequency
for variants in your cohort.
Note that some are in SCN2A conserved protein domains.
In Legend, switch the Plot by
option to Sample Count
in your cohort.
In Legend, uncheck all Variant Types
except Stop gained
. Now you should see 7 variants.
Hover over pin heads to see pop-up information about particular variants.
The Primate AI
track displays Scores for potential missense variants, based on polymorphisms observed in primate species. Points above the dashed line for the 75th percentile may be considered "likely pathogenic" as cross-species sequence is highly conserved; you often see high conservancy at the functional domains. Points below the 25th percentile may be considered "likely benign".
The Pathogenic variants
track displays markers from ClinVar color-coded by variant type. Hover over to see pop-ups with more information.
The Exons
track shows mRNA exon boundaries with click and zoom functionality at the ends.
Below the Needle Plot and analysis tracks is a list of "Variants observed in the selected cohort"
Export Gene Variants
table icon is above the legend on right side.
Now let's click on the Gene Expression
tab to see a Bar chart of 50 normal tissues from GTEx in transcripts per million (TPM). SCN2A is highly expressed in certain Brain tissues, indicating specificity to where good markers for intellectual disability and autism could be expected.
As a final exercise in discovering good markers, click on the tab for Genetic Burden Test
. The table here associates Phenotypes
with Mutations Observed
in each Study selected for our cohort, alongside Mutations Expected
to derive p-values. Given all considerations above, SCN2A is good marker for intellectual disability (p < 1.465 x 10 -22) and autism (p < 5.290 x 10 -9).
Continue to check the other genes of interest in step 1.
ICA Cohorts comes front-loaded with a variety of publicly accessible data sets, covering multiple disease areas and also including healthy individuals.
This walk-through is intended to represent a typical workflow when building and studying a cohort of oncology cases.
Click Create Cohort
button.
Select the following studies to add to your cohort:
TCGA – BRCA – Breast Invasive Carcinoma
TCGA – Ovarian Serous Cystadenocarcinoma
Add a Cohort Name
= TCGA Breast and Ovarian_1472
Click on Apply
.
Expand Show query details
to see the study makeup of your cohort.
Charts
will be open by default. If not, click Show charts
Use the gear icon in the top-right to change viewable chart settings.
Tip:
Disease Type
,Histological Diagnosis
,Technology
,Overall Survival
have interesting data about this cohorts
The Subject
tab with all Subjects list is displayed below Charts with a link to each Subject by ID and other high-level information, like Data Types measured and reported. By clicking a subject ID, you will be brought to the data collected at the Subject level.
Search for subject TCGA-E2-A14Y
and view the data about this Subject.
Click the TCGA-E2-A14Y
Subject ID link to view clinical data for this Subject that was imported via the metadata.tsv file on ingest.
Note: the Subject is a 35 year old Female with vital status and other phenotypes that feed up into the
Subject
attribute selection criteria when creating or editing cohorts.
Click X
to close the Subject details.
Click Hide charts
to increase interactive landscape.
Click the Marker Frequency
tab, then click the Somatic Mutation
tab.
Review the gene list and mutation frequencies.
Note that PIK3CA has a high rate of mutation in the Cohort (ranked 2nd with 33% mutation frequency in 326 of the 987 Subjects that have Somatic Mutation data in this cohort).
Do Subjects with PIK3CA mutations have changes in PIK3CA RNA Expression?
Click the Gene Expression
tab, search for PIK3CA
PIK3CA RNA is down-regulated in 27% of the subjects relative to normal samples.
Switch from normal
to disease
Reference where the Subject’s denominator is the median of all disease samples in your cohort.
The count of matching vs. total subjects that have PIK3CA up-regulated RNA which may indicate a distinctive sub-phenotype.
Click directly on PIK3CA
gene link in the Gene Expression
table.
You are brought to the Gene
tab under the Gene Summary
sub-tab that lists information and links to public resources about PIK3CA.
Click the Variants
tab and Show legend and filters
if it does not open by default.
Below the interactive legend you see a set of analysis tracks: Needle Plot, Primate AI, Pathogenic variants, and Exons.
The Needle Plot allows toggling the plot by gnomAD frequency
and Sample Count
. Select Sample Count
in the Plot by
legend above the plot.
There are 87 mutations distributed across the 1068 amino acid sequence, listed below the analysis tracks. These can be exported via the icon into a table.
We know that missense variants can severely disrupt translated protein activity. Deselect all Variant Types
except for Missense
from the Show Variant Type
legend above the needle plot.
Many mutations are in the functional domains of the protein as seen by the colored boxes and labels on the x-axis of the Needle Plot.
Hover over the variant with the highest sample count in the yellow PI3Ka
protein domain.
The pop-up shows variant details for the 64 Subjects observed with it: 63 in the Breast Cancer study and 1 in the Ovarian Cancer Study.
Use the Exon zoom bar from each end of the Amino Acid sequence to zoom in to the PI3Ka
domain to better separate observations.
There are three different missense mutations at this locus changing the wildtype Glutamine at different frequencies to Lysine (64), Glycine (6), or Alanine (2).
The Pathogenic Variant
Track shows 7 ClinVar entries for mutations stacked at this locus affecting amino acid 545. Pop up details with pathogenicity calls, phenotypes, submitter and a link to the ClinVar entry is seen by hovering over the purple triangles.
Note the Primate AI
track and high Primate AI score.
Primate AI
track displays Scores for potential missense variants, based on polymorphisms observed in primate species. Points above the dashed line for the 75th percentile may be considered likely pathogenic as cross-species sequence is highly conserved; you often see high conservancy at the functional domains. Points below the 25th percentile may be considered "likely benign".
Click the Expression
tab and notice that normal Breast and normal Ovarian tissue have relatively high PIK3CA RNA Expression in GTex RNAseq tissue data but ubiquitously expressed.
You can compare up to four previously created individual cohorts, to view differences in variants and mutations, RNA expression, copy number variation, and distribution of clinical attributes. Once comparisons are created, they are saved in the Comparisons
left-navigation tab of the Cohorts module.
Select Cohorts
from the left-navigation panel.
Select 2 to 4 cohorts already created. If you have not created any cohorts, See Create a Cohort documentation.
Click Compare Cohorts
in the right-navigation panel.
Note you are now in the Comparisons
left-navigation tab of the Cohorts module.
In the Charts
Section, if the COHORTS
item is not displayed, click the gear icon in the top right and add Cohorts
as the first attribute and click Save
.
The COHORTS
item in the charts panel will provide a count of subjects in each cohort and act as a legend for color representation throughout comparison screens.
For each clinical attribute category, a bar chart is displayed. Use the gear icon to select attributes to display in the charts panel.
You can share a comparison with other team members in the same ICA Project. Please refer to the section on "Sharing a Cohort" on "Create a Cohort" for details on sharing, unsharing, deleting, and archiving, which are analogous for sharing comparisons.
Select the Attributes
tab
Attribute categories are listed and can be expanded using the down-arrows next to the category names. The categories available are based on cohorts selected. Categories and attributes are part of the ICA Cohorts metadata template that map to each Subject.
For example, use the drop-down arrow next to Vital status
to view sub-categories and frequencies across selected cohorts.
Select the Genes
tab
Search for a gene of interest using its HUGO/HGNC gene symbol
As additional filter options, you can view only those variants that are occur in every cohort; that are unique to one cohort; that have been observed in at least two cohorts; or any variant.
Select the Survival Summary
tab.
Attribute categories are listed and can be expanded using the down-arrows next to the category names.
Select the drop-down arrow for Therapeutic interventions
.
In each subcategory there is a sum of the subject counts across select cohorts.
For each cohort, designated by a color, there is a Subject count
and Median survival (years)
column.
Type Malignancy
in the Search Box and an auto-complete dropdown suggests three different attributes.
Select Synchronous malignancy
and the results are automatically opened and highlighted in orange.
Click Survival Comparison
tab.
A Kaplan-Meier Curve is rendered based on each cohort.
P-Value Displayed at the top of Survival Comparison indicates whether there is statistically significant variance between survival probabilities over time of any pair of cohorts (CI=0.95).
When comparing two cohorts, the P-Value is shown above the two survival curves. For three or four cohorts, P-Values are shown as a pair-wise heatmap, comparing each cohort to every other cohort.
Select the Marker Frequency
tab.
Select either Gene expression
(default), Somatic mutation
, or Copy number variation
For gene expression (up- versus down-regulated) and for copy number variation (gain versus loss), Cohorts will display a list of all genes with bidirectional barcharts
For somatic mutations, the barcharts are unidirectional and indicate the percentage of samples with a mutation in each gene per cohort.
Bars are color-coded by cohort, see the accompanying legend.
Each row shows P-value(s) resulting from pairwise comparison of all cohorts. In the case of comparing two cohorts, the numerical P-value will be displayed in the table. In the case of comparing three or more cohorts, the pairwise P-values are shown as a triangular heatmap, with details available as a tooltip.
Select the Correlation
tab.
Similar to the single-cohort view (Cohort Analysis | Correlation
), choose two clinical attributes and/or genes to compare.
Depending on the available data types for the two selections (categorical and/or continuous), Cohorts will display a bubble plot, violin plot, or scatter plot.
Projects may be shared by modifying the project's Team. Team members can be added using one of the following entities:
User within the current tenant
E-mail address
Workgroup within the current tenant
Select the corresponding option under Add more team members.
Each entity added to the project team will have an assigned role with regards to specific categories of functionality in the application. These categories are:
Project
Flow
Base
Bench
While the categories will determine most of what a user can do or see, explicit upload and download rights need to be granted for users. This is done by selecting the appropriate upload and download icons.
Upload and download rights are independent of the assigned role. A user with only viewer rights will still be able to perform uploads and downloads if their upload and download rights are not disabled. Likewise, an administrator can only perform uploads and downloads if their upload and download rights are enabled.
The sections below describe the roles for each category and the allowed actions.
If a user qualifies for multiple entities added to the project team (ie, added as an individual user and is a member of an added workgroup), the highest level of access provided by an intersection of the roles is granted.
1kGP-DRAGEN
3202 WGS: 2504 original samples plus 698 relateds
Presumed healthy
DDD
4293 (3664 affected), de novos only
Developmental disorders
EPI4K
356, de novos only
Epilepsy
ASD Cohorts
6786 (4266 affected), de novos only
Autism Spectrum disorder
; ; ; ; ;
De Ligt et al.
100, de novos only
Intellectual disability
Homsy et al.
1213, de novos only
Congenital heart disease (HP:0030680)
Lelieveld et al.
820, de novos only
Intellectual disability
Rauch et al.
51, de novos only
Intellectual disability
Rare Genomes Project
315 WES (112 pedigrees)
Various
https://raregenomes.org/
TCGA
ca. 4200 WES, ca. 4000 RNAseq
12 tumor types
https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga
GEO
RNAseq
Auto-immune disorders, incl. asthma, arthritis, SLE, MS, Crohn's disease, Psoriasis, Sjögren's Syndrome
For GEO/GSE study identifiers, please refer to the in-product list of studies
RNAseq
Kidney diseases
For GEO/GSE study identifiers, please refer to the in-product list of studies
RNAseq
Central nervous system diseases
For GEO/GSE study identifiers, please refer to the in-product list of studies
RNAseq
Parkinson's disease
For GEO/GSE study identifiers, please refer to the in-product list of studies
Variants and mutations will be displayed as one needle plot for each cohort that is part of the comparison (see in this online help for more details)
Note Inviting users by email will not automatically send out the email invites when saving the changes because you might want to hold off sending the actual invite until you have completed your project configuration. In order to send out the email invite, select the letter icon at the right.
Create a Connector
x
x
x
x
View project resources
x
x
x
Link/Unlink data to a project
x
x
Subscribe to notifications
x
x
View Activity
x
x
Create samples
x
x
Delete/archive data
x
Manage notification channels
x
Manage project team
x
View analyses results
x
x
Create analyses
x
Create pipelines and tools
x
Edit pipelines and tools
x
Add docker image
x
View table records
x
x
Click on links in table
x
x
Create queries
x
x
Run queries
x
x
Export query
x
x
Save query
x
x
Export tables
x
x
Create tables
x
Load files into a table
x
Execute a notebook
x
x
Start/Stop Workspace
x
x
Create/Delete/Modify workspaces
x
Install additional tools, packages, libraries, …
x
Build a new Bench docker image
x
Create a tool for pipeline-execution
x
ICA Cohorts data can be viewed in an ICA Project Base instance as a shared database. A shared database in ICA Base operates as a database view. To use this feature, enable Base for your project prior to starting any ICA Cohorts ingestions. See Base for more information on enabling this feature in your ICA Project.
After ingesting data into your project, select Phenotypic and Molecular data are available to view in Base. See Cohorts Import for instruction on importing data sets into Cohorts.
Post ingestion, data will be represented in Base.
Select BASE
from the ICA left-navigation and click Query
.
Under the New Query window, a list of tables is displayed. Expand the Shared Database for Project \<your project name\>
.
Cohorts tables will be displayed.
To preview the table and fields click each view listed.
Clicking any of these views then selecting PREVIEW
on the right-hand side will show you a preview of the data in the tables.
Note: If your ingestion includes Somatic variants, there will be two molecular tables: ANNOTATED_SOMATIC_MUTATIONS and ANNOTATED_VARIANTS. All ingestions will include a PHENOTYPE table.
\
Note: The PHENOTYPE table includes a harmonized set that is collected across all data ingestions and is not representative of all data ingested for the Subject or Sample. Sample information is also displayed in this table, if applicable. Sample information drives the annotation process if molecular data is included in the ingestion. That data is stored in the PHENOTYPE table.
Field Name
Type
Description
SAMPLE_BARCODE
STRING
Sample Identifier
SUBJECTID
STRING
Identifer for Subject entity
STUDY
STRING
Study designation
AGE
NUMERIC
Age in years
SEX
STRING
Sex field to drive annotation
POPULATION
STRING
Population Designation for 1000 Genomes Project
SUPERPOPULATION
STRING
Superpopulation Designation from 1000 Genomes Project
RACE
STRING
Race according to NIH standard
CONDITION_ONTOLOGIES
VARIANT
Diagnosis Ontology Source
CONDITION_IDS
VARIANT
Diagnosis Concept Ids
CONDITIONS
VARIANT
Diagnosis Names
HARMONIZED_CONDITIONS
VARIANT
Diagnosis High-level concept to drive UI
LIBRARYTYPE
STRING
Seqencing technology
ANALYTE
STRING
Substance sequenced
TISSUE
STRING
Tissue source
TUMOR_OR_NORMAL
STRING
Tumor designation for somatic
GENOMEBUILD
STRING
Genome Build to drive annotations - hg38 only
SAMPLE_BARCODE_VCF
STRING
Sample ID from VCF
AFFECTED_STATUS
NUMERIC
Affected, Unaffected, or Unknown for Family Based Analysis
FAMILY_RELATIONSHIP
STRING
Relationship designation for Family Based Analysis
This table will be available for all projects with ingested molecular data
Field Name
Type
Description
SAMPLE_BARCODE
STRING
Original sample barcode used in VCF column
STUDY
STRING
Study designation
GENOMEBUILD
STRING
Only hg38 is supported
CHROMOSOME
STRING
Chromosome without 'chr' prefix
CHROMOSOMEID
NUMERIC
Chromosome ID: 1..22, 23=X, 24=Y, 25=Mt
DBSNP
STRING
dbSNP Identifiers
VARIANT_KEY
STRING
Variant ID in the form "1:12345678:12345678:C"
NIRVANA_VID
STRING
Broad Institute VID: "1-12345678-A-C"
VARIANT_TYPE
STRING
Description of Variant Type (e.g. SNV, Deletion, Insertion)
VARIANT_CALL
NUMERIC
1=germline, 2=somatic
DENOVO
BOOLEAN
true / false
GENOTYPE
STRING
"G|T"
READ_DEPTH
NUMERIC
Sequencing read depth
ALLELE_COUNT
NUMERIC
Counts of each alternate allele for each site across all samples
ALLELE_DEPTH
STRING
Unfiltered count of reads that support a given allele for an individual sample
FILTERS
STRING
Filter field from VCF. If all filters pass, field is PASS
ZYGOSITY
NUMERIC
0 = hom ref, 1 = het ref/alt, 2 = hom alt, 4 = hemi alt
GENEMODEL
NUMERIC
1=Ensembl, 2=RefSeq
GENE_HGNC
STRING
HUGO/HGNC gene symbol
GENE_ID
STRING
Ensembl gene ID ("ENSG00001234")
GID
NUMERIC
NCBI Entrez Gene ID (RefSeq) or numerical part of Ensembl ENSG ID
TRANSCRIPT_ID
STRING
Ensembl ENST or RefSeq NM_
CANONICAL
STRING
Transcript designated 'canonical' by source
CONSEQUENCE
STRING
missense, stop gained, intronic, etc.
HGVSC
STRING
The HGVS coding sequence name
HGVSP
STRING
The HGVS protein sequence name
This table will only be available for data sets with ingested Somatic molecular data.
Field Name
Type
Description
SAMPLE_BARCODE
STRING
Original sample barcode, used in VCF column
SUBJECTID
STRING
Identifier for Subject entity
STUDY
STRING
Study designation
GENOMEBUILD
STRING
Only hg38 is supported
CHROMOSOME
STRING
Chromosome without 'chr' prefix
DBSNP
NUMERIC
dbSNP Identifiers
VARIANT_KEY
STRING
Variant ID in the form "1:12345678:12345678:C"
MUTATION_TYPE
NUMERIC
Rank of consequences by expected impact: 0 = Protein Truncating to 40 = Intergenic Variant
VARIANT_CALL
NUMERIC
1=germline, 2=somatic
GENOTYPE
STRING
"G|T"
REF_ALLELE
STRING
Reference allele
ALLELE1
STRING
First allele call in the tumor sample
ALLELE2
STRING
Second allele call in the tumor sample
GENEMODEL
NUMERIC
1=Ensembl, 2=RefSeq
GENE_HGNC
STRING
HUGO/HGNC gene symbol
GENE_ID
STRING
Ensembl gene ID ("ENSG00001234")
TRANSCRIPT_ID
STRING
Ensembl ENST or RefSeq NM_
CANONICAL
BOOLEAN
Transcript designated 'canonical' by source
CONSEQUENCE
STRING
missense, stop gained, intronic, etc.
HGVSP
STRING
HGVS nomenclature for AA change: p.Pro72Ala
This table will only be available for data sets with ingested CNV molecular data.
Field Name
Type
Description
SAMPLE_BARCODE
STRING
Sample barcode used in the original VCF
GENOMEBUILD
STRING
Genome build, always 'hg38'
NIRVANA_VID
STRING
Variant ID of the form 'chr-pos-ref-alt'
CHRID
STRING
Chromosome without 'chr' prefix
CID
NUMERIC
Numerical representation of the chromosome, X=23, Y=24, Mt=25
GENE_ID
STRING
NCBI or Ensembl gene identifier
GID
NUMERIC
Numerical part of the gene ID; for Ensembl, we remove the 'ENSG000..' prefix
START_POS
NUMERIC
First affected position on the chromosome
STOP_POS
NUMERIC
Last affected position on the chromosome
VARIANT_TYPE
NUMERIC
1 = copy number gain, -1 = copy number loss
COPY_NUMBER
NUMERIC
Observed copy number
COPY_NUMBER_CHANGE
NUMERIC
Fold-chang of copy number, assuming 2 for diploid and 1 for haploid as the baseline
SEGMENT_VALUE
NUMERIC
Average FC for the identified chromosomal segment
PROBE_COUNT
NUMERIC
Probes confirming the CNV (arrays only)
REFERENCE
NUMERIC
Baseline taken from normal samples (1) or averaged disease tissue (2)
GENE_HGNC
STRING
HUGO/HGNC gene symbol
This table will only be available for data sets with ingested SV molecular data. Note that ICA Cohorts stores copy number variants in a separate table.
Field Name
Type
Description
SAMPLE_BARCODE
STRING
Sample barcode used in the original VCF
GENOMEBUILD
STRING
Genome build, always 'hg38'
NIRVANA_VID
STRING
Variant ID of the form 'chr-pos-ref-alt'
CHRID
STRING
Chromosome without 'chr' prefix
CID
NUMERIC
Numerical representation of the chromosome, X=23, Y=24, Mt=25
BEGIN
NUMERIC
First affected position on the chromosome
END
NUMERIC
Last affected position on the chromosome
BAND
STRING
Chromosomal band
QUALIITY
NUMERIC
Quality from the original VCF
FILTERS
ARRAY
Filters from the original VCF
VARIANT_TYPE
STRING
Insertion, deletion, indel, tandem_duplication, translocation_breakend, inversion ("INV"), short tandem repeat ("STR2")
VARIANT_TYPE_ID
NUMERIC
21=insertion, 22=deletion, 23=indel, 24=tandem_duplication, 25=translocation_breakend, 26=inversion ("INV"), 27=short tandem repeat ("STR2")
CIPOS
ARRAY
Confidence interval around first position
CIEND
ARRAY
Confidence interval around last position
SVLENGTH
NUMERIC
Overall size of the structural variant
BONDCHR
STRING
For translocations, the other affected chromosome
BONDCID
NUMERIC
For translocations, the other affected chromosome as a numeric value, X=23, Y=24, Mt=25
BONDPOS
STRING
For translocations, positions on the other affected chromosome
BONDORDER
NUMERIC
3 or 5: Whether this fragment (the current variant/VID) "receives" the other chromosome's fragment on it's 3' end, or attaches to the 5' of the other chromosome fragment
GENOTYPE
STRING
Called genotype from the VCF
GENOTYPE_QUALITY
NUMERIC
Genotype call quality
READCOUNTSSPLIT
ARRAY
Read counts
READCOUNTSPAIRED
ARRAY
Read counts, paired end
REGULATORYREGIONID
STRING
Ensembl ID for the affected regulatory region
REGULATORYREGIONTYPE
STRING
Type of the regulatory region
CONSEQUENCE
ARRAY
Variant consequence according to SequenceOntology
TRANSCRIPTID
STRING
Ensembl of RefSeq transcript identifier
TRANSCRIPTBIOTYPE
STRING
Biotype of the transcript
INTRONS
STRING
Count of impacted introns out of the total number of introns, specified as "M/N"
GENEID
STRING
Ensembl or RefSeq gene identifier
GENEHGNC
STRING
HUGO/HGNC gene symbol
ISCANONICAL
BOOLEAN
Is the transcript ID the canonical one according to Ensembl?
PROTEINID
STRING
RefSeq or Ensembl protein ID
SOURCEID
NUMERICAL
Gene model: 1=Ensembl, 2=RefSeq
These tables will only be available for data sets with ingested RNAseq molecular data.
Table for gene quantification results:
Field Name
Type
Description
GENOMEBUILD
STRING
Genome build, always 'hg38'
STUDY_NAME
STRING
Study designation
SAMPLE_BARCODE
STRING
Sample barcode used in the original VCF
LABEL
STRING
Group label specified during import: Case or Control, Tumor or Normal, etc.
GENE_ID
STRING
Ensembl or RefSeq gene identifier
GID
NUMERIC
Numerical part of the gene ID; for Ensembl, we remove the 'ENSG000..' prefix
GENE_HGNC
STRING
HUGO/HGNC gene symbol
SOURCE
STRING
Gene model: 1=Ensembl, 2=RefSeq
TPM
NUMERICAL
Transcripts per million
LENGTH
NUMERICAL
The length of the gene in base pairs.
EFFECTIVE_LENGTH
NUMERICAL
The length as accessible to RNA-seq, accounting for insert-size and edge effects.
NUM_READS
NUMERICAL
The estimated number of reads from the gene. The values are not normalized.
The corresponding transcript table uses TRANSCRIPT_ID instead of GENE_ID and GENE_HGNC.
These tables will only be available for data sets with ingested RNAseq molecular data.
Table for differential gene expression results:
Field Name
Type
Description
GENOMEBUILD
STRING
Genome build, always 'hg38'
STUDY_NAME
STRING
Study designation
SAMPLE_BARCODE
STRING
Sample barcode used in the original VCF
CASE_LABEL
STRING
Study designation
GENE_ID
STRING
Ensembl or RefSeq gene identifier
GID
NUMERIC
Numerical part of the gene ID; for Ensembl, we remove the 'ENSG000..' prefix
GENE_HGNC
STRING
HUGO/HGNC gene symbol
SOURCE
STRING
Gene model: 1=Ensembl, 2=RefSeq
BASEMEAN
NUMERICAL
FC
NUMERICAL
Fold-change
LFC
NUMERICAL
Log of the fold-change
LFCSE
NUMERICAL
Standard error for log fold-change
PVALUE
NUMERICAL
P-value
CONTROL_SAMPLECOUNT
NUMERICAL
Number of samples used as control
CONTROL_LABEL
NUMERICAL
Label used for controls
The corresponding transcript table uses TRANSCRIPT_ID instead of GENE_ID and GENE_HGNC.
The project details page contains the properties of the project, such as the location, owner, storage and linked bundles. This is also the place where you add assets in the form of linked bundles.
The project details are configured during project creation and may be updated by the project owner, entities with the project Adminstrator role, and tenant administrators.
Click the Edit button at the top of the Details page.
Click the + button, under LINKED BUNDLES.
Click on the desired bundle, then click the Link button.
Click Save.
If your linked bundle contained a pipeline, then it will appear in Projects > your_project > Flow > Pipelines.
Name
Name of the project unique within the tenant. Alphanumerics, underscores, dashes, and spaces are permitted.
Short Description
Short description of the project
Project Owner
Owner of the project (has Administrator access to the project)
Storage Configuration
Storage configuration to use for data stored in the project
User Tags
User tags on the project
Technical Tags
Technical tags on the project
Metadata Model
Metadata model assigned to the project
Project Location
Project region where data is stored and pipelines are executed. Options are derived from the Entitlement(s) assigned to user account, based on the purchased subscription
Storage Bundle
Storage bundle assigned to the project. Derived from the selected Project Location based on the Entitlement in the purchased subscription
Billing Mode
Billing mode assigned to the project
Data sharing
Enables data and samples in the project to be linked to other projects
A project's billing mode determines the strategy for how costs are charged to billable accounts.
Project
All incurred costs will be charged to the tenant of the project owner
Tenant
Incurred costs will be charged to the tenant of the user owning the project resource (ie, data, analysis). The only exceptions are base tables and queries, as well as bench compute and storage costs, which are always billed to the project owner.
For example, with billing mode set to Tenant, if tenant A has created a project resource and uses it in their project, then tenant A will pay for the resource data, compute costs and storage costs of any output they generate within the project. When they share the project with tenant B, then tenant B will pay the compute and storage for the data which they generate in that project. Put simply, in billing mode tenant, the person who generates data pays for the processing and storage of that data, regardless of who owns the actual project.
If the project billing mode is updated after the project has been created, the updated billing mode will only be applied to resources generated after the change.
If you are using your own S3 storage, then the billing mode impacts where collaborator data is stored.
Project billing will result in using your S3 storage for the data.
Tenant billing will result in collaborator data being stored in Illumina-managed storage instead of your own S3 storage.
Tenant billing, when your collaborators also have their own S3 storage and have it set as default, will result in their data being stored in their S3 storage.
Use the Create OAuth access token
button to generate an OAuth access token which is valid for 12 hours after generation. This token can be used by Snowflake and Tableau to access the data in your Base databases and tables for this Project.
See SnowSQL for more information.
The platform provides Connectors to facilitate automation for operations on data (ie, upload, download, linking).
The ICA CLI is a useful tool for uploading, downloading and viewing information about data stored within ICA projects. If not already authenticated, please see the Authentication section of the CLI help pages. Once the CLI has been authenticated with your account, use the command below to list all projects:
icav2 projects list
The first column of the output (table format, which is default) will show the ID
. This is the project ID and will be used in the examples below.
To upload a file called Sample-1_S1_L001_R1_001.fastq.gz
to the project, copy the project id and use the command syntax below:
icav2 projectdata upload Sample-1_S1_L001_R1_001.fastq.gz --project-id <project-id>
To verify the file has uploaded, run the following to get a list of all files stored within the specified project:
icav2 projectdata list --project-id <project-id>
This will show a file ID starting with fil.
which can then be used to get more information about the file and its attributes:
icav2 projectdata get <file-id> --project-id <project-id>
It is necessary to use --project-id
in the above example if not entered into a specific project context. In order to enter a project context use the command below.
icav2 projects enter <project-name or project-id>
This will infer the project id, so that it does not need to be entered into each command.
Note: filenames beginning with / are not allowed, so be careful when entering full path names as those will result in the file being stored on S3 but not being visible in ICA. Likewise, folders containing a / in their individual folder name and folders named '.' are not supported
The ICA CLI can also be used to download files via command line. This can be especially helpful if the download destination is a remote server or HPC cluster that you are logged into from a local machine. To download into the current directory, run the following from the command line terminal:
icav2 projectdata download <file-id> ./
The above assumes you have entered into a project context. If this is not the case, either enter the project that contains the desired data, or be sure to supply the --project-id
option in the command.
To fetch temporary AWS credentials for given project data, use the command icav2 projectdata temporarycredentials [path or data Id] [flags]
. If the path is provided, the project id from the flag --project-id is used. If the --project-id flag is not present, then the project id is taken from the context. The returned AWS credentials for file or folder upload expire after 36 hours.
For information on options such as using the ICA API and AWS CLI to transfer data, visit the Data Transfer Options tutorial.
Flow provides tooling for building and running secondary analysis pipelines. The platform supports analysis workflows constructed using Common Workflow Language (CWL) and Nextflow. Each step of an analysis pipeline executes a containerized application using inputs passed into the pipeline or output from previous steps.
You can configure the following components in Illumina Connected Analytics Flow:
Reference Data — Reference Data for Graphical CWL flows. See Reference Data.
Pipelines — One or more tools configured to process input data and generate output files. See Pipelines.
Analyses — Launched instance of a pipeline with selected input data. See Analyses.
The CLI supports outputs in table, JSON, and YAML formats. The format is set using the output-format
configuration setting through a command line option, environment variable, or configuration file.
Dates are output as UTC times when using JSON/YAML output format and local times when using table format.
To set the output format, use the following setting:
--output-format <string>
json
- Outputs in JSON format
yaml
- Outputs in YAML format
table
- Outputs in tabular format
The platform GUI provides the Project Connector utility which allows data to be linked automatically between projects. This creates a one-way dynamic link for files and samples from source to destination, meaning that additions and deletions of data in the source project also affect the destination project. This differs from copying or moving which create editable copies of the data. In the destination project, you can delete data which has been moved or copied and unlink data which has been linked.
move
x
x
x
x
x
copy
x
x
x
x
manual link
x
x
x
project connector
x
x
x
Select the source project (project that will own the data to be linked) from the Projects page (Projects > your_source_project).
Select Project Settings > Details.
Select Edit
Under Data Sharing ensure the value is set to Yes
Select Save
Select the destination project (the project to which data from the source project will be linked) from the Projects page (Projects > your_destination_project).
From the projects menu, select Project Settings > Connectivity > Project Connector
Select + Create and complete the necessary fields.
Check the box next to Active to ensure the connector will be active.
Name (required) — Provide a unique name for the connector.
Type (required) — Select the data type that will be linked (either File or Sample)
Source Project - Select the source poject whose data will be linked to.
Filter Expression (optional) — Enter an expression to restrict which files will be linked via the connector (see Filter Expression Examples below)
Tags (optional) — Add tags to restrict what data will be linked via the connector. Any data in the source project with matching tags will be linked to the destination project.
The examples below will link Files based on the Format field.
Only Files with Format of FASTQ will be linked:
[?($.details.format.code == 'FASTQ')]
Only Files with Format of VCF will be linked:
[?($.details.format.code == 'VCF')]
The examples below will restrict linked Files based on a filenames.
Exact match to 'Sample-1_S1_L001_R1_001.fastq.gz':
[?($.details.name == 'Sample-1_S1_L001_R1_001.fastq.gz')]
Ends with '.fastq.gz':
[?($.details.name =~ /.*\.fastq.gz/)]
Starts with 'Sample-':
[?($.details.name =~ /Sample-.*/)]
Contains '_R1_':
[?($.details.name =~ /.*_R1_.*/)]
The examples below will link Samples based on User Tags and Sample name, respectively.
Only Samples with the User Tag 'WGS-Project-1'
[?('WGS-Project-1' in $.tags.userTags)]
Only Samples with the name 'BSSH_Sample_1':
Link a Sample with the name 'BSSH_Sample_1':
[?($.name == 'BSSH_Sample_1')]
Notifications (Projects > your_project > Project Settings > Notifications ) are events to which you can subscribe. When they are triggered, they deliver a message to an external target system such as emails, Amazon SQS or SNS systems or HTTP post requests. The following table describes available system events to subscribe to:
Analysis failure
ICA_EXEC_001
Emitted when an analysis fails
Analysis
Analysis success
ICA_EXEC_002
Emitted when an analysis succeeds
Analysis
Analysis aborted
ICA_EXEC_027
Emitted when an analysis is aborted either by the system or the user
Analysis
Analysis status change
ICA_EXEC_028
Emitted when an state transition on an analysis occurs
Analysis
Base Job failure
ICA_BASE_001
Emitted when a Base job fails
BaseJob
Base Job success
ICA_BASE_002
Emitted when a Base job succeeds
BaseJob
Data transfer success
ICA_DATA_002
Emitted when a data transfer is marked as Succeeded
DataTransfer
Data transfer stalled
ICA_DATA_025
Emitted when data transfer hasn't progressed in the past 2 minutes
DataTransfer
Data <action>
ICA_DATA_100
Subscribing to this serves as a wildcard for all project data status changes and covers those changes that have no separate code. This does not include DataTransfer events or changes that trigger no data status changes such as adding tags to data.
ProjectData
Data linked to project
ICA_DATA_104
Emitted when a file is linked to a project
ProjectData
Data can not be created in non-indexed folder
ICA_DATA_105
Emitted when attempting to create data in a non-indexed folder
ProjectData
Data deleted
ICA_DATA_106
Emitted when data is deleted
ProjectData
Data created
ICA_DATA_107
Emitted when data is created
ProjectData
Data uploaded
ICA_DATA_108
Emitted when data is uploaded
ProjectData
Data updated
ICA_DATA_109
Emitted when data is updated
ProjectData
Data archived
ICA_DATA_110
Emitted when data is archived
ProjectData
Data unarchived
ICA_DATA_114
Emitted when data is unarchived
ProjectData
Job status changed
ICA_JOB_001
Emitted when a job changes status (INITIALIZED, WAITING_FOR_RESOURCES, RUNNING, STOPPED, SUCCEEDED, PARTIALLY_SUCCEEDED, FAILED)
JobId
Sample completed
ICA_SMP_002
Emitted when a sample is marked as completed
ProjectSample
Sample linked to a project
ICA_SMP_003
Emitted when a sample is linked to a project
ProjectSample
Workflow session start
ICA_WFS_001
Emitted when workflow is started
WorkflowSession
Workflow session failure
ICA_WFS_002
Emitted when workflow fails
WorkflowSession
Workflow session success
ICA_WFS_003
Emitted when workflow succeeds
WorkflowSession
Workflow session aborted
ICA_WFS_004
Emitted when workflow is aborted
WorkflowSession
When you subscribe to overlapping event codes such as ICA_EXEC_002 (analysis success) and ICA_EXEC_028 (analysis status change) you will get both notifications when analysis success occurs.
When integrating with external systems, it is advised to not solely rely on ICA notifications, but to also add a polling system to check the status of long-running tasks. For example verifying the status of long-running (>24h) analyses with a 12 hour interval.
Event notifications can be delivered to the following delivery targets:
E-mail delivery
E-mail Address
Sqs
AWS SQS Queue
AWS SQS Queue URL
Sns
AWS SNS Topic
AWS SNS Topic ARN
Http
Webhook (POST request)
URL
In order to allow the platform to deliver events to Amazon SQS or SNS delivery targets, a cross-account policy needs to be added to the target Amazon service.
Substitute the variables in the example above according to the table below.
platform_aws_account
The platform AWS account ID: 079623148045
action
For SNS use SNS:Publish
. For SQS, use SQS:SendMessage
arn
The Amazon Resource Name (ARN) of the target SNS topic or SQS queue
See examples for setting policies in Amazon SQS and Amazon SNS
To create a subscription to deliver events to an Amazon SNS topic, one can use either GUI or API endpoints.
To create a subscription via GUI, select Projects > your_project > Project Settings > Notifications > +Create > ICA event. Select an event from the dropdown menu, insert optional filter, select the channel type (SNS), and then insert the ARN from the target SNS topic and the AWS region.
To create a subscription via API, use the endpoint /api/notificationChannel to create a channel and then /api/projects/{projectId}/notificationSubscriptions to create a notification subscription.
To create a subscription to deliver events to an Amazon SQS queue, you can use either GUI or API endpoints.
To create a subscription via the GUI, select Projects > your_project > Project Settings > Notifications > +Create > ICA event. Next, select an event from the dropdown menu, choose SQS as the way to receive the notifications, enter your SQS URL, and if applicable for that event, choose a payload version. Not all payload versions are applicable for all events and targets, so the system will filter the options out for you. Finally, you can enter a filter expression to filter which events are relevant for you. Only those events matching the expression will be received.
To create a subscription via API, use the endpoint /api/notificationChannel to create a channel and then /api/projects/{projectId}/notificationSubscriptions to create a notification subscription.
Messages delivered to AWS SQS contain the following event body attributes:
correlationId
GUID used to identify the event
timestamp
Date when the event was sent
eventCode
Event code of the event
description
Description of the event
payload
Event payload
The following example is a Data Updated event payload sent to an AWS SQS delivery target (condensed for readability):
Notification subscriptions will trigger for all events matching the configured event type. A filter may be configured on a subscription to limit the matching strategy to only those event payloads which match the filter.
The filter expressions leverage the JsonPath library for describing the matching pattern to be applied to event payloads. The filter must be in the format [?(<expression>)]
.
The Analysis Success
event delivers a JSON event payload matching the Analysis
data model (as output from the API to retrieve a project analysis).
The below examples demonstrate various filters operating on the above event payload:
Filter on a pipeline, with a code that starts with ‘Copy’. You’ll need a regex expression for this:
[?($.pipeline.code =~ /Copy.*/)]
Filter on status (note that the Analysis success
event is only emitted when the analysis is successful):
[?($.status == 'SUCCEEDED')]
Both payload Version V3 and V4 guarantee the presence of the final state (SUCCEEDED, FAILED, FAILED_FINAL, ABORTED) but depending on the flow (so not every intermediate state is guaranteed):
V3 can have REQUESTED - IN_PROGRESS - SUCCEEDED
V4 can have the status REQUESTED - QUEUED - INITIALIZING - PREPARING_INPUTS - IN_PROGRESS - GENERATING_OUTPUTS - SUCCEEDED
Filter on pipeline, having a technical tag “Demo":
[?('Demo' in $.pipeline.pipelineTags.technicalTags)]
Combination of multiple expressions using &&
. It's best practice to surround each individual expression with parentheses:
[?(($.pipeline.code =~ /Copy.*/) && $.status == 'SUCCEEDED')]
Examples for other events
Filtering ICA_DATA_104 on owning project name. The top level keys on which you can filter are under the payload key, so payload is not included in this filter expression.
[?($.details.owningProjectName == 'my_project_name')]
Custom events enable triggering notification subscriptions using event types beyond the system-defined event types. When creating a custom subscription, a custom event code may be specified to use within the project. Events may then be sent to the specified event code using a POST API with the request body specifying the event payload.
Custom events can be defined using the API. In order to create a custom event for your project please follow the steps below:
Create a new custom event POST {ICA_URL}/ica/rest/api/projects/{projectId}/customEvents
a. Your custom event code must be 1-20 characters long, e.g. 'ICA_CUSTOM_123'.
b. That event code will be used to reference that custom event type.\
Create a new notification channel POST {ICA_URL}/ica/rest/api/notificationChannels
a. If there is already a notification channel created with the desired configuration within the same project, it is also possible to get the existing channel ID using the call GET {ICA_URL}/ica/rest/api/notificationChannels
.\
Create a notification subscription POST {ICA_URL}/ica/rest/api/projects/{projectId}/customNotificationSubscriptions
.
a. Use the event code created in step 1.
b. Use the channel ID from step 2.\
To create a subscription via the GUI, select Projects > your_project > Project Settings > Notifications > +Create > Custom event.
Once the steps above have been completed successfully, the call from the first step POST {ICA_URL}/ica/rest/api/projects/{projectId}/customEvents
could be reused with the same event code to continue sending events through the same channel and subscription.
Following is a sample Python function used inside an ICA pipeline to post custom events for each failed metric:
Base is a genomics data aggregation and knowledge management solution suite. It is a secure and scalable integrated genomics data analysis solution which provides Information management and knowledge mining. Users are able to analyze, aggregate and query data for new insights that can inform and improve diagnostic assay development, clinical trials, patient testing and patient care. For this, all clinically relevant data generated from routine clinical testing needs to be extracted and clinical questions need to be asked across all data and information sources. As a large data store, Base provides a secure and compliant environment to accumulate data, allowing for efficient exploration of the aggregated data. This data consists of test results, patient data, metadata, reference data, consent and QC data.
Base can be used by different user personas supporting different use cases:
Clinical and Academic Researchers:
Big data storage solution housing all aggregated sample test outcomes
Analyze information by way of a convenient query formalism
Look for signals in combined phenotypic and genotypic data
Analyze QC patterns over large cohorts of patients
Securely share (sub)sets of data with other scientists
Generate reports and analyze trends in a straightforward and simple manner.
Bioinformaticians:
Access, consult, audit, and query all relevant data and QC information for tests run
All accumulated data and accessible pipelines can be used to investigate and improve bioinformatics for clinical analysis
Metadata is captured via automatic pipeline version tracking, including information on individual tools and/or reference files used during processing for each sample analyzed, information on the duration of the pipeline, the execution path of the different analytical steps, or in case of failure, exit codes can be warehoused.
Product Developers and Service Providers:
Better understand the efficiency of kits and tests
Analyze usage, understand QC data trends, improve products
Store and aggregate business intelligence data such as lab identification, consumption patterns and frequency, as well as allow renderings of test result outcome trends and much more.
Data Warehouse Creation: Build a relational database for your Project in which desired data sets can be selected and aggregated. Typical data sets include pipeline output metrics and other suitable data files generated by the ICA platform which can be complemented by additional public (or privately built) databases.
Report and Export: Once created, a data warehouse can be mined using standard database query instructions. All Base data is stored in a structured and easily accessible way. An interface allows for the selection of specific datasets and conditional reporting. All queries can be stored, shared, and re-used in the future. This type of standard functionality supports most expected basic mining operations, such as variant frequency aggregation. All result sets can be downloaded or exported in various standard data formats for integration in other reporting or analytical applications.
Detect Signals and Patterns: extensive and detailed selection of subsets of patients or samples adhering to any imaginable set of conditions is possible. Users can, for example, group and list subjects based on a combination of (several) specific genetic variants in combination with patient characteristics such as therapeutic (outcome) information. The built-in integration with public datasets allows users to retrieve all relevant publications, or clinically significant information for a single individual or a group of samples with a specific variant. Virtually any possible combination of stored sample and patient information allow for detecting signals and patterns by a simple single query on the big data set.
Profile/Cluster patients: use and re-analyze patient cohort information based on specific sample or individual characteristics. For instance, they might want to run a next agile iteration of clinical trials with only patients that respond. Through integrated and structured consent information allowing for time-boxed use, combined with the capability to group subjects by the use of a simple query, patients can be stratified and combined to export all relevant individuals with their genotypic and phenotypic information to be used for further research.
Share your data: Data sharing is subject to strict ethical and regulatory requirements. Base provides built-in functionality to securely share (sub)sets of your aggregated data with third parties. All data access can be monitored and audited, in this way Base data can be shared with people in and outside of an organization in a compliant and controlled fashion.
Base is a module that can be found in a project. It is shown in the menu bar of the project.
To access Base:
On the domain level, Base needs to be included in the subscription
On the project level, the project owner needs to enable Base
On the user level, the project administrator needs to enable workgroups to access the Base pages
The access to activate the Base module is controlled based upon the chosen subscription (full and premium subscriptions give access to Base) when registering the account. This will all happen automatically after the first user logs into the system for that account. So from the moment the account is up and running, the Base module will also be ready to be enabled.
When a user has created a project, they can go to the Base pages and click the Enable button. From that moment on, every user who has the proper permissions has access to the Base module in that project.
Only the project owner can enable Illumina Connected Analytics Base. Make sure that your subscription for the domain includes Base.
Navigate to Projects > your_project > Base > Tables / Query / Schedule.
Select Enable
Access to the projects and all modules located within the project is provided via the Team page within the project.
Authenticate using icav2 config set
command. The CLI will prompt for an x-api-key
value. Input the API Key generated from the product dashboard here. See the example below (replace EXAMPLE_API_KEY
with the actual API Key).
The CLI will save the API Key to the config file as an encrypted value.
If you want to overwrite existing environment values, use the command icav2 config set
.
To remove an existing configuration/session file, use the command icav2 config reset
.\
Check the server and confirm you are authenticated using icav2 config get
If during these steps or in the future you need to reset the authentication, you can do so using the command: icav2 config reset
ICA provides a Service Connector, which is a small program that runs on your local machine to sync data between the platform's cloud-hosted data store and your local computer or server. The Service Connector securely uploads data or downloads results using TLS 1.2. In order to do this, the Connector makes 2 connections:
A control connection, which the Connector uses to get configuration information from the platform, and to update the platform about its activities
A connection towards the storage node, used to transfer the actual data between your local or network storage and your cloud-based ICA storage.
This Connector runs in the background, and configuration is done in the Illumina Connected Analytics (ICA) platform, where you can add upload and download rules to meet the requirements of the current project and any new projects you may create.
The Service Connector looks at any new files and checks their size. As long as the file size is changing, it knows data is still being added to the file and it is not ready for transfer. Only when the file size is stable and does not change anymore will it consider the file to be complete and initiate transfer. Despite this, it is still best practice to not connect the Service Connector to active folders which are used as streaming output for other processes as this can result in incomplete files being transferred when the active processes have extended compute periods in which the file size remains unchanged.
The service connector will handle integrity checking during file transfer, which requires the calculation of hashes on the data. In addition, Transmission speed depends on the available data transfer bandwidth and connection stability. For these reasons, uploading large amounts of data can take considerable time. This can in turn result in temporarily seeing empty folders at the destination location since these are created at the beginning of the transfer process.
Select Projects > your_project > Project Settings > Connectivity > Service Connectors.
Select + Create.
Fill out the fields in the New Connector configuration page.
Name - Enter the name of the connector.
Status - This is automatically updated with the actual status, you do not need to enter anything here.
Debug Information Accessible by Illumina (optional) - Illumina support can request connector debugging information to help diagnose issues. For security reasons, support can only collect this data if the option Debug Information Accessible by Illumina is active. You can choose to either proactively enable this when encountering issues to speed up diagnosis or to only activate it when support requests access. You can at any time revoke access again by deselecting the option.
Description (optional) - Enter any additional information you want to show for this connector.
Mode (required) - Specify if the connector can upload data, download data, both or neither.
Operating system (required) - Select your server or computer operating system.
Select Save and download the connector (top right). An initialization key will be displayed in the platform now. Copy this value as it will be needed during installation.
Launch the installer after the download completes and follow the on-screen prompts to complete the installation, including entering the initialization key copied in the previous step. Do not install the connector in the upload folder as this will result in the connector attempting to upload itself and the associated log files.
Run the downloaded .exe file. During the installation, the installer will ask for the initialization key. Fill out the initialization key you see in the platform.
The installer will create an Illumina Service Connector, register it as a Windows service, and start the service. That means, if you wait for about 60 seconds, and then refresh the screen in the Platform by using the refresh button in the right top corner of the page, the connector should display as connected.
You can only install 1 connector on Windows. If for some reason, you need to install a new one, first uninstall the old one. You only need to do this when there is a problem with your existing connector. Upgrading a connector is also possible. To do this, you don’t need to uninstall the old one first.
Double click the downloaded .dmg file. Double click Illumina Service Connector in the window that opens to start the installer. Run through the installer, and fill out the initialization key when asked for it.
To start the connector once installed or after a reboot, open the app. You can find the app on the location where you installed it. The connector icon will appear in your dock when the app is running.
In the platform on the Connectivity page, you can check whether your local connector has been connected with the platform. This can take 60 seconds after you started your connector locally, and you may need to refresh the Connectivity page using the refresh button in the top right corner of the page to see the latest status of your connector.
The connector app needs to be closed to shut down your computer. You can do this from within your dock.
Installations require Java 11 or later. You can check this with ‘java –version’ from a command line terminal. With Java installed, you can run the installer from the command line using the command bash illumina_unix_develop.sh
.
Depending on whether you have an X server running or not, it will display a UI, or follow a command line installation procedure. You can force a command line installation by adding a –c flag: bash illumina_unix_develop.sh -c
.
The connector can be started by running ./illuminaserviceconnector start
from the directory in which the connector was installed.
In the upload and download rules, you define different properties when setting up a connector. A connector can be used by multiple projects and a connector can have multiple upload and download rules. Configuration can be changed anytime. Changes to the configuration will be applied approximately 60 seconds after changes are made in ICA if the connector is already connected. If the connector is not already started when configuration changes are made in ICA, it will take about 60 seconds after the connector is started for the configuration changes to be propagated to the connector. The following are the different properties you can configure when setting up a connector. After adding a rule and installing the connector, you can use the Active checkbox to disable rules.
Below is an example of a new connector setup with an Upload Rule to find all files ending with .tar
or .tar.gz
located within the local folder C:\Users\username\data\docker-images
.
An upload rule tells the connector which folder on your local disk it needs to watch for new files to upload. The connector contacts the platform every minute to pick up changes to upload rules. To configure upload rules for different projects, first switch into the desired project and select Connectivity. Choose the connector from the list and select Click to add a new upload rule and define the rule. The project field will be automatically filled with the project you are currently within.
When you schedule downloads in the platform, you can choose which connector needs to download the files. That connector needs some way to know how and where it needs to download your files. That’s what a download rule is for. The connector contacts the platform every minute to pick up changes to download rules. The following are the different download rule settings.
When you set up your connector for the first time, and your sample files are located on a shared drive, it’s best to create a folder on your local disk, put one of the sample files in there, and do the connector setup with that folder. When this works, try to configure the shared drive.
Transfer to and from a shared drive may be quite slow. That means it can take up to 30 minutes after you configured a shared drive before uploads start. This is due to the integrity check the connector does for each file before it starts uploading. The connector can upload from or download to a shared drive, but there are a few conditions:
The drive needs to be mounted locally. X:\illuminaupload
or /Volumes/shareddrive/illuminaupload
will work, \\shareddrive\illuminaupload
or smb://shareddrive/illuminaupload
will not.
The user running the connector must have access to the shared drive without a password being requested.
The user who runs the Illumina Service Connector process on the Linux machine needs to have read, write and execute permissions on the installation folder.
Illumina might release new versions of the Service Connector, with improvements and/or bug fixes. You can easily download a new version of the Connector with the Download button on the Connectivity screen in the platform. After you downloaded the new installer, run it and choose the option ‘Yes, update the existing installation’.
To uninstall the connector, perform one of the following:
Windows and Linux: Run the uninstaller located in the directory the connector was installed.
Mac: Move the Illumina Service Connector to your Trash folder.
The Connector has a log file containing technical information about what’s happening. When something doesn’t work, it often contains clues to why it doesn’t work. Interpreting this log file is not always easy, but it can help the support team to give a fast answer on what is wrong, so it is suggested to attach it to your email when you have upload or download problems. You can find this log file at the following location:
\<Installation Directory>\logs\BSC.out
Default: C:\Program Files (x86)\illumina\logs\BSC.out
/<Installation Directory>/Illumina Service Connector.app/Contents/java/app/logs/BSC.out
Default: /Applications/Illumina Service Connector.app/Contents/java/app/logs/BSC.out
/<Installation Directory>/logs/BSC.out
Default: /usr/local/illumina
After the file is downloaded, place the CLI in a folder that is included in your $PATH environment variable list of paths, typically /usr/local/bin. Open the Terminal application, navigate to the directory where the downloaded CLI file is located (usually your Downloads folder), and run the following command to copy the CLI file to the appropriate folder. If you do not have write access to your /usr/local/bin folder, then you may use sudo
prior to the cp
command. For example:
If you do not have sudo access on your system, contact your administrator for installation. Alternately, you may place the file in an alternate location and update your $PATH to include this location (see the documentation for your shell to determine how to update this environment variable).
You will also need to make the file executable so that the CLI can run:
You will likely want to place the CLI in a folder that is included in your $PATH environment variable list of paths. In Windows, you typically want to save your applications in the C:\Program Files
folder. If you do not have write access to that folder, then open a CMD window in administrative mode (hold down the SHIFT key as you right-click on the CMD application and select "Run as administrator"). Type in the following commands (assuming you have saved ica.exe in your current directory):
Then you make sure that the C:\Program Files\Illumina directory is included in your %path% list of paths. Please do a web search for how to add a path to your %path% system environment variable for your particular version of Windows.
Upload allowed
Download allowed
No upload allowed
No Download allowed
The status and history of Base activities and jobs are shown on the page.
The ICA CLI uses an Illumina API Key to authenticate. An Illumina API Key can be acquired through the product dashboard after logging into a domain. See for instructions to create an Illumina API Key.
Add any upload or download rules. See below.
Download links for the CLI can be found at the .
Name
Name of the upload rule.
Active
Set to true to have this rule be active. This allows you to deactivate rules without deleting them.
Local folder
The folder path on the local machine where files to be uploaded are stored.
File pattern
Files with filenames that match the string/pattern will be uploaded.
Location
The location the data will be uploaded to.
Project
The project the data will be uploaded to.
Description
Additional information about the upload rule.
Assign Format
Select which data format tag the uploaded files will receive. This is used for various things like filtering.
Data owner
The owner of the data after upload.
Name
Name of the download rule.
Active
Set to true to have this rule be active. This allows you to deactivate rules without deleting them.
Order of execution
If using multiple download rules, set the order the rules are performed.
Target Local folder
The folder path on the local machine where the files will be downloaded to.
Description
Additional information about the download rule.
Format
The format the files must comply to in order to be scheduled as downloaded.
Project
The projects the rule applies to.
Windows
Service connector doesn't connect
First, try restarting your computer. If that doesn’t help, open the Services application (By clicking the Windows icon, and then typing services). In there, there should be a service called Illumina Service Connector. • If it doesn’t have status Running, try starting it (right mouse click -> start) • If it has status Running, and still does not connect, you might have a corporate proxy. Proxy configuration is currently not supported for the connector. • If you do not have a corporate proxy, and your connector still doesn’t connect, contact Illumina Technical Support, and include your connector BSC.out log files.
OS X
Service connector doesn't connect
Check whether the Connector is running. If it is, there should be an Illumina icon in your Dock. • If it doesn’t, log out and log back in. An Illumina service connector icon should appear in your dock. • If it still doesn’t, try starting the Connector manually from the Launchpad menu. • If it has status Running, and still does not connect, you might have a corporate proxy. Proxy configuration is currently not supported for the connector. • If you do not have a corporate proxy, and your connector still doesn’t connect, contact Illumina Technical Support, and include your connector BSC.out log files.
Linux
Service connector doesn't connect
Check whether the connector process is running with:
ps aux
Linux
Can’t define java version for connector
The connector makes use of java version 8 or 11. If you run the installer and get the following error “Please define INSTALL4J_JAVA_HOME to point to a suitable JVM.”: • When downloading the correct java version from Oracle, there are 2 variables in the script that can be defined (INSTALL4J_JAVA_HOME_OVERRIDE & INSTALL4J_JAVA_PREFIX), but not INSTALL4J_JAVA_HOME, which is printed in the above error message. Instead, export the variable to your env before running the installation script. You can export the variable to your env before running the script, like this: • Note that Java home should not point to the java executable, but to the jre folder. For example: export INSTALL4J_JAVA_HOME_OVERRIDE=/usr/lib/jvm/java-1.8.0-openjdk-amd64 sh illumina _unix_1_13_2_0_35.sh
Linux
Corrupted installation script
If you get the following error message “gzip: sfx_archive.tar.gz: not in gzip format. I am sorry, but the installer file seems to be corrupted. If you downloaded that file please try it again. If you transfer that file with ftp please make sure that you are using binary mode.” : • This indicates the installation script file is corrupted. Editing the shell script will cause it to be corrupt. Please re-download the installation script from ICA.
Linux
Unsupported version error in log file
If the log file gives the following error "Unsupported major.minor version 52.0", an unsupported version of java is present. The connector makes use of java version 8 or 11.
Linux
Manage the connector via the CLI
• Connector installation issues:
It may be necessary to first make the connector installation script executable with:
chmod +x illumina_unix_develop.sh
Once it has been made executable, run the installation script with:
bash illumina_unix_develop.sh
It may be necessary to run with sudo
depending on user permissions on the system:
sudo bash illumina_unix_develop.sh
If installing on a headless system, use the -c
flag to do everythign from the command line:
bash illumina_unix_develop.sh -c
• Start connector with logging directly to the terminal stdout) (in case log file is not present, likely due to the absence of java version 8 or 11). From within the installation directory run:
./illuminaserviceconnector run
• Check status of connector. From within the install location run:
./illuminaserviceconnector status
• Stop the connector with:
./illuminaserviceconnector stop
• Restart the connector with:
./illuminaserviceconnector restart
Connector gets connected, but uploads won’t start
Create a new empty folder on your local disk, put a small file in there, and configure this folder as upload folder. • If it works, and your sample files are on a shared drive, have a look at the Shared Drives section. • If it works, and your sample files are on your local disk, there are a few possibilities: a) There is an error in how the upload folder name is configured in the platform. b) For big files, or on slow disks, the connector needs quite some time to start the transfer because it needs to calculate a hash to make sure there are no transfer errors. Wait up to 30 minutes, without changing anything to your Connector configuration. • If this doesn’t work, you might have a corporate proxy. Proxy configuration is currently not supported for the connector.
Upload from shared drive does not work
Follow the guidelines in Shared Drives section. Inspect the connector BSC.log file for any error messages regarding the directory not being found. • If there is such a message, there are two options: a) An issue with the the folder name, such as special characters and spaces. As a best practice, use only alphanumeric characters, underscores, dashes and periods. b) A permissions issue. In this case, ensure the user running the connector has read & write access, without a password being requested, to the network share. • If there are no messages indicating the directory cannot be found, it may be necessary to wait for some time until the integrity checks have been done. This check can take quite long on slow disks and slow networks.
Data Transfers are slow
Many factors can affect the speed: • Distance from upload location to storage location • Quality of the internet connection • Hardlines are preferred over WiFi • Restrictions for up- and download by the company or the provider. These factors can change every time the customer switches from location (e.g. working from home).
The upload or download progress % goes down instead of up.
This is normal behavior. Instead of one continuous transmission, data is split into blocks so that whenever transmission issues occur, not all data has to be retried. This does result in dropping back to a lower % of transmission completed when retrying.
The ICA CLI accepts configuration settings from multiple places, such as environment variables, configuration file, or passed in as command line arguments. When configuration settings are retrieved, the following precedence is used to determine which setting to apply:
Command line options - Passed in with the command such as --access-token
Environment variables - Stored in system environment variables such as ICAV2_ACCESS_TOKEN
Default config file - Stored by default in the ~/.icav2/config.yaml
on macOS/Linux and C:\Users\USERNAME\.icav2\.config
on Windows
The following global flags are available in the CLI interface:
Environment variables provide another way to specify configuration settings. Variable names align with the command line options with the following modifications:
Upper cased
Prefix ICAV2_
All dashes replaced by underscore
For example, the corresponding environment variable name for the --access-token
flag is ICAV2_ACCESS_TOKEN
.
The environment variable ICAV2_ICA_NO_RETRY_RATE_LIMITING allows to disable the retry mechanism. When it is set to 1
, no retries are performed. For any other value, http code 429 will result in 4 retry attempts:
after 500 milliseconds
after 2 seconds
after 10 seconds
after 30 seconds
Upon launching icav2
for the first time, the configuration yaml file is created and the default config settings are set. Enter an alternative server URL or press enter to leave it as the default. Then enter your API Key and press enter.
After installing the CLI, open a terminal window and enter the icav2
command. This will initialize a default configuration file in the home directory at .icav2/config.yaml
.
To reset the configuration, use ./icav2 config reset
Resetting the configuration removes the configuration from the host device and cannot be undone. The configuration needs to be recreated.
Configuration settings is stored in the default configuration file:
The file ~/.icav2/.session.ica.yaml
on macOS/Linux and C:\Users\USERNAME\.icav2\.session.ica
on Windows will contain the access-token and project-id. These are output files and should not be edited as they are automatically updated.
This variable is used to set the API Key.
Command line options - Passed as --x-api-key <your_api_key>
or -k <your_api_key>
Environment variables - Stored in system as ICAV2_X_API_KEY
Default config file - Use icav2 config set
to update ~/.icav2/config.yaml
(macOS/Linux) or C:\Users\USERNAME\.icav2\.config
(Windows)
The build number, together with the used libraries and licenses are provided in the accompanying readme file.
This command generates custom completion functions for icav2 tool. These functions facilitate the generation of context-aware suggestions based on the user's input and specific directives provided by the icav2 tool. For example, for ZSH shell the completion function _icav2() is generated. It could provide suggestions for available commands, flags, and arguments depending on the context, making it easier for the user to interact with the tool without having to constantly refer to documentation.
To enable this custom completion function, you would typically include it in your Zsh configuration (e.g., in .zshrc or a separate completion script) and then use the compdef command to associate the function with the icav2 command:
This way, when the user types icav2 followed by a space and presses the TAB key, Zsh will call the _icav2 function to provide context-aware suggestions based on the user's input and the icav2 tool's directives.
Example 1
Example 2
Here an example of how to download all BAM files from a project (we are using some jq features to remove '.bam.bai' and '.bam.md5sum' files)
Tip: If you want to look up a file id from the GUI, go to that file and open te details view. The file id can be found on the top left side and will begin with fil.
It is best practice to always surround your path with quotes if you want to use the * wildcard. Otherwise, you may run into situations where the command results in "accepts at most 1 arg(s), received x" as it returns folders with the same name, but different amounts of subfolders.
If you want to look up a file id from the GUI, go to that file and open te details view. The file id can be found on the top left side and will begin with fil.
Example to list files in the folder SOURCE
Example to list only subfolders in the folder SOURCE
Example for uploading multiple files
In this example all the fastq.gz files from source will be uploaded to target using xargs utility.
Example for uploading multiple files using a CSV file
In this example we upload multiple bam files specified with the corresponding path in the file bam_files.csv. The files will be renamed. We are using screen in detached mode (this creates a new session but not attaching to it):
icav2 projectpipelines create cwl
icav2 projectpipelines create cwljson
icav2 projectpipelines create nextflow
icav2 projectpipelines create nextflowjson
icav2 projectpipelines start cwl
icav2 projectpipelines start cwljson
Field definition
A field can only have values (--field) and a data field can only have datavalues (--field-data). To create multiple fields or data fields, you have to repeat the flag.
For example
matches
The following example with --field and --field-data
matches
Group definition
A group will only have values (--group) and a data group can only have datavalues (--group-data). Add flags multiple times for multiple groups and fields in the group.
icav2 projectpipelines start nextflow
icav2 projectpipelines start nextflowjson
Field definition
A field can only have values (--field) and a data field can only have datavalues (--field-data). To create multiple fields or data fields, you have to repeat the flag.
For example
matches
The following example with --field and --field-data
matches
Group definition
A group will only have values (--group) and a data group can only have datavalues (--group-data). Add flags multiple times for multiple groups and fields in the group.
In this tutorial, we will create a pipeline which will split a TSV file into chunks, sort them, and merge them together.
Select Projects > your_project > Flow > Pipelines. From the Pipelines view, click the +Create pipeline > Nextflow > XML based button to start creating a Nextflow pipeline.
In the Details tab, add values for the required Code (unique pipeline name) and Description fields. Nextflow Version and Storage size defaults to preassigned values.
First, we present the individual processes. Select +Nextflow files > + Create file and label the file split.nf. Copy and paste the following definition.
Next, select +Create file and name the file sort.nf. Copy and paste the following definition.
Select +Create file again and label the file merge.nf. Copy and paste the following definition.
Add the corresponding main.nf file by navigating to the Nextflow files > main.nf tab and copying and pasting the following definition.
Here, the operators flatten and collect are used to transform the emitting channels. The Flatten operator transforms a channel in such a way that every item of type Collection or Array is flattened so that each single entry is emitted separately by the resulting channel. The collect operator collects all the items emitted by a channel to a List and return the resulting object as a sole emission.
Finally, copy and paste the following XML configuration into the XML Configuration tab.
Click the Generate button (at the bottom of the text editor) to preview the launch form fields.
Click the Save
button to save the changes.
Go to the Pipelines page from the left navigation pane. Select the pipeline you just created and click Start New Analysis.
Fill in the required fields indicated by red "*" sign and click on Start Analysis button. You can monitor the run from the Analyses page. Once the Status changes to Succeeded, you can click on the run to access the results page.
Select Projects > your_project > Flow > Analyses, and open the Logs tab. From the log files, it is clear that in the first step, the input file is split into multiple chunks, then these chunks are sorted and merged.
Find the links to CLI builds in the Releases section below.
Checksums are provided alongside each downloadable CLI binary to verify file integrity. The checksums are generated using the SHA256 algorithm. To use the checksums:
Download the CLI binary for your OS
Download the corresponding checksum using the links in the table
Calculate the SHA256 checksum of the downloaded CLI binary
Diff the calculated SHA256 checksum with the downloaded checksum. If the checksums match, the integrity is confirmed.
There are a variety of open source tools for calculating the SHA256 checksum. See the below tables for examples.
For CLI v2.2.0:
For CLI v2.3.0+:
In this tutorial, we will show how to create and launch a pipeline using the Nextflow language in ICA.
After creating the project, select the project from the Projects view to enter the project. Within the project, navigate to the Flow > Pipelines view in the left navigation pane. From the Pipelines view, click +Create Pipeline > Nextflow > XML based
to start creating the Nextflow pipeline.
In the Nextflow pipeline creation view, the Information tab is used to add information about the pipeline. Add values for the required Code (unique pipeline name) and Description fields.
Add the container
directive to each process with the latest ubuntu image. If no Docker image is specified, public.ecr.aws/lts/ubuntu:22.04_stable is used as default.
Add the publishDir
directive with value 'out'
to the reverse
process.
Modify the reverse
process to write the output to a file test.txt
instead of stdout.
The description of the pipeline from the linked Nextflow docs:
This example shows a pipeline that is made of two processes. The first process receives a FASTA formatted file and splits it into file chunks whose names start with the prefix seq_.
The process that follows, receives these files and it simply reverses their content by using the rev command line tool.
Syntax example:
Navigate to the Nextflow files > main.nf tab to add the definition to the pipeline. Since this is a single file pipeline, we won't need to add any additional definition files. Paste the following definition into the text editor:
Next we'll create the input form used when launching the pipeline. This is done through the XML Configuration tab. Since the pipeline takes in a single FASTA file as input, the XML-based input form will include a single file input.
Paste the below XML input form into the XML CONFIGURATION text editor. Click the Generate button to preview the launch form fields.
With the definition added and the input form defined, the pipeline is complete.
On the Documentation tab, you can fill out additional information about your pipeline. This information will be presented under the Documentation tab whenever a user starts a new analysis on the pipeline.
Click the Save
button at the top right. The pipeline will now be visible from the Pipelines view within the project.
To upload the FASTA file to the project, first navigate to the Data section in the left navigation pane. In the Data view, drag and drop the FASTA file from your local machine into the indicated section in the browser. Once the file upload completes, the file record will show in the Data explorer. Ensure that the format of the file is set to "FASTA".
Now that the input data is uploaded, we can proceed to launch the pipeline. Navigate to the Analyses view and click the button to Start Analysis
. Next, select your pipeline from the list. Alternatively you can start your pipeline from Projects > your_project > Flow > Pipelines > Start new analysis.
In the Launch Pipeline view, the input form fields are presented along with some required information to create the analysis.
Enter a User Reference (identifier) for the analysis. This will be used to identify the analysis record after launching.
Set the Entitlement Bundle (there will typically only be a single option).
In the Input Files section, select the FASTA file for the single input file. (chr1_GL383518v1_alt.fa)
Set the Storage size to small. This will attach a 1.2TB shared file system to the environment used to run the pipeline.
With the required information set, click the button to Start Analysis.
After launching the pipeline, navigate to the Analyses view in the left navigation pane.
The analysis record will be visible from the Analyses view. The Status will transition through the analysis states as the pipeline progresses. It may take some time (depending on resource availability) for the environment to initialize and the analysis to move to the In Progress status.
Click the analysis record to enter the analysis details view.
Once the pipeline succeeds, the analysis record will show the "Succeeded" status. Do note that this may take considerable time if it is your first analysis because of the required resource management. (in our example, the analysis took 28 minutes)
From the analysis details view, the logs produced by each process within the nextflow pipeline are accessible via the Logs tab.
Analysis outputs are written to an output directory in the project with the naming convention {Analysis User Reference}-{Pipeline Code}-{GUID}
. (1)
Inside of the analysis output directory are the files output by the analysis processes written to the 'out'
directory. In this tutorial, the file test.txt
(2) is written to by the reverse
process. Navigating into the analysis output directory, clicking into the test.txt
file details, and opening the VIEW tab (3) shows the output file contents.
The "Download" button (4) can be used to download the data to the local machine.
After a project has been created, a DRAGEN bundle must be linked to a project to obtain access to a DRAGEN docker image. Enter the project by clicking on it, and click Edit
in the Project Details page. From here, you can link a DRAGEN Demo Tool bundle into the project. The bundle that is selected here will determine the DRAGEN version that you have access to. For this tutorial, you can link DRAGEN Demo Bundle 3.9.5. Once the bundle has been linked to your project, you can now access the docker image and version by navigating back to the All Projects page, clicking on Docker Repository, and double clicking on the docker image dragen-ica-4.0.3. The URL of this docker image will be used later in the container
directive for your DRAGEN process defined in Nextflow.
Select Projects > your_project > Flow > Pipelines. From the Pipelines view, click +Create Pipeline > Nextflow > XML based to start creating a Nextflow pipeline.
In the Nextflow pipeline creation view, the Details tab is used to add information about the pipeline. Add values for the required Code (pipeline name) and Description fields. Nextflow Version and Storage size defaults to preassigned values. For the customized DRAGEN pipeline, Nextflow Version must be changed to 22.04.3.
Next, add the Nextflow pipeline definition by navigating to the Nextflow files > MAIN.NF tab. You will see a text editor. Copy and paste the following definition into the text editor. Modify the container
directive by replacing the current URL with the URL found in the docker image dragen-ica-4.0.3.
This pipeline takes two FASTQ files, one reference file and one sample_id parameter as input.
Paste the following XML input form into the XML CONFIGURATION text editor.
Click the Generate button (at the bottom of the text editor) to preview the launch form fields.
Click the Save
button to save the changes.
The dataInputs
section specifies file inputs, which will be mounted when the workflow executes. Parameters defined under the steps
section refer to string and other input types.
Each of the dataInputs
and parameters
can be accessed in the Nextflow within the workflow's params
object named according to the code
defined in the XML (e.g. params.sample_id
).
If you have no test data available, you need to link the Dragen Demo Bundle to your project at Projects > your_project > Project Settings > Details > Linked Bundles.
Go to the pipelines page from the left navigation pane. Select the pipeline you just created and click Start New Analysis.
Fill in the required fields indicated by red "*" sign and click on Start Analysis button.
You can monitor the run from the analysis page.
Once the Status changes to Succeeded, you can click on the run to access the results page.
This approach is applicable in situations where your main.nf file contains all your pipeline logic and illustrates what the liftover process would look like.
Select Projects > your_project > Flow > Pipelines. From the Pipelines view, click the +Create pipeline > Nextflow > XML based button to start creating a Nextflow pipeline.
In the Details tab, add values for the required Code (unique pipeline name) and Description fields. Nextflow Version and Storage size defaults to preassigned values.
In the XML configuration, the input files and settings are specified. For this particular pipeline, you need to specify the transcriptome and the reads directory. Navigate to the XML Configuration tab and paste the following:
Click the Generate button (at the bottom of the text editor) to preview the launch form fields.
Click the Save
button to save the changes.
Go to the Pipelines page from the left navigation pane. Select the pipeline you just created and click Start New Analysis.
Fill in the required fields indicated by red "*" sign and click on Start Analysis button. You can monitor the run from the Analyses page. Once the Status changes to Succeeded, you can click on the run to access the results page.
This is not an official Illumina product, but is intended to make your Nextflow experience in ICA more fruitful.
Some additional repos that can help with your ICA experience can be found below:
What these scripts do:
Parse configuration files and the Nextflow scripts (main.nf, workflows, subworkflows, modules) of a pipeline and update the configuration of the pipeline with pod directives to tell ICA what compute instance to run
Strips out parameters that ICA utilizes for workflow orchestration
Migrates manifest closure to conf/base.ica.config
file
Ensures that docker is enabled
Adds workflow.onError
(main.nf, workflows, subworkflows, modules) to aid troubleshooting
Modifies the processes that reference scripts and tools in the bin/
directory of a pipeline's projectDir
, so that when ICA orchestrates your Nextflow pipeline, it can find and properly execute your pipeline process
Additional edits to ensure your pipeline runs more smoothly on ICA
Nextflow workflows on ICA are orchestrated by kubernetes and require a parameters XML file containing data inputs (i.e. files + folders) and other string-based options for all configurable parameters to properly be passed from ICA to your Nextflow workflows
Nextflow processes will need to contain a reference to a container --- a Docker image that will run that specific process
Nextflow processes will need a pod annotation
specified for ICA to know what instance type to run the process.
The scripts mentioned below can be run in a docker image keng404/nextflow-to-icav2-config:0.0.3
This has:
nf-core installed
All Rscripts in this repo with relevant R libraries installed
The ICA CLI installed, to allow for pipeline creation and CLI templates to request pipeline runs after the pipeline is created in ICA
You'll likely need to run the image with a docker command like this for you to be able to run git commands within the container:
where pwd
is your $HOME
directory
If you have a specific pipeline from Github, you can skip this statement below.
You'll first need to download the python module from nf-core via a pip install nf-core
command. Then you can use nf-core list --json to return a JSON metadata file containing current pipelines in the nf-core repository.
You can choose which pipelines to git clone
, but as a convenience, the wrapper nf-core.conversion_wrapper.R
will perform a git pull, parse nextflow_schema.json files and generate parameter XML files, and then read configuration and Nextflow scripts and make some initial modifications for ICA development. Lastly, these pipelines are created in an ICA project of your choosing, so you will need to generate and download an API key from the ICA domain of your choosing.
The Project view should be the default view after logging into your private domain (https://my_domain.login.illumina.com) and clicking on your ICA 'card' ( This will redirect you to https://illumina.ica.com/ica).
GIT_HUB_URL
can be specified to grab pipeline code from github. If you intend to liftover anything in the master branch, your GIT_HUB_URL
might look like https://github.com/keng404/my_pipeline
. If there is a specific release tag you intend to use, you can use the convention https://github.com/keng404/my_pipeline:my_tag
.
Alternatively, if you have a local copy/version of a Nextflow pipeline you'd like to convert and use in ICA, you can use the --pipeline-dirs
argument to specify this.
In summary, you will need the following prerequisites, either to run the wrapper referenced above or to carry out individual steps below.
git clone
nf-core pipelines of interest
Install the python module nf-core
and create a JSON file using the command line nf-core list --json > {PIPELINE_JSON_FILE}
nf-core.conversion_wrapper.R
does for each Nextflow pipelineA Nextflow schema JSON is generated by nf-core's python library nf-core
nf-core can be installed via a pip install nf-core
command
nextflow.config
and a base config
file so that it is compatible with ICA.This script will update your configuration files so that it integrates better with ICA. The flag --is-simple-config
will create a base config file from a template. This flag will also be active if no arguments are supplied to --base-config-files
.
This step adds some updates to your module scripts to allow for easier troubleshooting (i.e. copy work directory back to ICA if an analysis fails). It also allows for ICA's orchestration of your Nextflow pipeline to properly handle any script/binary in your bin/
directory of your pipeline $projectDir
.
You may have to edit your {PARAMETERS_XML}
file if these edits are unnecessary.
Currently ICA supports Nextflow versions nextflow/nextflow:22.04.3
and nextflow/nextflow:20.10.0
(with 20.10.0 to be deprecated soon)
nf-core.create_ica_pipeline.R
Add the flag --developer-mode
to the command line above if you have custom groovy libraries or modules files referenced in your workflow. When this flag is specified, the script will upload these files and directories to ICA and update the parameters XML file to allow you to specify directories under the parameters project_dir
and files under input_files
. This will ensure that these files and directories will be placed in the $workflow.launchDir
when the pipeline is invoked.
As a convenience, you can also get a templated CLI command to help run a pipeline (i.e. submit a pipeline request) in ICA via the following:
There will be a corrsponding JSON file (i.e. a file with a file extension *ICAv2_CLI_template.json
) that saves these values that one could modify and configure to build out templates or launch the specific pipeline run you desire. You can specify the name of this JSON file with the parameter --output-json
.
Once you modify this file, you can use --template-json
and specify this file to create the CLI you can use to launch your pipeline.
If you have a previously successful analysis with your pipeline, you may find this approach more useful.
Where possible, these scripts search for config files that refer to a test (i.e. test.config,test_full.config,test*config) and creates a boolean parameter params.ica_smoke_test
that can be toggled on/off as a sanity check that the pipeline works as intended. By default, this parameter is set to false
.
When set to true
, these test config files are loaded in your main nextflow.config
.
In this tutorial, we will demonstrate how to create and launch a Nextflow pipeline using the ICA command line interface (CLI).
The 'main.nf' file defines the workflow that orchestrates various RNASeq analysis processes.
The script uses the following tools:
Salmon: Software tool for quantification of transcript abundance from RNA-seq data.
FastQC: QC tool for sequencing data
MultiQC: Tool to aggregate and summarize QC reports
docker pull nextflow/rnaseq-nf
Create a tarball of the image to upload to ICA.
Following are lists of commands that you can use to upload the tarball to your project.
Add the image to the ICA Docker repository
The uploaded image can be added to the ICA docker repository from the ICA Graphical User Interface (GUI).
Change the format for the image tarball to DOCKER:
Navigate to Projects > <your_project> Data
Check the checkbox for the uploaded tarball
Click on "Manage" dropdown
Click on "Change format" In the new popup window, select "DOCKER" format and hit save.
To add this image to the ICA Docker repository, first click on "All Projects" to go back to the home page.
From the ICA home page, click on the "Docker Repository" page under "System Settings"
Click the "+ New" button to open the "New Docker Image" window.
In the new window, click on the "Select a file with DOCKER format"
This will open a new window that lets you select the above tarball.
Select the region (US, EU, CA) your project is in.
Select your project. You can start typing the name in the textbox to filter it.
The bottom pane will show the "Data" section of the selected project. If you have the docker image in subfolders, browse the folders to locate the file. Once found, click on the checkbox corresponding to the image and press "Select".
You will be taken back to the "New Docker image" window. The "Data" and "Name" fields will have been populated based on the imported image. You can edit the "Name" field to rename it. For this tutorial, we will change the name to "rnaseq". Select the region, and give it a version number, and description. Click on "Save".
If you have the images hosted in other repositories, you can add them as external image by clicking the "+ New external image" button and completing the form as shown in the example below.
After creating a new docker image, you can double click on the image to get the container URL for the nextflow configuration file.
Create a configuration file called "nextflow.config" in the same directory as the main.nf file above. Use the URL copied above to add the process.container
line in the config file.
An empty form looks as follows:
The input files are specified within a single dataInputs node with individual input file specified in a separate dataInput node. Settings (as opposed to files) are specified within the steps node. Settings represent any non-file input to the workflow, including but not limited to, strings, booleans, integers, etc..
For this tutorial, we do not have any settings parameters but it requires multiple file inputs. The parameters.xml file looks as follows:
Use the following commands to create the pipeline with the above workflow in your project.
If not already in the project context, enter it by using the following command:
icav2 enter <PROJECT NAME or ID>
Create pipeline using icav2 project pipelines create nextflow
Example:
If you prefer to organize the processes in different folders/files, you can use --other
parameter to upload the different processes as additional files. Example:
Example command to run the pipeline from CLI:
You can get the pipeline id under "ID" column by running the following command:
You can get the file ids under "ID" column by running the following commands:
Additional Resources:
Using this command all the files starting with VariantCaller- will be downloaded (prerequisite: a tool is installed on the machine):
For more information on how to use pagination, please refer to
Please see the for all content related to Cloud Analysis Auto-Launch:
Nextflow offers support for Scatter-gather pattern natively. The initial uses this pattern by splitting the FASTA file into chunks to channel records in the task splitSequences, then by processing these chunks in the task reverse.
.
Note: To access release history of CLI versions prior to v2.0.0, please see the ICA v1 documentation .
This tutorial references the example in the Nextflow documentation.
The first step in creating a pipeline is to create a project. For instructions on creating a project, see the page. In this tutorial, we'll use a project called "Getting Started".
Next we'll add the Nextflow pipeline definition. The pipeline we're creating is a modified version of the example from the Nextflow documentation. Modifications to the pipeline definition from the nextflow documentation include:
Resources: For each process, you can use the and to set the . ICA will then determine the best matching compute type based on those settings. Suppose you set memory '10240 GB'
and cpus 6
, then ICA will determine you need standard-large
ICA Compute Type.
Before we launch the pipeline, we'll need to upload a FASTA file to use as input. In this tutorial, we'll use a public FASTA file from the . Download the file and unzip to decompress the FASTA file.
In this tutorial, we will demonstrate how to create and launch a simple DRAGEN pipeline using the Nextflow language in ICA GUI. More information about Nextflow on ICA can be found . For this example, we will implement the alignment and variant calling example from this for Paired-End FASTQ Inputs.
As of Dragen version 4.3.13, Dragen pipelines is no longer possible because of proprietary code.
The first step in creating a pipeline is to select a project for the pipeline to reside in. If the project doesn't exist, create a project. For instructions on creating a project, see the page. In this tutorial, we'll use a project called Getting Started.
To specify a compute type for a Nextflow process, use the directive within each process.
Outputs for Nextflow pipelines are uploaded from the out
directory in the attached shared filesystem. The directive specifies the output folder for a given process. Only data moved to the out folder using the publishDir
directive will be uploaded to the ICA project after the pipeline finishes executing.
Refer to the for details on ICA specific attributes within the Nextflow definition.
Next, create the input form used for the pipeline. This is done through the XML CONFIGURATION tab. More information on the specifications for the input form can be found in page.
In this tutorial, we will be using the example RNASeq pipeline to demonstrate the process of lifting a simple Nextflow pipeline over to ICA.
Copy and paste the into the Nextflow files > main.nf tab. The following comparison highlights the differences between the original file and the version for deployment in ICA. The main difference is the explicit specification of containers and pods within processes. Additionally, some channels' specification are modified, and a debugging message is added. When copying and pasting, be sure to remove the text highlighted in red (marked with -) and add the text highlighted in green (marked with +).
This is an to help develop Nextflow pipelines that will run successfully on ICA. There are some syntax bugs that may get introduced in your Nextflow code. One suggestion is to run the steps as described below and then open these files in VisualStudio Code with the Nextflow plugin installed. You may also need to run smoke tests on your code to identify syntax errors you might not catch upon first glance.
Some examples of Nextflow pipelines that have been lifted over with this repo can be found .
Some additional examples of ICA-ported Nextflow pipelines are .
Relaunch pipeline analysis and
Monitor your analysis run in ICA and troubleshoot
Wrap a WDL-based workflow in a
Wrap a Nextflow-based workflow in a
This will allow you to test your main.nf script. If you have a Nextflow pipeline that is more nf-core like (i.e. where you may have several subworkflow and module files), this may be more appropriate. Any and all comments are welcome.
Generates parameter XML file based on nextflow_schema.json, nextflow.config, conf/
`- Take a look at to understand a bit more of what's done with the XML, as you may want to make further edits to this file for better usability
A table of instance types and the associated CPU + Memory specs can be found under a table named Compute Types
These scripts have been made to be compatible with workflows, so you may find the concepts from the documentation here a better starting point.
Next, you'll need an API key file for ICA that can be generated using the instructions .
Finally, you'll need to create a project in ICA. You can do this via the CLI and API, but you should be able to follow these to create a project via the ICA GUI.
Install ICA CLI by following these .
A table of all CLI releases for mac, linux, and windows can be found .
Relaunch pipeline analysis and .
Please refer to for installing ICA CLI. To authenticate, please follow the steps in the page.
In this tutorial, we will create in ICA. The workflow includes four processes: index creation, quantification, FastQC, and MultiQC. We will also upload a Docker container to the ICA Docker repository for use within the workflow.
We need a Docker container consisting of these tools. You can refer to the section in the help page to build your own docker image with the required tools. For the sake of this tutorial, we will use the container from the
With in your computer, download the image required for this project using the following command.
You can add a pod directive within a process or in the config file to specify a compute type. The following is an example of a configuration file with the 'standard-small' compute type for all processes. Please refer to the page for a list of available compute types.
The parameters file defines the workflow input parameters. Refer to the for detailed information for creating correctly formatted parameters files.
You can refer to page to explore options to automate this process.
Refere to for details on running the pipeline from CLI.
Please refer to command help (icav2 [command] --help
) to determine available flags to filter output of above commands if necessary. You can also refer to page for available flags for the icav2 commands.
For more help on uploading data to ICA, please refer to the page.
Windows
CertUtil -hashfile icav2.exe SHA256
Linux
sha256sum icav2
Mac
shasum -a 256 icav2
Windows
CertUtil -hashfile ica-windows-amd64.zip SHA256
Linux
sha256sum ica-linux-amd64.zip
Mac
shasum -a 256 ica-darwin-amd64.zip
2.34.0
2.33.0
2.32.2
2.31.0
2.30.0
2.29.0
2.28.0
2.27.0
2.26.0
2.25.0
2.24.0
2.23.0
2.22.0
2.21.0
2.19.0
2.18.0
2.17.0
2.16.0
2.15.0
2.12.0
2.10.0
2.9.0
2.8.0
2.4.0
2.3.0
2.2.0
2.1.0
2.0.0
In bioinformatics and computational biology, the vast and growing amount of data necessitates methods and tools that can process and analyze data in parallel. This demand gave birth to the scatter-gather approach, an essential pattern in creating pipelines that offers efficient data handling and parallel processing capabilities. In this tutorial, we will demonstrate how to create a CWL pipeline utilizing the scatter-gather approach. To this purpose, we will use two widely known tools: fastp and multiqc. Given the functionalities of both fastp and multiqc, their combination in a scatter-gather pipeline is incredibly useful. Individual datasets can be scattered across resources for parallel preprocessing with fastp. Subsequently, the outputs from each of these parallel tasks can be gathered and fed into multiqc, generating a consolidated quality report. This workflow not only accelerates the preprocessing of large datasets but also offers an aggregated perspective on data quality, ensuring that subsequent analyses are built upon a robust foundation.
First, we create the two tools: fastp and multiqc. For this, we need the corresponding Docker images and CWL tool definitions. Please, look up this part of our help sites to learn more how to import a tool into ICA. In a nutshell, once the CWL tool definition is pasted into the editor, the other tabs for editing the tool will be populated. To complete the tool, the user needs to select the corresponding Docker image and to provide a tool version (could be any string).
For this demo, we will use the publicly available Docker images: quay.io/biocontainers/fastp:0.20.0--hdbcaa40_0 for fastp and docker.io/ewels/multiqc:v1.15 for multiqc. In this tutorial one can find how to import publicly available Docker images into ICA.
Furthermore, we will use the following CWL tool definitions:
and
Once the tools are created, we will create the pipeline itself using these two tools at Projects > your_project > Flow > Pipelines > CWL > Graphical:
On the Definition tab, go to the tool repository and drag and drop the two tools which you just created on the pipeline editor.
Connect the JSON output of fastp to multiqc input by hovering over the middle of the round, blue connector of the output until the icon changes to a hand and then drag the connection to the first input of multiqc. You can use the magnification symbols to make it easier to connect these tools.
Above the diagram, drag and drop two input FASTQ files and an output HTML file on to the pipeline editor and connect the blue markers to match the diagram below.
Relevant aspects of the pipeline:
Both inputs are multivalue (as can be seen on the screenshot)
Ensure that the step fastp has scattering configured: it scatters on both inputs using the scatter method 'dotproduct'. This means that as many instances of this step will be executed as there are pairs of FASTQ files. To indicate that this step is executed multiple times, the icons of both inputs have doubled borders.
Both input arrays (Read1 and Read2) must be matched. Currently an automatic sorting of input arrays is not supported yet. One has to take care of matching the input arrays. There are two ways to achieve this (besides the manual specification in the GUI):
invoke this pipeline in CLI using Bash functionality to sort the arrays
add a tool to the pipeline which will intake array of all FASTQ files, spread them on R1 and R2 suffixes, and sort them.
We will describe the second way in more detail. The tool will be based on public python Docker docker.io/python:3.10
and have the following definition. In this tool we are providing the Python script spread_script.py via Dirent feature.
Now this tool can added to the pipeline before fastp step.
This tutorial aims to guide you through the process of creating CWL tools and pipelines from the very beginning. By following the steps and techniques presented here, you will gain the necessary knowledge and skills to develop your own pipelines or transition existing ones to ICA.
The foundation for every tool in ICA is a Docker image (externally published or created by the user). Here we present how to create your own Docker image for the popular tool (FASTQC).
Copy the contents displayed below to a text editor and save it as a Dockerfile. Make sure you use an editor which does not add formatting to the file.
Open a terminal window, place this file in a dedicated folder and navigate to this folder location. Then use the following command:
docker build --file fastqc-0.11.9.Dockerfile --tag fastqc-0.11.9:1 .
Check the image has been successfully built:
docker images
Check that the container is functional:
docker run --rm -i -t --entrypoint /bin/bash fastqc-0.11.9:1
Once inside the container check that the fastqc
command is responsive and prints the expected help message. Remember to exit
the container.
Save a tar of the previously built image locally:
docker save fastqc-0.11.9:1 -o fastqc-0.11.9:1.tar.gz
Upload your docker image .tar to an ICA project (browser upload, Connector, or CLI). Important: In Data tab, select the uploaded .tar file, then click “Manage --> Change Format”, select 'DOCKER' and Save.
Now step outside of the Project and go to Docker Repository, Select New and click on the Search Icon. You can filter on Project names and locations, select your docker file (use the checkbox on the left) and Press Select.
While outside of any Project go to Tool Repository and Select New Tool. Fill the mandatory fields (Name and Version) and click on the Search Icon to look for a Docker image to link to the tool. You must double-click on the image row to confirm the selection. Tool creation in ICA adheres to the cwl standard.
There are 2 ways you can create a (cwl) tool on top of a docker image in ICA UI:
1: Navigate to the Tool cwl tab and use the Text Editor to create the tool definition in CWL syntax. 2: Use the other tabs to independently define inputs, outputs, arguments, settings, etc …
In this tutorial we will present the 1st option using the CWL file: paste the following content into the Tool CWL tab
Please, observe the following: since the user needs to specify the output folder for FASTQC application (-o prefix), we are using the $(runtime.outdir) runtime parameter to point to the designated output directory.
While inside a Project, navigate to Pipelines and click on cwl and then Graphical.
Fill the mandatory fields (Code = pipeline name and free text Description) and click on the Definition tab to open the Graphical Editor.
Expand the Tool Repository menu (lower right) and drag your FastQC tool into the Editor field (center).
Now drag one Input and one Output file icon (on top) into the Editor field as well. Both may be given a Name (editable fields on the right when icon is selected) and need a Format attribute. Set the Input Format to fastq and Output Format to html. Connect both Input and Output files to the matching nodes on the tool itself (mouse over the node, then hold-click and drag to connect).
Press Save, you just created your first FastQC pipeline on ICA!
First make sure you have at least one Fastq file uploaded and/or linked to your Project. You may use Fastq files available in the Bundle.
Navigate to Pipelines and select the pipeline you just created, then press Start New Run.
Fill the mandatory field (User Reference = pipeline execution name) and click on the Select button to open the File Selection dialog box. Select any of the Fastq files available to you (use the checkbox on the left and press Select on the lower right).
Press Start Run on the top right, the platform is now orchestrating the workflow execution.
Navigate to Runs and observe that the pipeline execution is now listed and will first appear to be in “Requested” Status. After a few minutes the Status should change to “In Progress” and then to “Succeeded”.
Once this Run is succeeded click on the row (a single click is enough) to enter Result view. You should see the FastQC HTML output file listed on the right. Click on the file to open Data Details view. Since it is an HTML file Format there is a View tab that allows visualizing the HTML within the browser.
In this tutorial, we will demonstrate how to create and launch a DRAGEN pipeline using the CWL language.
In ICA, CWL pipelines are built using tools developed in CWL. For this tutorial, we will use the "DRAGEN Demo Tool" included with DRAGEN Demo Bundle 3.9.5.
1.) Start by selecting a project at the Projects inventory.
2.) In the details page, select Edit
.
3.) In the edit mode of the details page, click the +
button in the LINKED BUNDLES section.
4.) In the Add Bundle to Project window: Select the dragen demo tool bundle from the list. Once you have selected the bundle, the Link Bundles button becomes available. Select it to continue.
Tip: You can select multiple bundles using
Ctrl + Left mouse button
orShift + Left mouse button
.
5.) In the project details page, the selected bundle will appear under the LINKED BUNDLES section. If you need to remove a bundle, click on the -
button. Click Save to save the project with linked bundles.
1.) From the project details page, select Pipelines > CWL
2.) You will be given options to create pipelines using a graphical interface or code. For this tutorial, we will select Graphical.
3.) Once you have selected the Graphical option, you will see a page with multiple tabs. The first tab is the Information page where you enter pipeline information. You can find the details for different fields in the tab in the GitBook. The following three fields are required for the INFORMATION page.
Code: Provide pipeline name here.
Description: Provide pipeline description here.
Storage size: Select the storage size from the drop-down menu.
4.) The Documentation tab provides options for configuring the HTML description for the tool. The description appears in the tool repository but is excluded from exported CWL definitions.
5.) The Definition tab is used to define the pipeline. When using graphical mode for the pipeline definition, the Definition tab provides options for configuring the pipeline using a visualization panel (A) and a list of component menus (B). You can find details on each section in the component menu here
6.) To build a pipeline, start by selecting Machine PROFILE from the component menu section on the right. All fields are required and are pre-filled with default values. Change them as needed.
The profile Name field will be updated based on the selected Resource. You can change it as needed.
Color assigns the selected color to the tool in the design view to easily identify the machine profile when more than one tool is used in the pipeline.
Resource lets you choose from various compute resources available. In this case, we are building a DRAGEN pipeline and we will need to select a resource with FPGA in it. Choose from FPGA resources (FPGA Medium/Large) based on your needs.
7.) Once you have selected the Machine Profile for the tool, find your tool from the Tool Repository at the bottom section of the component menu on the right. In this case, we are using the DRAGEN Demo Tool. Drag and drop the tool from the Tool Repository section to the visualization panel.
8.) The dropped tool will show the machine profile color, number of outputs and inputs, and warning to indicate missing parameters, mandatory values, and connections. Selecting the tool in the visualization panel activates the tool (Dragen Demo Tool) component menu. On the component menu section, you will find the details of the tool under Tool - DRAGEN Demo Tool. This section lists the inputs, outputs, additional parameters, and the machine profile required for the tool. In this case, the DRAGEN Demo Tool requires three inputs (FASTQ read 1, FASTQ read 2, and a Reference genome). The tool has two outputs (a VCF file and an output directory). The tool also has a mandatory parameter (Output File Prefix). Enter the value for the input parameter (Output File Prefix) in the text box.
9.) The top right corner of the visualization panel has icons to zoom in and out in the visualization panel followed by three icons: ref, in, and out. Based on the type of input/output needed, drag and drop the icons into the visualization area. In this case, we need three inputs (read 1, read 2, and Reference hash table.) and two outputs (VCF file and output directory). Start by dragging and dropping the first input (a). Connect the input to the tool by clicking on the blue dot at the bottom of the input icon and dragging it to the blue dot representing the first input on the tool (b). Select the input icon to activate the input component menu. The input section for the first input lets you enter the Name, Format, and other relevant information based on tool requirements. In this case, for the first input, enter the following information:
Name: FASTQ read 1
Format: FASTQ
Comments: any optional comments
10.) Repeat the step for other inputs. Note that the Reference hash table is treated as the input for the tool rather than Reference files. So, use the input icon instead of the reference icon.
11.) Repeat the process for two outputs by dragging and connecting them to the tool. Note that when connecting output to the tool, you will need to click on the blue dot at the bottom of the tool and drag it to the output.
12.) Select the tool and enter additional parameters. In this case, the tool requires Output File Prefix. Enter demo_ in the text box.
13.) Click on the Save button to save the pipeline. Once saved, you can run it from the Pipelines page under Flow from the left menus as any other pipeline.
You can access the databases and tables within the Base module using snowSQL command-line interface. This is useful for external collaborators who do not have access to ICA core functionalities. In this tutorial we will describe how to obtain the token and use it for accessing the Base module. This tutorial does not cover how to install and configure snowSQL.
Once the Base module has been enabled within a project, the following details are shown in Projects > your_project > Project Settings > Details.
After clicking the button Create OAuth access token
, the pop-up authenticator is displayed.
After clicking the button Generate snowSQL command
the pop-up authenticator presents the snowSQL command.
Copy the snowSQL command and run it in the console to log in.
You can also get the OAuth access token via API by providing <PROJECT ID> and <YOUR KEY>.
API Call:
Response
Template snowSQL:
Now you can perform a variety of tasks such as:
Querying Data: execute SQL queries against tables, views, and other database objects to retrieve data from the Snowflake data warehouse.
Creating and Managing Database Objects: create tables, views, stored procedures, functions, and other database objects in Snowflake. you can also modify and delete these objects as needed.
Loading Data: load data into Snowflake from various sources such as local files, AWS S3, Azure Blob Storage, or Google Cloud Storage.
Overall, snowSQL CLI provides a powerful and flexible interface to work with Snowflake, allowing external users to manage data warehouse and perform a variety of tasks efficiently and effectively without access to the ICA core.
Show all tables in the database:
Create a new table:
List records in a table:
Load data from a file: To load data from a file, you can start by create a staging area in the internal storage using the following commend:
You can then upload the local file to the internal storage using the following command:
You can check if the file was uploaded properly using LIST command:
Finally, Load data by using COPY TO command. The command assumes the data.tsv is a tab delimited file. You can easily modify the following command to import JSON file setting TYPE=JSON.
Load data from a string: If you have data as JSON string, you can import the data into the tables using following commands.
Load data into specific columns: If you want to load sample_name into the table, you can remove the "count" from the column and the value list as below:
List the views of the database to which you are connected. As shared database and catalogue views are created within the project database, they will be listed. However, it does not show views which are granted via another database, role or from bundles.
Show grants, both directly on the tables and views and grants to roles which in turn have grants on tables and views.
Base is a genomics data aggregation and knowledge management solution suite. It is a secure and scalable integrated genomics data analysis solution which provides information management and knowledge mining. Refer to the Base documentation for more details.
This tutorial provides an example for exercising the basic operations used with Base, including how to create a table, load the table with data, and query the table.
An ICA project with access to Base
If you don't already have a project, please follow the instructions in the Project documentation to create a project.
File to import
A tab delimited gene expression file (sampleX.final.count.tsv). Example format:
Tables are components of databases that store data in a 2-dimensional format of columns and rows. Each row represents a new data record in the table; each column represents a field in the record. On ICA, you can use Base to create custom tables to fit your data. A schema definition defines the fields in a table. On ICA you can create a schema definition from scratch, or from a template. In this activity, you will create a table for RNAseq count data, by creating a schema definition from scratch.
Go to the Projects > your_project > Base > Tables and enable Base by clicking on the Enable button.
Select Add Table > New Table.
Create your table
To create your table from scratch, select Empty Table from the Create table from dropdown.
Name your table FeatureCounts
Uncheck the box next to Include reference, to exclude reference data from your table.
Check the box next to Edit as text. This will reveal a text box that can be used to create your schema.
Copy the schema text below and paste it in into the text box to create your schema.
Click the Save button
Upload sampleX.final.count.tsv file with the final count.
Select Data tab (1) from the left menu.
Click on the grey box (2) to choose the file to upload or drag and drop the sampleX.final.count.tsv into the grey box
Refresh the screen (3)
The uploaded file (4) will appear on the data page after successful upload.
Data can be loaded into tables manually or automatically. To load data automatically, you can set up a schedule. The schedule specifies which files’ data should be automatically loaded into a table, when those files are uploaded to ICA or created by an analyses on ICA. Active schedules will check for new files every 24 hours.
In this exercise, you will create a schedule to automatically load RNA transcript counts from .final.count.tsv files into the table you created above.
Go to Projects > your_project > Base > Schedule and click the + Add New button.
Select the option to load the contents from files into a table.
Create your schedule.
Name your schedule LoadFeatureCounts
Choose Project as the source of data for your table.
To specify that data from .final.count.tsv files should be loaded into your table, enter .final.count.tsv in the Search for a part of a specific ‘Orignal Name’ or Tag text box.
Specify your table as the one to load data into, by selecting your table (FeatureCounts) from the dropdown under Target Base Table.
Under Write preference, select Append to table. New data will be appended to your table, rather than overwriting existing data in your table.
The format of the .final.count.tsv files that will be loaded into your table are TSV/tab-delimited, and do not contain a header row. For the Data format, Delimiter, and Header rows to skip fields, use these values:
Data format: TSV
Delimiter: Tab
Header rows to skip: 0
Click the Save button
Highlight your schedule. Click the Run button to run your schedule now.
It will take a short time to prepare and load data into your table.
Check the status of your job on your Projects > your_project > Activity page.
Click the BASE JOBS tab to view the status of scheduled Base jobs.
Click BASE ACTIVITY to view Base activity.
Check the data in the table.
Go back to your Projects > your_project > Base > Tables page.
Double-click your table to view its details.
You will land on the SCHEMA DEFINITION page.
Click the PREVIEW tab to view the records that were loaded into your table.
Click the DATA tab, to view a list of the files whose data has been loaded into your table.
To request data or information from a Base table, you can run an SQL query. You can create and run new queries or saved queries.
In this activity, we will create and run a new SQL query to find out how many records (RNA transcripts) in your table have counts greater than 100.
Go to your Projects > your_project > Base > Query page.
Paste the above query into the NEW QUERY text box
Click the Run Query button to run your query
View your query results.
Save your query for future use by clicking the Save Query button. You will be asked to "Name" the query before clicking on the "Create" button.
Find the table you want to export on the "Tables" page under BASE. Go to the table details page by clicking twice on the table you want to export.
Click on the Export As File icon and complete the required fields
Name: Name of the exported file
Data Format: A table can be exported in CSV and JSON format. The exported files can be compressed using GZIP, BZ2, DEFLATE or RAW_DEFLATE.
CSV Format: In addition to Comma, the file can be Tab, Pipe or Custom character delimited.
JSON Format: Selecting JSON format exports the table in a text file containing a JSON object for each entry in the table. This is the standard snowflake behavior.
Export to single/multiple files: This option allows the export of a table as a single (large) file or multiple (smaller) files. If "Export to multiple files" is selected, a user can provide "Maximum file size (in bytes)" for exported files. The default value is 16,000,000 bytes but can be increased to accommodate larger files. The maximum file size supported is 5 GB.
Prerequisite - Launch a CWL or Nextflow pipeline to completion using the ICA GUI with the intended set of parameters.
Configure and Authenticate ICA command line interface (CLI).
Obtain a list of your projects with their associated IDs:
Use the ID of the project from the previous step to enter the project context:
Find the pipeline you want to start from the CLI by obtaining a list of pipelines associated with your project:
Find the ID associated with your pipeline of interest.
To find the input files parameter, you can use a previously launched projectanalyses with the input command.
Find the previous analyses launched along with their associated IDs:
List the analyses inputs by using the ID found in the previous step:
This will return the Input File Codes, as well as the file names and data IDs of the associated data used to previously launch the pipeline
Currently, this step for CWL requires the use of the ICA API to access the configuration settings of a project analyses that ran successfully. It is optional for Nextflow since the XML configuration file can be accessed in the ICA GUI.
Click the previous GUI run, and select the pipeline that was run. On the pipeline page, select the XML Configuration Tab to view the configuration settings.
In the "steps" section of the XML file, you will find various steps labeled with
and subsequent labels of parameters with a similar structure
The code should be used to generate the later command line parameters e.g.
--parameters enable_map_align:true
Generate JWT Token from API Key or Basic login credentials
Instructions on how to get an API Key https://illumina.gitbook.io/ica/account-management/am-iam#api-keys
If your user has access to multiple domains, you will need to need to add a "?tenant=($domain)" to the request
Response to this request will provide a JWT token {"token":($token)}, use the value of the token in further requests
Using the API endpoint /api/projects/{projectID}/analyses/{analysisId}/configurations
to find the configuration file listing out all of required and optional parameters
The response JSON to this API will have configuration items listed as
Structure of the final command
icav2 projectpipelines start cwl $(pipelineID) --user-reference
Plus input options
Input Options - For CLI, the entire input can be broken down as individual command line arguments
To launch the same analysis as using the GUI, use the same file ID and parameters, if using new data you can use the CLI command icav2 projectdata list
to find new file IDs to launch a new instance of the pipeline Required information in Input - Input Data and Parameters
This option requires the use of --type input STRUCTURED
along with --input
and --parameters
Successful Response
Unsuccessful Response Pipeline ID not formatted correctly
Check that the pipeline ID is correct based on icav2 projectpipelines list
File ID not found
Check that the file ID is correct based on icav2 projectdata list
Parameter not found
When using nextflow to start runs, the input-type parameter is not used, but the --project-id
is required
Structure of the file command icav2 projectpipelines start nextflow $(pipelineID) --user-reference
Plus input options
Response status can be used to determine if the pipeline was submitted successfully
status options: REQUESTED,SUCCEEDED,FAILED,ABORTED
This tutorial demonstrates how to use the ICA Python library packaged with the JupyterLab image for Bench Workspaces.
The tutorial will show how authentication to the ICA API works and how to search, upload, download and delete data from a project into a Bench Workspace. The python code snippets are written for compatibility with a Jupyter Notebook.
Navigate to Bench > Workspaces and click Enable to enable workspaces. Select +New Workspace to create a new workspace. Fill in the required details and select JupyterLab for the Docker image. Click Save and Start to open the workspace. The following snippets of code can be pasted into the workspace you've created.
This snippet defines the required python modules for this tutorial:
This snippet shows how to authenticate using the following methods:
ICA Username & Password
ICA API Token
These snippets show how to manage data in a project. Operations shown are:
Create a Project Data API client instance
List all data in a project
Create a data element in a project
Upload a file to a data element in a project
Download a data element from a project
Search for matching data elements in a project
Delete matching data elements in a project
These snippets show how to get a connection to a base database and run an example query. Operations shown are:
Create a python jdbc connection
Create a table
Insert data into a table
Query the table
Delete the table
This snipppet defines the required python modules for this tutorial:
The platform provides Connectors to facilitate automation for operations on data (ie, upload, download, linking). The connectors are helpful when you want to sync data between ICA and your local computer or link data between projects in ICA.
The ICA CLI upload/download proves beneficial when handling large files/folders, especially in situations where you're operating on a remote server by connecting from your local computer. You can use icav2 projects enter <project-name/id>
to set the project context for the CLI to use for the commands when relevant. If the project context is not set, you can supply the additional parameter --project-id <project-id>
to specify the project for the command.
Note: Because of how S3 manages storage, it doesn't have a concept of folders in the traditional sense. So, if you provide the "folder" ID of an empty "folder", you will not see anything downloaded.
In the example above, we're generating a partial file named 'tempFile.txt' within a project identified by the project ID '41d3643a-5fd2-4ae3-b7cf-b89b892228be', situated inside a folder with the folder ID 'fol.579eda846f1b4f6e2d1e08db91408069'. You can access project, file, or folder IDs either by logging into the ICA web interface or through the use of the ICA CLI.
The response will look like this:
Retrieve the data/file ID from the response (for instance: fil.b13c782a67e24d364e0f08db9f537987) and employ the following format for the Post request - /api/projects/{projectId}/data/{dataId}:createUploadUrl:
The response will look like this:
Use the URL from the response to upload a file (tempFile.txt) as follows:
If you are trying to upload data to /cli-upload/
folder, you can get the temporary credentials to access the folder using icav2 projectdata temporarycredentials /cli-upload/
. It will produce following output with accessKey, secretKey and sessionToken that you will need to configure AWS CLI to access this folder.
Copy the awsTempCredentials.accessKey, awsTempCredentials.secretKey and awsTempCredentials.sessionToken to build the credentials file: ~/.aws/credentials
. It should look something like
The temporary credentials expire in 36 hours. If the temporary credentials expire before the copy is complete, you can use AWS sync command to resume from where it left off.
Following are a few AWS commands to demonstrate the use. The remote path in the commands below are constructed off of the output of temporarycredentials command in this format: s3://<awsTempCredentials.bucket>/<awsTempCredentials.objectPrefix>
You can also write scripts to monitor the progress of your copy operation and regenerate and refresh the temporary credentials before they expire.
You may also use Rclone for data transfer if you prefer. The steps to generate temporary credentials is the same as above. You can run rclone config
to set keys and tokens to configure rclone with the temporary credentials. You will need to select the advanced edit option when asked to enter the session key. After completing the config, your configure file (~/.config/rclone/rclone.conf) should look like this:
For other operating systems, refer to OS specific documentation for FUSE driver installation.
Identify the project id by running the following command:
Provide the project id under "ID" column above to the mount command to mount the project data for the project.
Check the content of the mount.
WARNING Do NOT use the CP -f command to copy or move data to a mounted location. This will result in data loss as data on the destination location will be deleted.
You can unmount the project data using the 'unmount' command.
See the for details about the JupyterLab docker image provided by Illumina.
Snowflake Python API documentation can be found
Another option to upload data to ICA is via . This option is helpful where data needs to be transferred via automated scripts. You can use the following two endpoints to upload a file to ICA.
Post - with the following body which will create a partial file at the desired location and return a dataId for the file to be uploaded. {projectId} is the the project id for the destination project. You can find the projectId in yout projects details page (Project > Details > URN > urn:ilmn:ica:project:projectId#MyProject).
Post - where dataId is the dataId from the response of the previous call. This call will generate the URL that you can use to upload the file.
Create data in the project by making the API call below. If you don't already have the API-Key, refer to the instructions on the for guidance on generating one.
ICA allows you to directly upload/download data from ICA using . It is especially helpful when dealing with an unstable internet connection to upload or download a large amount of data. If the transfer gets interrupted midway, you can employ the sync command to resume the transfer from the point it was stopped.
To connect to ICA storage, you must first download and install AWS CLI on your local system. You will need temporary credentials to AWS CLI to access ICA storage. You can generate temporary credentials through the ICA CLI, which can be used to authenticate AWS CLI against ICA. The temporary credentials can be obtained using
icav2 allows project data to be mounted on a local system. This feature is currently available on Linux and Mac systems only. Although not supported, users have successfully used Windows Subsystem for Linux (WSL) on Windows to use icav2 projectdata mount
command. Please refer to the for installing WSL.
icav2 (>=2.3.0) and in the local system.
For MAC refer to .
A project created on ICA v2 with data in it. If you don't already have a project, please follow the instructions to create a project.
icav2 utilizes the FUSE driver to mount project data, providing both read and write capabilities. However, there are some limitations on the write capabilities that are enforced by the underlying AWS S3 storage. For more information, please refer to .
You can access the databases and tables within the Base module using Python from your local machine. Once retrieved as e.g. pandas object, the data can be processed further. In this tutorial, we will describe how you could create a Python script which will retrieve the data and visualize it using Dash framework. The script will contain the following parts:
Importing dependencies and variables.
Function to fetch the data from Base table.
Creating and running the Dash app.
This part of the code imports the dependencies which have to be installed on your machine (possibly with pip). Furthermore, it imports the variables API_KEY and PROJECT_ID from the file named config.
We will be creating a function called fetch_data to obtain the data from Base table. It can be broken into several logically separated parts:
Retrieving the token to access the Base table together with other variables using API.
Establishing the connection using the token.
SQL query itself. In this particular example, we are extracting values from two tables Demo_Ingesting_Metrics and BB_PROJECT_PIPELINE_EXECUTIONS_DETAIL. The table Demo_Ingesting_Metrics contains various metrics from DRAGEN analyses (e.g. the number of bases with quality at least 30 Q30_BASES) and metadata in the column ica which needs to be flattened to access the value Execution_reference. Both tables are joined on this Execution_reference value.
Fetching the data using the connection and the SQL query.
Here is the corresponding snippet:
Once the data is fetched, it is visualized in an app. In this particular example, a scatter plot is presented with END_DATE as x axis and the choice of the customer from the dropdown as y axis.
Now we can create a single Python script called dashboard.py by concatenating the snippets and running it. The dashboard will be accessible in the browser on your machine.
Any operation from the ICA graphical user interface can also be performed with the API.
The following are some basic examples on how to use the API. These examples are based on using Python as programming language. For other languages, please see their native documentation on API usage.
An installed copy of Python. (https://www.python.org/)
The package installer for python (pip) (https://pip.pypa.io/)
Having the python requests library installed (pip install requests
)
One of the easiest authentication methods is by means of API keys. To generate an API key, refer to the Get Started section. This key is then used in your Python code to authenticate the API calls. It is best practice to regularly update your API keys.
API keys are valid for a single user, so any information you request is for that user to which the key belongs. For this reason, it is best practice to create a dedicated API user so you can manage the access rights for the API by managing those user rights.
There is a dedicated API Reference where you can enter your API key and try out the different API commands and get an overview of the available parameters.
The examples on the API Reference page use curl (Client URL) while Python uses Python requests. There are a number of online tools to automatically convert from curl to python.
To get the curl command,
Look up the endpoint you want to use on the API reference page.
Select Try it out
.
Enter the necessary parameters.
Select Execute
.
Copy the resulting curl command.
Never paste your API authentication key into online tools when performing curl conversion as this poses a significant security risk.
In the most basic form, the curl command
curl my.curlcommand.com
becomes
You will see the following options in the curl commands on the API Reference page.
-H
means header.
-X
means the string is passed "as is" without interpretation.
becomes
This is a straightforward request without parameters which can be used to to verify your connection.
The API call is
response = requests.get('https://ica.illumina.com/ica/rest/api/eventcodes', headers={'X-API-Key': '<your_generated_API_key>'})
In this example, the API key is part of your API call, which means you must update all API calls when the key changes. A better practice is to put this API key in the headers so it is easier to maintain. The full code then becomes
The list of application codes was returned as a single line, which makes it difficult to read, so let's pretty-print the result.
Now that we are able to retrieve information with the API, we can use it for a more practical request like retrieving a list of projects. This API request can also take parameters.
First, we pass the request without parameters to retrieve all projects.
The easiest way to pass a parameter is by appending it to the API request. The following API request will list the projects with a filter on CAT as user tag.
response = requests.get('https://ica.illumina.com/ica/rest/api/projects?userTags=CAT', headers=headers)
If you only want entries that have both the tags CAT and WOLF, you would append them like this:
response = requests.get('https://ica.illumina.com/ica/rest/api/projects?userTags=CAT&userTags=WOLF', headers=headers)
To copy data, you need to know:
Your generated API key.
The dataId of the files and folders which you want to copy (their syntax is fil.hexadecimal_identifier and fol.hexadecimal_identifier). You can select a file or folder in the GUI and select it to see the Id (Projects > your_project > Data > your_file > Data details > Id) or you can use the /api/projects/{projectId}/data
endpoint.
The destination project to which you want to copy the data.
The destination folder within the destination project to which you want to copy the data (fol.hexadecimal_identifier).
What to do when the destination files or folders already exist (OVERWRITE, SKIP or RENAME).
The full code will then be as follows:
Now that we have done individual API requests, we can combine them and use the output of one request as input for the next request. When you want to run a pipeline, you need a number of input parameters. In order to obtain these parameters, you need to make a number of API calls first and use the returned results as part of your request to run the pipeline. In the examples below, we will build up the requests one by one so you can run them individually first to see how they work. These examples only follow the happy path to keep them as simple as possible. If you program them for a full project, remember to add error handling. You can also use the GUI to get all the parameters or write them down after performing the individual API calls in this section. Then, you can build your final API call with those values fixed.
This block must be added at the beginning of your code
Previously, we already requested a list of all projects, now we add a search parameter to look for a project called MyProject. (Replace MyProject with the name of the project you want to look for).
Now that we have found our project by name, we need to get the unique project id, which we will use in the combined requests. To get the id, we add the following line to the end of the code above.
Syntax ['items'][0]['id'] means we look for the items list, 0 means we take the first entry (as we presume our filter was accurate enough to only return the correct result and we don't have duplicate project names) and id means we take the data from the id field. Similarly, you can build other expressions to give you the data you want to see, such as ['items'][0]['urn'] to get the urn or ['items'][0]['tags']['userTags'] to get the list of user tags.
Once we have the identifier we need, we add it to a variable which we will call Project_Identifier in our examples.
Once we have the identifier of our project, we can fill it out in the request to list the pipelines which are part of our project.
This will give us all the available pipelines for that project. As we will only want to run a single pipeline, we can search for our pipeline, which in this example will be the basic_pipeline. Unfortunately, this API call has no direct search parameter, so when we get the list of pipelines, we will look for the id and store that in a variable which we will call Pipeline_Identifier in our examples as follows:
Once we know the project identifier and the pipeline identifier, we can create an API request to retrieve the list of input parameters which are needed for the pipeline. We will consider a simple pipeline which only needs a file as input. If your pipeline has more input parameters, you will need to set those as well.
Here we will look for the id of the extra small storage size. This is done with the 0 in the My_API_Data['items'][0]['id']
Now we will look for a file "testExample" which we want to use as input and store the file id.
Finally, we can run the analysis with parameters filled out.
Cohorts
Fixed an issue where users could not search the hierarchical disease concepts because of incorrect URL in the UI configuration.
General
The current tab (e.g. analysis details, analysis steps, pipeline details, pipeline XML config, ...) is now saved in the URL, making the back button bring the user back to the tab they were in
Users are now able to go from the analysis data details to their location in the project data view
Toggling the availability of the "Acceptance list" tab in the legal view by a tenant admin used to be possible in the "Restrictions of monitoring" tab when editing the bundle. It has been moved to the legal tab
Data Management
New data formats available:
TRANSCRIPT: *.quant.sf, *.quant.sf.gz
GENE: *.quant.genes.sf, *.quant.genes.sf.gz
JSON.gz are now recognized as JSON format
New endpoint to create files POST /api/projects/{projectId}/data:createFile
Endpoint POST /api/projects/{projectId}/data has been deprecated
The endpoint GET/api/projects/{projectId}/data/{dataId}/children now has more filters for more granular filtering
Users are now able to filter based on the owning project ID for the endpoint GET/api/projects/{projectId}/data
The links section in Bundle details and pipeline details now has proper URL validation and both fields are now required when adding links. In the case of editing an older links section of a bundle/pipeline, the user won't be able to save until the section is corrected
Flow
The cost of a single analysis is now exposed on its details page
Users can now abort analysis while being in the analysis detail view
'.command.*' files from Nextflow WorkDir are now copied into ICA logs
Base
Expanded the lifespan of Base OAuth token to 12h
Bench
Removed display of the current user using a bench workspace
Experimental Features
Streamable inputs for JSON-based input forms. Adding "streamable":true to an input field of type "data" makes it a streamable input.
General
Fixed an issue which would overgenerate event calls when an analysis would run into diskfull alert
Improved API error handling so that being unable to reach ICA storage will now result in error code 500 instead of error code 400
Added a full name field for users in various grids (Bench activity, Bundle share, ...) to replace the separate first and last name fields
In the event log, the event ICA_EXEC_028 is now shown as INFO, it was before displayed as ERROR which was not correct
Data Management
Fixed an issue which would result in a null-pointer error when trying to open the details of a folder which was in the process of being deleted
Fixed an issue with bundle invites, now making it clear that you can only re-invite someone to that bundle if the previously rejected invites are removed
Various improvements to hardening data linking
Fixed an issue where the folder copy job would throw an Access Denied when copying file with _ in path
Fixed an issue that would produce a false error notification when changing the format from the data details window
Fixed an issue where an out of order event for Folder Deleting and Deleted would occur in rare scenarios
Fixed an issue regarding path too long error for Folder copy/Move operations for Managed bucket src and destination
Flow
Improved API file handling to better handle post processing when downloading the results from a successful analysis which could previously result in failed analysis being reported as result
Fixed an issue which resulted in a null-pointer error when starting an XML based CWL pipeline with an input.json
Fixed an issue which caused user references with slashes to prevent errors in failed runs from being displayed
Fixed an issue where the value 0 was not accepted in pipeline's inputForm.json for fields of type number
Fixed an issue where users could not retrieve pipeline_runner.0 logs file while a pipeline is running
List fields in filter options are now saved if closing and reopening the filter panel
Fixed an issue where the start time of an analysis's step would be intermittently reported wrongly
Fixed an issue where retrieving outputs of analysis through API was not consistent between analysis with advanced output-mapping or without
Improvements to the handling of large file uploads to prevent token expiry from blocking uploads
Base
Fixed an issue where shared database would not be visible in project Base, this was fixed in the newer version of Snowflake 9.3
Bench
Removed the rollback failed operations function on docker images as it had little to no benefit for end-users and frequently caused confusion
Fixed issue where users without proper permissions could create a workspace
Cohorts
Fixed issues where users doing large scale inputs of data received timeouts from the ICA API for file retrieval
Fixed issue with large OMOP data sets causing out of memory issues on input
Fixed issue where the 'Search Attributes' box in the 'Create Cohort' was not scrolling after typing a partial string.
Fixed issue with line-up of the exon values under exon track.
Fixed issue where subject attribute search box overlapped with other items when web browser zoom used.
Fixed issue where single subject view displayed concept codes and now shows concept names for diseases, drugs, procedures, and measurements.
Flow
Added retries for analysis process infrastructure provisioning to mitigate intermittent (~1%) CWL analysis failures. This impacts analysis steps failing with error "OCI runtime create failed" in logs.
Features and Enhancements
General
The End User License Agreement has been updated
New API endpoints for Docker Images management:
GET /api/dockerImages
GET /api/dockerImages/{imageId}
POST /api/dockerImages:createExternal
POST /api/dockerImages:createInternal
POST /api/dockerImages/{imageId}:addRegions
POST /api/dockerImages/{imageId}:removeRegions
Split up CWL endpoint (POST/api/projects/{projectId}/analysis:cwl) in 2:
CWL analysis with a JSON input (POST /api/projects/{projectId}/analysis:cwlWithJsonInput)
CWL analysis with a structured input (POST /api/projects/{projectId}/analysis:cwlWithStructuredInput)
Data Management
Next to using the Teams page to invite other tenants to use your assets, a dedicated bundle-sharing feature is now available. This allows you to share assets while also shielding sensitive information from other users, such as who has access to these assets
Improved visibility on ongoing data actions move and copy on the UI
Users can now add/remove bundles in an externally managed project. It will not be possible to link a restricted Bundle to a project containing read-only, externally managed data
Flow
JSON based input form now has a built-in check to make sure a tree does not have any cyclical dependencies
Added commands for creation and start of CWL JSON pipelines in the CLI
Users can now input external data into JSON based input forms from the API
Bench
Bench workspaces can be used in externally managed project
Cohorts
Users can now filter needles by customizable PrimateAI Score thresholds, affecting both plot and table variants, with persistence across gene views
The 'Single Subject View' now displays a summary of measurements (without values), with a link to the 'Timeline View' for detailed results under the section 'Measurements and Laboratory Values Available
Fixed Issues
General
Fixed an issue which caused authentication failures when using a direct link
Actions which are not allowed on externally-managed projects are now greyed-out instead of presenting an error when attempting to use them
Improved handling of regions for Docker images so that at least one region must remain. Previously, removing all regions would result in deleting the Docker image
Improved filtering out Docker images which are not relevant to the current user
Tertiary modules are no longer visible in externally-managed projects as they had no functional purpose there
Fixed an issue where adding public domain users to multiple collaborative workgroups would result in inconsistent instrument integration results
Added verification on the filter expressions of notification subscriptions
Fixed an issue where generating a cURL command with empty field values on the Swagger page resulted in invalid commands
Added information in the API swagger page that the GET /api/projects/{projectId}/data endpoint can not retrieve the list of files from a linked folder. To get this list, use parentfolderid instead of parentfolderpath
For consistency reasons, UUID has been renamed to ID in the GUI
The bundle administrator will now see all data present in the bundle, including all versions with older versions in a different color
Data Management
Removed deprecated cloud connector from Activity/Data transfers option
Removed the erroneous 'Import' option from the direction filter which was present in Activity/Data transfers
Fixed an issue where entering multiple Download rules for a Service connector would result in not setting the correct unique sequence numbers
Improved the error message when erroneously trying to link data from an externally-managed subject to a sample. This is not allowed because data can only be linked to a single sample
Fixed an issue where filtering on file formats was not correctly applied when selecting files and folders for downloads with the service connector
Improved the download connector to fix Chrome compatibility issues
Fixed an issue where it was possible to update linked files if you had access to both the original file and the linked file
Fixed an issue where samples from externally-managed projects were not correctly linked to analyses
Flow
Fixed a JSON validation error when attempting to have more than one default value for a field configured as single value which would result in index out of bounds error
Fixed an issue where numerical values with a scientific exponent would not be correctly accepted
Improved the API error validation for usage of duplicate group id fields
Improved error handling when starting analysis via API with an incorrect DATA-ID in the request body
Improved handling of incorrect field types for JSON-based input forms
Improved error handling when trying to use externally-managed data as reference data
Removed the superfluous "save as" button from the create pipelines screen
Fixed an issue where refreshing the analysis page would result in an error when more than 1 log file was opened
Upon clicking "start run" to launch a pipeline, ICA now redirects to the "Runs" view
Fixed an issue where the minimum and maximum numbers of high values were incorrectly rounded for JSON input forms
Fixed an issue where the user could pass a value with the "values" parameter instead of "dataValues" for the data field type
Fixed an issue which caused the "dataValues" parameter to be valid for the textbox field type instead of "values"
Improved timeout handling for autolaunch workflow
Fixed auto-launched TSO500 pipelines using the StartsFromFastq=false setting to direct analysis outputs to the ilmn-analyses folder within the ICA project data
Added JSON validation to ensure only a single radio button can be selected at the same time as part of a radio-type list
Removed the Simulate button from the View mode pipeline detail screen
The proprietary option can now be set via the CLI on create pipeline commands
Added a validation to prevent pipeline input forms from having the same value for 'value' and 'parent'
Bench
Fixed an issue which caused bench workspaces to have incorrect last modified timestamps that were over 2000 years ago. They now will use the correct last updated timestamp
Adding or removing regions to bench images is now possible
Improvements to how workspaces handle deletion
Cohorts
Fixed issue where the error message for invalid disease IDs did not disappear after selecting the correct ontology, and filter chips were incorrectly created as 'UNDEFINED
Fixed issue where the search functionality in the ingestion file picker was not working correctly in production, causing a long delay and no files to display after entering a filename or folder name
Fixed issue where the Clinvar significance track was not resetting properly, causing the resized track and pointer to not return to the original position, with triangle data points displaying empty whitespace
Fixed issue where the 'PARTIAL' status for HPO filter chips was incorrectly removed when multiple chips were selected
Fixed issue where the pagination on the Variant List Needle Plot incorrectly displayed 741 items across 75 pages, causing a discrepancy with the actual number of displayed variants
The 'Search Attributes' box in the 'Create Cohort' page now properly scrolls and filters results when typing substrings, displaying matching results as the user types
Fixed issue where the search spinner continued loading after the search results were displayed in the Import Jobs table
Fixed issue where the 'stop_lost' consequence in Needleplot is corrected to 'Frameshift, Stop lost,' and the legend updated to 'Stop gained|lost.' The 'Stop gained' filter now excludes 'Stop lost' variants when the 'Display only variants shown in the plot above' toggle is on
Fixed issue where intermittent 500 error codes occurred on USE1 Prod when running Needleplot/VariantList queries with the full AGD/DAC dataset (e.g., LAMA1 gene query)
Features and Enhancements
Flow
Analysis logs (task stdout/stderr files) are now written to a folder named ‘ica_logs’ within the analysis output folder
Default scratch disk size attached to analysis steps reduced from 2TB to 0B to improve cost and performance of analyses. Pipelines created before ICA v2.21.0 will not be impacted
Notifications
Notifications can now be updated and deleted in externally managed Projects
API
Clarified on the Swagger page which sorting options apply to which paging strategy (cursor-based versus offset-based). Changed the default sorting behavior so that:
When no paging strategy is specified and no sort is requested, then cursor-based paging is default
When no paging strategy is specified and sort is requested, then offset-based paging is default
Cohorts
Procedure Search Box: Users can now access additional UI functionalities for Procedures
Users can now access Procedure codes from OMOP
Improved handling of drug codes across all reports, excluding Survival comparison
Ingestion
Users now have enhanced job warning log and API status improvements
Users now require download permissions to facilitate the data ingestion process
Fetch Molecular Files: Improved import – Users can now input a directory path and select sample files individually
Variant Type Summary: Users can now access a new variants tab that summarizes Variant type statistics per gene
Added sorting and filtering capabilities to report tables, such as variants observed in genes
Users can now view sample barcodes, replacing internal auto-increment sample IDs in the Structural Variants table within the Genes tab
“Search subjects” functionality improved with flexible filtering logic that now supports partial matches against a single input string
Fixed Issues
Data Management
Fixed an issue with data copy via the CLI where the file was being copied to a subfolder of the intended location instead of the specified folder
Resolved an issue where browser upload hangs intermittently when creating data
Fixed an issue where the delete popup does not always disappear when deleting data
Fixed an issue where GetFolder API call returns 404 error if the Create and Get operations are performed 100ms apart
Fixed an issue where file copy would fail if the file was located at the root level of User’s S3 storage bucket
Fixed an issue causing data linked from externally managed projects to be incorrectly excluded from the list project data API response
Fixed an issue where User cannot use data URNs to identify the destination folder when interacting with copy data API endpoints
Bundles: Fixed an issue where clicking the back button before saving a new bundle leads to inconsistencies
Flow
Fixed an issue where pipeline documentation is not scrollable when launching pipeline
Fixed an issue with logfiles of a task not being available for streaming while the task is still running
Fixed an issue where using the 're-run' button from the analysis page reverts the storage size selection to default
Fixed an inconsistency where the following two endpoints would show different analysis statuses:
GET /api/projects/{projectId}/analyses
GET /api/projects/{projectId}/analyses/{analysisId}
Improved performance issues with UI loading data records when selecting inputs for analysis
Fixed a caching issue which resulted in delays when running pipelines
Fixed an issue where back button for analysis or pipeline details does not always direct Users back to analysis or pipelines view, respectively
Fixed an issue where system performance is degraded when large batches (e.g., 1,000) of data are added as input to Analyses via the graphical UI. It is recommended to start Analyses with large numbers of input files via API
Base
Fixed an issue where enabling Base from a Base view other than Base Tables returned a warning message
Fixed an issue where Base access was not enabled when a bundle with tables is added to a project without Base (Base is automatically enabled so users can see the bundle's tables). However, access to the bundle's tables is revoked upon the deletion of Base, and was not granted again once Base was re-enabled
Fixed an issue where a Base job to load data into a table never finished because the file was deleted after the job started and before it finished. Now the job will end up in a Failed state
Cohorts
Fixed an issue where needle plot filtered out data points reappear when zooming in the exon when a filter is in place
Fixed an issue where users from a different tenant who accept a project share may encounter a failure at the final step of the data ingestion process
Fixed an issue where users can encounter intermittent errors when browsing and typing for a gene
Fixed an issue where the UI hangs on large genes and returns a 502 error
Fixed Issues
Data Management
Fixed an issue where multiple folder copy jobs with the same destination may get stuck In Progress
Fixed an intermittent issue where tags on the target folder for a batch data update call are not set, but are set for all child data
Flow
Fixed an issue causing intermittent pipeline failures due to an infrastructure error
Features and Enhancements
General
Navigation: If multiple regions are enabled for the same tenant, the region will be indicated in the waffle menu
Logging: Data transfers of BaseSpace Sequence Hub projects with data stored in ICA will be traced in ICA logs
Cohorts
Disease Search Box: Added support for specifying subjects by age of onset of disease(s)
Drug Search Box: Added a new query builder box for Drugs
Ingestion: Support for Drug, drug route, etc. attached to subjects
Cohorts building: Users can build cohorts by specifying drugs, drug route, etc.
Ingestion
Combine different variant types during ingestion (small variants, cnv, sv)
Cohorts supports Illumina Pisces variant caller for hg19 VCFs
Fixed Issues
General
Fixed an issue where the graphical UI hands with ha spinning wheel when saving or executing a command
Fixed an issue where rich text editor for Documentation tab on Pipelines, Tools, Projects and Bundles does not populate with correct styles in edit mode
Data Management
Fixed an issue where multiple clicks on create data in Project API endpoint resulted in multiple requests
Fixed an issue where the secondary data selection screen could not be resized
A spinning wheel icon with ‘copying’ status is displayed at the folder level in the target Project when a folder is being copied. This applies to the actual folder itself and not for folders higher up in the hierarchy
Fixed an issue where API to retrieve a project data update batch is failing with 500 error when either the Technical or the User tags are updated during the batch update request
Fixed an issue where linking jobs fail to complete if other linking jobs are running
Improved performance for data transfer to support BaseSpace Sequence Hub Run transfers
Fixed an issue causing some folder copy jobs to remain in "Partially Succeeded" status despite being completed successfullyBundles: Fixed an issue where the URL and Region where a Docker image is available is not displayed for a Docker image Tool shared via an entitled Bundle
Fixed an issue where the folder copy job was getting stuck copying large amounts of big files
Fixed an issue where the folder counts were not matching expected counts after Data linking
Fixed an issue where delete data popup would occasionally not disappear after deleting data.
Fixed an issue with data copy where referencing data from another region would not result in immediate failure
Fixed issue where uploading a folder using the CLI was not working
Fixed an issue where a Docker image shared via an entitled Bundle can be added to another region
Workflows
Fixed an issue where workflow does not fail if BCL Convert fails for a BCL Convert-only run
Flow
Improved performance when batches of data up to 1000 are added as input to an Analysis
Nextflow engine will return exit code 55 if the pipeline runner task is preempted
Fixed an issue where log files cannot be opened for any steps in an analysis while the analysis is in progress
Fixed an issue with concurrent updates on analysis
Fixed an issue where unknown data inputs in the XML of an analysis are not being ignored
The warning, close, and machine profile icons for Tools can now be seen in the graphical CWL pipeline editor
Fixed an issue where user cannot expand analysis output folder if user permissions change after starting analysis. Now, if a user has the correct permissions to start an analysis, that analysis should be able to finish correctly no matter the permissions at the time it succeeds
Base
Fixed an issue switching back from template to Empty Table did not clear the fields
Data linked from an externally managed project can be added to Base Tables
Fixed an issue in the graphical UI where schema definition does not scroll correctly when many columns are defined
Features and Enhancements
Data Management/API
Added a new endpoint available to change project owner
POST /api/projects/{projectId}:changeOwner { “newOwnerId”:”}
Added a new endpoint to copy data from one project to another:
/api/projects/{projectId}/projectDataCopyBatch
Data Management/CLI
Added the ability to copy files and folders between projects in the UI and CLI. This includes support for copying data from projects with ICA-managed storage (default) to projects with S3-configured storage.
Flow/API
When starting an analysis via the API, you can specify the input files based on HTTP(s). When your analysis is done, you will see the URL corresponding to the inputs in the UI, but you will not be able to start an analysis from the UI using this URL
Added two new endpoints for workflow sessions:
Get /api/projects/{projectId}/workflowSessions
Get /api/projects/{projectId}/workflowSessions/{workflowSessionId}/inputs
Added a new endpoint to retrieve configurations from a workflow session
Flow/CLI
Duplicate analyses submitted via the CLI will be avoided
Flow
Removed the ability to start analyses from data and sample views in the UI where a single input is selected to start analyses in bulk
Flow/Autolaunch ICA Workflow Session and Orchestrated Analyses (launched by the workflow session) now saves outputs in an organized folder structure: /ilmn-analysis/<name_used_to_create_sequencer_run_output_folder>
Base
The Base module has a new feature called ‘Data Catalogue’. This allows you to add usage data from your tenant/project if that data is available for you.
Data Catalogue views will be available and can be used in Base to query on
You will be able to preview and query Data Catalogue views through Base Tables and Query screens
The Data Catalogue will always be up to date with the available views for your tenant/project
Data Catalogue views cannot be shared through a Bundle
Data Catalogue views will also be available to team members that were added after the view was added
Data Catalogue views can be removed from the Base tables and corresponding project
By removing Base from a project, the Data Catalogue will also be removed from that project
Cohorts: Disease Search box
Cohorts now includes a disease search box to search for disease concepts. This replaces the disease concept tree explorer
Disease search box located under a Disease tab in main Query builder
Search box allows for a copy/paste action of codes to be processed as separate query elements. Currently, the feature is limited to a complete valid list
Each disease entered into the search box is displayed as a separate query item and can be set to include or exclude.
Diseases in search box can be used with boolean logic in cohort creation
Search box allows for an auto-complete of diagnosis concepts and identifiers
The disease filter is included in the cohort query summary on cohort page
Fixed Issues
Data Management
Data copy between ICA-managed projects and S3 storage configured projects is supported
Fixed an issue where storage configurations matching ICA-managed buckets would cause volume records to get associated with the wrong storage configuration in the system
API
The endpoint GET/api/projects/{ProjectID}/samples/{SampleID} correctly returns all the own samples and linked samples
Improved handling of bulk update via API when concurrent deletion of file has occurred
CLI
Fixed an issue where projectdata update tags would not update the tags
Fixed an issue to support adding the server-url as a parameter instead of having the config set
Flow
Fixed an issue resulting in failure to send a notification resulting in a failed workflow
Fixed an issue where one workflow session may override another when both are executed at the same time
Base
Fixed an issue where query download in JSON format returns an error
Added a message in the UI when a query takes longer than 30 seconds to inform the user that the query is ongoing and can be monitored in the Activity view
Added a section describing the Data Catalogue functionality
Bench
Fixed an issue where resizing the workspace to current size would prevent users from resizing for the next 6 hours
Cohorts
Fixed an issue where Gene Expression table does not display with TCGA data or for tenants with a hyphen (e.g., ‘genome-group’)
Fixed an issue where user had no way to delete a cohort comparison from a deleted cohort
Fixed an issue in the UI where multi-cohort needle plot tracks are overlapping
Fixed an issue causing failures during annotation step with ‘CNV’ data type when selection ‘GB=hg19’ and ‘CNV data’ for liftover; also observed with ‘SM data’ and ‘hg38’ without liftover (in APS1 and CAC1 regions) due to a ‘404 Not Found’ error.
Fixed Issue
Fixed an issue uploading folders via the CLI
Fixed Issue
Fixed an issue causing CWL pipelines using Docker images that do not contain bash shell executable to fail.
Fixed Issue
Fixed an issue leading to intermittent system instability.
Fixed Issue
Cohorts
Issue fixed where GTEx plot is not available for tenants with a hyphen (e.g. ilmn-demo).
Features and Enhancements
General
Versioning: The ICA version can now be found under your user when you select "About"
Versioning/API: It is possible to retrieve system information about ICA, such as the current version through GET/api/systeminfo
Logging: When an action is initiated by another application, such as BaseSpace Sequence Hub, it will be traced as well in the ICA logs
Data Management
New API endpoints are available for:
Creation of a data update in bulk: POST/api/projects/{projectId}/dataUpdateBatch
A list of data updates for a certain project: GET/api/projects/{projectId}/dataUpdateBatch/{batchId}
A list of items from the batch update: GET/api/projects/{projectId}/dataUpdateBatch/{batchId}/items
A specific item from the batch update: GET/api/projects/{projectId}/dataUpdateBatch/{batchId}/items/{itemId} Note: Batch updates include tags, format, date to be archived and date to be deleted
Data Management/API
The sequencing run information can be retrieved through its Id by using the API endpoint GET/api/sequencingRuns/{sequencingRunId}
Flow:
Auto launch now supports BCL Convert v3.10.9 pipeline and both TruSight Oncology 500 v2 pipelines (from FASTQs)
Removed "fpga-small" from available compute types. Pipelines using "fpga-small" will use the "fpga-medium"-equivalent compute specifications instead
Analyses launched/tracked by BaseSpace Sequence Hub contain relevant BaseSpace information in analysis details view
Flow/API
getPipelineParameters API returns parameter type in response
Added endpoints to retrieve and update a project pipeline definition
New API endpoint available to request the analyses in which a sample is being used
When leaving activationCodeDetailId empty when starting an analysis, the best match activation code will be used
Flow/API/CLI
Include "mountPaths" field in response for API and CLI command to retrieve analysis inputs
API
Two new API endpoints added to accept Terms and Conditions on a bundle:
GET /api/bundles/{bundleId}/termsOfUse/userAcceptance/currentUser Returns you the time of acceptance when you, the current user, accepted the Terms & Conditions.
POST /api/bundles/{bundleId}/termOfUse:accept
Add temporary credentials duration to API documentation
Notifications
List of events to which you can subscribe contains new ICA notification containing analyses updates
Bench
A new Bench permission is being introduced: Administrator. This permission allows users to manages existing workspaces and create new workspaces
The Bench Administrator role allows you to create new Bench workspaces with any permissions even if you as a Bench administrator do not have these permissions. In that case, you can create and modify the workspace, but you cannot enter that workspace. Modifying is only possible when the workspace is stopped
As a Bench Contributor you are not allowed anymore to delete a Bench Workspace, you need the Bench Administrator role.
Cohorts
Users can now ingest raw DRAGEN bulk RNAseq results for genes and transcripts (TPM), with the option to precompute differential expression during ingestion
Added support for running multiple DEseq2 analyses in the ingestion workflow through bulk processing based on sample size and specific requirements
In multiple needle plot view, individual needle plots can now be collapsed and expanded
Pop-outs for needle plot variants now contain additional links to external resources, such as UCSC
For a given cohort, display a distribution of raw expression values (TPM per gene) for a selected attributes
Use of the Cohorts maintains session between core ICA and Cohorts iFrame to prevent unwanted timeouts
Cohorts displays structural variants that include or overlap with a gene of interest
Fixed Issues
General
Collaboration: Fixed an issue where a user is presented with a blank screen when responding to a project invitation
Data Management/API
Improved error handling for API endpoint: DELETE/api/bundles/{bundleId}samples{sampleId}
Fixed an issue where the API endpoint GET /api/samples erroneously returned a 500
API endpoint GET/api/projects/{projectId}/analyses now returns the correct list when filtering on UserTags whereas it previously returned too many
Improved retry mechanism for API endpoint to create folderuploadsession
Data Management/CLI
When an upload of a folder/file is done through the CLI, it returns the information and ID of the folder/file
Data Management
CreatorId is now present on all data, including subfolders
Improved external linking to data inside ICA using deep linking
Improved error handling when creating folders with invalid characters.
Fixed an inconsistency for URN formats on output files from Analyses. This fix will apply only for analyses that are completed starting from ICAv2.18.0
Improved resilience in situations of concurrent linking and unlinking of files and folders from projects
It is only possible to delete a storage configuration if all projects that are using this storage configuration have been hidden and are not active projects anymore
Improved accuracy of the displayed project data size. Prior cost calculations were accurate, but the project data size visualization included technical background data
Fixed an issue where there is a discrepancy in number of configurations between Storage->Configurations and Configurations-> Genomics.Byob.Storage Configuration view
Flow/API
Improved error handling when invalid project-id is used in API endpoint GET /api/projects/{projectId}/pipelines
Fixed an issue when an Analysis completed with error "incomplete folder session", the outputs of the Analysis are not always completely listed in the data listing APIs
Updated ICA Swagger Project > createProject to correctly state that the analysis priority must be in uppercase
Flow
When a spot instance is configured, but revoked by AWS, the pipeline will fail and exit code 55 is returned
Fix to return meaningful error message when instrument run ID is missing from Run Completion event during an auto launched analysis
Improved parallel processing of the same analysis multiple times
Base
Improved error handling when creating queries which use two or more fields with the same name. The error message now reads "Query contains duplicate column names. Please use column alias in the query"
Fixed an issue where queries on tables with many entries fail with NullPointerException
Bench
Clarified that changes to Bench workspace size only take effect after a restart
Cohorts
Fixed issue where counts of subjects are hidden behind attribute names
Fixed issue where the state of checked files are not retained when selecting molecular files that are in multiple nested folders
Fixed issue where projects that contain files from linked bundles cause a time out, resulting in users not being able to select files for ingestion
Fixed an issue where the 'Import Jobs' page loaded within the Data Sets frame, depending on where the import was initiated
Fixed an issue in the Correlation plat where x-axis counts were hidden under attribute names
Fixed an issue where users were previously incorrectly signed out of their active sessions
Fixed Issues
Fixed an issue causing analyses requesting FPGA compute resources to experience long wait times (>24h) or not be scheduled
Features and Enhancements
Data Management
Performance improvements for data link and unlink operations – Larger and more complex folders can now be linked in the graphical UI, and progress can be monitored with a new visual indication under Activity > Batch Jobs
Notifications
Notifications are now available for batch job changes
Flow
Increased the allowed Docker image size from 10GB to >20GB
CWL: Added support for javascript expressions “ResourceRequirements” fields (i.e., type, size, tier, etc.) in CWL Pipeline definitions
Flow/API
Added support for using Pipeline APIs to query Pipelines included in Entitled Bundles (i.e., to retrieve input parameters)
Added support for providing S3 URLs as Pipeline data inputs when launching via the API (using storage credentials)
Added support for specifying multi-value input parameters in a Pipeline launch command
Bench
Project and Tenant Administrators are now allowed to stop running Workspaces
Cohorts
Enhanced ingestion workflow to ingest RNAseq raw data from DRAGEN output into backend Snowflake database
Added support for running multiple DEseq2 analyses in the ingestion workflow through bulk processing based on sample size and specific requirements
Multi-Cohort Marker Frequency - Added Multi-Cohort Marker Frequency tab allowing users to compare expression data across up to four Cohorts at the gene level
Multi-Cohort Marker Frequency includes a pairwise p-value heat map
Multi-Cohort Marker Frequency - Includes frequencies for Somatic and Copy Number Variants
Tab added for a multi-cohort marker frequency analysis in cohort comparisons
Multi-Cohort Needle Plot - Added new tab in the Comparison view with vertically aligned needle plots per cohort for a specified gene, allowing collapsible and expandable individual needle plots
Additional filter logic added to multi-cohort needle plot
Improved DRAGEN data type determination during ingestion allowing for multiple variant type ingestion
Enhanced list of observed variants with grouped phenotypes and individual counts, including a column for total sample count; tooltips/pop-outs provide extended information
Updates to needle plot link outs
Improved the Comparison feature by optimizing API calls to handle subjects with multiple attributes, ensuring successful loading of the page and enabling API invocation only when the user selects or expands a section
Removed unused columns (genotype, mrna_feature_id, allele1, allele2, ref_allele, start_pos, stop_pos, snp_id) from annotated_somatic_mutations table in backend database
Refactored shared functionality for picking consequence type to reduce code duplication in PheWAS-plot and GWAS-plot components
Invalid comparisons on the Comparisons page are now grayed out and disabled This improvement prevents the selection of invalid options
Automatic retry of import jobs when there are failures accessing data from ICA API
Fixed Issues
General
Navigation: Removed breadcrumb indication in the graphical UI
Data Management
The content of hidden Projects can now be displayed
Fixed the TimeModified timestamp on files
Bundles: Resolved issues when linking a large number of files within a folder to a Bundle
Flow
Single values are now passed as a list when starting an Analysis
Pipelines will succeed if the input and output formats specified on the pipeline level match at the Tool level
Fixed an issue causing Analysis failures due to intermittent AWS S3 network errors when downloading input data
CWL: Improved performance on output processing after a CWL Pipeline Analysis completes
Flow/UI: Mount path details for Analysis input files are now visible
Flow/UI: Improved usability when starting an Analysis by filtering entitlement options based o inputs selected and available entitlements
Flow/API
List of Analyses can now be retrieved via the API based on filters for UserReference and UserTags
Base
Fixed an issue where the Scheduler continues to retry uploading files which cannot be loaded
Bench
Resolved an issue when attempting to access Workspaces with multi-factor authentication (MFA) enabled at the Tenant-level
API
Improved error messaging for POST /api/projects/{projectId}/data/{dataId}:scheduleDownload
Cohorts
Fixed issue where Correlation bubble plot not showing for any projects intermittently
Fixed issue where importing Germline/hg19 test file did not load variants for a specific gene in the Needle plot due to missing entries in the Snowflake table
Fixed a bug causing an HTTP 400 error while loading the Cohort for the second time due to the UI passing "undefined" as variantGroup, which failed to convert to the VariantGroup Enum type
Fixed issue where scale (y-axis) of needle plot is changed even if value of sample count gnomAD frequency is not accepted
Fixed an issue where no data was generated in the Base Tables after a successful import job in Canada - Central Region (CAC1)
Fixed issue where long chart axis labels overlap with tick marks on graph
Features and Enhancements
General
Navigation: Updated URLs for Correlation Engine and Emedgene in the waffle menu
Authentication: Using POST /api/tokens:refresh for refreshing the JWT is not possible if it has been created using an API-key.
Authentication: Improved error handling when there is an issue reaching the authentication server
Authentication: Improved usability of "Create OAuth access token" screen
Data Management
You can now select 'CYTOBAND' as format after file upload
Added support for selecting the root folder (of the S3 bucket) for Projects with user-managed
Added support for creating an AWS Storage Configuration with an S3 bucket with Versioning enabled
Auto-launch
Added technical tags for upstream BaseSpace Run information to auto-launched analyses
Added support for multiple versions of BCL Convert for auto-launched analyses
Flow
Added support for '/' as separator in CWL ResourceRequirements when specifying Compute Type
Flow/API
The API to retrieve analysis steps now includes exit code for completed steps
Bench
Restricted Workspaces (Open or Restricted) always allow for access to Project Data within the Workspace
Restricted Bench workspaces have limited access through whitelisted URLs that are checked before entry
Restricted Bench Workspaces allow for Open or Restricted workspaces. Restricted workspaces do not have access to the internet except for user-entered whitelist URLs
Fixed Issues
Data Management
Upload for files names including spaces is now consistent for connector and browser upload. We do still advise not to use spaces in file names in general
Fixed search functionality in Activity > Data Transfers screen
Improved performance on opening samples
Fixed an issue where reference data in download tab initiates an unexpected download
Fixed intermittent issue where the Storage configuration within a Project can go into Error status and can block users from creating records such as folders and files
Service Connector: Improved error message for DELETE/api/connectors/{connectorId}/downloadRules/{downloadRuleId}
Data Management/API
Improved error handling for API endpoints: Delete/api/projects/ {projectId}/bundles/{bundleId} and POST/api/projects/{projectId}/bundles/{bundleId}
Improved error handling for POST/api/projects/{projectId}/base:ConnectionDetails
Bundles
Fixed an issue where the Table view in Bundles is not available when linking to a new Bundle version
Fixed an issue where linking/unlinking a Bundle with Base Tables could result in errors
Bundles/API
Improved error handling for DELETE/api/bundles/{bundleId}/tools/{toolId} and POST/api/bundles/{bundleId}/tools/{toolId}
Improved error message for POST/api/bundles/{bundleId}/samples/{sampleId}
Notifications/API
Custom subscriptions with empty filter expressions will not fail when retrieving them via the API
Improved error handling for POST/api/projects/{projectId}/notificationSubscriptions
Improved notification for Pipeline success events
Flow
When the input for a pipeline is too large, ICA will fail the Analysis and will not retry
Fixed issue where analysis list does not search-filter by ID correctly
Improved error handling when issues occur with provisioning resources
When retry succeeds in a Nextflow pipeline, exit code is now '0' instead of '143'
Flow/API
Fixed an issue causing API error when attempting to launch an Analysis with 50,000 input files
Improved pipeline error code for GET/api/projects/{projectId}/pipelines/{pipelineId} when already unlinked pipeline Id is used for API call
Fixed an issue where Analyses could not be retrieved via API when the Pipeline contained reference data and originated from a different tenant
Fixed filtering analyses on analysisId. Filtering happens via exact match, so part of the Id won't work
Bench/CLI
Fixed issue where the latest CLI version was not available in Bench workspace images
Cohorts
Fixed an issue where CNV data converted from hg19 to hg38 do not show up in Base table views
Fixed an issue accounting for multiple methods of referring to the alternate allele in a deletion from Nirvana data
Fixed intermittent issue where GWAS ingestions not working after Base enabled in a project.
Fixed Issue
Fixed an issue causing incorrect empty storage configuration dropdown during Project creation when using the “I want to manage my own storage” option for users with access to a single region
Features and Enhancements
General
General availability of sequencer integration for Illumina sequencing systems and analysis auto launch
General usability improvements in the graphical interface, including improved navigation structure and ability to switch between applications via the waffle menu in the header
Storage Bundle field will be auto-filled based on the Project location that is being chosen if multiple regions are available
Event Log entries will be paged in the UI and will contain a maximum of 1,000 entries. Exports are limited to the maximum number of entries displayed on the page.
Read-only temporary credentials will be returned when you are not allowed to modify the contents of a file
The ICA UI will only allow selection of storage bundles belonging to ICA during Project creation, and the API will only return storage bundles for ICA
Notifications
Creating Project notifications for BaseSpace externally managed projects is now supported
Flow
Allow attached storage for Pipeline steps to be set to 0 to disable provisioning attached storage and improve performance
Cohorts
GRCh37/hg19-aligned molecular data will get converted to GRCh38/hg38 coordinates to facilitate cross-project analyses and incorporating publicly available data sets.
API
Project list API now contains a parameter to filter on (a) specific workgroup(s)
Two new API endpoints are added to retrieve regular parameters from a pipeline within or without a Project context
Fixed Issues
General
Optimized price calculations resulting in less overhead and logging
Improved error handling:
during Project creation
of own storage Project creation failures.
to indicate connection issue with credential
for graphical CWL draft Pipelines being updated during an Analysis
Improved error messaging in cases where the AWS path contains (a) special character(s)
Fixed an issue causing errors when navigating via deep link to the Analysis Details view
Data Management
Fixed an issue causing data records to remain incorrectly in Unarchiving status when an unarchive operation is requested in the US and Germany regions
API
Fixed returning list of unlinked data in a sample that was linked before in GET/api/projects/{projectId}/data
Fixed error for getSampleCreationBatch when using status filter
CLI
Unarchive of folders is supported when archive or unarchive actions are not in progress for the folder
Improved error message to indicate connection issue with credentials
Flow
Fixed an issue causing incorrect naming of Analysis tasks generated from CWL Expression Tools
Fixed an issue when cloning Pipelines linked from Entitled Bundles to preserve the original Tenant as the Owning Tenant of the cloned Pipeline instead of the cloning user’s Tenant
Fixed an issue causing outputs from CWL Pipelines to not show in the Analysis Details despite being uploaded to the Project Data Analysis output folder when an output folder is empty
When a Contributor starts an Analysis, but is removed afterwards, the Analysis still runs as expected
Fixed an issue where Analyses fail where Nextflow is run a second time
Fixed an issue causing API error when attempting to launch an Analysis with up to 50,000 input files
Fixed an issue causing degraded performance in APIs to retrieve Analysis steps in Pipelines with many steps
Fixed an issue causing Analysis failure during output upload with error “use of closed network connection”
Fixed an issue causing disk capacity alter log to not show when an Analysis fails due to disk capacity and added error message
Fixed an issue preventing cross-tenant users from being able to open a shared CWL pipeline
Base
Improved target Table selection for schedulers to be limited to your own Tables
Bench
Fixed an issue causing Workspaces to hang in the Starting or Stopping statuses
Cohorts
Now handles large VCFs/gVCFs correctly by splitting them into smaller files for subsequent annotation by Nirvana
Features and Enhancements
General
Added a limit to Event Log and Audit UI screens to show 10,000 records
API
Parent output folder can be specified in URN format when launching a Workflow session via the API
Flow
Reduced Analysis delays when system is experiencing heavy load
Improved formatting of Pipeline error text shown in Analysis Details view
Users can now start Analyses from the Analysis Overview screen
Superfluous “Namespace check-0” step was removed to reduce Analysis failures
Number of input files for an Analysis is limited to 50,000
Auto launched Workflow sessions will fail if duplicate sample IDs are detected under Analysis Settings in the Sample Sheet
Base
Activity screen now contains the size of the query
Cohorts
Detect and Lift Genome Build: Cohorts documentation provides set-up instructions to convert hg19/GRCh37 VCFs to GRCh38 before import into Cohorts.
Attribute Queries: Improved the user experience choosing a range of values for numerical attributes when defining a cohort
Export Cohort to ICA Project Data: Improved the user experience exporting list of subjects that match cohort definition criteria to their ICA project for further analysis
Ingest Structural Variants into database
The Cohorts ingestion pipeline supports structural variant VCFs and will deposit all such variants into an ICA Base table if Base is enabled for the given project
Structural variants can be ingested and viewed in base tables
Needle Plot Enhancements
Users can input a numerical value in the Needle Plot legend to display variants with a specific gnomAD frequency percentage or sample count
The needle plot combines variants that are observed among subjects in the current project as well as shared and public projects into a single needle, using an additional shape to indicate these occurrences
Needle Plot legend color changes for Variant severity; pathogenic color coding is the same as the color coding in the visualization; differentiating hue between proteins and variants; and other color coding changes.
Needle plot tool tips that display additional information on variants and mutations are now larger and modal
The needle plot now allows to filter by gnomAD allele frequency and sample count in the selected cohort. Variants include links to view a list of all subjects carrying that variant and export that list.
Remove Samples Individually from Cohorts
Exclude individual subjects from a cohort and save the refined list
The subjects view allows users to exclude individual subjects from subsequent analyses and plots and save these changes Subject exclusions are reset when editing a cohort
Subject Selection in Analysis Visualization: Users can follow the link for subject counts in the needle plot to view a list of subjects carrying the selected variant or mutation.
UI/UX: Start and End time points are available as a date or age with a condition attribute in the subject data summary screen.
Fixed Issues
General
Improved resilience against misconfiguration of the team page when there is an issue with Workgroup availability
Removed ‘IGV (beta)’ button from ‘View’ drop down when selecting Project Data in UI
Data Management
Improved handling of multi-file upload when system is experiencing heavy loads
Fixed an issue to allow upload of zero-byte files via the UI
Fixed issue where other Bundles would not be visible after editing and saving your Bundle
API:
Improved error handling for API endpoint: POST /api/projects/{projectId}/analysisCreationBatch
Improved performance of API endpoint: getbestmatchingfornextflow
Flow
Fixed an issue causing Analysis output mapping to incorrectly use source path as target path
Fixed an issue where the UI may display incorrect or invalid parameters for DRAGEN workflows which do not accurately show the true parameters passed. Settings can be confirmed by looking at the DRAGEN analysis log files.
Base
“Allow jagged rows” setting in the Scheduler has been replaced with “Ignore unknown values” to handle files containing records with more fields than there are Table columns
Improved Base Activity view loading time
Fixed an error message when using the API to load data into a Base Table that has been deleted
Bench
Fixed an issue resulting in incorrect Bench compute pricing calculations
Fixed an issue preventing building Docker images from Workspaces in UK, Australia, and India regions
Fixed an issue where /tmp path is not writeable in a Workspace
Cohorts
Fixed issue where the bubble plot sometimes failed to display results even though the corresponding scatter plot showed data correctly.
The order of messages and warnings for ingestion jobs was not consistent between the UI and an error report sent out via e-mail.
The UI now displays any open cohort view tabs using shortened (“…”) names where appropriate
Issue fixed where ingestions with multiple errors caused halting to the ingestion queue.
The needle plot sometimes showed only one source for a given variant as opposed to all projects in which the variant had been observed in.
Issue fixed with unhandled genotype index format in annotation file to base database table conversion
Status updates via e-mail sometimes contained individual error messages or warnings without a text.
Fixed issue where items show in needle plot with incorrect numbering on the y-axis.
Fixed performance issue with subject count.
Widget bar-chart counts are intermittently cut off over four digits.
Fixed slowness when switching between tabs in query builder
Fixed Issue
Fixed issue with BaseSpace Free Trial and Professional users storing data in ICA
Fixed Issue
Fixed an issue resulting in analysis failures caused by a Kubernetes 404 timeout error
Features and Enhancements
General *
Each tenant supports a maximum of 30,000 Projects
.MAF files are now recognized as .TSV files instead of UNKNOWN
Added VCF.IDX as a recognized file format
General scalability optimizations and performance improvements
API
POST /api/projects/{projectId}/data:createDownloadUrls now supports a list of paths (in addition to a list of IDs)
Fixed Issues
General
Fixed an issue preventing the ‘Owning Project’ column from being used outside of Project
Fixed an issue allowing the region of a Project to be changed. Changing the region of a resource is not supported
Strengthened data separation and improved resilience against cross-Project metadata contamination
Bundles
After creating a new Bundle the user will be taken to the Bundle Overview page
Data Management
Fixed an issue which prevented changing the format of a file back to UNKNOWN
Fixed an issue causing inaccurate upload progress to be displayed for UI uploads. The Service Connector or CLI are recommended for large file uploads.
Fixed an issue showing an incorrect status for data linking batch jobs when data is deleted during the linking job
Service Connector: Fixed an issue allowing download of a Service Connector when no operating system is set
Service Connector: Cleaned up information available on Service Connectors by removing empty address information fields
API
Fixed date formatting for GET /api/eventLog (yyyy-MM-dd’T’HH:mm:ss.SSS’Z’)
Fixed an issue where the GET users API was not case sensitive on email address
Fixed an issue causing the metadata model to be returned twice in PSOT /api/projects/{projectId}/samples:search
Fixed the listProjects API 500 response when using the pageoffset query parameter
The searchProjectSamples API returns Sample metadata for Samples shared via a Bundle
Fixed an issue causing createProjectDataDownloadUrls API 400 and 502 errors when server is under load
Flow
Fixed analysis failures caused by kubernetes 404 timeout error
Fixed an issue where Workflwos would prematurely report completion of an Analysis
Improved Pipeline retry logic to reduce startup delays
Fixed an issue where Nextflow pipelines were created with empty files (Nextflow config is allowed to be empty)
Removed the 1,000 input file limitation when starting an Analysis
Improved the performance of status update messages for pipelines with many parallel steps
Fixed an issue with overlapping fields on the Analysis Details screen
Deactivated the Abort button for Succeeded analyses
Base
Fixed an issue where Pipeline metadata was not captured in the metadata Table generated by the metadata schedule
Error logging and notification enhancements
Bench
Fixed an issue where Workspaces could be started twice
Fixed an issue where the system checkpoint folder was incorrectly created in Project data when opening a file in a Workspace
Features and Enhancements
Analysis system infrastructure updates
Features and Enhancements
Added ability to refresh Batch Jobs updates without needing to leave the Details screen.
Projects will receive a job queuing priority which can be adjusted by an Administrator.
The text "Only showing the first 100 projects. Use the search criteria to find your projects or switch to Table view." when performing queries is now displayed both on the top and bottom of the page for more clarity.
API: Added a new endpoint to retrieve download URLs for data: POST/api/projects/{projectId}/data:createDownloadUrls
API: Added support for paging of the Project Data/getProjectDataChildren endpoint to handle large amounts of data.
API: Added anew endpoint to deprecate a bundle (POST /api/bundles/{bundleId}:deprecate)
API: If the API client provides request header "Accept-Encoding: gzip", then the API applies GZIP compression on the JSON response. This way the size of the response is significantly smaller which improves the download time of the response, resulting in faster end-to-end API calls. In case of compression the API also provides header "Content-Encoding: gzip" in the response, indicating that compression was effectively applied.
Flow: Optimized Analysis storage billing, resulting in reduced pipeline charges.
Flow: Internal details of a (non-graphical) pipeline marked ‘Proprietary’ will not be shared with users from a different tenant.
Flow: A new grid layout is used to display Logs for Analyses with more than 50 steps. The classic view is retained for analyses with 50 steps or less, though you can choose to also use the grid layout by means of a grid button on the top right on the Analysis Log tab.
CLI: Command to launch a CWL and Nextlfow Pipeline now contains the mount path as a parameter.
CLI: Version command now contains the build number.
CLI: Added support for providing the nextflow.config file when creating a Nextflow pipeline via CLI.
API: HTML documentation for aPipeline can now be returned with the following requests:
GET /api/pipelines/{pipelineId}/documentation/HTML
GET /api/projects/{projectId}/pipelines/{pipelineId}/documentation/HTML
API: Added a new endpoint for creating and starting multiple analyses in batch: POST /api/projects/{projectId}/analysisCreationBatch
Flow: Linking to individual Analyses and Workflow sessions is now supported by /ica/link/project//analysis/ and /ica/link/project//workflowSession/
Cohorts: Users can now export subject lists to the ICA Project Data as a file.
Cohorts: Users can query their ingested data through ICA Base. For users who already have ingested private data into ICA Cohorts, another ingestion will need to happen prior to seeing available database shares. Customers can contact support to have previously ingested data sets available in Base.
Cohorts: Correlation bubble plot counts now link to a subject/sample list.
Fixed Issues
Tooltip in the Project Team page provides information about the status of an invite
‘Resend invite’ button in the Project Team page will become available only when the invite is expired instead of from the moment the invite is being send out
Folders, subfolders and files all contain information about which user created the data
Files and folders with UTF-8 character are not supported. Please look at the documentation on how to recover from it in case you already have used them.
Improved performance for creating or hiding a Project in a tenant with many Projects
Service Connector: Updated information in the Service Connector screen to reflect the name change from "Type of Files" to the more accurate "Assign Format"
Service Connector: Folders within a Bundle can be downloaded via the Service Connector
Service Connector: Upload rules can only be modified in the Project where they apply
Service Connector: A message describes when a file is skipped during upload because it already exists in the Project
Service Connector: Fixed an issue where opening the Connectivity tab occasionally results in a null pointer error
Service Connector: Fixed an issue causing excessive logging when downloading files with long file paths
Service Connector: Fixed an issue where the Service Connector log may contain spurious errors which do not impact data transfers
Existing storage configurations are displayed and accessible via API and UI
Newly added storage configurations do no longer remain in ‘Initializing’ state
Fixed error when creating a storage configuration with more than 63 characters
Clicking on a Data folder in flat mode will now open the details of the folder
Only Tools in Released state can be added to a Bundle
Fixed issue preventing new Bundle versions to be created from Restricted Bundles
Deprecated Bundles are displayed upon request in card and table view
Bundles view limited to 100 Bundles
API: Fixed the API spec for ProjectDataTransfer.getDataTransfers
API: Fixed an issue with the projectData getChildren endpoint which returned incorrect object and pagination
API: Fixed an issue where multiple clicks on Create sample batch API endpoint resulted in multiple requests
API: POST /api/projects/{projectId}/data/{dataId}:scheduleDownload can now also perform folder downloads
API: Improved information on the Swagger page for GET /api/pipelines, GET/api/projects/{projectId}/pipelines, and GET/api/projects/{projectId}/pipelines/{pipelineId}
API: Fixed and issue when a user provides the same input multiple times to a multi-value input on an analysis run, that input is only passed to the pipeline once instead of multiple times: POST /api/projects/{projectId}/analysis:nextflow
CLI: Copying files in the CLI from a local directory on MacOS to your Project can result in both the desired file and the metadata file (beginning with ‘./’) being uploaded. The metadata file can safely be deleted from the Project
CLI: Hardened protection against accidental file overwriting
CLI: Improved handling for FUSE when connection to ICA is lost
CLI: icav2 projectdata mount –list shows updated list of mounted Projects
CLI: Paging improvements made for project list, projectanalyses list, and projectsdata list
CLI: When there is no config or session file the user will not be asked to create one for icav2 config reset and icav2 config get
CLI: Fixed an issue where Bundle data could not be seen through FUSE in Bench
CLI: Fixed an error message when missing config file upon entering the Project context
CLI: The unmount is possible without a path and will work via the stored Project ID or with a directory path resulting in an unmount of that path
CLI: Fixed an error when creating a Pipeline using URN for Project identifier
CLI: Attempting to delete a file from an externally-managed project returns an error indicating this not allowed
CLI: Fix to delete session file when config file is not detected
CLI: Paging option added to projectsamples list data
CLI: Fixed “Error finding children for data” error in CLI when downloading a folder
CLI: projectdata list now returns the correct page-size results
Flow: Fixed handling of special characters in CWL pipeline file names
Flow: Fixed an issue where task names exceeding 25 characters cause analysis failure in CWL pipelines
Flow: Fixed an issue which prevented requests for economy tier compute
Flow: Fixed an issue limiting CWL workflow concurrency to two running tasks
Flow: Fixed an issue where analysis file inputs specified in the input.json with ‘location’set to an external URL cause to CWL pipelines to fail
Flow: Fixed an issue resulting in out of sync Pipeline statuses
Flow: Improved Nextflow engine resiliency, including occurrences where Nextflow pipelines fail with ‘pod 404 not found’ error
Flow: Fix issue with intermittent system runtime failures incorrectly causing analysis failures
Flow: Fixed an issue where links to Analysis Details returned errors
Flow: Enabled scrolling for Pipeline documentation
Flow: Improved performance for handling analyses with large numbers of inputs
Flow: Improved handling of hanging Analyses
Flow: Improved error messages for failed Pipelines
Flow: Added documentation on how to use XML configuration files for CWL Pipelines
Flow: Duplicate values for multi-value parameters are no longer automatically removed
Flow: Correct exit code 0 is shown for successful Pipeline steps
Base: Fixed an issue so that only users with correct permissions are allowed to retrieve a list of Base tables
Base: Fixed an issue with metadata scheduler resulting in a null pointer
Base: An empty Table description will not return an error when requesting to list all Tables in a Project
Base: Jobs failed with an error containing 'has locked table' are not shown on the Base Job activity list. They can be displayed by selecting the 'Show transient failures' checkbox at Projects > Activity > Base Jobs.
Base: Users can see Schedulers and their results for the entire tenant if created by a tenant administrator in their project, but not create, edit or run them
Base: Fixed an issue preventing data format change in a schedule
Base: Fixed an issue preventing exporting data to Excel format
Bench: Improved handling to prevent multiple users in a single running Workspace
Bench: Fixed an issue causing Workspaces to be stuck in "Starting" state
Bench: Fixed an issue where usage does not showing up on usage CSV-based report
Bench: Fixed an issue where Bundle data could not be seen via the Fuse driver
Bench: Users can now consistently exit Workspaces with a single click on the ‘Back’ button.
Bench: After leaving a Workspace by clicking on the ‘Back’ button, the Workspace will remain in a ‘Running’ state and become available for a new user to access
Bench: Workspaces in a ‘Stuck’ state can be manually changed to ‘Error’ state, allowing users to restart or delete them
Cohorts: Fixed issue where file system cleanup not occurring after delete.
Cohorts: Fixed sign in and authentication issues in APN1 region.
Cohorts: Fixed issue where gene filter missing when editing a cohort and removing the edited filter and cancelling. The filter was preserved and should not have been.
Cohorts: Fixed issue where users see an application tile in the Illumina application dashboard selection screen called "Cohort Analysis Module".
Cohorts: Correlation: Fixed issue, Data type selections shows half when loading the search result
Cohorts: Fixed issue, Users will see an application tile on the Connected Platform home page screen called “Cohort Analysis Module” if the Cohorts module is added to the domain. Users should not enter the ICA Cohorts through this page. They should enter through ICA."
This page contains the release notes for the current release. See the subpages for historic release notes.
General
A new Experimental Nextflow version has been made available (v24.10.2). Users will no longer be able to create new pipelines with Nextflow v20.10.0. In early 2026 ICA will no longer support Nextflow v20.10.0
Added an API endpoint to retrieve analysis usage details, exposing the analysis price. The UI now differentiates between errors and ongoing price calculations, displaying 'Price is being calculated' for pending requests instead of a generic error message
Made the project owner field read-only in the project details view and added a button in the Teams view to edit the project owner via a separate dialog
Autolaunch and BCLConvert now support dots and spaces in project names
Data Management
Users are now able to create non-indexed folders. These are special folders which cannot be accessed from the UI with some specific actions blocked (such as moving or copying those folders)
Enhanced visibility for data transfers by clearly marking those that do not match any download rule as 'ignored' in the UI. This helps users quickly identify transfers that won't start, preventing confusion and improving troubleshooting.
In bundles it is now possible to open the details for docker and tool images by clicking on the name in the overview
User managed storage configurations now allow for the copying of tags when copying/moving/archiving files and folders
Bench
For fast read/write access in Bench, you can now, link non-indexed folders with the CLI command workspace-ctl data create-mount --mode read-write
Bench can now be started in a single-user mode allowing only one user to work in the workspace. All assets generated in bench (e.g. pipelines) are owned by the Bench user instead of a service account
UI Changes made to Workspace configuration and splash pages
General
Fixed an issue where labels for success, failure, and other item counts were missing in the Batch job details panel
Improved error message in the API when creating a new project with user managed storage
Fixed an issue where the Save button remained enabled when clicking on the Documentation tab in the Tool Repository
Fixed duplicate project detection to handle the new 400 error response format, ensuring consistency with other unique constraint violations
Changes made to advanced scheduler options that are not applicable any longer
Removed erroneous Link/Unlink buttons in Tool and Docker Images of shared bundles
Fixed issue where project with large number of analyses loads slowly
Data Management
There will be a change in one of the upcoming releases where users are no longer able to edit connectors of other users through the API. This will be made consistent with the UI.
Fixed an issue where the sample list did not automatically refresh after deleting samples using the 'Delete input data and unlink other data' or 'Delete all data' options
Made display color of Bundle-related data more consistent
Removed on-click behavior off an added Cohorts dataset in a bundle that caused a yellow-bar warning
Fixed an issue preventing project creation with user managed storage when specifying a bucket name and prefix in the Storage Config without a subfolder
Fixed an issue where managing tags on data could result in a TransactionalException error, causing long load times and failed saves
When a project data download CLI command returned an error for a file, the command returned status 0, while it should have returned 1. This has now been fixed
Brought API in line with UI for detection of duplicate folder path already existing outside of your project
Fixed local version detection affecting automatic service connector upgrades
Flow
Improved error messaging when developing pipeline JSON based input forms
Fixed an issue where in some case Nextflow logs are too big but are still copied into the notification which causes the notification to fail. The log behind 'Show more' button is now truncated to a size which is accepted by SQS
Updated API behavior for JSON-based CWL and Nextflow pipelines to prevent unintended rounding of 'number' fields with values greater than 15 digits. Added a warning to advise users to pass such values as strings to maintain precision
Each analysis step attempt is now recorded as a separate entry, ensuring accurate billing and providing end users access to stdout/stderr logs for every retry
Fixed an issue when retries or duplicate step names caused improper entity_id identification
Clicking 'Open in Data' from analysis details now correctly redirects users to the file's parent folder in the project data view instead of the root
Refreshing the pipeline/workflow detail view now correctly updates the UI to reflect the latest version, ensuring any changes are displayed
Base
Fixed issue where columns filtering on a number in base activity producing an error
Bench
Improved protection against concurrent status changes when stopping workspaces
Added refresh button to Bench workspaces
Improved error handling when special characters are added to the storage size of bench workspaces
Made the behavior when running and stopping workspaces more consistent
Fixed an issue where the UI did not refresh automatically during long workspace initialization times, causing the workspace status to remain outdated until manually refreshed
This tutorial demonstrates major functions of the ICA platform, beginning with setting up a project with instrument run data to prepare to use pipelines, and concluding with viewing pipeline outputs in preparation for eventually ingesting outputs into available modules.
In the following example, we start from an existing ICA project with demultiplexed instrument run data (fastq), use the DRAGEN Germline pipeline, and view output data.
This tutorial assumes you already have an existing project in ICA. To create a new project, please see instructions in the Projects page.
Additionally, you will need the DRAGEN Demo Bundle linked to your existing ICA project. The DRAGEN Demo Bundle is an an entitled bundle provided by Illumina with all standard ICA subscriptions and includes DRAGEN pipelines, references, and demo data.
For general steps on creating and linking bundles to your project, see the Bundles page. This tutorial explores the DRAGEN Germline Published Pipeline, so we will need to link the DRAGEN Demo Bundle to our existing project.
Steps:
Go to your project's Details
page
Click the Edit
button
Click the +
button, under LINKED BUNDLES
Click on the DRAGEN Demo Bundle
, then click the Link Bundles
button
There may be multiple versions of DRAGEN Demo Bundles. This tutorial details steps for DRAGEN Demo Bundle 3.9.5.
Steps for other versions after 3.9.5 should be similar.
Click the Save
button
DRAGEN Demo Bundle assets should be available now in your projects Data
and Pipelines
pages.
After setting up the project in ICA and linking a bundle, we can run various pipelines.
This example demonstrates how to run the DRAGEN Germline Published Pipeline (version 3.9.5) in your ICA project using the demo data from the linked DRAGEN Demo Bundle.
The required pipeline input assets for this tutorial include:
Under Data
page
Illumina DRAGEN Germline Demo Data folder
Illumina DRAGEN Enrichment Demo Data folder
Illumina References folder
Under Pipelines
page
DRAGEN Germline
From the Pipelines
page, select DRAGEN Germline 3.9.5
, and then click Start Analysis
. Initial set-up details require a User Reference
(pipeline run name meaningful to the user) and an Entitlement Bundle
from the drop-down menu under Pricing.
Running the DRAGEN Germline pipeline uses the following inputs which are to be added in the Input Files section:
FASTQ files
Select the FASTQ files in the Illumina DRAGEN Enrichment Demo Data
folder and select Add.
Reference:
Select a reference genome from the Illumina References
folder (do not select a methyl-converted reference genome for this tutorial)
Ie: hg38_altaware_nohla-cnv-anchored.v8.tar (suggested, if enabling CNV analysis)
The DRAGEN Germline Settings to be selected are:
Enable germline small variant calling: Set to true
Enable SV (structural variant) calling: Set to true
If true, Enable map align output must also be set to true
Enable repeat genotyping: Set to true
Enable map align: Set to true
When using FASTQ files as input, as in this example, set this to true
as default.
When using BAM files as input, set to true
to realign reads in input BAMs; set to false
to keep alignments in input BAM files.
Enable CNV calling: Set to true
Enabling Copy Number Variant calling requires one of the following:
Enable CNV self normalization is set to true
A panel of normals (PON) is provided in the Input Files
Output format: Set to CRAM
Other available options for alignments output are BAM and SAM format.
Enable CNV self-normalization: Set to true
Required if Enable CNV calling is set to true
and no panel of normals (PON) is provided in the Input Files.
Enable duplicate marking: Set to true
Emit Ref Confidence: Set to GVCF
to enable banded gVCF generation for this example
To enable base pair resolution in the gVCF, set to BP_RESOLUTION
Additional DRAGEN args: Leave Empty
Users can provide additional DRAGEN arguments here (see the DRAGEN user guide for examples), but we will leave this blank for this example run.
Sample sex: Leave blank
Users may specific the sex of the sample here if known, but the user will omit this setting for this example run.
Enable HLA: Set to true
to enable HLA typing
Enable map align output: Set to true
The format for alignment output was selected previously in the "Output format setting" above
Resources
Use the default resources settings:
Storage size: Set to small
FPGA Medium Tier: Set to Standard
FPGA Medium Resources: Set to FPGA Medium
Once all parameters have been set, click Start analysis
You can monitor the status of analysis pipeline runs from the Flow > Analysis
page in your project. See Analysis Lifecycle for more details.
Click on the run to view more information about it. The various tabs under a given run provide additonal context regarding the status of the completed run.
If you encounter a failed run, you can find more information in the Projects > your_project > Flow > Analyses > your_analysis > Details tab and on the execution report tab.
Analysis run logs can be found on the Steps tab. Use the sliders next to Stderr and Stdout for more details. Check the box next to "Show technical steps" to view additional log files.
DRAGEN analysis output folders are found on the project's Data
page, along with all other data loaded to the project (such as assets from a linked entitled bundle). Output analysis will be grouped into folders, so users can click through the directory structure to explore outputs.
DRAGEN Support Site: https://support.illumina.com/sequencing/sequencing_software/dragen-bio-it-platform.html
ICA Pricing: https://help.ica.illumina.com/reference/r-pricing
There are several ways to connect pipelines in ICAv2. One of them is to use Single Notification Service (SNS) and a Lambda function deployed on AWS. Once the initial pipeline is completed, SNS triggers the Lambda function. Lambda extracts information from the event parameter to create an API call to start the subsequent pipeline.
Notifications are used to subscribe to events in the platform and trigger the delivery of a message to an external delivery target. You can read more here. Important: In order to allow the platform to deliver events to Amazon SQS or SNS delivery targets, a cross-account policy needs to be added to the target Amazon service.
with arn being the Amazon Resource Name (ARN) of the target SNS topic. Once the SNS is created in AWS, you can create a New ICA Subscription in Projects > your_project > Project Settings > Notifications > New ICA Subscription. The following screenshot displays the settings of a subscription for Analysis success of a pipeline with the name starting with Hello.
On this site there is a list of all available API endpoints for ICA. To use it, obtain the API-Key from the Illumina ICA portal.
To start a Nextflow pipeline using the API, use the endpoint /api/projects/{projectId}/analysis:nextflow. Provide the projectID and the reference body in JSON format containing userReference, pipelineId, analysisInput etc. Two parameters activationCodeDetailId and analysisStorageId have to be retrieved using the API endpoint api/activationCodes:findBestMatchingForNextflow from Entitlement Detail section in Swagger. For example:
Output of the API call:
In this particular case, the activationCodeDetailId is "6375eb43-e865-4d7c-a9e2-2c153c998a5c" and analysisStorageId is "6e1b6c8f-f913-48b2-9bd0-7fc13eda0fd0" (for resource type "Small").
Once you have all these parameters, you can start the pipeline using API.
Next, create a new Lambda function in the AWS Management Console. Choose Author from scratch and select Python3.7 (includes requests library) as the runtime. In the Function code section, write the code for the Lambda function that will use different Python modules and execute API calls to the existing online application. Add the SNS created above as a trigger.
Here is an example of a Python code to check if there is file named 'test.txt' in the output of the successful pipeline. If the file exists, a new API call will be made to invoke the second pipeline with this 'test.txt' as an input.
Fixed Issues
When creating a new cohort, the disease filter’s tree hierarchy was not showing up, meaning it was not possible to add disease filter into the cohort definition. This has been resolved.
Fixed Issues
Flow
Fixed an issue which caused service degradation where analysis steps were not properly updated until analyses were finished, and log files only became available after completion.
Features and Enhancements
General
General usability improvements for the project overview screen
The timing for when jobs are deleted has been updated so that:
SUCCEEDED remains 7 days
FAILED and PARTIALLY_SUCCEEDED are increased to 90 days
Data Management
Data can now be uploaded into the BaseSpace-managed project
Flow
Analyses can now be started from the pipeline details screen
The analysis details now contain two additional tabs displaying timeline and execution reports for Nextflow analyses to aid in troubleshooting errors
Introduced a start command for starting a Nextflow pipeline with a JSON-based input form
Added new API endpoints to create a new CWL pipeline and start an analysis from a CWL pipeline with JSON-based input forms:
POST/api/projects/{projectId}/pipelines:createCwlJsonPipeline
POST/api/projects/{projectId}/analysis:cwlJson
Pipelines with JSON-based input forms can now pre-determine and validate storage sizes
Added support for tree structures in dropdown boxes on JSON-based input forms to simplify searching for specific values
Introduced a new filtering option on the analyses gid to enable filtering for values which differ from, or do not equal (!=), a given value (such as exit codes in the pipeline steps in the analysis details screen)
The analysis output folder format will now be user reference-analysis id
Cohorts
The side panel now displays the Boolean logic used for a query with ‘AND’, ‘OR’ notations
The needle plot visualization now drives the content of the variant list table below it. By default, the list displays variants in the visualization and can be toggled to display all variants with subsequent filtering
For diagnostic hierarchies, concept children count and descendant count for each disease name is displayed
The measurement/lab value can be removed when creating query criteria
Fixed Issues
General
Notification channels are not created at the tenant level and are not visible to members of external tenants working on the same project
Data Management
Fixed an issue where move jobs fail when the destination is set to the user’s S3 bucket where the root of the bucket mapped to ICA as storage configuration and volume
Fixed a data synchronization issue when restoring an already restored object from a project configured with S3 storage
Flow
Corrected the status of deleted Docker images from incorrect ‘Available’ to ‘Deleted’
The reference for an analysis has changed to userReference-UUID, where the UUID matches the ID from the analysis. (The previous format was userReference-pipelineCode-UUID.)
Pipeline files are limited to a file size of 20 Megabytes
Bench
Fixed an issue which caused ‘ICA_PROJECT_UUID not found in workspaceSettings.EnvironmentVariables’ when creating a new Workspace
Cohorts
Fixed an issue where the system displays ALL/partial filter chips when the top level tree node is selected in a hierarchical search
Fixed an issue where the system displays 400 bad request error despite valid input of metadata files during import jobs
Fixed an issue where the system displays inconsistent hierarchical disease filter results
Fixed an issue where the system changes the layout when displaying the p-value column
Fixed an issue where the system disables the next blutton when there is no study available in the dropdown menu
Fixed an issue where studies could not be selected when a project has one study to ingest data into
Fixed Issues
Mitigated an issue causing intermittent system authentication request failures. Impact includes analysis failures with "createFolderSessionForbidden" error
Features and Enhancements
General
The projectdata upload CLI command will from now on give you the credentials to access the data
Data Management
Introduced a limit to the number of data elements that can be put in POST /api/projects/{projectId}/dataUpdateBatch to 100.000 entries
Flow
Users can now access json-based pipeline input forms for both Nextflow and CWL pipelines. API access is not yet available for CWL pipelines
Added GPU compute types (gpu-small, gpu-medium) for use in workflows
Users can now sort analyses by request date instead of start date, which was not always available
The analysis details page has been upgraded with the following features:
The progress bar which could be found on the analyses overview page will now also appear in the details page
A maximum of 5 rows of output are shown for each output parameter, but the output can be displayed in a large popup to have a better overview
Orchestrated analyses are shown in a separate tab
Cohorts
Users can now use the Measurement concept API to create cohorts based on lab measurement data and harmonize their values to perform downstream analysis
Users can now access the Hierarchical concept search API to view the phenotype ontologies
Fixed Issues
General
The mail option is now automatically filtered out for those events that do not support it
Fixed an issue where there was no email sent after rerunning a workflow session
Fixed an issue which caused authentication failures when using a direct link
Made file and folder selection more consistent
Fixed an issue with the CLI where using the “projectsamples get” command to retrieve a sample shared via an entitled bundle in another tenant failed
Fixed filtering so you can only see subscriptions and channels from your own tenant
Improved GUI handling for smaller display sizes
Fixed the workflow session user reference and output folder naming to use BaseSpace Experiment Name when available
Data Management
The unlink action is now greyed out if data not linked is in the selection
Fixed an issue that when deleting folders, the parent folders were deleted first, giving the impression that the parent folder is deleted but not the subfolders and files
Fixed an issue where the connector downloads only downloaded the main folder, not the folder contents
For consistency, it is no longer possible to link to folders or files from within subfolders. Previously, you could link, but the files and folders are always linked to top level instead of the subfolder from where linking was done
Updated error handling for dataUpdateBatch API endpoint
Moving small files (>8Mb) will not trigger a "moving" event, only that the move has completed as out-of-order events caused issues and moving small files happens fast enough to not need the status of being moved, only the completion of the move
Improved error handling when encountering issues during cancellation of data copy/move
Improved error message when trying to unlink data from a project via the API when this data is native to that project and not linked
Fixed issue where analysis can proceed to download input data when any of the inputs are in status other than AVAILABLE, including records within folder data inputs
Flow
Redesigned UI component to prevent issues with Analysis summary display
Fixed an issue where the field content was not set to empty when the field input forms have changed between the original analysis and a rerun
Replaced retry exhaustion message, "External Service failed to process after max retries 503 Unique Reference ID: 1234" with a more useful message to end users that advises them to contact Illumina support: "Attempt to launch ICA workflow session failed after max retries. Please contact Illumina Tech Support for assistance. Unique Reference ID: 1234". This does not replace more specific error messages that provide corrective advice to the user, such as "projectId cannot be null"
For efficiency reasons, pipeline files are limited to a file size of 100 Megabytes
Bench
Fixed an issue which caused .bash_profile to no longer source .bashrc
Fixed the status of deleted docker images which previously were displayed as available
After creating a tool, the Information tab and Create Tool are now no longer accessible to prevent erroneous selection
Cohorts
Fixed layout issue where buttons were moved up when the user selected the option
Fixed issue where user was not able to view PheWas plot when multiple cohorts are open and same gene is searched
Fixed issue where the user was not able to view GWAS plot when multiple cohorts are open and user switched forth and back between cohorts
Fixed issue where users were not able to see the cryogenic map in the gene summary page for gene associated with the chromosome
Fixed Issues
General
Fixed an issue where various Data Transfer Ownership API calls were failing with a 'countryView' constraint violation error
Features and Enhancements
General
Dynamically linked folders and files now have their own icon type, which is a folder/file symbol with a link symbol consisting of three connected circles
Data Management
With the move from static to dynamic data linking, unlinking data is now only possible from the project top level to prevent inconsistencies
The user can now manually, dynamically link a folder
The icav2 project data mount command now supports the “--allow-other” option to allow other users access to the data
The user can now set a time to be archived or deleted for files and folders with the “timeToBeArchived” and “timeToBeDeleted” parameter on the “POST/api/projects/{projectId}/dataUpdateBatch” command
Added 4 new API endpoints which combine the create and retrieve upload information
Flow
The default Nextflow version is now 22.04.03, from 20.10.0
The user can now specify the Nextflow version when deploying a pipeline via the CLI with the “--nextflow-version” flag
Bench
The user now has the option to choose either a tool or bench as a docker image when adding new docker images
It is now possible to open contents of a Bench workspace in a new tab from the Bench details tab > access section
Fixed Issues
General
Improved handling of API calls with an invalid or expired JWT or API token
Data Management
Renamed the "New storage credential" button to "Create storage credential"
Removed the "Edit storage credential" button. The user can now edit the column directly in the open dialog when clicking on the name
Performance improvements to scheduled data download
Fixed an issue where data records were shown more than once when updating the tags\
The data details were erroneously labeled with "size in bytes" while the size was in a variable unit
Fixed an issue where trying to download files could result in the error "Href must not be null" when the file was not available
Fixed an issue where existing data catalog views would return an empty screen caused by a mismatch in role naming
Flow
Fixed an issue that caused opening a pipeline in the read-only view to incorrectly detect there were unsaved changes to the pipeline
Fixed an issue when having different pipeline bundles with the same (name) resource models would result in duplicate listing of these resources
Improved error handling when encountering output folder creation failure, which previously could result in analysis being stuck in REQUESTED status
By default Nextflow will no longer generate the trace report. If you want to enable generating the report, add the section below to your userNextflow.config file:
trace.enabled = true
trace.file = '.ica/user/trace-report.txt'
trace.fields = 'task_id,hash,native_id,process,tag,name,status,exit,module,container,cpus,time,disk,memory,attempt,submit,start,complete,duration,realtime,queue,%cpu,%mem,rss,vmem,peak_rss,peak_vmem,rchar,wchar,syscr,syscw,read_bytes,write_bytes,vol_ctxt,inv_ctxt,env,workdir,script,scratch,error_action'
Fixed the issue where users not allowed to run or rerun workflows could start them from the API or BaseSpace SequenceHub. Now, users that cannot start workflows cannot rerun them.
Cohorts
Fixed issue where the users can select the lower needles despite the overlap of multiple needles at the same location in the needle plot
Fixed issue where the user would not be able to view the cryogenic map in the gene summary page for gene associated with the chromosome
Fixed issue where the user would not be able to view the PheWas plot when multiple cohorts are open and same gene is searched
Features and Enhancements
Data Management
Improved performance of data linking jobs
Fixed Issues
General
Fixed an issue causing slow API responses and 500 errors
Features and Enhancements
General
The CLI readme file will now additionally contain the CLI build number
Data Management
Fixed an issue where there was a discrepancy between the Run Input tags shown to the user and what was stored on the data
Added a 25,000-item limit to the v3 endpoint for batch data linking. Using the v4 endpoint, which does not have this limitation, is recommended
Flow
Analyses and workflow sessions can now be resubmitted, and parameters can be updated upon resubmission
Changed the default image used for CWL pipeline processes with undefined image from docker.io/bash:5 to public.ecr.aws/docker/library/bash:5
Updated the choice of default nextflow docker image which is used when no docker image is defined. It is now public.ecr.aws/lts/ubuntu:22.04_stable
The analysis logs in the analysis details page can be refreshed
The user is now able to write a pipeline which executes files located in the /data/bin directory of the runtime environment
Pipeline files are now shown in a tree structure for easier overview
Cohorts
Updated GWAS UK Biobank data base gives users access to more phenotype information
Users can now incrementally ingest their molecular data for germline, CNV, structural variants, and somatic mutation data
Fixed Issues
General
Added an "All" option to the workgroup selection box in the projects view to reset the filter, which previously required you to delete all characters from the filter
Fixed an issue where updating two base permissions at the same time would sometimes not execute correctly
Fixed an issue where creating grid filters could result in a nullpointer error
Fixed an issue where 'Copy to Clipboard' button did not work anymore
After searching for a folder in the search box and going into that folder, the search box is now cleared
Improved the project permissions API to correctly handle empty values
Previously, when attempting to save and send a message from the Websolutions section without a unique subject, the system would report an error and still send the message. Now the non-unique message subject error is reported and no message is sent
Fixed an issue where linking samples in the sample screen would result in receiving the same "sample(s) linked" message twice
Improved error handling for CLI FUSE driver
Hardened log streaming for ongoing runs to better handle network issues which previously would result in missing log streaming
Add retries for "connected reset by peer" network-related errors during analysis upload tasks
Fixed an issue where inviting a user to collaborate on a project containing base would result in the error "entity not managed" if that user did not have base enabled in any project or if base was not enabled in the project tenant
Data Management
Fixed an issue where data could be moved to a restricted location called /analyses/ and no longer be visible after the move. Please contact Illumina Support with your data move job information to recover your data if you have encountered this issue
Fixed an issue where sorting on data format did not work correctly
Copying empty folders no longer results in a partially copied status
ICA now performs an automatic refresh after unlinking or deleting a sample
Improved handling of file path collisions when handling linked projects during data copy / move
Fixed an issue where, even though uploading a file in a linked folder is not permitted, this would erroneously present a success message without copying the file
Analysis-events which are too large for SQS (256KB) are now truncated at the first 1000 characters when using SQS
Improved error handling when trying to upload files which no longer exist
Fix system degradation under load by introducing rate limit for spawning tasks for a given analysis to 25 per 1 min
The createUploadUrl endpoint can now be used to upload a file directly from the region where it is located. The user can create both new files and overwrite files in status "partial"
Improved the project data list command with wildcard support. For example:
/ or /* will return the contents of the root
/folder/ will return the folder
/folder/* will return the contents of the folder
To optimize performance, a limit has been set to prevent concurrent uploading of more than 100 files
Fixed an issue where folder syncing functionality would sometimes result in “Unhandled exception in scheduling worker”
Flow
Fixed an issue where writing a pipeline which executes files in the /data/bin folder wasn't functioning properly with larger storage classes
Nextflow pipelines no longer require pipeline inputs when starting them via the CLI
Improved error handling when using an unsupported data format in the XML input forms during pipeline creation
Fixed the issue where it was not possible to add links in the detail page for pipelines and bundles
Sorting is no longer supported on duration and average duration columns for analysis grids
In situations where the user would previously get the error "zero choices with weight >= 1" after the first attempt, additional retries will execute to prevent this from occurring
Cohorts
Fixed an issue resulting in a blank error when a cohort with hundreds of diagnostic concepts was created
Features and Enhancements
Flow
Improved analysis queue times during periods of limited compute resource availability
Features and Enhancements
General
New notification to the user when a copy job finishes running
Updated the "GET analysis storage" API endpoint to account for the billing mode of the project. If the billing mode of the project is set to tenant, then the analysis storage of the user's tenant will be returned. If the billing mode of the project is set to project, then the analysis storage of the project's owner tenant will be returned
A ReadMe file containing used libraries and licenses is now available for ICA CLI
Data Management
New DataFormats YAML (.yaml, .cwl), JAVASCRIPT (.js, .javascript), GROOVY (.groovy, .nf), DIFF (.diff), SH (.sh), SQL (.sql) to determine the used syntax highlighting when displaying/editing new pipeline files in the UI
ICAv2 CLI supports moving data both between and within projects
Added an alert to notify users when data sharing or data move is disabled for the project
A new version of the Event Log endpoint has been developed to support paging, retrieval of previous events, and resolution of inconsistencies in date formats. This new endpoint introduces the EventLogListV4 data transfer object
The user is now able to select a single file and download it from the browser directly. This does not apply for folders and multiple files selected at once
User can subscribe to notifications when data is unarchived
The BaseSpace Sequencing Run Experiment name will now be added to the technical tags when a workflow session is launched
Flow
Fastqs with the .ora extension are now supported when staging these for secondary analysis, either as a list of fastqs or as fastq_list_s3.csv files
Before, users had to click on the pipeline on the pipeline overview screen to start a new analysis. Now, you will enter the pipeline in edit mode when you click on the pipeline name. If you want to select a pipeline to start an analysis, you need to check the checkbox
Fixed Issues
General
Removed the refresh button from the workspace detail view as it was superfluous
Fixed an issue where searching for certain characters in the search field of the Projects or Data overviews screen would result in an indecipherable error
Improved security handling around tenant admin-level users in the context of data move
Data Management
Fixed a bug so folders copied from another previously copied folder no longer results in a corrupted file
Fixed an issue where creating a new bundle would result in an error if a project with the same name already exists
Data move between projects from different tenants is now supported
Fixed an issue where not selecting files before using the copy or move commands would result in EmptyDataId errors
For the CLI, Improved notifications when files can not be downloaded correctly
Fixed an issue where scheduled downloads of linked data would fail without warning
Corrected an issue where the tenant billing mode would be erroneously set to Illumina after a data copy
Fixed an issue where BatchCopy on linked data did not work
Flow
Resolved an issue to ensure that when a user creates a pipeline using a docker image shared from an entitled bundle, their analyses utilizing that pipeline can pull the docker image without errors
Removed superfluous options from the analysis status filter
Awaiting input
Pending request
Awaiting previous unit
Fixed an issue where writing a pipeline which executes files in the /data/bin folder wasn't functioning properly with larger storage classes
Fixed an issue where many-step analyses are getting stuck in "In Progress" status
Fixed an issue where the wrapper scripts when running a CommandLineTool in CWL would return a warning
Fixed the issue which caused the "Save as" option not to work when saving pipelines
Base
Fixed an issue where the ICA reference fields in the schema definition had the wrong casing. As a result of this update you might end up with 2 different versions of the reference data (one with keys written with an uppercase letter at the start, another one with keys written entirely in lowercase letters). To fix this:
Update your queries and use the Snowflake function : GET_IGNORE_CASE (ex: select GET_IGNORE_CASE( to_object(ica) , 'data_name' ) from testtableref)
Update the 'old' field names to the new ones (ex: update testtableref_orig set ica = object_delete(object_insert(ica, 'data_name', ica:Data_name), 'Data_name'))
Fixed an issue where using an expression to filter the "Base Job Success" event is not working
Fixed Issues
Flow
Resolved an issue to ensure that when a user creates a pipeline using a docker image shared from an entitled bundle, their analyses utilizing that pipeline can pull the docker image without errors.
Features and Enhancements
General
The left side navigation bar will collapse by default for screen smaller than 800 pixels. The user can expand it by hovering over it
The browser URL may be copied to share analyses, pipelines, samples, tools, workspaces and data in various contexts (project, bundle)
Data Management
Users are now able to move data within and across projects:
The user can:
Move available data
Move up to 1000 files and/or folders in 1 move operation
Retain links to entities (sample, sequencing run, etc.) and other meta-data (tags, app-info) when moving
Move data within a project if the user is a contributor
Move data across projects if (1) in the source project the user has download rights, has at least contributor rights, and data sharing is enabled, and (2) the user has upload rights and at least viewer rights in the target project
Move data across projects with different types of storage configurations (user-defined or default ICA-managed storage)
Select and move data to the folder they are currently in through the graphical UI
Select and move data in a destination project and/or folder through the API
The user cannot:
Move linked data. Only the source data can be moved
Move data to linked data. Can only move data to the source data location
Move data to a folder that is in the process of being moved
Move data which is in the first level of the destination folder
Move data to a destination folder which would create a naming conflict such as a file name duplicate
Move data across regions
New Event Log entries are provided when a user links (ICA_BASE_100) or unlinks (ICA_BASE_101) a Cohorts data set to a bundle
Added support for the following data formats: ora, adat, zarr, tiff and wsi
Flow
New compute types (Transfer Small, Transfer Medium, Transfer Large) are supported and can be used in upload and download tasks to significantly reduce overall analysis runtimes (and overall cost)
API: All the endpoints containing pipeline information now contain the status from the pipeline(s) as well
Bench
External Docker images will no longer display a status as they consistently showed 'Available,' even when the URL is not functional
Cohorts
Performance improvements to needle plot by refactoring its API endpoint to return only sample IDs
Users now click a cancel button that returns them to the landing page
Users can now perform time series analysis for a single patient view
Refresh of PrimateAI data now drives data in variant tables
Users can now access the structural variant tab in the Marker frequency section
Fixed Issues
General
Fixed an issue where, when a user is added to or removed from a workgroup, they could be stuck on an infinite redirect loop when logging in
Fixed syncing discrepancy issues about deleted files in user-managed storage projects with Lifecycle rules & Versioning
Data Access & Management
Sorting API responses for the endpoint GET /api/jobs is possible on the following criteria: timeCreated, timeStarted and timeFinished
Improved the error message when trying to link a bundle which is in a different region than the project
More documentation has been added to the GET /eventLog regarding the order of rows to fetch
Fixed an issue where the API call - POST api/projects/{projectId}/permissions would return an error when DATA_PROVIDER was set for roleProject
Fixed an issue stemming from attempts to copy files from the same source to the same destination, which incorrectly updated file statuses to Partial
CLI: Fixed an issue where the environment variable ICAV2_X_API_KEY did not work
Flow
The analysis is no longer started from the API if error 400 ( 'Content-Type' and 'Accept' do not match) occurs
Base
Fixed an issue where the Base schedule would not run automatically in some cases when files are present in the schedule
Bench
Improved error handling when trying to create a tool with insufficient permissions
Fixed an issue where the user is unable to download docker-image with adhoc-subscription
The "version":"string" field is now included in the API response GET /api/referenceSets. If no version is specified, the field is set to "Not Specified"
Fixed an issue where, under some conditions, fetching a job by id would throw an error if the job was in pending status
Features and Enhancements
Data Management
The GUI now has a limit of 100 characters for the name and 2048 characters for the URL for links in pipelines and bundles
Added a link to create a new connector if needed when scheduling a data download
Improved the data view with additional filtering in the side panel
Flow
New CLI environment variable ICA_NO_RETRY_RATE_LIMITING allows users to disable the retry mechanism. When it is set to "1”, no retries are performed. For any other value, http code 429 will result in 4 retry attempts after 0.5, 2, 10, and 30 seconds * Code-based pipelines will alphabetically order additional files next to the main.nf or workflow.cwl file
When the Compute Type is unspecified, it will be determined automatically based on CPU and Memory values using a "best fit" strategy to meet the minimum specified requirements
Bench
Paths can be whitelisted to allowed URLs on restricted settings
Fixed Issues
General
Fixed an issue where the online help button does not work upon clicking on it
Data Access & Management
Improved automatic resource cleanup when hiding a project
Fixed an issue with the service connector where leading blanks in the path of an upload/download rule would result in errors. It is no longer possible to define rules with leading or trailing blanks
Fixed an issue where a folder copy job fails if the source folder doesn't have metadata set
Linking data to sample has been made consistent between API and GUI
Improved resource handling when uploading large amounts of files via the GUI
Fixed an issue where the API endpoint to retrieve input parameters for a project pipeline linked to a bundle would fail when the user is not entitled on the bundle
Fixed an issue where deleting and adding a bundle to a project in one action does not work
Flow
The event sending protocol was rewritten to limit prematurely exhausting event retries and potentially leaving workflows stuck when experiencing high server loads or outages
Fixed an issue where specifying the minimum number of CPUs using coresMin in a CWL workflow would always result in the allocation of a standard-small instance, regardless of the coresMin value specified
Fixed an issue in the API endpoint to create a Nextflow analysis where tags were incorrectly marked as mandatory inputs in the request body
Fixed an issue with intermittent failures following completion of a workflow session
Base
Improved syntax highlighting in Base queries by making the different colors more distinguishable
Bench
Fixed an issue where the Bench workspace disk size cannot be adjusted when the workspace is stopped. Now, the adjusted size is reflected when the workspace is resumed
Fixed an issue where regions were not populating correctly for Docker images
Fixed an issue where API keys do not get cleaned up after failed workspace starts, leading to unusable workspaces once the API key limit is reached
Features and Enhancements
Cohorts
Users can now query variant lists with a large number of associated phenotypes
Users can now perform multiple concurrent data import jobs
Fixed Issues:
Cohorts
Fixed an issue with displaying shared views when refreshing a Bundle’s shared database in Base
Fixed Issues
Fixed an issue where autolaunch is broken for any users utilizing run and samplesheet inputs stored in BSSH and operating in a personal context, rather than a workgroup.
Features and Enhancements
Data Management
Data (files and folders) may be copied from one folder to another within the same Project
The empty ‘URN’ field in the Project details at Project creation is now removed
The ‘Linked Bundles’ area in the Project details at Project creation is now removed as you are only allowed to link Bundles after Project creation
The card or grid view selected will become the default view when navigating back to the Projects or Bundles views
Added a new API endpoints to retrieve and accept the Terms & Conditions of an entitled bundle:
/api/entitledbundles{entitledBundleId}/termsOfUse
/api/entitledbundles/{entitledBundleId}/termsOfUse/userAcceptance/currentUser
/api/entitledbundles/{entitledBundleId}/termsOfUse:accept
Flow
Added a new API endpoint to retrieve orchestrated analyses of a workflow session
GET /api/projects/{ProjectID}/workflowSessoins/{WorkflowSessionID}/analyses
Code-based pipelines will alphabetically order additional files next to the main.nf or workflow.cwl file
Bench
New JupyterLab - 1.0.19 image published for Bench using the Ubuntu 22.04 base image
Resources have been expanded to include more options for compute families when configuring a workspace. See ICA help documentation for more details
Cohorts
Sample count for an individual cohort may be viewed in the variants table
Filter the variants list table through the filter setting in the needle plot
Execute concurrent jobs from a single tenant
Improved the display of error and warning messages for import jobs
Structural variant tab may be accessed from the Marker frequency section
Fixed Issues
Data Access & Management
Bundles now reflect the correct status when they are released instead of the draft status
Double clicking a file opens the data details popup only once instead of multiple times
Improved performance to prevent timeouts during list generation which resulted in Error 500
The counter is now accurately updated when selecting refresh in the Projects view
Fixed an issue resulting in one job to succeed and one to fail when running two or more file copy jobs at the same time to copy files from same project to same destination folder
Fixed an issue resulting in an error in a sample when linking nested files with the same name
Added a new column to the Source Data tab of the Table view which indicates the upload status of the source data
Removed the unused ‘storage-bundle’ field from the Data details window
Fixed an issue where the Project menu does not update when navigating into a Project in Chrome browsers
(CLI) Fixed an issue where deleting a file/folder via path would result in an error on Windows CLI
Base
Improved schedule handling to prevent an issue where some files were not correctly picked up by the scheduler in exceptional circumstances
Fixed an issue where an incorrect owning tenant is set on a schedule when running it before saving
The number of returned results which is displayed on the scheduler when trying to load files now reflects the total number of files instead of the maximum number of files which could be displayed per page
Fixed an issue where Null Pointer Exception is observed when deleting Base within a Project
Bench
Fixed an issue where users were unable to delete their own Bench image(s) from the docker repository
Cohorts
Fixed an issue where the value in the tumor_or_normal field, in the phenotype table in database, would not set properly for germline and somatic mutation data
Fixed an issue where large genes with subjects containing large sets of diagnostic concepts caused a 503 error
Fixed Issues
Fixed an issue where automated analysis after sequencing run in non-US regions may fail for certain analysis configurations
Features and Enhancements
Data Management
The --exclude-source-path flag has been added to the ‘project data download’ command so that subfolders can be downloaded to the target path without including the parent path
The system automatically re-validates storage credentials updated in the graphical UI
Added a new API endpoint to validate storage configurations after credentials are changed: /api/storageConfigurations/{storageConfigurationId}:validate
Notifications
Added support for multi-version notification event payloads corresponding to versioned API response models
Flow
(API) Improved the analysis-dto by adding a new POST search endpoint as a replacement for the search analysis GET endpoint. The GET endpoint will keep working but we advise using the new POST endpoint.
Improved analysis statuses to reflect the actual status more accurately
Parallelized analysis input data downloads and output data uploads to reduce overall analysis time
No scratch size is allocated if tmpdirMin is not specified
Cohorts
Performance improvements of the ingestion pipeline
Performance improvements to subject list retrieval
Increased the character limit of ingestion log messages to the user
Fixed Issues
Data Access & Management
Fixed an issue where the target user cannot see analysis outputs after a successful transfer of analysis ownership in BaseSpace Sequence Hub
Update the API Swagger documentation to include paging information for: /projects/{projectId}/samples/{sampleId}/data
Fixed an issue resulting in errors when creating a new bundle version
Fixed an issue where the GET API call with the ‘Sort’ parameter returns an error when multiple values are separated by commas followed by a space
Fixed an issue where adding the –eligible-link flag to the ‘projectdata list’ API endpoint caused other flags to not work correctly
Added cursor-based pagination for the ‘projectdata list’ API endpoint
Fixed an issue with the entitled bundles cards view where the region is cut off when the Status is not present
Fixed an issue where bundle filtering on categories did not work as expected
Fixed an issue where file copy across tenants did not work as expected
Added a cross-account permission check so that file copy jobs fail when the cross-account set up is missing instead of being retried indefinitely
Fixed an issue where ‘Get Projects’ API endpoint returns an error when too many projects are in the tenant
Fixed an issue where the UpdateProject API call (PUT /api/projects/{projectId}) returns an error when technical tags are removed from the request
Fixed an issue where users need to confirm they want to cancel an action multiple times when clicking the back button in the graphical UI
Fixed an issue where clicking into a new version of a bundle from the details view does not open the new version, and instead directs to the bundle card view
Flow
Fixed an issue where the analysis logs are returned in the analysis screen “outputs” section and included in the getAnalysisOutputs API response. The log output is no longer considered as part of the analysis outputs
Analysis history screen has been removed
Fixed an issue resulting in inability to retrieve pipeline files via the API when the pipeline is shared cross-tenant
Fixed an issue where the API endpoint to retrieve files for a project pipeline would not return all files for pipelines created via CLI or API
Fixed an issue where the API does not check the proprietary flag of a pipeline before retrieving or downloading the pipeline files
Base
The ‘Download’ button is available to download Base activity data locally (and replaces the non-functional ‘Export’ button for restricted bundles)
Fixed an issue resulting in missing ICA reference fields in table records if the file was loaded into the table with no metadata
Improved consistency of the references included in the scheduler
Bench
Users are now logged out from a terminal window opened in a workspace after a period of inactivity
Fixed an issue where permissions could not be enabled after a workspaces has been created
Fixed an issue where a Contributor could not start/stop a workspace
Cohorts
Fixed an issue where large genes with subjects with large sets of diagnostic concepts cause a 503 error
Fixed an issue where the value in tumor_or_normal field in the phenotype table in the database is not set properly for germline and somatic mutation data
Resolved a discrepancy between the number of samples reported when hovering over the needle plot and the variant list
Features and Enhancements
General
Data Management
Users are now able to revalidate storage configurations in an Error state
Improved existing endpoints and added new endpoints to link and unlink data to a bundle or a project in batch:
POST /api/projects/{projectId}/dataUnlinkingBatch
GET /api/projects/{projectId}/dataUnlinkingBatch/{batchId}
GET /api/projects/{projectId}/dataUnlinkingBatch/{batchId}/items
GET /api/projects/{projectId}/dataUnlinkingBatch/{batchId}/items/{itemId}
Flow
Analyses started via the API can now leverage data stored in BaseSpace Sequence Hub as input
ICA now supports auto-launching analysis pipelines upon sequencing run completion with run data stored in BaseSpace Sequence Hub (instead of ICA)
Updated the API for creating pipelines to include "proprietary" setting, which hides pipeline scripts and details from users who do not belong to the tenant which owns the pipeline and prevents pipeline cloning.
Cohorts
Added support for partial matches against a single input string to the “Search subjects” flexible filtering logic
Users can now view an overview page for a gene when they search for it or click on a gene in the marker frequency charts
ICA Cohorts includes access to both pathogenic and benign variants, which are plotted in the “Pathogenic variants” track underneath the needle plot
Ingestion: UI notifications and/or errors will be displayed in the event of partially completed ingestions
Users can share cohort comparisons with any other users with access to the same project
Fixed Issues
General
Improved the project card view in the UI
Fixed an issue with user administration where changing the permissions of multiple users at the same time would result in users receiving Invalid OAuth access token messages
Data Access & Management
Improved the error message when downloading project data if the storage configuration is not ready for use
Fixed an issue causing Folder Copy jobs to time out and restart, resulting in delays in copy operations
Fixed an issue where only the Docker image of the first restricted bundle that was added could be selected
Improved the performance of folder linking with "api/projects/{ProjectID}/dataLinkingBatch"
The URL for links for "post/api/bundles" endpoint can be up to 2048 characters long
Improved the error response when using offset-based paging on API responses which contain too much data and require cursor-based paging
Fixed an issue resulting in failures downloading data from CLI using a path
The correct error message is displayed if the user does not have a valid subscription when creating a new project
Fixed an issue where changing ownership of a project does not change previous owner access for Base tables
Flow
Input parameters of pipelines are now displayed in the "label (code)" format unless there is no label available or the label equals the code, in which case only the code is shown
Fixed an issue where multiple folders were created upon starting new analyses
Fixed an issue preventing analyses from using inputs with BaseSpace v1pre3 APIs
Fixed an issue causing analyses with a specified output path to incorrectly return an error stating that the data does not exist
The following endpoint "/api/projects/{projectId}/workflowSessions/{workflowSessionId}/inputs" now supports using external data as input
Any value other than "economy" or "standard" for submitted analysis jobs will default to "standard" and use "standard"
The parameter to pass an activationcode is now optional for start-analysis API endpoints
Base
Improved the display of errors in the activity jobs screen if a Meta Data schedule fails
If an error occurs when processing metadata a failed job entry will be added in the Base Activity screen
Fixed an issue where records ingested via schedules from the same file could be duplicated
Fixed an issue where exporting the view shared via bundle would show an error 'Could not find data with ID (fol. ....)'
Resolved a NullPointerException error when clicking on Format and Status filters in the details screen of a Schedule in the Results tab
Fixed an issue where a schedule download would fail when performed by different user than the initial user
Bench
Fixed an issue when trying to query a Base table with a high limit within a workspace
Fixed an apt-get error when building images due to an outdated repository
Fixed an issue where a stopped workspace would display "Workspace paused" instead of "Workspace stopped"
Fixed an issue where large files (e.g., 150GB+) could not be downloaded to a fuse-driver location from a Workspace, and set the new limit to 500GB
Cohorts
Fixed an issue where split Nirvana JSON files are not recognized during ingestion
Fixed an issue causing the UI hangs on large genes and returns a 502 error
Fixed an issue where OMOP files are not correctly converted to CAM data model, preventing OMOP data ingestions
Fixed an issue where OMOP large drug ingestions led to memory issues and preventing further drug data ingestion
Fixed an issue where users from a different tenant accessing a shared project could not ingest data
Click the refresh button () in upper right corner of the ICA environment page to update the status.
System notifications which could already previously be found on , both regional and global, are now shown in the ICA UI when an important ICA message needs to be communicated
POST Creates a file in this project, and retrieves temporary credentials for it
POST Creates a file in this project, and retrieves an upload url for it
POST Creates a folder in this project, and and retrieves temporary credentials for it
POST Creates a folder in this project, and creates a trackable folder upload session
Users can now access the system via or
In this tutorial, we will demonstrate how to create and launch a pipeline using the CWL language using the ICA command line interface (CLI).
Please refer to these instructions for installing ICA CLI.
In this project, we will create two simple tools and build a workflow that we can run on ICA using CLI. The first tool (tool-fqTOfa.cwl) will convert a FASTQ file to a FASTA file. The second tool(tool-countLines.cwl) will count the number of lines in an input FASTA file. The workflow (workflow.cwl) will combine the two tools to convert an input FASTQ file to a FASTA file and count the number of lines in the resulting FASTA file.
Following are the two CWL tools and workflow scripts we will use in the project. If you are new to CWL, please refer to the cwl user guide for a better understanding of CWL codes. You will also need the cwltool installed to create these tools and workflows. You can find installation instructions on the CWL github page.
[!IMPORTANT] Please note that we don't specify the Docker image used in both tools. In such a case, the default behaviour is to use public.ecr.aws/docker/library/bash:5 image. This image contains basic functionality (sufficient to execute
wc
andawk
commands).
In case you want to use a different public image, you can specify it using requirements tag in cwl file. Assuming you want to use *ubuntu:latest' you need to add
In case you want to use a Docker image from the ICA Docker repository, you would need the link to AWS ECR from ICA GUI. Double-click on the image name in the Docker repository and copy the URL to the clipboard. Add the URL to dockerPull key.
To add a custom or public docker image to the ICA repository, please refer to the Docker Repository.
Before you can use ICA CLI, you will need to authenticate using the Illumina API key. Please follow these instructions to authenticate.
You can create a project or use an existing project for creating a new pipeline. You can create a new project using the "icav2 projects create" command.
If you do not provide the "--region" flag, the value defaults to the existing region when there is only one region available. When there is more than one region available, a selection must be made from the available regions at the command prompt. The region input can be determined by calling the "icav2 regions list" command first.
You can select the project to work on by entering the project using the "icav2 projects enter" command. Thus, you won't need to specify the project as an argument.
You can also use the "icav2 projects list" command to determine the names and ids of the project you have access to.
"projectpipelines" is the root command to perform actions on pipelines in a project. "create" command creates a pipeline in the current project.
The parameter file specifies the input for the workflow with additional parameter settings for each step in the workflow. In this tutorial, input is a FASTQ file shown inside <dataInput> tag in the parameter file. There aren't any specific settings for the workflow steps resulting in a parameter file below with an empty <steps> tag. Create a parameter file (parameters.xml) with the following content using a text editor.
The following command creates a pipeline called "cli-tutorial" using the workflow "workflow.cwl", tools "tool-fqTOfa.cwl" and "tool-countLines.cwl" and parameter file "parameter.xml" with small storage size.
Once the pipeline is created, you can view it using the "list" command.
Upload data to the project using the "icav2 projectdata upload" command. Please refer to the Data page for advanced data upload features. For this test, we will use a small FASTQ file test.fastq containing the following reads.
The "icav2 projectdata upload" command lets you upload data to ica.
The "list" command lets you view the uploaded file. Note the ID of the file you want to use with the pipeline.
The "icav2 projectpipelines start" command initiates the pipeline run. The following command runs the pipeline. Note the id for exploring the analysis later.
Note: If for some reason your "create" command fails and needs to rerun, you might get an error (ConstraintViolationException). If so, try your command with a different name.
You can check the status of the run using the "icav2 projectanalyses get" command.
The pipelines can be run using JSON input type as well. The following is an example of running pipelines using JSON input type. Note that JSON input works only with file-based CWL pipelines (built using code, not a graphical editor in ICA).
runtime.ram
and runtime.cpu
values are by default evaluated using the compute environment running the host CWL runner. CommandLineTool Steps within a CWL Workflow run on different compute environments than the host CWL runner, so the valuations of the runtime.ram
and runtime.cpu
for within the CommandLineTool will not match the runtime environment the tool is running in. The valuation of runtime.ram
and runtime.cpu
can be overridden by specifying cpuMin
and ramMin
in the ResourceRequirements
for the CommandLineTool.