Data Catalogue

Data Catalogues provide views on data from Illumina hardware and processes (Instruments, Cloud software, Informatics software and Assays) so that this data can be distributed to different applications. This data consists of read-only tables to prevent updates by the applications accessing it. Access to data catalogues is included with professional and enterprise subscriptions.

Available views

Project-level views

ICA_PIPELINE_ANALYSES_VIEW (Lists project-specific ICA pipeline analysis data)
ICA_DRAGEN_QC_METRIC_ANALYSES_VIEW (project-specific quality control metrics)

Tenant-level views

ICA_PIPELINE_ANALYSES_VIEW (Lists ICA pipeline analysis data)
CLARITY_SEQUENCINGRUN_VIEW_tenant (sequencing run data coming from the lab workflow software)
CLARITY_SAMPLE_VIEW_tenant (sample data coming from the lab workflow software)
CLARITY_LIBRARY_VIEW_tenant (library data coming from the lab workflow software)
CLARITY_EVENT_VIEW_tenant (event data coming from the lab workflow software)
ICA_DRAGEN_QC_METRIC_ANALYSES_VIEW (quality control metrics)

Preconditions for view content

DRAGEN metrics will only have content when DRAGEN pipelines have been executed.
Analysis views will only have content when analyses have been executed.
Views containing Clarity data will only have content if you have a Clarity LIMS instance with minimum version 6.0 and the Product Analytics service installed and configured. Please see the Clarity LIMS documentation for more information.

Who can add or remove Catalogue data (views) to a project?

Members of a project, who have both base contributor and project contributor or administrator rights and who belong to the same tenant as the project can add views from a Catalogue. Members of a project with the same rights who do not belong to the same tenant can remove the catalogue views from a project. Therefore, if you are invited to collaborate on a project, but belong to a different tenant, you can remove catalogue views, but cannot add them again.

Adding Catalogue data (views) to your project

To add Catalogue data,

Go to Projects > your_project > Base > Tables.
Select Add table > Import from Catalogue.
A list of available views will be displayed. (Note that views which are already part of your project are not listed)
Select the table you want to add and choose +Select

Catalogue data will have View as type, the same as tables which are linked from other projects.

Removing Catalogue data (views) from your project

To delete Catalogue data,

go to Projects > your_project > Base > Tables.
Select the table you want to delete and choose Delete.
A warning will be presented to confirm your choice. Once deleted, you can add the Catalogue data again if needed.

Catalogue table details (Catalogue Table Selection Screen)

View: The name of the Catalogue table.
Description: An explanation of which data is contained in the view.
Category: The identification of the source system which provided the data.
Tenant/project. Appended to the view name as _tenant or _project. Determines if the data is visible for all projects within the same tenant or only within the project. Only the tenant administrator can see the non-project views.

Catalogue table details (Table Schema Definition)

In the Projects > your_project > Base > Tables view, double-click the Catalogue table to see the details. For an overview of the available actions and details, see Tables.

Querying views

In this section, we provide examples of querying selected views from the Base UI, starting with ICA_PIPELINE_ANALYSES_VIEW (project view). This table includes the following columns: TENANT_UUID, TENANT_ID, TENANT_NAME, PROJECT_UUID, PROJECT_ID, PROJECT_NAME, USER_UUID, USER_NAME, and PIPELINE_ANALYSIS_DATA. While the first eight columns contain straightforward data types (each holding a single value), the PIPELINE_ANALYSIS_DATA column is of type VARIANT, which can store multiple values in a nested structure. In SQL queries, this column returns data as a JSON object. To filter specific entries within this complex data structure, a combination of JSON functions and conditional logic in SQL queries is essential.

Since Snowflake offers robust JSON processing capabilities, the FLATTEN function can be utilized to expand JSON arrays within the PIPELINE_ANALYSIS_DATA column, allowing for the filtering of entries based on specific criteria. It's important to note that each entry in the JSON array becomes a separate row once flattened. Snowflake aligns fields outside of this FLATTEN operation accordingly, i.e. the record USER_ID in the SQL query below is "recycled".

The following query extracts

USER_NAME directly from the ICA_PIPELINE_ANALYSES_VIEW_project table.
PIPELINE_ANALYSIS_DATA:reference and PIPELINE_ANALYSIS_DATA:price. These are direct accesses into the JSON object stored in the PIPELINE_ANALYSIS_DATA column. They extract specific values from the JSON object.
Entries from the array 'steps' in the JSON object. The query uses LATERAL FLATTEN(input => PIPELINE_ANALYSIS_DATA:steps) to expand the steps array within the PIPELINE_ANALYSIS_DATA JSON object into individual rows. For each of these rows, it selects various elements (like bpeResourceLifeCycle, bpeResourcePresetSize, etc.) from the JSON.

Furthermore, the query filters the rows based on the status being 'FAILED' and the stepId not containing the word 'Workflow': it allows the user to find steps which failed.

SELECT
    USER_NAME as user_name,
    PIPELINE_ANALYSIS_DATA:reference as reference,
    PIPELINE_ANALYSIS_DATA:price as price,
    PIPELINE_ANALYSIS_DATA:totalDurationInSeconds as duration,
    f.value:bpeResourceLifeCycle::STRING as bpeResourceLifeCycle,
    f.value:bpeResourcePresetSize::STRING as bpeResourcePresetSize,
    f.value:bpeResourceType::STRING as bpeResourceType,
    f.value:completionTime::TIMESTAMP as completionTime,
    f.value:durationInSeconds::INT as durationInSeconds,
    f.value:price::FLOAT as price,
    f.value:pricePerSecond::FLOAT as pricePerSecond,
    f.value:startTime::TIMESTAMP as startTime,
    f.value:status::STRING as status,
    f.value:stepId::STRING as stepId
FROM
    ICA_PIPELINE_ANALYSES_VIEW_project,
    LATERAL FLATTEN(input => PIPELINE_ANALYSIS_DATA:steps) f
WHERE
    f.value:status::STRING = 'FAILED'
    AND f.value:stepId::STRING NOT LIKE '%Workflow%';

Now let's have a look at DRAGEN_METRICS_VIEW_project view. Each DRAGEN pipeline on ICA creates multiple metrics files, e.g. SAMPLE.mapping_metrics.csv, SAMPLE.wgs_coverage_metrics.csv, etc for DRAGEN WGS Germline pipeline. Each of these files is represented by a row in DRAGEN_METRICS_VIEW_project table with columns ANALYSIS_ID, ANALYSIS_UUID, PIPELINE_ID, PIPELINE_UUID, PIPELINE_NAME, TENANT_ID, TENANT_UUID, TENANT_NAME, PROJECT_ID, PROJECT_UUID, PROJECT_NAME, FOLDER, FILE_NAME, METADATA, and ANALYSIS_DATA. ANALYSIS_DATA column contains the content of the file FILE_NAME as an array of JSON objects. Similarly to the previous query we will use FLATTEN command. The following query extracts

Sample name from the file names.
Two metrics 'Aligned bases in genome' and 'Aligned bases' for each sample and the corresponding values.

The query looks for files SAMPLE.wgs_coverage_metrics.csv only and sorts based on the sample name:

SELECT DISTINCT
    SPLIT_PART(FILE_NAME, '.wgs_coverage_metrics.csv', 1) as sample_name,
    f.value:column_2::STRING as metric,
    f.value:column_3::FLOAT as value
FROM
    DRAGEN_METRICS_VIEW_project,
    LATERAL FLATTEN(input => ANALYSIS_DATA) f
WHERE
    FILE_NAME LIKE '%wgs_coverage_metrics.csv'
    AND (
        f.value:column_2::STRING = 'Aligned bases in genome'
        OR f.value:column_2::STRING = 'Aligned bases'
    )
ORDER BY
    sample_name;

Lastly, you can combine these views (or rather intermediate results derived from these views) using the WITH and JOIN commands. The SQL snippet below demonstrates how to join two intermediate results referred to as 'flattened_dragen_scrna' and 'pipeline_table'. The query:

Selects two metrics ('Invalid barcode read' and 'Passing cells') associated with single-cell RNA analysis from records where the FILE_NAME ends with 'scRNA.metrics.csv', and then stores these metrics in a temporary table named 'flattened_dragen_scrna'.
Retrieves metadata related to all scRNA analyses by filtering on the pipeline ID from the 'ICA_PIPELINE_ANALYSES_VIEW_project' view and stores this information in another temporary table named 'pipeline_table'.
Joins the two temporary tables using the JOIN operator, specifying the join condition with the ON operator.

WITH flattened_dragen_scrna AS (   
SELECT DISTINCT
    SPLIT_PART(FILE_NAME, '.scRNA.metrics.csv', 1) as sample_name,
    ANALYSIS_UUID, 
    f.value:column_2::STRING as metric,
    f.value:column_3::FLOAT as value
FROM
    DRAGEN_METRICS_VIEW_project,
    LATERAL FLATTEN(input => ANALYSIS_DATA) f
WHERE
    FILE_NAME LIKE '%scRNA.metrics.csv'
    AND (
        f.value:column_2::STRING = 'Invalid barcode read'
        OR f.value:column_2::STRING = 'Passing cells'
    )
),
pipeline_table AS (
SELECT
    PIPELINE_ANALYSIS_DATA:reference::STRING as reference,
    PIPELINE_ANALYSIS_DATA:id::STRING as analysis_id,
    PIPELINE_ANALYSIS_DATA:status::STRING as status,
    PIPELINE_ANALYSIS_DATA:pipelineId::STRING as pipeline_id,
    PIPELINE_ANALYSIS_DATA:requestTime::TIMESTAMP as start_time
FROM
    ICA_PIPELINE_ANALYSES_VIEW_project
WHERE
    PIPELINE_ANALYSIS_DATA:pipelineId = 'c9c9a2cc-3a14-4d32-b39a-1570c39ebc30'
    )
SELECT * FROM flattened_dragen_scrna JOIN pipeline_table 
ON
     flattened_dragen_scrna.ANALYSIS_UUID = pipeline_table.analysis_id;

An example how to obtain the costs incurred by the individual steps of an analysis

You can use ICA_PIPELINE_ANALYSES_VIEW to obtained the costs of individual steps of an analysis. Using the following SQL snippet you can retrieve the costs of individual steps for every analyses run in the past week.

SELECT
    USER_NAME as user_name,
    PROJECT_NAME as project,
    SUBSTRING(PIPELINE_ANALYSIS_DATA:reference, 1, 30) as reference,
    PIPELINE_ANALYSIS_DATA:status as status,
    ROUND(PIPELINE_ANALYSIS_DATA:computePrice,2) as price,
    PIPELINE_ANALYSIS_DATA:totalDurationInSeconds as duration,
    PIPELINE_ANALYSIS_DATA:startTime::TIMESTAMP as startAnalysis,
    f.value:bpeResourceLifeCycle::STRING as bpeResourceLifeCycle,
    f.value:bpeResourcePresetSize::STRING as bpeResourcePresetSize,
    f.value:bpeResourceType::STRING as bpeResourceType,
    f.value:durationInSeconds::INT as durationInSeconds,
    f.value:price::FLOAT as priceStep,
    f.value:status::STRING as status,
    f.value:stepId::STRING as stepId
FROM
    ICA_PIPELINE_ANALYSES_VIEW_project,
    LATERAL FLATTEN(input => PIPELINE_ANALYSIS_DATA:steps) f
WHERE
   PIPELINE_ANALYSIS_DATA:startTime > CURRENT_TIMESTAMP() - INTERVAL '1 WEEK'
ORDER BY
   priceStep DESC;

Limitations

Data Catalogue views cannot be shared as part of a Bundle.
Data size is not shown for views because views are a subset of data.
By removing Base from a project, the Data Catalogue will also be removed from that project.

Best Practices

As tenant-level Catalogue views can contain sensitive data, it is best to save this (filtered) data to a new table and share that table instead of sharing the entire view as part of a project. To do so, add your view to a separate project and run a query on the data at Projects > your_project > Base > Query > New Query. When the query completes, you can export the result as a new table. This ensures no new data will be added on consequent runs.

PreviousTables NextQuery

Last updated 4 months ago

Was this helpful?