Comment on page
Import New Samples
ICA Cohorts can pull any molecular data available in an ICA Project, as well as additional sample- and subject-level metadata information such as demographics, biometrics, sequencing technology, phenotypes, and diseases.
To import a new data set, select
Import Jobsfrom the left navigation tab underneath
Cohorts, and click the
Import Filesbutton. The
Import Filesbutton is also available under the
Data Setsleft navigation item.
Data Setmenu item is used to view imported data sets and information. The
Import Jobsmenu item is used to check the status of data set imports.
Confirm that the project shown is the ICA Project that contains the molecular data you would like to add to ICA Cohorts.
- 1.Choose a data type among
- Germline variants
- Somatic mutations
- 2.Choose a new study name by selecting the radio button:
Create new studyand entering a
- 3.To add new data to an existing Study, select the radio button:
Select from list of studiesand select an existing
Study Namefrom the dropdown.
- 4.To add data to existing records or add new records, select
- 5.To replace data, select
Replace. If you are ingesting data again, use the Replace job type.
- 6.Enter an optional
- 7.Select the metadata model (default: Cohorts; alternatively, select OMOP version 5.4 if your data is formatted that way.)
- 8.Select the genome build your molecular data is aligned to (default: GRCh38/hg38)
- 9.For RNAseq, specify whether you want to run differential expression (see below) or only upload raw TPM.
- 11.Navigate to VCFs located in the Project Data.
- 12.Select each single-sample VCF or multi-sample VCF to ingest. For GWAS, select CSV files produced by Regenie.
- 14.Navigate to the metadata (phenotype) data tsv in the project Data.
- 15.Select the TSV file or files for ingestion.
All VCF types, specifically from DRAGEN, can be ingested using the Germline variants selection. Cohorts will distinguish the variant types that it is ingesting. If Cohorts cannot determine the variant file type, it will default to ingest small variants.
Alternatively to VCFs, you can select Nirvana JSON files for DNA variants: small variants, structural variants, and copy number variation.
The sample identifiers used in the VCF columns need to match the sample identifiers used in subject/sample metadata files; accordingly, if you are starting from JSON files containing variant- and gene-level annotations provided by ILMN Nirvana, the
sampleslisted in the header need to match the metadata files.
ICA Cohorts supports VCF files formatted according to VCF v4.2 and v4.3 specifications. VCF files require at least one of the following header rows to identify the genome build:
- ##reference=file://... --- needs to contain a reference to hg38/GRCh38 in the file path or name (numerical value is sufficient)
- ##contig=<ID=chr1,length=248956422> --- for hg38/GRCh38
- ##DRAGENCommandLine= ... --ht-reference
ICA Cohorts accepts VCFs aligned to hg38/GRCh38 and hg19/GRCh37. If your data uses hg19/GRCh37 coordinates, Cohorts will convert these to hg38/GRCh38 during the ingestion process [see Reference 1]. Harmonizing data to one genome build facilitates searches across different private, shared, and public projects when building and analyzing a cohort. If your data contains a mixture of samples mapped to hg38 and hg19, please ingest these in separate batches, as each import job into Cohorts is limited to one genome build.
Alternative to VCFs, ICA Cohorts accepts the JSON output of Illumina Nirvana for hg38/GRCh38-aligned data for small germline variants and somatic mutations, copy number variations other structural variants.
ICA Cohorts can process gene- and transcript-level quantification files produced by the Illumina DRAGEN RNA pipeline. The file naming convention needs to match .quant.genes.sf for genes; and .quant.sf for transcript-level TPM (transcripts per million.)
Note: If annotating large sets of samples with molecular data, expect the annotation process to take over 20 minutes per whole genome batch of samples. You will receive two e-mail notifications: once your ingestion starts and once completed successfully or failed.
As an alternative to ICA Cohorts' metadata file format, you can provide files formatted according to the OMOP common data model 5.4. Cohorts currently ingests data for these OMOP 5.4 tables, formatted as tab-delimited files:
- PERSON (mandatory),
- CONCEPT (mandatory if any of the following is provided),
- CONDITION_OCCURRENCE (optional),
- DRUG_EXPOSURE (optional), and
- PROCEDURE_OCCURRENCE (optional.)
Additional files such as measurement and observation will be supported in a subsequent release of Cohorts.
Note that Cohorts requires that all such files do not deviate from the OMOP CDM 5.4 standard. Depending on your implementation, you may have to adjust file formatting to be OMOP CDM 5.4-compatible.
 VcfMapper: https://stratus-documentation-us-east-1-public.s3.amazonaws.com/downloads/cohorts/main_vcfmapper.py
 crossMap: https://crossmap.sourceforge.net/
 liftOver: https://genome.ucsc.edu/cgi-bin/hgLiftOver