Import New Samples
Import New Samples
ICA Cohorts can pull any molecular data available in an ICA Project, as well as additional sample- and subject-level metadata information such as demographics, biometrics, sequencing technology, phenotypes, and diseases.
To import a new data set, select Import Jobs
from the left navigation tab underneath Cohorts
, and click the Import Files
button. The Import Files
button is also available under the Data Sets
left navigation item.
The
Data Set
menu item is used to view imported data sets and information. TheImport Jobs
menu item is used to check the status of data set imports.
Confirm that the project shown is the ICA Project that contains the molecular data you would like to add to ICA Cohorts.
Choose a data type among
Germline variants
Somatic mutations
RNAseq
GWAS
Choose a new study name by selecting the radio button:
Create new study
and entering aStudy Name
.To add new data to an existing Study, select the radio button:
Select from list of studies
and select an existingStudy Name
from the dropdown.To add data to existing records or add new records, select
Job Type
,Append
.Append
does not wipe out any data ingested previously and can be used to ingest the molecular data in an incremental manner.To replace data, select
Job Type
,Replace
. If you are ingesting data again, use the Replace job type.Enter an optional
Study description
.Select the metadata model (default: Cohorts; alternatively, select OMOP version 5.4 if your data is formatted that way.)
Select the genome build your molecular data is aligned to (default: GRCh38/hg38)
For RNAseq, specify whether you want to run differential expression (see below) or only upload raw TPM.
Click
Next
.Navigate to VCFs located in the Project Data.
Select each single-sample VCF or multi-sample VCF to ingest. For GWAS, select CSV files produced by Regenie.
As an alernative to selecting individual files, you can also opt to select a folder instead. Toggle the radio button on Step 2 from "Select files" to "Select folder".
This option is currently only available for germline variant ingestion: any combination of small variants, structural variation, and/or copy number variants.
ICA Cohorts will scan the selected folder and all sub-folders for any VCF files or JSON files and try to match them against the Sample ID column in the metadata TSV file (Step 3).
Files not matching sample IDs will be ignored; allowed file extensions for VCF files after the sample ID are: *.vcf.gz, *.hard-filtered.vcf.gz, *.cnv.vcf.gz, and *.sv.vcf.gz .
Files not matching sample IDs will be ignored; allowed file extensions for JSON files after the sample ID are: .json,.json.gz, *.json.bgz, *.json.gzip.
Click
Next
.Navigate to the metadata (phenotype) data tsv in the project Data.
Select the TSV file or files for ingestion.
Click
Finish
.
All VCF types, specifically from DRAGEN, can be ingested using the Germline variants selection. Cohorts will distinguish the variant types that it is ingesting. If Cohorts cannot determine the variant file type, it will default to ingest small variants.
Alternatively to VCFs, you can select Nirvana JSON files for DNA variants: small variants, structural variants, and copy number variation.
The maximum amount of files that can be part of a single manual ingestion batch is capped at 1000
Alternatively, users can choose a single folder and ICA Cohorts will identify all ingestible files within that folder and its sub-folders. In this scenario, cohorts will select molecular data files matching the samples listed in the metadata sheet which is the next step in the import process.
Users have the option to ingest either VCF files or Nirvana JSON files for any given batch, regardless of the chosen ingestion method.
The sample identifiers used in the VCF columns need to match the sample identifiers used in subject/sample metadata files; accordingly, if you are starting from JSON files containing variant- and gene-level annotations provided by ILMN Nirvana, the
samples
listed in the header need to match the metadata files.
Variant file formats
ICA Cohorts supports VCF files formatted according to VCF v4.2 and v4.3 specifications. VCF files require at least one of the following header rows to identify the genome build:
##reference=file://... --- needs to contain a reference to hg38/GRCh38 in the file path or name (numerical value is sufficient)
##contig=<ID=chr1,length=248956422> --- for hg38/GRCh38
##DRAGENCommandLine= ... --ht-reference
ICA Cohorts accepts VCFs aligned to hg38/GRCh38 and hg19/GRCh37. If your data uses hg19/GRCh37 coordinates, Cohorts will convert these to hg38/GRCh38 during the ingestion process [see Reference 1]. Harmonizing data to one genome build facilitates searches across different private, shared, and public projects when building and analyzing a cohort. If your data contains a mixture of samples mapped to hg38 and hg19, please ingest these in separate batches, as each import job into Cohorts is limited to one genome build.
Alternative to VCFs, ICA Cohorts accepts the JSON output of Illumina Nirvana for hg38/GRCh38-aligned data for small germline variants and somatic mutations, copy number variations other structural variants.
RNAseq file format
ICA Cohorts can process gene- and transcript-level quantification files produced by the Illumina DRAGEN RNA pipeline. The file naming convention needs to match .quant.genes.sf for genes; and .quant.sf for transcript-level TPM (transcripts per million.)
Please also see the online documentation for the Illumina DRAGEN RNA Pipeline for more information on output file formats.
GWAS file format
ICA Cohorts currently support upload of SNV-level GWAS results produced by Regenie and saved as CSV files.
Metadata and File Types
Note: If annotating large sets of samples with molecular data, expect the annotation process to take over 20 minutes per whole genome batch of samples. You will receive two e-mail notifications: once your ingestion starts and once completed successfully or failed.
As an alternative to ICA Cohorts' metadata file format, you can provide files formatted according to the OMOP common data model 5.4. Cohorts currently ingests data for these OMOP 5.4 tables, formatted as tab-delimited files:
PERSON (mandatory),
CONCEPT (mandatory if any of the following is provided),
CONDITION_OCCURRENCE (optional),
DRUG_EXPOSURE (optional), and
PROCEDURE_OCCURRENCE (optional.)
Additional files such as measurement and observation will be supported in a subsequent release of Cohorts.
Note that Cohorts requires that all such files do not deviate from the OMOP CDM 5.4 standard. Depending on your implementation, you may have to adjust file formatting to be OMOP CDM 5.4-compatible.
References
[1] VcfMapper: https://stratus-documentation-us-east-1-public.s3.amazonaws.com/downloads/cohorts/main_vcfmapper.py
[2] crossMap: https://crossmap.sourceforge.net/
[3] liftOver: https://genome.ucsc.edu/cgi-bin/hgLiftOver
[4] Chain files: ftp://ftp.ensembl.org/pub/assembly_mapping/homo_sapiens/
Last updated