Somatic CNV calling
Somatic variant calling is implemented differently for exome and whole genome data.
The whole genome data “branch” is currently under review, as GRCh38 support in Control-FREEC
(the main workhorse for WGS CNV calling) is not complete.
CNV calling in WGS data can also be done using cnvkit
, but its pipeline implementation is also incomplete.
The following documentation is restricted to the tools currently implemented to process exome data: cnvkit
, purecn
& sequenza
.
Output & performance
The 3 methods generally broadly agree on the log ratio of coverage between tumor & normal samples.
However, the segmentation and the number of copies assigned to a segment can be quite different between the algorithms.
Output files
There is no widely used standard to report copy number alterations.
In absence of a better solution, all CNV tools implemented in somatic pipeline output the segmentation table loosely following the DNAcopy format.`
The copy number call may or may not be present, and the chromosome number is replaced by its name.
The segmentation output is in file output/<mapper>.<cnv caller>.<lib name>/out/<mapper>.<cnv caller>.<lib name>_dnacopy.seg
.
Genome support
Both purecn
& sequenza
have better support for GRCh37
than for GRCh38
.
In both cases, it stems from the fact that the segmentation packages used by the methods haven not been updated for GRCh38
.
Both purecn
& sequenza
provide remedial solutions, however they are not available from bioconda
.
For that reason, the pipeline implementation relies on Docker containers for purecn
and on github forks of R packages for sequenza
.
The pipeline currently cannot handle male & female samples differently, even though the sex has a strong influence on the results.
It is therefore useful to avoid calling CNVs on sex chromosomes.
This can be easily achieved for sequenza
, using the ignore_chroms
option.
purecn
will handle the sample sex internally, but cnvkit
requires the user to remove the sex chromosomes from the access file.
Segmentation and CNV calls
cnvkit
generally produces more segments than purecn
, and shorter ones.
However, with the default settings, cnvkit
is very conservative with the calls.
Most segments are called diploid, even when the coverage log ratio suggests otherwise.
For best results, it is advisable to experiment with different segmentation algorithms and different sensitivity thresholds.
Purity & ploidy
purecn
attempts to model tumor purity and ploidy, and the results are found in <mapper>.purecn.<lib name>/out/<mapper>.purecn.<lib name>.csv
.
sequenza
can also offer these estimates, but they are currently only stored in R
objects in the work
directory, at <mapper>.sequenza.<lib name>/report/<lib name>_sequenza_cp_table.RData
.
Getting results for PureCN
The pipeline implementation of PureCN
requires a panel of normals (which in turns requires building a panel of normals for mutect2
).
Moreover, it also needs vcf
files containing both germline & somatic variants.
For these files, mutect2
needs to be run with different parameters than those used for somatic variant calling.
Therefore, the somatic_variant_calling
step must be run twice, with different arguments, in different locations.
For the second mutect2
run executed in the somatic_variant_calling_for_purecn
directory, the configuration would look like:
somatic_mutation_calling:
tools: [mutect2]
mutect2:
extra_arguments: [ "--genotype-germline-sites true", "--genotype-pon-sites true" ] # These arguments must be added
panel_of_normals: <absolute path to the panel of normal output>
germline_resource: <path to af-only-gnomad.raw.sites.vcf.gz>
common_variants: <path to small_exac_common_3.vcf.gz>
window_length: 300000000 # For exome data, it is sufficient to split the genome by chromosomes
ignore_chroms: [NC_007605, hs37d5, chrEBV, '*_decoy', 'HLA-*', 'GL000220.*'] # For hs37d5
ignore_chroms: [NC_007605, hs37d5, chrEBV, '*_decoy', 'HLA-*', 'GL000220.*'] # Must be repeated at the level above mutect2
somatic_targeted_seq_cnv_calling:
tools: [purecn]
purecn:
genome_name: "hg19" # This must match the names given while building the panel of normals
enrichment_kit_name: "exome" # This must match the names given while building the panel of normals
path_somatic_variants: ../somatic_variant_calling_for_purecn
somatic_variant_caller: mutect2
path_panels_of_normals: <absolute path to panel_of_normals/output/<mapper>.purecn/out/<mapper>.purecn.panel_of_normals.rds>
path_mapping_bias: <absolute path to panel_of_normals/output/<mapper>.purecn/out/<mapper>.purecn.mapping_bias.rds>
path_intervals: <absolute path to panel_of_normals/output/purecn/out/exome_hg19.list>
path_container: <absolute path to panel_of_normals/work/containers/out/purecn.simg>
From the panel_of_normals
directory, purecn
requires 3 types of files:
the
panel_of_normals
itself, and themapping_bias
objects are taken from<mapper>.purecn/out
. This is because they might change with different mapping tools.the
intervals
taken frompurecn/out
, as the definition of intervals depend only on the genome & the exome kit, but not on the mapping tool.the
container
taken fromwork/containers/out
, to ensure that thePureCN
version used to compute copy number variants is identical to that used to compute the panel of normals.