Somatic Variant Calling
Implementation of the somatic_variant_calling
step
The somatic_variant_calling
step takes as the input the results of the ngs_mapping
step
(aligned reads in BAM format) and performs somatic variant calling. The result are variant files
with somatic variants (bgzip-ed and indexed VCF files).
Usually, the somatic variant calling step is followed by the somatic_variant_annotation
step.
Step Input
The somatic variant calling step uses Snakemake sub workflows for using the result of the
ngs_mapping
step.
Step Output
For each tumor DNA NGS library with name lib_name
/key lib_pk
and each read mapper
mapper
that the library has been aligned with, and the variant caller var_caller
, the
pipeline step will create a directory output/{mapper}.{var_caller}.{lib_name}-{lib_pk}/out
with symlinks of the following names to the resulting VCF, TBI, and MD5 files.
{mapper}.{var_caller}.{lib_name}-{lib_pk}.vcf.gz
{mapper}.{var_caller}.{lib_name}-{lib_pk}.vcf.gz.tbi
{mapper}.{var_caller}.{lib_name}-{lib_pk}.vcf.gz.md5
{mapper}.{var_caller}.{lib_name}-{lib_pk}.vcf.gz.tbi.md5
Two vcf
files are produced:
{mapper}.{var_caller}.{lib_name}.vcf.gz
which contains only the variants that have passed all filters, or that were protected, and{mapper}.{var_caller}.{lib_name}.full.vcf.gz
which contains all variants, with the reason for rejection in theFILTER
column.
For example, it might look as follows for the example from above:
output/
+-- bwa.mutect2.P001-N1-DNA1-WES1-4
| `-- out
| |-- bwa.mutect2.P001-N1-DNA1-WES1-4.vcf.gz
| |-- bwa.mutect2.P001-N1-DNA1-WES1-4.vcf.gz.tbi
| |-- bwa.mutect2.P001-N1-DNA1-WES1-4.vcf.gz.md5
| `-- bwa.mutect2.P001-N1-DNA1-WES1-4.vcf.gz.tbi.md5
[...]
Generally, the callers are set up to return many variants to avoid missing clinically important ones. They are likely to contain low-quality variants and for some callers variants flagged as being non-somatic.
Default Configuration
The default configuration is as follows.
step_config:
somatic_variant_calling:
# List of tools
#tools: [] # Options: 'mutect', 'mutect2', 'scalpel', 'strelka2', 'gatk_hc_joint', 'gatk_ug_joint', 'bcftools_joint', 'platypus_joint', 'varscan_joint'
#
# Path to ngs_mapping
#path_ngs_mapping: ../ngs_mapping
#
# Patterns of contig names to ignore
#ignore_chroms: # Examples: NC_007605, hs37d5, chrEBV, *_decoy, HLA-*, GL000220.*
# - NC_007605
# - hs37d5
# - chrEBV
# - '*_decoy'
# - HLA-*
# - GL000220.*
#
# Configuration for joint calling with samtools+bcftools.
#bcftools_joint:
# max_depth: 4000
# max_indel_depth: 4000
# window_length: 10000000
# num_threads: 16
#
# Configuration for joint calling with Platypus.
#platypus_joint:
#
# # whether or not to split complex and MNV variants
# split_complex_mnvs: true
# num_threads: 16
#
# Configuration for MuTect
#mutect:
#
# # number of cores to use locally
# num_cores: 2
#
# # split input into windows of this size, each triggers a job
# window_length: 3500000
#
# # number of windows to process in parallel
# num_jobs: 500
#
# # use Snakemake profile for parallel processing
# use_profile: true
#
# # number of times to re-launch jobs in case of failure
# restart_times: 5
#
# # throttling of job creation
# max_jobs_per_second: 2
#
# # throttling of status checks
# max_status_checks_per_second: 10
#
# # truncation to first N tokens (0 for none)
# debug_trunc_tokens: 0
#
# # keep temporary directory, {always, never, onerror}
# keep_tmpdir: never # Options: 'always', 'never', 'onerror'
#
# # memory multiplier
# job_mult_memory: 1.0
#
# # running time multiplier
# job_mult_time: 1.0
#
# # memory multiplier for merging
# merge_mult_memory: 1.0
#
# # running time multiplier for merging
# merge_mult_time: 1.0
#
# Configuration for MuTect 2
#mutect2:
#
# # number of cores to use locally
# num_cores: 2
# window_length: 50000000
#
# # number of windows to process in parallel
# num_jobs: 500
#
# # use Snakemake profile for parallel processing
# use_profile: true
#
# # number of times to re-launch jobs in case of failure
# restart_times: 5
#
# # throttling of job creation
# max_jobs_per_second: 2
#
# # throttling of status checks
# max_status_checks_per_second: 10
#
# # truncation to first N tokens (0 for none)
# debug_trunc_tokens: 0
#
# # keep temporary directory, {always, never, onerror}
# keep_tmpdir: never # Options: 'always', 'never', 'onerror'
#
# # memory multiplier
# job_mult_memory: 1.0
#
# # running time multiplier
# job_mult_time: 1.0
#
# # memory multiplier for merging
# merge_mult_memory: 1.0
#
# # running time multiplier for merging
# merge_mult_time: 1.0
#
# # Set path to panel of normals vcf if required
# panel_of_normals: ''
#
# # Germline variants resource (same as panel of normals)
# germline_resource: ''
#
# # Common germline variants for contamination estimation
# common_variants: ''
#
# # List additional Mutect2 arguments.
# # Each additional argument must be of the form:
# # "--<argument name> <argument value>"
# # For example, to filter reads prior to calling & to add annotations to the output vcf:
# # - "--read-filter CigarContainsNoNOperator"
# # - "--annotation AssemblyComplexity BaseQuality"
# extra_arguments: [] # Examples: --read-filter CigarContainsNoNOperator, --annotation AssemblyComplexity BaseQuality
#
# Configuration for Scalpel
#scalpel:
# path_target_regions: # REQUIRED
#
# Configuration for Strelka2
#strelka2:
#
# # For exomes: include a bgzipped bed file with tabix index. That also triggers the --exome flag
# path_target_regions:
#
# Configuration for GatkHcJoint
#gatk_hc_joint:
#
# # number of cores to use locally
# num_cores: 2
# window_length: 50000000
#
# # number of windows to process in parallel
# num_jobs: 500
#
# # use Snakemake profile for parallel processing
# use_profile: true
#
# # number of times to re-launch jobs in case of failure
# restart_times: 5
#
# # throttling of job creation
# max_jobs_per_second: 2
#
# # throttling of status checks
# max_status_checks_per_second: 10
#
# # truncation to first N tokens (0 for none)
# debug_trunc_tokens: 0
#
# # keep temporary directory, {always, never, onerror}
# keep_tmpdir: never # Options: 'always', 'never', 'onerror'
#
# # memory multiplier
# job_mult_memory: 1.0
#
# # running time multiplier
# job_mult_time: 1.0
#
# # memory multiplier for merging
# merge_mult_memory: 1.0
#
# # running time multiplier for merging
# merge_mult_time: 1.0
# allow_seq_dict_incompatibility: false
# annotations:
# - BaseQualityRankSumTest
# - FisherStrand
# - GCContent
# - HaplotypeScore
# - HomopolymerRun
# - MappingQualityRankSumTest
# - MappingQualityZero
# - QualByDepth
# - ReadPosRankSumTest
# - RMSMappingQuality
# - DepthPerAlleleBySample
# - Coverage
# - ClippingRankSumTest
# - DepthPerSampleHC
#
# Configuration for GatkUgJoint
#gatk_ug_joint:
#
# # number of cores to use locally
# num_cores: 2
# window_length: 50000000
#
# # number of windows to process in parallel
# num_jobs: 500
#
# # use Snakemake profile for parallel processing
# use_profile: true
#
# # number of times to re-launch jobs in case of failure
# restart_times: 5
#
# # throttling of job creation
# max_jobs_per_second: 2
#
# # throttling of status checks
# max_status_checks_per_second: 10
#
# # truncation to first N tokens (0 for none)
# debug_trunc_tokens: 0
#
# # keep temporary directory, {always, never, onerror}
# keep_tmpdir: never # Options: 'always', 'never', 'onerror'
#
# # memory multiplier
# job_mult_memory: 1.0
#
# # running time multiplier
# job_mult_time: 1.0
#
# # memory multiplier for merging
# merge_mult_memory: 1.0
#
# # running time multiplier for merging
# merge_mult_time: 1.0
# downsample_to_coverage: 250.0
# allow_seq_dict_incompatibility: false
# annotations:
# - BaseQualityRankSumTest
# - FisherStrand
# - GCContent
# - HaplotypeScore
# - HomopolymerRun
# - MappingQualityRankSumTest
# - MappingQualityZero
# - QualByDepth
# - ReadPosRankSumTest
# - RMSMappingQuality
# - DepthPerAlleleBySample
# - Coverage
# - ClippingRankSumTest
# - DepthPerSampleHC
#
# Configuration for VarscanJoint
#varscan_joint:
# max_depth: 4000
# max_indel_depth: 4000
# min_bq: 13
# no_baq: true
#
# # number of cores to use locally
# num_cores: 2
# window_length: 5000000
#
# # number of windows to process in parallel
# num_jobs: 500
#
# # use Snakemake profile for parallel processing
# use_profile: true
#
# # number of times to re-launch jobs in case of failure
# restart_times: 5
#
# # throttling of job creation
# max_jobs_per_second: 2
#
# # throttling of status checks
# max_status_checks_per_second: 10
#
# # truncation to first N tokens (0 for none)
# debug_trunc_tokens: 0
#
# # keep temporary directory, {always, never, onerror}
# keep_tmpdir: never # Options: 'always', 'never', 'onerror'
#
# # memory multiplier
# job_mult_memory: 1.0
#
# # running time multiplier
# job_mult_time: 1.0
#
# # memory multiplier for merging
# merge_mult_memory: 1.0
#
# # running time multiplier for merging
# merge_mult_time: 1.0
# min_coverage: 8.0
# min_reads2: 2
# min_avg_qual: 15.0
# min_var_freq: 0.01
# min_freq_for_hom: 0.75
# p_value: 0.99
Available Somatic Variant Callers
The following somatic variant callers are currently available
mutect2
is the recommended callermutect
&scalpel
are deprecatedthe joint variant callers
bcftools
,platypus
,gatk
&varscan
are unsupported.
Efficient usage of mutect2
The recommended caller mutect2
is implemented using a parallelisation mechanism to reduce execution time.
During parallelisation, the genome is split into small chunks, and parallel jobs perform the somatic variant calling in their region only.
All results are then assembled to generate the final output files.
There is at least one chunk by contig defined in the reference genome, with the chunk size upper limit given by the window_length
configuration option.
So large chromosomes can be split into several chunks, while smaller contigs are left in one chunk.
Even for large chunk size, this parallelisation can create hundreds of jobs when the reference genome contains many contigs
(unplaced or unlocalized contigs, viral sequences, decoys, HLA alleles, …).
Somatic variant calling is generally meaningless for many of these small contigs.
It is possible to configure the somatic variant calling to avoid the contigs irrelevant for downstream analysis, for example:
mutect2:
ignore_chroms: ['*_random', 'chrUn_*', '*_decoy', "EBV", "HPV*", "HBV", "HCV-*", "HIV-*", "HTLV-1", "CMV", "KSHV", "MCV", "SV40"] # GRCh38.d1.vd1
window_length: 300000000 # OK for WES, too large for WGS
keep_tmpdir: onerror # Facilitates debugging
...
Reports
Currently, no reports are generated.