Somatic Variant Calling
Implementation of the somatic_variant_calling step
The somatic_variant_calling step takes as the input the results of the ngs_mapping step
(aligned reads in BAM format) and performs somatic variant calling. The result are variant files
with somatic variants (bgzip-ed and indexed VCF files).
Usually, the somatic variant calling step is followed by the somatic_variant_annotation step.
Step Input
The somatic variant calling step uses Snakemake sub workflows for using the result of the
ngs_mapping step.
Step Output
For each tumor DNA NGS library with name lib_name/key lib_pk and each read mapper
mapper that the library has been aligned with, and the variant caller var_caller, the
pipeline step will create a directory output/{mapper}.{var_caller}.{lib_name}-{lib_pk}/out
with symlinks of the following names to the resulting VCF, TBI, and MD5 files.
{mapper}.{var_caller}.{lib_name}-{lib_pk}.vcf.gz{mapper}.{var_caller}.{lib_name}-{lib_pk}.vcf.gz.tbi{mapper}.{var_caller}.{lib_name}-{lib_pk}.vcf.gz.md5{mapper}.{var_caller}.{lib_name}-{lib_pk}.vcf.gz.tbi.md5
Two vcf files are produced:
{mapper}.{var_caller}.{lib_name}.vcf.gzwhich contains only the variants that have passed all filters, or that were protected, and{mapper}.{var_caller}.{lib_name}.full.vcf.gzwhich contains all variants, with the reason for rejection in theFILTERcolumn.
For example, it might look as follows for the example from above:
output/
+-- bwa.mutect2.P001-N1-DNA1-WES1-4
| `-- out
| |-- bwa.mutect2.P001-N1-DNA1-WES1-4.vcf.gz
| |-- bwa.mutect2.P001-N1-DNA1-WES1-4.vcf.gz.tbi
| |-- bwa.mutect2.P001-N1-DNA1-WES1-4.vcf.gz.md5
| `-- bwa.mutect2.P001-N1-DNA1-WES1-4.vcf.gz.tbi.md5
[...]
Generally, the callers are set up to return many variants to avoid missing clinically important ones. They are likely to contain low-quality variants and for some callers variants flagged as being non-somatic.
Default Configuration
The default configuration is as follows.
step_config:
somatic_variant_calling:
# List of tools
#tools: [] # Options: 'mutect2'
#
# Path to ngs_mapping
#path_ngs_mapping: ../ngs_mapping
#
# Patterns of contig names to ignore
#ignore_chroms: [] # Examples: NC_007605, hs37d5, chrEBV, *_decoy, HLA-*, GL000220.*
#
# Configuration for MuTect 2
#mutect2:
#
# # List additional GetPileupSummaries arguments.
# # Each additional argument must be of the form:
# # "--<argument name> <argument value>"
# extra_arguments: [] # Examples: --max-depth-per-sample 1000, --maximum-population-allele-frequency 0.2
#
# # Options for the java run-time compiler
# java_options: '' # Examples: -Xms4000m -Xmx8000m
#
# # number of cores to use locally
# num_cores: 2
#
# # number of windows to process in parallel
# num_jobs: 24
#
# # use Snakemake profile for parallel processing
# use_profile: true
#
# # number of times to re-launch jobs in case of failure
# restart_times: 5
#
# # throttling of job creation
# max_jobs_per_second: 2
#
# # throttling of status checks
# max_status_checks_per_second: 10
#
# # truncation to first N tokens (0 for none)
# debug_trunc_tokens: 0
#
# # keep temporary directory, {always, never, onerror}
# keep_tmpdir: never # Options: 'always', 'never', 'onerror'
#
# # memory multiplier
# job_mult_memory: 1.0
#
# # running time multiplier
# job_mult_time: 1.0
#
# # memory multiplier for merging
# merge_mult_memory: 1.0
#
# # running time multiplier for merging
# merge_mult_time: 1.0
#
# # Set path to panel of normals vcf if required
# panel_of_normals: ''
#
# # Germline variants resource (same as panel of normals)
# germline_resource: ''
#
# # Estimation of contamination using GetPileupSummaries & CalculateContamination
# contamination: # REQUIRED
#
# # List additional GetPileupSummaries arguments.
# # Each additional argument must be of the form:
# # "--<argument name> <argument value>"
# extra_arguments: [] # Examples: --max-depth-per-sample 1000, --maximum-population-allele-frequency 0.2
#
# # Options for the java run-time compiler
# java_options: '' # Examples: -Xms4000m -Xmx8000m
# enabled: false
#
# # Common germline variants for contamination estimation
# common_variants: ''
#
# # Parameters for GetPileupSummaries used on the tumor bam (& normalk if present)
# pileup:
#
# # List additional GetPileupSummaries arguments.
# # Each additional argument must be of the form:
# # "--<argument name> <argument value>"
# extra_arguments: [] # Examples: --max-depth-per-sample 1000, --maximum-population-allele-frequency 0.2
#
# # Options for the java run-time compiler
# java_options: '' # Examples: -Xms4000m -Xmx8000m
#
# # Additional arguments for filtration
# filtration:
#
# # List additional GetPileupSummaries arguments.
# # Each additional argument must be of the form:
# # "--<argument name> <argument value>"
# extra_arguments: [] # Examples: --max-depth-per-sample 1000, --maximum-population-allele-frequency 0.2
#
# # Options for the java run-time compiler
# java_options: '' # Examples: -Xms4000m -Xmx8000m
#
# # Padding around intervals for scatter/gather
# padding: 5000
#
# # Whether to call variants in paired, tumor_only, or automatic mode.
# tumor_normal_mode: automatic # Options: 'automatic', 'paired', 'tumor_only'
Available Somatic Variant Callers
The following somatic variant callers are currently available
mutect2is the recommended callermutect&scalpelare deprecatedthe joint variant callers
bcftools,platypus,gatk&varscanare unsupported.
Efficient usage of mutect2
The recommended caller mutect2 is implemented using a parallelisation mechanism to reduce execution time.
During parallelisation, the genome is split into small chunks, and parallel jobs perform the somatic variant calling in their region only.
All results are then assembled to generate the final output files.
There is at least one chunk by contig defined in the reference genome, with the chunk size upper limit given by the window_length configuration option.
So large chromosomes can be split into several chunks, while smaller contigs are left in one chunk.
Even for large chunk size, this parallelisation can create hundreds of jobs when the reference genome contains many contigs
(unplaced or unlocalized contigs, viral sequences, decoys, HLA alleles, …).
Somatic variant calling is generally meaningless for many of these small contigs.
It is possible to configure the somatic variant calling to avoid the contigs irrelevant for downstream analysis, for example:
mutect2:
ignore_chroms: ['*_random', 'chrUn_*', '*_decoy', "EBV", "HPV*", "HBV", "HCV-*", "HIV-*", "HTLV-1", "CMV", "KSHV", "MCV", "SV40"] # GRCh38.d1.vd1
window_length: 300000000 # OK for WES, too large for WGS
keep_tmpdir: onerror # Facilitates debugging
...
Reports
Currently, no reports are generated.