Somatic Variant Calling
Implementation of the somatic_variant_calling
step
The somatic_variant_calling
step takes as the input the results of the ngs_mapping
step
(aligned reads in BAM format) and performs somatic variant calling. The result are variant files
with somatic variants (bgzip-ed and indexed VCF files).
Usually, the somatic variant calling step is followed by the somatic_variant_annotation
step.
Step Input
The somatic variant calling step uses Snakemake sub workflows for using the result of the
ngs_mapping
step.
Step Output
For each tumor DNA NGS library with name lib_name
/key lib_pk
and each read mapper
mapper
that the library has been aligned with, and the variant caller var_caller
, the
pipeline step will create a directory output/{mapper}.{var_caller}.{lib_name}-{lib_pk}/out
with symlinks of the following names to the resulting VCF, TBI, and MD5 files.
{mapper}.{var_caller}.{lib_name}-{lib_pk}.vcf.gz
{mapper}.{var_caller}.{lib_name}-{lib_pk}.vcf.gz.tbi
{mapper}.{var_caller}.{lib_name}-{lib_pk}.vcf.gz.md5
{mapper}.{var_caller}.{lib_name}-{lib_pk}.vcf.gz.tbi.md5
For example, it might look as follows for the example from above:
output/
+-- bwa.mutect.P001-N1-DNA1-WES1-4
| `-- out
| |-- bwa.mutect.P001-N1-DNA1-WES1-4.vcf.gz
| |-- bwa.mutect.P001-N1-DNA1-WES1-4.vcf.gz.tbi
| |-- bwa.mutect.P001-N1-DNA1-WES1-4.vcf.gz.md5
| `-- bwa.mutect.P001-N1-DNA1-WES1-4.vcf.gz.tbi.md5
[...]
Generally, these files will be unfiltered, i.e., contain low-quality variants and also variants flagged as being non-somatic.
Global Configuration
If the somatic variant caller MuTect is used, then the global settings
static_data_config/dbsnp
andstatic_data_config/cosmic
must be given as MuTect uses this in its algorithm.static_data_config/reference/path
must be set appropriately
Default Configuration
The default configuration is as follows.
# Default configuration somatic_variant_calling
step_config:
somatic_variant_calling:
tools: ['mutect', 'scalpel'] # REQUIRED, examples: 'mutect' and 'scalpel'.
path_ngs_mapping: ../ngs_mapping # REQUIRED
ignore_chroms: # patterns of chromosome names to ignore
- NC_007605 # herpes virus
- hs37d5 # GRCh37 decoy
- chrEBV # Eppstein-Barr Virus
- '*_decoy' # decoy contig
- 'HLA-*' # HLA genes
- 'GL000220.*' # Contig with problematic, repetitive DNA in GRCh37
# Configuration for joint calling with samtools+bcftools.
bcftools_joint:
max_depth: 4000
max_indel_depth: 4000
window_length: 10000000
num_threads: 16
# Configuration for joint calling with Platypus.
platypus_joint:
split_complex_mnvs: true # whether or not to split complex and MNV variants
num_threads: 16
# VCF annotation databases are given as mapping from name to
# {'file': '/path.vcf.gz',
# 'info_tag': 'VCF_TAG',
# 'description': 'VCF header description'}
# Configuration for MuTect
mutect:
# Parallelization configuration
num_cores: 2 # number of cores to use locally
window_length: 3500000 # split input into windows of this size, each triggers a job
num_jobs: 500 # number of windows to process in parallel
use_profile: true # use Snakemake profile for parallel processing
restart_times: 5 # number of times to re-launch jobs in case of failure
max_jobs_per_second: 2 # throttling of job creation
max_status_checks_per_second: 10 # throttling of status checks
debug_trunc_tokens: 0 # truncation to first N tokens (0 for none)
keep_tmpdir: never # keep temporary directory, {always, never, onerror}
job_mult_memory: 1 # memory multiplier
job_mult_time: 1 # running time multiplier
merge_mult_memory: 1 # memory multiplier for merging
merge_mult_time: 1 # running time multiplier for merging
# Configuration for MuTect 2
mutect2:
panel_of_normals: '' # Set path to panel of normals vcf if required
germline_resource: '' # Germline variants resource (same as panel of normals)
common_variants: '' # Common germline variants for contamination estimation
extra_arguments: [] # List additional Mutect2 arguments
# Each additional argument xust be in the form:
# "--<argument name> <argument value>"
# For example, to filter reads prior to calling & to
# add annotations to the output vcf:
# - "--read-filter CigarContainsNoNOperator"
# - "--annotation AssemblyComplexity BaseQuality"
# Parallelization configuration
num_cores: 2 # number of cores to use locally
window_length: 50000000 # split input into windows of this size, each triggers a job
num_jobs: 500 # number of windows to process in parallel
use_profile: true # use Snakemake profile for parallel processing
restart_times: 5 # number of times to re-launch jobs in case of failure
max_jobs_per_second: 2 # throttling of job creation
max_status_checks_per_second: 10 # throttling of status checks
debug_trunc_tokens: 0 # truncation to first N tokens (0 for none)
keep_tmpdir: never # keep temporary directory, {always, never, onerror}
job_mult_memory: 1 # memory multiplier
job_mult_time: 1 # running time multiplier
merge_mult_memory: 1 # memory multiplier for merging
merge_mult_time: 1 # running time multiplier for merging
# Configuration for Scalpel
scalpel:
path_target_regions: REQUIRED # REQUIRED
# Configuration for strelka2
strelka2:
path_target_regions: "" # For exomes: include a bgzipped bed file with tabix index. That also triggers the --exome flag
gatk_hc_joint:
# Parallelization configuration
num_cores: 2 # number of cores to use locally
window_length: 50000000 # split input into windows of this size, each triggers a job
num_jobs: 500 # number of windows to process in parallel
use_profile: true # use Snakemake profile for parallel processing
restart_times: 5 # number of times to re-launch jobs in case of failure
max_jobs_per_second: 10 # throttling of job creation
max_status_checks_per_second: 10 # throttling of status checks
debug_trunc_tokens: 0 # truncation to first N tokens (0 for none)
keep_tmpdir: never # keep temporary directory, {always, never, onerror}
job_mult_memory: 1 # memory multiplier
job_mult_time: 1 # running time multiplier
merge_mult_memory: 1 # memory multiplier for merging
merge_mult_time: 1 # running time multiplier for merging
# GATK HC--specific configuration
allow_seq_dict_incompatibility: false
annotations:
- BaseQualityRankSumTest
- FisherStrand
- GCContent
- HaplotypeScore
- HomopolymerRun
- MappingQualityRankSumTest
- MappingQualityZero
- QualByDepth
- ReadPosRankSumTest
- RMSMappingQuality
- DepthPerAlleleBySample
- Coverage
- ClippingRankSumTest
- DepthPerSampleHC
gatk_ug_joint:
# Parallelization configuration
num_cores: 2 # number of cores to use locally
window_length: 50000000 # split input into windows of this size, each triggers a job
num_jobs: 500 # number of windows to process in parallel
use_profile: true # use Snakemake profile for parallel processing
restart_times: 5 # number of times to re-launch jobs in case of failure
max_jobs_per_second: 10 # throttling of job creation
max_status_checks_per_second: 10 # throttling of status checks
debug_trunc_tokens: 0 # truncation to first N tokens (0 for none)
keep_tmpdir: never # keep temporary directory, {always, never, onerror}
job_mult_memory: 1 # memory multiplier
job_mult_time: 1 # running time multiplier
merge_mult_memory: 1 # memory multiplier for merging
merge_mult_time: 1 # running time multiplier for merging
# GATK UG--specific configuration
downsample_to_coverage: 250
allow_seq_dict_incompatibility: false
annotations:
- BaseQualityRankSumTest
- FisherStrand
- GCContent
- HaplotypeScore
- HomopolymerRun
- MappingQualityRankSumTest
- MappingQualityZero
- QualByDepth
- ReadPosRankSumTest
- RMSMappingQuality
- DepthPerAlleleBySample
- Coverage
- ClippingRankSumTest
- DepthPerSampleHC
varscan_joint:
# Parallelization configuration
num_cores: 2 # number of cores to use locally
window_length: 5000000 # split input into windows of this size, each triggers a job
num_jobs: 500 # number of windows to process in parallel
use_profile: true # use Snakemake profile for parallel processing
restart_times: 5 # number of times to re-launch jobs in case of failure
max_jobs_per_second: 2 # throttling of job creation
max_status_checks_per_second: 10 # throttling of status checks
# Configuration for samtools mpileup
max_depth: 4000
max_indel_depth: 4000
min_bq: 13
no_baq: True
# Configuration for Varscan
min_coverage: 8
min_reads2: 2
min_avg_qual: 15
min_var_freq: 0.01
min_freq_for_hom: 0.75
p_value: 99e-02
Available Somatic Variant Callers
The following somatic variant callers are currently available
"mutect"
"scalpel"
Reports
Currently, no reports are generated.