Somatic Variant Calling

Implementation of the somatic_variant_calling step

The somatic_variant_calling step takes as the input the results of the ngs_mapping step (aligned reads in BAM format) and performs somatic variant calling. The result are variant files with somatic variants (bgzip-ed and indexed VCF files).

Usually, the somatic variant calling step is followed by the somatic_variant_annotation step.

Step Input

The somatic variant calling step uses Snakemake sub workflows for using the result of the ngs_mapping step.

Step Output

For each tumor DNA NGS library with name lib_name/key lib_pk and each read mapper mapper that the library has been aligned with, and the variant caller var_caller, the pipeline step will create a directory output/{mapper}.{var_caller}.{lib_name}-{lib_pk}/out with symlinks of the following names to the resulting VCF, TBI, and MD5 files.

  • {mapper}.{var_caller}.{lib_name}-{lib_pk}.vcf.gz

  • {mapper}.{var_caller}.{lib_name}-{lib_pk}.vcf.gz.tbi

  • {mapper}.{var_caller}.{lib_name}-{lib_pk}.vcf.gz.md5

  • {mapper}.{var_caller}.{lib_name}-{lib_pk}.vcf.gz.tbi.md5

Two vcf files are produced:

  • {mapper}.{var_caller}.{lib_name}.vcf.gz which contains only the variants that have passed all filters, or that were protected, and

  • {mapper}.{var_caller}.{lib_name}.full.vcf.gz which contains all variants, with the reason for rejection in the FILTER column.

For example, it might look as follows for the example from above:

output/
+-- bwa.mutect2.P001-N1-DNA1-WES1-4
|   `-- out
|       |-- bwa.mutect2.P001-N1-DNA1-WES1-4.vcf.gz
|       |-- bwa.mutect2.P001-N1-DNA1-WES1-4.vcf.gz.tbi
|       |-- bwa.mutect2.P001-N1-DNA1-WES1-4.vcf.gz.md5
|       `-- bwa.mutect2.P001-N1-DNA1-WES1-4.vcf.gz.tbi.md5
[...]

Generally, the callers are set up to return many variants to avoid missing clinically important ones. They are likely to contain low-quality variants and for some callers variants flagged as being non-somatic.

Default Configuration

The default configuration is as follows.

step_config:
  somatic_variant_calling:

    # List of tools
    #tools: []                                         # Options: 'mutect', 'mutect2', 'scalpel', 'strelka2', 'gatk_hc_joint', 'gatk_ug_joint', 'bcftools_joint', 'platypus_joint', 'varscan_joint'
    #
    # Path to ngs_mapping
    #path_ngs_mapping: ../ngs_mapping
    #
    # Patterns of contig names to ignore
    #ignore_chroms:                                    # Examples: NC_007605, hs37d5, chrEBV, *_decoy, HLA-*, GL000220.*
    #  - NC_007605
    #  - hs37d5
    #  - chrEBV
    #  - '*_decoy'
    #  - HLA-*
    #  - GL000220.*
    #
    # Configuration for joint calling with samtools+bcftools.
    #bcftools_joint:
    #  max_depth: 4000
    #  max_indel_depth: 4000
    #  window_length: 10000000
    #  num_threads: 16
    #
    # Configuration for joint calling with Platypus.
    #platypus_joint:
    #
    #  # whether or not to split complex and MNV variants
    #  split_complex_mnvs: true
    #  num_threads: 16
    #
    # Configuration for MuTect
    #mutect:
    #
    #  # number of cores to use locally
    #  num_cores: 2
    #
    #  # split input into windows of this size, each triggers a job
    #  window_length: 3500000
    #
    #  # number of windows to process in parallel
    #  num_jobs: 500
    #
    #  # use Snakemake profile for parallel processing
    #  use_profile: true
    #
    #  # number of times to re-launch jobs in case of failure
    #  restart_times: 5
    #
    #  # throttling of job creation
    #  max_jobs_per_second: 2
    #
    #  # throttling of status checks
    #  max_status_checks_per_second: 10
    #
    #  # truncation to first N tokens (0 for none)
    #  debug_trunc_tokens: 0
    #
    #  # keep temporary directory, {always, never, onerror}
    #  keep_tmpdir: never                              # Options: 'always', 'never', 'onerror'
    #
    #  # memory multiplier
    #  job_mult_memory: 1.0
    #
    #  # running time multiplier
    #  job_mult_time: 1.0
    #
    #  # memory multiplier for merging
    #  merge_mult_memory: 1.0
    #
    #  # running time multiplier for merging
    #  merge_mult_time: 1.0
    #
    # Configuration for MuTect 2
    #mutect2:
    #
    #  # number of cores to use locally
    #  num_cores: 2
    #  window_length: 50000000
    #
    #  # number of windows to process in parallel
    #  num_jobs: 500
    #
    #  # use Snakemake profile for parallel processing
    #  use_profile: true
    #
    #  # number of times to re-launch jobs in case of failure
    #  restart_times: 5
    #
    #  # throttling of job creation
    #  max_jobs_per_second: 2
    #
    #  # throttling of status checks
    #  max_status_checks_per_second: 10
    #
    #  # truncation to first N tokens (0 for none)
    #  debug_trunc_tokens: 0
    #
    #  # keep temporary directory, {always, never, onerror}
    #  keep_tmpdir: never                              # Options: 'always', 'never', 'onerror'
    #
    #  # memory multiplier
    #  job_mult_memory: 1.0
    #
    #  # running time multiplier
    #  job_mult_time: 1.0
    #
    #  # memory multiplier for merging
    #  merge_mult_memory: 1.0
    #
    #  # running time multiplier for merging
    #  merge_mult_time: 1.0
    #
    #  # Set path to panel of normals vcf if required
    #  panel_of_normals: ''
    #
    #  # Germline variants resource (same as panel of normals)
    #  germline_resource: ''
    #
    #  # Common germline variants for contamination estimation
    #  common_variants: ''
    #
    #  # List additional Mutect2 arguments.
    #  # Each additional argument must be of the form:
    #  # "--<argument name> <argument value>"
    #  # For example, to filter reads prior to calling & to add annotations to the output vcf:
    #  #   - "--read-filter CigarContainsNoNOperator"
    #  #   - "--annotation AssemblyComplexity BaseQuality"
    #  extra_arguments: []                             # Examples: --read-filter CigarContainsNoNOperator, --annotation AssemblyComplexity BaseQuality
    #
    # Configuration for Scalpel
    #scalpel:
    #  path_target_regions:                            # REQUIRED
    #
    # Configuration for Strelka2
    #strelka2:
    #
    #  # For exomes: include a bgzipped bed file with tabix index. That also triggers the --exome flag
    #  path_target_regions:
    #
    # Configuration for GatkHcJoint
    #gatk_hc_joint:
    #
    #  # number of cores to use locally
    #  num_cores: 2
    #  window_length: 50000000
    #
    #  # number of windows to process in parallel
    #  num_jobs: 500
    #
    #  # use Snakemake profile for parallel processing
    #  use_profile: true
    #
    #  # number of times to re-launch jobs in case of failure
    #  restart_times: 5
    #
    #  # throttling of job creation
    #  max_jobs_per_second: 2
    #
    #  # throttling of status checks
    #  max_status_checks_per_second: 10
    #
    #  # truncation to first N tokens (0 for none)
    #  debug_trunc_tokens: 0
    #
    #  # keep temporary directory, {always, never, onerror}
    #  keep_tmpdir: never                              # Options: 'always', 'never', 'onerror'
    #
    #  # memory multiplier
    #  job_mult_memory: 1.0
    #
    #  # running time multiplier
    #  job_mult_time: 1.0
    #
    #  # memory multiplier for merging
    #  merge_mult_memory: 1.0
    #
    #  # running time multiplier for merging
    #  merge_mult_time: 1.0
    #  allow_seq_dict_incompatibility: false
    #  annotations:
    #    - BaseQualityRankSumTest
    #    - FisherStrand
    #    - GCContent
    #    - HaplotypeScore
    #    - HomopolymerRun
    #    - MappingQualityRankSumTest
    #    - MappingQualityZero
    #    - QualByDepth
    #    - ReadPosRankSumTest
    #    - RMSMappingQuality
    #    - DepthPerAlleleBySample
    #    - Coverage
    #    - ClippingRankSumTest
    #    - DepthPerSampleHC
    #
    # Configuration for GatkUgJoint
    #gatk_ug_joint:
    #
    #  # number of cores to use locally
    #  num_cores: 2
    #  window_length: 50000000
    #
    #  # number of windows to process in parallel
    #  num_jobs: 500
    #
    #  # use Snakemake profile for parallel processing
    #  use_profile: true
    #
    #  # number of times to re-launch jobs in case of failure
    #  restart_times: 5
    #
    #  # throttling of job creation
    #  max_jobs_per_second: 2
    #
    #  # throttling of status checks
    #  max_status_checks_per_second: 10
    #
    #  # truncation to first N tokens (0 for none)
    #  debug_trunc_tokens: 0
    #
    #  # keep temporary directory, {always, never, onerror}
    #  keep_tmpdir: never                              # Options: 'always', 'never', 'onerror'
    #
    #  # memory multiplier
    #  job_mult_memory: 1.0
    #
    #  # running time multiplier
    #  job_mult_time: 1.0
    #
    #  # memory multiplier for merging
    #  merge_mult_memory: 1.0
    #
    #  # running time multiplier for merging
    #  merge_mult_time: 1.0
    #  downsample_to_coverage: 250.0
    #  allow_seq_dict_incompatibility: false
    #  annotations:
    #    - BaseQualityRankSumTest
    #    - FisherStrand
    #    - GCContent
    #    - HaplotypeScore
    #    - HomopolymerRun
    #    - MappingQualityRankSumTest
    #    - MappingQualityZero
    #    - QualByDepth
    #    - ReadPosRankSumTest
    #    - RMSMappingQuality
    #    - DepthPerAlleleBySample
    #    - Coverage
    #    - ClippingRankSumTest
    #    - DepthPerSampleHC
    #
    # Configuration for VarscanJoint
    #varscan_joint:
    #  max_depth: 4000
    #  max_indel_depth: 4000
    #  min_bq: 13
    #  no_baq: true
    #
    #  # number of cores to use locally
    #  num_cores: 2
    #  window_length: 5000000
    #
    #  # number of windows to process in parallel
    #  num_jobs: 500
    #
    #  # use Snakemake profile for parallel processing
    #  use_profile: true
    #
    #  # number of times to re-launch jobs in case of failure
    #  restart_times: 5
    #
    #  # throttling of job creation
    #  max_jobs_per_second: 2
    #
    #  # throttling of status checks
    #  max_status_checks_per_second: 10
    #
    #  # truncation to first N tokens (0 for none)
    #  debug_trunc_tokens: 0
    #
    #  # keep temporary directory, {always, never, onerror}
    #  keep_tmpdir: never                              # Options: 'always', 'never', 'onerror'
    #
    #  # memory multiplier
    #  job_mult_memory: 1.0
    #
    #  # running time multiplier
    #  job_mult_time: 1.0
    #
    #  # memory multiplier for merging
    #  merge_mult_memory: 1.0
    #
    #  # running time multiplier for merging
    #  merge_mult_time: 1.0
    #  min_coverage: 8.0
    #  min_reads2: 2
    #  min_avg_qual: 15.0
    #  min_var_freq: 0.01
    #  min_freq_for_hom: 0.75
    #  p_value: 0.99

Available Somatic Variant Callers

The following somatic variant callers are currently available

  • mutect2 is the recommended caller

  • mutect & scalpel are deprecated

  • the joint variant callers bcftools, platypus, gatk & varscan are unsupported.

Efficient usage of mutect2

The recommended caller mutect2 is implemented using a parallelisation mechanism to reduce execution time. During parallelisation, the genome is split into small chunks, and parallel jobs perform the somatic variant calling in their region only. All results are then assembled to generate the final output files.

There is at least one chunk by contig defined in the reference genome, with the chunk size upper limit given by the window_length configuration option. So large chromosomes can be split into several chunks, while smaller contigs are left in one chunk. Even for large chunk size, this parallelisation can create hundreds of jobs when the reference genome contains many contigs (unplaced or unlocalized contigs, viral sequences, decoys, HLA alleles, …). Somatic variant calling is generally meaningless for many of these small contigs. It is possible to configure the somatic variant calling to avoid the contigs irrelevant for downstream analysis, for example:

mutect2:
  ignore_chroms: ['*_random', 'chrUn_*', '*_decoy', "EBV", "HPV*", "HBV", "HCV-*", "HIV-*", "HTLV-1", "CMV", "KSHV", "MCV", "SV40"] # GRCh38.d1.vd1
  window_length: 300000000   # OK for WES, too large for WGS
  keep_tmpdir: onerror       # Facilitates debugging
  ...

Reports

Currently, no reports are generated.