Somatic Variant Calling

Implementation of the somatic_variant_calling step

The somatic_variant_calling step takes as the input the results of the ngs_mapping step (aligned reads in BAM format) and performs somatic variant calling. The result are variant files with somatic variants (bgzip-ed and indexed VCF files).

Usually, the somatic variant calling step is followed by the somatic_variant_annotation step.

Step Input

The somatic variant calling step uses Snakemake sub workflows for using the result of the ngs_mapping step.

Step Output

For each tumor DNA NGS library with name lib_name/key lib_pk and each read mapper mapper that the library has been aligned with, and the variant caller var_caller, the pipeline step will create a directory output/{mapper}.{var_caller}.{lib_name}-{lib_pk}/out with symlinks of the following names to the resulting VCF, TBI, and MD5 files.

  • {mapper}.{var_caller}.{lib_name}-{lib_pk}.vcf.gz

  • {mapper}.{var_caller}.{lib_name}-{lib_pk}.vcf.gz.tbi

  • {mapper}.{var_caller}.{lib_name}-{lib_pk}.vcf.gz.md5

  • {mapper}.{var_caller}.{lib_name}-{lib_pk}.vcf.gz.tbi.md5

Two vcf files are produced:

  • {mapper}.{var_caller}.{lib_name}.vcf.gz which contains only the variants that have passed all filters, or that were protected, and

  • {mapper}.{var_caller}.{lib_name}.full.vcf.gz which contains all variants, with the reason for rejection in the FILTER column.

For example, it might look as follows for the example from above:

output/
+-- bwa.mutect2.P001-N1-DNA1-WES1-4
|   `-- out
|       |-- bwa.mutect2.P001-N1-DNA1-WES1-4.vcf.gz
|       |-- bwa.mutect2.P001-N1-DNA1-WES1-4.vcf.gz.tbi
|       |-- bwa.mutect2.P001-N1-DNA1-WES1-4.vcf.gz.md5
|       `-- bwa.mutect2.P001-N1-DNA1-WES1-4.vcf.gz.tbi.md5
[...]

Generally, the callers are set up to return many variants to avoid missing clinically important ones. They are likely to contain low-quality variants and for some callers variants flagged as being non-somatic.

Default Configuration

The default configuration is as follows.

step_config:
  somatic_variant_calling:

    # List of tools
    #tools: []                                         # Options: 'mutect2'
    #
    # Path to ngs_mapping
    #path_ngs_mapping: ../ngs_mapping
    #
    # Patterns of contig names to ignore
    #ignore_chroms: []                                 # Examples: NC_007605, hs37d5, chrEBV, *_decoy, HLA-*, GL000220.*
    #
    # Configuration for MuTect 2
    #mutect2:
    #
    #  # List additional GetPileupSummaries arguments.
    #  # Each additional argument must be of the form:
    #  # "--<argument name> <argument value>"
    #  extra_arguments: []                             # Examples: --max-depth-per-sample 1000, --maximum-population-allele-frequency 0.2
    #
    #  # Options for the java run-time compiler
    #  java_options: ''                                # Examples: -Xms4000m -Xmx8000m
    #
    #  # number of cores to use locally
    #  num_cores: 2
    #
    #  # number of windows to process in parallel
    #  num_jobs: 24
    #
    #  # use Snakemake profile for parallel processing
    #  use_profile: true
    #
    #  # number of times to re-launch jobs in case of failure
    #  restart_times: 5
    #
    #  # throttling of job creation
    #  max_jobs_per_second: 2
    #
    #  # throttling of status checks
    #  max_status_checks_per_second: 10
    #
    #  # truncation to first N tokens (0 for none)
    #  debug_trunc_tokens: 0
    #
    #  # keep temporary directory, {always, never, onerror}
    #  keep_tmpdir: never                              # Options: 'always', 'never', 'onerror'
    #
    #  # memory multiplier
    #  job_mult_memory: 1.0
    #
    #  # running time multiplier
    #  job_mult_time: 1.0
    #
    #  # memory multiplier for merging
    #  merge_mult_memory: 1.0
    #
    #  # running time multiplier for merging
    #  merge_mult_time: 1.0
    #
    #  # Set path to panel of normals vcf if required
    #  panel_of_normals: ''
    #
    #  # Germline variants resource (same as panel of normals)
    #  germline_resource: ''
    #
    #  # Estimation of contamination using GetPileupSummaries & CalculateContamination
    #  contamination:                                  # REQUIRED
    #
    #    # List additional GetPileupSummaries arguments.
    #    # Each additional argument must be of the form:
    #    # "--<argument name> <argument value>"
    #    extra_arguments: []                           # Examples: --max-depth-per-sample 1000, --maximum-population-allele-frequency 0.2
    #
    #    # Options for the java run-time compiler
    #    java_options: ''                              # Examples: -Xms4000m -Xmx8000m
    #    enabled: false
    #
    #    # Common germline variants for contamination estimation
    #    common_variants: ''
    #
    #    # Parameters for GetPileupSummaries used on the tumor bam (& normalk if present)
    #    pileup:
    #
    #      # List additional GetPileupSummaries arguments.
    #      # Each additional argument must be of the form:
    #      # "--<argument name> <argument value>"
    #      extra_arguments: []                         # Examples: --max-depth-per-sample 1000, --maximum-population-allele-frequency 0.2
    #
    #      # Options for the java run-time compiler
    #      java_options: ''                            # Examples: -Xms4000m -Xmx8000m
    #
    #  # Additional arguments for filtration
    #  filtration:
    #
    #    # List additional GetPileupSummaries arguments.
    #    # Each additional argument must be of the form:
    #    # "--<argument name> <argument value>"
    #    extra_arguments: []                           # Examples: --max-depth-per-sample 1000, --maximum-population-allele-frequency 0.2
    #
    #    # Options for the java run-time compiler
    #    java_options: ''                              # Examples: -Xms4000m -Xmx8000m
    #
    #  # Padding around intervals for scatter/gather
    #  padding: 5000
    #
    #  # Whether to call variants in paired, tumor_only, or automatic mode.
    #  tumor_normal_mode: automatic                    # Options: 'automatic', 'paired', 'tumor_only'

Available Somatic Variant Callers

The following somatic variant callers are currently available

  • mutect2 is the recommended caller

  • mutect & scalpel are deprecated

  • the joint variant callers bcftools, platypus, gatk & varscan are unsupported.

Efficient usage of mutect2

The recommended caller mutect2 is implemented using a parallelisation mechanism to reduce execution time. During parallelisation, the genome is split into small chunks, and parallel jobs perform the somatic variant calling in their region only. All results are then assembled to generate the final output files.

There is at least one chunk by contig defined in the reference genome, with the chunk size upper limit given by the window_length configuration option. So large chromosomes can be split into several chunks, while smaller contigs are left in one chunk. Even for large chunk size, this parallelisation can create hundreds of jobs when the reference genome contains many contigs (unplaced or unlocalized contigs, viral sequences, decoys, HLA alleles, …). Somatic variant calling is generally meaningless for many of these small contigs. It is possible to configure the somatic variant calling to avoid the contigs irrelevant for downstream analysis, for example:

mutect2:
  ignore_chroms: ['*_random', 'chrUn_*', '*_decoy', "EBV", "HPV*", "HBV", "HCV-*", "HIV-*", "HTLV-1", "CMV", "KSHV", "MCV", "SV40"] # GRCh38.d1.vd1
  window_length: 300000000   # OK for WES, too large for WGS
  keep_tmpdir: onerror       # Facilitates debugging
  ...

Reports

Currently, no reports are generated.