Creation of panel of normals for somatic SNV & CNV calling

Implementation of the panel_of_normals step

The panel_of_normals step takes as the input the results of the ngs_mapping step (aligned reads in BAM format) and creates background information for somatic variant calling. and/or somatic copy number calling. This background information is summarized as a panel of normals.

Usually, the panel_of_normals step is required by somatic variant calling or somatic copy number calling tools.

Step Input

The somatic variant calling step uses Snakemake sub workflows for using the result of the ngs_mapping step.

By default, all normal DNA samples in the ngs_mapping step are using to create the panel of normals. However, the user can select of subset of those using the path_normals_list configuration option (which can be different for the different tools). In this case, the libraries listed in the file will be used, even if they are not flagged as corresponding to normal samples.

Step Output

For each panel of normals tool, the step outputs one set of files describing the panel. For example, the mutect2 panel of normal generates {mapper}.mutect2.pon.vcf.gz and associated files (md5 sums indices).

The normals that have been used, as well as the individual files (for example vcf files for each normal) are kept in the work directory. This enables the augmentation of the panel by new files when they become available.

Notes on the cnvkit workflow

cnvkit is a set of tools originally designed to call somatic copy number alterations from exome data. Its design is modular, which enables its use for whole genome and amplicon data.

cnvkit provides a tool to encapsulate common practice workflows (batch), depending on the type of data, and on the availability of optional inputs. The current implementation recapitulates the common practice, while still dispaching computations on multiple cluster nodes.

For exome and whole genome data, the cnvkit documentation recommends the creation of a panel of normal (called reference). The actual workflow to generate this reference is slightly different between exome and whole genome data, and also changes depending whether an accessibility file is provided by the user or not.

Therefore, the cnvkit tool to generate such accessibility file is implemented as a separate tool. If a user wants to create this accessibility file with cnvkit tools, then she must first run the access tool. Only after it has been created can she use it to generate the panel of normals. For that, she will need to modify the configuration file, adding cnvkit in the list of tools, and setting the access parameter to the output of the access tool.

Default Configuration

The default configuration is as follows.

# Default configuration somatic_variant_calling
step_config:
  panel_of_normals:
    tools: ['mutect2']  # REQUIRED - available: 'mutect2'
    path_ngs_mapping: ../ngs_mapping  # REQUIRED
    # Configuration for mutect2
    mutect2:
      path_normals_list: null    # Optional file listing libraries to include in panel
      germline_resource: REQUIRED
      # Java options
      java_options: ' -Xmx16g '
      # Parallelization configuration
      num_cores: 2               # number of cores to use locally
      window_length: 100000000   # split input into windows of this size, each triggers a job
      num_jobs: 500              # number of windows to process in parallel
      use_profile: true          # use Snakemake profile for parallel processing
      restart_times: 5           # number of times to re-launch jobs in case of failure
      max_jobs_per_second: 2     # throttling of job creation
      max_status_checks_per_second: 10 # throttling of status checks
      debug_trunc_tokens: 0      # truncation to first N tokens (0 for none)
      keep_tmpdir: never         # keep temporary directory, {always, never, onerror}
      job_mult_memory: 1         # memory multiplier
      job_mult_time: 1           # running time multiplier
      merge_mult_memory: 1       # memory multiplier for merging
      merge_mult_time: 1         # running time multiplier for merging
      ignore_chroms:             # patterns of chromosome names to ignore
      - NC_007605    # herpes virus
      - hs37d5       # GRCh37 decoy
      - chrEBV       # Eppstein-Barr Virus
      - '*_decoy'    # decoy contig
      - 'HLA-*'      # HLA genes
      - 'GL000220.*' # Contig with problematic, repetitive DNA in GRCh37
    cnvkit:
      path_normals_list: ""       # Optional file listing libraries to include in panel
      path_target_regions: ""     # Bed files of targetted regions (Missing when creating a panel of normals for WGS data)
      access: ""                  # Access bed file (output/cnvkit.access/out/cnvkit.access.bed when create_cnvkit_acces was run)
      annotate: ""                # [target] Optional targets annotations
      target_avg_size: 0          # [target] Average size of split target bins (0: use default value)
      bp_per_bin: 50000           # [autobin] Expected base per bin
      split: True                 # [target] Split large intervals into smaller ones
      antitarget_avg_size: 0      # [antitarget] Average size of antitarget bins (0: use default value)
      min_size: 0                 # [antitarget] Min size of antitarget bins (0: use default value)
      min_mapq: 0                 # [coverage] Mininum mapping quality score to count a read for coverage depth
      count: False                # [coverage] Alternative couting algorithm
      min_cluster_size: 0         # [reference] Minimum cluster size to keep in reference profiles. 0 for no clustering
      gender: ""                  # [reference] Specify the chromosomal sex of all given samples as male or female. Guess when missing
      male_reference: False       # [reference & sex] Create male reference
      gc_correction: True         # [reference] Use GC correction
      edge_correction: True       # [reference] Use edge correction
      rmask_correction: True      # [reference] Use rmask correction
      drop_low_coverage: False    # [metrics] Drop very-low-coverage bins before calculations
    access:                       # Creates access file for cnvkit, based on genomic sequence & excluded regions (optionally)
      exclude: []                 # [access] Bed file of regions to exclude (mappability, blacklisted, ...)
      min_gap_size: 0             # [access] Minimum gap size between accessible sequence regions (0: use default value)

Panel of normals generation for tools

  • Panel of normal for mutect2 somatic variant caller

  • Panel of normal for cvnkit somatic Copy Number Alterations caller

access is used to create a genome accessibility file that can be used for cnvkit panel of normals creation. Its output (output/cnvkit.access/out/cnvkit.access.bed) is optional, but its presence impacts of the way the target and antitarget regions are computed in whole genome mode.

In a nutshell, for exome data, the accessibility file is only used to create antitarget regions. For genome data, it is used by the autobin tool to compute the average target size used during target regions creation. If it is present, the target size is computed in amplicon mode, and when it is absent, an accessibility file is created with default settings, which value is used by autobin is whole genome mode.

This follows the internal batch code of cnvkit.

Reports

Report tables can be found in the output/{mapper}.cnvkit/report directory. Two tables are produced, grouping results for all normal samples together:

  • metrics.txt: coverage metrics over target and antitarget regions.

  • sex.txt: prediction of the donor’s gender based on the coverage of chromosome X & Y target and antitarget regions.

The cnvkit authors recommend to check these reports to ensure that all data is suitable for panel of normal creation.