Creation of panel of normals for somatic SNV & CNV calling

Implementation of the panel_of_normals step

The panel_of_normals step takes as the input the results of the ngs_mapping step (aligned reads in BAM format) and creates background information for somatic variant calling. and/or somatic copy number calling. This background information is summarized as a panel of normals.

Usually, the panel_of_normals step is required by somatic variant calling or somatic copy number calling tools.

Step Input

The somatic variant calling step uses Snakemake sub workflows for using the result of the ngs_mapping step.

By default, all normal DNA samples in the ngs_mapping step are using to create the panel of normals. However, the user can select of subset of those using the path_normals_list configuration option (which can be different for the different tools). In this case, the libraries listed in the file will be used, even if they are not flagged as corresponding to normal samples.

Step Output

For each panel of normals tool, the step outputs one set of files describing the panel. For example, the mutect2 panel of normal generates {mapper}.mutect2.pon.vcf.gz and associated files (md5 sums indices).

The normals that have been used, as well as the individual files (for example vcf files for each normal) are kept in the work directory. This enables the augmentation of the panel by new files when they become available.

Warning

Panel of normals are powerful tools to reduce systematic bias in the analysis of sequencing data. However, they should be built using data generated as similarily as possible. In particular, a panel of normals should only contain data collected with the same exome enrichment kit. It is also essential to use such a panel on tumor samples collected in the same way.

Notes on the `cnvkit` workflow

cnvkit is a set of tools originally designed to call somatic copy number alterations from exome data. Its design is modular, which enables its use for whole genome and amplicon data.

Provided that sufficient normal samples are available, the cnvkit documentation recommends the creation of a panel of normal (called reference) for exome and whole genome data.

Note

cnvkit provides a tool to encapsulate common practice workflows (batch), depending on the type of data, and on the availability of optional inputs. The actual workflow to generate this reference is slightly different between exome and whole genome data. The current implementation recapitulates the common practice, while still dispaching computations on multiple cluster nodes.

Access file

cnvkit can use a bed file describing the accessible regions for coverage computations. The cnvkit distribution provides it for the GRCh37 human genome release, but incompletely only for GRCh38. Therefore, a tentative access tool has been added, to generate this bed file when the user knows which locii should be excluded from coverage. Its output (output/cnvkit.access/out/cnvkit.access.bed) is optional, but its presence impacts of the way the target and antitarget regions are computed in whole genome mode.

Note

In a nutshell, for exome data, the accessibility file is only used to create antitarget regions. For genome data, it is used by the autobin tool to compute the average target size used during target regions creation. If it is present, the target size is computed in amplicon mode, and when it is absent, an accessibility file is created with default settings, which value is used by autobin is whole genome mode.

To generate the access file from a bed file containing regions to exclude from further coverage computations, the user must proceed in two steps:

First, she needs to run the access tool to create the desired access file

panel_of_normals:
    tools: [access]
    access:
        exclude: <absolute path to excluded locii bed file>

This will create output/cnvkit.access/out/cnvkit.access.bed from the genomic sequence & excluded regions.

Panel of normal creation

If the user wants to create her own access file, then the panel of normal can only be created after the access tool has been run. If she decides that the access file provided in the cnvkit distribution is suitable (no excluded region), then she can skip the access tool step and directly creates her panel of normals.

In both cases, the configuration might read:

panel_of_normals:
    tools: [cnvkit]                                               # , access]
    path_access: <absolute path to access file>                   # Even when created by the ``access`` tool.
    path_target: <absolute path to baits>                         # Keep empty for WGS data
    path_normals_list: <absolute path to list of normal samples>  # Keep empty to use all available normals

Note that there is no provision (yet) to automatically create separate panel of normals for males & females. If the number of samples collected in the same fashion is large enough, it is nevertheless the way to achieve best results.

Reports

Report tables can be found in the output/{mapper}.cnvkit/report directory. Two tables are produced, grouping results for all normal samples together:

metrics.txt: coverage metrics over target and antitarget regions.
sex.txt: prediction of the donor’s gender based on the coverage of chromosome X & Y target and antitarget regions.

The cnvkit authors recommend to check these reports to ensure that all data is suitable for panel of normal creation.

Notes `purecn`

In the current implementation, the purecn panel of normals is required when calling somatic copy numbers in the somatic_targeted_seq_cnv_calling step. In turn, the purecn panel of normals requires the availability of a mutect2 panel of normals. This is because mutect2 is used as somatic variant caller, rather than the older mutect which is the PureCN default.

The PureCN docker container is used, rather than conda environments, because of the complexity of PureCN R packages requirements (including github-only changes to older packages).

Default Configuration

The default configuration is as follows.

step_config:
  panel_of_normals:
    #tools:                                            # Options: 'mutect2', 'cnvkit', 'purecn', 'access'
    #  - mutect2
    #path_ngs_mapping: ../ngs_mapping
    #
    # Patterns of contig names to ignore
    #ignore_chroms: []                                 # Examples: ['NC_007605', 'hs37d5', 'chrEBV', '*_decoy', 'HLA-*', 'GL000220.*'], ['chrEBV', 'HPV*', 'CMV', 'HBV', 'HCV-*', 'HIV-*', 'KSHV', 'HTLV-1', 'MCV', '*_decoy', 'chrUn_GL00220*', 'SV40']
    #mutect2:
    #
    #  # List additional GetPileupSummaries arguments.
    #  # Each additional argument must be of the form:
    #  # "--<argument name> <argument value>"
    #  extra_arguments: []                             # Examples: --max-depth-per-sample 1000, --maximum-population-allele-frequency 0.2
    #
    #  # Options for the java run-time compiler
    #  java_options: ''                                # Examples: -Xms4000m -Xmx8000m
    #
    #  # number of cores to use locally
    #  num_cores: 2
    #
    #  # number of windows to process in parallel
    #  num_jobs: 24
    #
    #  # use Snakemake profile for parallel processing
    #  use_profile: true
    #
    #  # number of times to re-launch jobs in case of failure
    #  restart_times: 5
    #
    #  # throttling of job creation
    #  max_jobs_per_second: 2
    #
    #  # throttling of status checks
    #  max_status_checks_per_second: 10
    #
    #  # truncation to first N tokens (0 for none)
    #  debug_trunc_tokens: 0
    #
    #  # keep temporary directory, {always, never, onerror}
    #  keep_tmpdir: never                              # Options: 'always', 'never', 'onerror'
    #
    #  # memory multiplier
    #  job_mult_memory: 1.0
    #
    #  # running time multiplier
    #  job_mult_time: 1.0
    #
    #  # memory multiplier for merging
    #  merge_mult_memory: 1.0
    #
    #  # running time multiplier for merging
    #  merge_mult_time: 1.0
    #
    #  # Optional file listing libraries to include in panel
    #  path_normals_list: ''
    #
    #  # Germline variants resource (same as panel of normals)
    #  germline_resource:                              # REQUIRED
    #
    #  # Padding around intervals for scatter/gather
    #  padding: 5000
    #
    #  # Parameters for the creation of the (transient?) genomics database.
    #  # The 2nd part of panel of normals creation (CreateSomaticPanelOfNormals) is not under user control
    #  genomicsdb:
    #
    #    # List additional GetPileupSummaries arguments.
    #    # Each additional argument must be of the form:
    #    # "--<argument name> <argument value>"
    #    extra_arguments: []                           # Examples: --max-depth-per-sample 1000, --maximum-population-allele-frequency 0.2
    #
    #    # GenomicsDBImport needs lots of memory.
    #    # (mind the caveats in https://gatk.broadinstitute.org/hc/en-us/articles/35967560824859-GenomicsDBImport)
    #    java_options: -Xms16g -Xmx32g
    #cnvkit:
    #
    #  # Path to target regions
    #  path_target:                                    # REQUIRED; Examples: ../panel_of_normals/output/cnvkit.target/out/cnvkit.target.bed
    #
    #  # Path to accessible regions
    #  path_access:                                    # Examples: /data/cephfs-1/work/groups/cubi/projects/biotools/cnvkit/access-10kb.hg38.bed
    #
    #  # Path to accessible regions
    #  path_annotation:                                # Examples: /data/cephfs-1/work/projects/cubit/20.05/static_data/annotation/GENCODE/33/GRCh38/gencode.v33.annotation.gtf
    #  coverage:
    #
    #    # [coverage] Mininum mapping quality score to count a read for coverage depth
    #    min_mapq: 0
    #
    #    # [coverage] Alternative counting algorithm
    #    count: false
    #  target:
    #
    #    # Split large tiled intervals into smaller, consecutive targets.
    #    split: false
    #
    #    # Average size of split target bins (results are approximate). Default 266.6666667 is not allowed (ints only).
    #    avg_size:
    #  antitarget:
    #
    #    # Minimum size of antitarget bins (smaller regions are dropped). Default: 1/16 avg size when not set
    #    min_size:
    #
    #    # Average size of split target bins (results are approximate). Default 150000.
    #    avg_size: 150000
    #  reference:
    #
    #    # [fix] Use GC correction
    #    gc_correction: true
    #
    #    # [fix] Use edge correction
    #    edge_correction: true
    #
    #    # [fix] Use rmask correction
    #    rmask_correction: true
    #
    #    # Minimum cluster size to keep in reference profiles. When 0, don't use cluster, default value = 4
    #    min_cluster_size: 0
    #  genemetrics:
    #
    #    # [genemetrics] Min number of covered probes to consider a gene
    #    min_probes: 3
    #
    #    # [genemetrics] Min abs log2 change to consider a gene
    #    threshold: 0.2
    #
    #    # [genemetrics] Significance cutoff
    #    alpha: 0.05
    #
    #    # [genemetrics] Number of bootstraps
    #    bootstrap: 100
    #  segmetrics:
    #
    #    # [segmetrics] Significance cutoff
    #    alpha: 0.05
    #
    #    # [segmetrics] Number of bootstraps
    #    bootstrap: 100
    #
    #    # [segmetrics] Smooth bootstrap results
    #    smooth_bootstrap: false
    #  breaks:
    #
    #    # [breaks] Min number of covered probes for a break inside the gene
    #    min_probes: 1
    #
    #  # [genemetrics] Drop very low coverage bins
    #  drop_low_coverage: false
    #
    #  # [call, diagram] Specify the chromosomal sex of all given samples as male or female.
    #  # Guess when missing
    #  gender: ''                                      # Options: 'male', 'female', '' (guess)
    #
    #  # [call, diagram] Create male reference
    #  male_reference: false
    #
    #  # Desired average number of sequencing read bases mapped to each bin.
    #  bp_per_bin: 100000.0
    #access:
    #
    #  # [access] Bed file of regions to exclude (mappability, blacklisted, ...)
    #  exclude: []
    #
    #  # [access] Minimum gap size between accessible sequence regions (0: use default value)
    #  min_gap_size: 0
    #purecn:
    #
    #  # Optional file listing libraries to include in panel
    #  path_normals_list: ''
    #
    #  # Bed files of enrichment kit sequences (MergedProbes for Agilent SureSelect),
    #  # recommended by PureCN author
    #  path_bait_regions:                              # REQUIRED
    #
    #  # Mutect2 genomicsDB created during panel_of_normals
    #  path_genomicsDB:                                # REQUIRED
    #  genome_name: unknown                            # Options: 'hg18', 'hg19', 'hg38', 'mm9', 'mm10', 'rn4', 'rn5', 'rn6', 'canFam3'
    #
    #  # For filename only...
    #  enrichment_kit_name: unknown
    #
    #  # GRCh38:
    #  # /fast/work/groups/cubi/projects/biotools/static_data/app_support/PureCN/hg38/mappability.bw
    #  mappability: ''
    #
    #  # Nothing for GRCh38
    #  reptiming: ''
    #  seed: 1234567

Panel of normals generation for tools

Panel of normal for mutect2 somatic variant caller
Panel of normal for cvnkit somatic Copy Number Alterations caller