Creation of panel of normals for somatic SNV & CNV calling
Implementation of the panel_of_normals
step
The panel_of_normals
step takes as the input the results of the ngs_mapping
step
(aligned reads in BAM format) and creates background information for somatic variant calling.
and/or somatic copy number calling. This background information is summarized as a
panel of normals.
Usually, the panel_of_normals
step is required by somatic variant calling or
somatic copy number calling tools.
Step Input
The somatic variant calling step uses Snakemake sub workflows for using the result of the
ngs_mapping
step.
By default, all normal DNA samples in the ngs_mapping
step are using to create the panel of normals.
However, the user can select of subset of those using the path_normals_list
configuration option
(which can be different for the different tools).
In this case, the libraries listed in the file will be used, even if they are not flagged as corresponding to normal samples.
Step Output
For each panel of normals tool, the step outputs one set of files describing the panel.
For example, the mutect2
panel of normal generates {mapper}.mutect2.pon.vcf.gz
and associated files (md5 sums indices).
The normals that have been used, as well as the individual files (for example
vcf files for each normal) are kept in the work
directory. This enables the
augmentation of the panel by new files when they become available.
Notes on the cnvkit
workflow
cnvkit
is a set of tools originally designed to call somatic copy number alterations from exome data.
Its design is modular, which enables its use for whole genome and amplicon data.
cnvkit
provides a tool to encapsulate common practice workflows (batch
), depending on the type of data, and on the availability of optional inputs.
The current implementation recapitulates the common practice, while still dispaching computations on multiple cluster nodes.
For exome and whole genome data, the cnvkit
documentation
recommends the creation of a panel of normal (called reference
).
The actual workflow to generate this reference is slightly different between exome and whole genome data,
and also changes depending whether an accessibility file is provided by the user or not.
Therefore, the cnvkit
tool to generate such accessibility file is implemented as a separate tool.
If a user wants to create this accessibility file with cnvkit
tools, then she must first run the access
tool.
Only after it has been created can she use it to generate the panel of normals.
For that, she will need to modify the configuration file, adding cnvkit
in the list of tools, and setting the access
parameter to the output of the access
tool.
Default Configuration
The default configuration is as follows.
# Default configuration somatic_variant_calling
step_config:
panel_of_normals:
tools: ['mutect2'] # REQUIRED - available: 'mutect2'
path_ngs_mapping: ../ngs_mapping # REQUIRED
# Configuration for mutect2
mutect2:
path_normals_list: null # Optional file listing libraries to include in panel
germline_resource: REQUIRED
# Java options
java_options: ' -Xmx16g '
# Parallelization configuration
num_cores: 2 # number of cores to use locally
window_length: 100000000 # split input into windows of this size, each triggers a job
num_jobs: 500 # number of windows to process in parallel
use_profile: true # use Snakemake profile for parallel processing
restart_times: 5 # number of times to re-launch jobs in case of failure
max_jobs_per_second: 2 # throttling of job creation
max_status_checks_per_second: 10 # throttling of status checks
debug_trunc_tokens: 0 # truncation to first N tokens (0 for none)
keep_tmpdir: never # keep temporary directory, {always, never, onerror}
job_mult_memory: 1 # memory multiplier
job_mult_time: 1 # running time multiplier
merge_mult_memory: 1 # memory multiplier for merging
merge_mult_time: 1 # running time multiplier for merging
ignore_chroms: # patterns of chromosome names to ignore
- NC_007605 # herpes virus
- hs37d5 # GRCh37 decoy
- chrEBV # Eppstein-Barr Virus
- '*_decoy' # decoy contig
- 'HLA-*' # HLA genes
- 'GL000220.*' # Contig with problematic, repetitive DNA in GRCh37
cnvkit:
path_normals_list: "" # Optional file listing libraries to include in panel
path_target_regions: "" # Bed files of targetted regions (Missing when creating a panel of normals for WGS data)
access: "" # Access bed file (output/cnvkit.access/out/cnvkit.access.bed when create_cnvkit_acces was run)
annotate: "" # [target] Optional targets annotations
target_avg_size: 0 # [target] Average size of split target bins (0: use default value)
bp_per_bin: 50000 # [autobin] Expected base per bin
split: True # [target] Split large intervals into smaller ones
antitarget_avg_size: 0 # [antitarget] Average size of antitarget bins (0: use default value)
min_size: 0 # [antitarget] Min size of antitarget bins (0: use default value)
min_mapq: 0 # [coverage] Mininum mapping quality score to count a read for coverage depth
count: False # [coverage] Alternative couting algorithm
min_cluster_size: 0 # [reference] Minimum cluster size to keep in reference profiles. 0 for no clustering
gender: "" # [reference] Specify the chromosomal sex of all given samples as male or female. Guess when missing
male_reference: False # [reference & sex] Create male reference
gc_correction: True # [reference] Use GC correction
edge_correction: True # [reference] Use edge correction
rmask_correction: True # [reference] Use rmask correction
drop_low_coverage: False # [metrics] Drop very-low-coverage bins before calculations
access: # Creates access file for cnvkit, based on genomic sequence & excluded regions (optionally)
exclude: [] # [access] Bed file of regions to exclude (mappability, blacklisted, ...)
min_gap_size: 0 # [access] Minimum gap size between accessible sequence regions (0: use default value)
Panel of normals generation for tools
Panel of normal for
mutect2
somatic variant callerPanel of normal for
cvnkit
somatic Copy Number Alterations caller
access
is used to create a genome accessibility file that can be used for cnvkit
panel of normals creation.
Its output (output/cnvkit.access/out/cnvkit.access.bed
) is optional, but its presence impacts of the way the target and antitarget regions are computed in whole genome mode.
In a nutshell, for exome data, the accessibility file is only used to create antitarget regions.
For genome data, it is used by the autobin
tool to compute the average target size used during target regions creation.
If it is present, the target size is computed in amplicon mode, and when it is absent,
an accessibility file is created with default settings, which value is used by autobin
is whole genome mode.
This follows the internal batch
code of cnvkit
.
Reports
Report tables can be found in the output/{mapper}.cnvkit/report
directory.
Two tables are produced, grouping results for all normal samples together:
metrics.txt
: coverage metrics over target and antitarget regions.sex.txt
: prediction of the donor’s gender based on the coverage of chromosome X & Y target and antitarget regions.
The cnvkit authors recommend to check these reports to ensure that all data is suitable for panel of normal creation.