Panel of normals

The purpose of panel of normals is to use the normal samples to identify technical biases present in the data. Thes biases can be (to a certain extent) be then corrected when processing the tumor samples. It is therefore very important to make sure that all normal samples have been collected, processed & sequenced in the same way. Downstream processing of tumor samples must also ensure that the sample collection, processing & sequencing match the panel.

Finally, the panels require a certain number of normal samples to be effective. Below that number, the particulars of any normal sample included in the panel will be considered as bias, and corrected for later in processing.

Panel of normals can be generated for multiple applications: somatic variant calling (mutect2) & somatic copy number calling (exome data).

Somatic variant calling

The panel of normals for mutect2 is straightforward to generate: only a germline resource is needed, to locate known germline variants (SNVs & indels). This germline resource is provided in the GATK best practice bundle. On the cluster, it can be found in /fast/work/groups/cubi/projects/biotools/static_data/app_support/GATK/, search for vcf files with af-only-gnomad in their name.

It is recommended to have at least 40 normals in the panel, but it is not uncommon to build a panel with fewer samples. When the number of samples in too small, GATK provides a panel of normals in the same best practice bundle. It can also be found on the cluster, in the same location as above.

In case samples should be omitted from the panel of normals, the user can create a file with the names of the libraries to be included in the panel (one line per library). This file can be provided using the path_normals_list configuration option.

Finally, it is recommended to restrict the generation of the panel to the sequences of interest. This is done by excluding unwanted sequences in the following way:

panel_of_normals:
  tools: [mutect2]
  mutect2:
    germline_resource: <path to af-only-gnomad.raw.sites.vcf.gz>
    path_normals_list: ""          # Use all samples when empty, otherwise use only samples in the list file.
    window_length: 300000000       # For exome data, it is sufficient to split the genome by chromosomes
    ignore_chroms: [NC_007605, hs37d5, chrEBV, '*_decoy', 'HLA-*', 'GL000220.*'] # For hs37d5
  ignore_chroms: [NC_007605, hs37d5, chrEBV, '*_decoy', 'HLA-*', 'GL000220.*'] # Must be repeated at the level above mutect2

Copy numbers (exome data)

Note

The current implementations of panel of normals for cnvkit & PureCN rely on files generated by the pipeline, but that are not properly maintained by the pipeline. This is suboptimal, complicates considerably the

`cnvkit`

cnvkit requires a file describing regions unsuitable for coverage analysis (repeats, difficult to map, extreme GC content, …). It provides it for GRCh37, but not for GRCh38, but the access tool can create such a file from mappability regions in bed format.

This access tool must be run before creating the cnvkit panel of normals. In other words, the cnvkit panel of normals is created in two steps:

First, the generation of the access file:

panel_of_normals:
  tools: [access]
  access:
    exclude:
    - <path to mappability file>

This create an access file in output/cnvkit.access/out/cnvkit.access.bed.

Then, the panel can be created:

panel_of_normals:
  tools: [access, cnvkit]
  access:
    exclude:
    - <path to mappability file>
  cnvkit:
    access: <absolute path to access file>
    path_target_regions: <path to exome baits file>

`PureCN`

PureCN requires the mutect2 panel of normals to be able to create one for itself. Like cnvkit, the PureCN panel of normals creation involves two steps:

First, the mutect2 panel of normals must be created, and second, it can be used to create PureCN’s panel:

panel_of_normals:
  tools: [mutect2, purecn]
  mutect2:
    ...
  purecn:
    path_genomeDB: <absolute path to output/<mapper>.mutect2/out/<mapper>.mutect2.genomicsDB.tar.gz>
    path_bait_regions: <path to exome baits file>
    genome_name: hg19               # Must be either "hg19" or "hg38"
    enrichment_kit_name: "exome"    # Used to name output files and for CNV processing