.. _panel_of_normals: ---------------- Panel of normals ---------------- The purpose of panel of normals is to use the normal samples to identify technical biases present in the data. Thes biases can be (to a certain extent) be then corrected when processing the tumor samples. It is therefore **very** important to make sure that all normal samples have been collected, processed & sequenced in the same way. Downstream processing of tumor samples must also ensure that the sample collection, processing & sequencing match the panel. Finally, the panels require a certain number of normal samples to be effective. Below that number, the particulars of any normal sample included in the panel will be considered as bias, and corrected for later in processing. Panel of normals can be generated for multiple applications: somatic variant calling (``mutect2``) & somatic copy number calling (exome data). Somatic variant calling ======================= The panel of normals for ``mutect2`` is straightforward to generate: only a germline resource is needed, to locate known germline variants (SNVs & indels). This germline resource is provided in the `GATK best practice bundle `_. On the cluster, it can be found in ``/fast/work/groups/cubi/projects/biotools/static_data/app_support/GATK/``, search for ``vcf`` files with ``af-only-gnomad`` in their name. It is `recommended `_ to have at least 40 normals in the panel, but it is not uncommon to build a panel with fewer samples. When the number of samples in too small, GATK provides a panel of normals in the same `best practice bundle `_. It can also be found on the cluster, in the same location as above. In case samples should be omitted from the panel of normals, the user can create a file with the names of the libraries to be included in the panel (one line per library). This file can be provided using the ``path_normals_list`` configuration option. Finally, it is recommended to restrict the generation of the panel to the sequences of interest. This is done by excluding unwanted sequences in the following way: .. code-block:: yaml panel_of_normals: tools: [mutect2] mutect2: germline_resource: path_normals_list: "" # Use all samples when empty, otherwise use only samples in the list file. window_length: 300000000 # For exome data, it is sufficient to split the genome by chromosomes ignore_chroms: [NC_007605, hs37d5, chrEBV, '*_decoy', 'HLA-*', 'GL000220.*'] # For hs37d5 ignore_chroms: [NC_007605, hs37d5, chrEBV, '*_decoy', 'HLA-*', 'GL000220.*'] # Must be repeated at the level above mutect2 Copy numbers (exome data) ========================= .. note:: The current implementations of panel of normals for ``cnvkit`` & ``PureCN`` rely on files generated by the pipeline, but that are **not** properly maintained by the pipeline. This is suboptimal, complicates considerably the ``cnvkit`` ---------- ``cnvkit`` requires a file describing regions unsuitable for coverage analysis (repeats, difficult to map, extreme GC content, ...). It provides it for ``GRCh37``, but not for ``GRCh38``, but the ``access`` tool can create such a file from mappability regions in ``bed`` format. This ``access`` tool **must** be run before creating the ``cnvkit`` panel of normals. In other words, the ``cnvkit`` panel of normals is created in two steps: First, the generation of the access file: .. code-block:: yaml panel_of_normals: tools: [access] access: exclude: - This create an access file in ``output/cnvkit.access/out/cnvkit.access.bed``. Then, the panel can be created: .. code-block:: yaml panel_of_normals: tools: [access, cnvkit] access: exclude: - cnvkit: access: path_target_regions: ``PureCN`` ---------- ``PureCN`` requires the ``mutect2`` panel of normals to be able to create one for itself. Like ``cnvkit``, the ``PureCN`` panel of normals creation involves two steps: First, the ``mutect2`` panel of normals must be created, and second, it can be used to create ``PureCN``'s panel: .. code-block:: yaml panel_of_normals: tools: [mutect2, purecn] mutect2: ... purecn: path_genomeDB: .mutect2/out/.mutect2.genomicsDB.tar.gz> path_bait_regions: genome_name: hg19 # Must be either "hg19" or "hg38" enrichment_kit_name: "exome" # Used to name output files and for CNV processing