Somatic Targeted Seq. CNV Calling

Implementation of the somatic_target_seq_cnv_calling step

This step allows for the detection of CNV events for cancer samples from targeted sequenced (e.g., exomes or large panels). The wrapped tools start from the aligned reads (thus off ngs_mapping) and generate CNV calls for somatic variants.

The wrapped tools implement different strategies. Some work “reference free” and just use the somatic BAM files for their input, some work in “matched cancer normal mode” and need the cancer and normal BAM files, others again need both normal and cancer BAM files, and additionally a set of non-cancer BAM files for their background.

Step Input

Gene somatic CNV calling for targeted sequencing starts off the aligned reads, i.e., ngs_mapping.

Step Output

There is no widely used standard to report copy number alterations. In absence of a better solution, all CNV tools implemented in somatic pipeline output the segmentation table loosely following the DNAcopy format.` The copy number call may or may not be present, and the chromosome number is replaced by its name. The segmentation output is in file output/<mapper>.<cnv caller>.<lib name>/out/<mapper>.<cnv caller>.<lib name>_dnacopy.seg.

output/
+-- bwa.cnvkit.P001-N1-DNA1-WES1
|   |-- out
|   |   |-- bwa.cnvkitP001-N1-DNA1-WES1_dnacopy.seg
        [...]

Note that tool cnvetti doesn’t follow the snappy convention above: the tool name is followed by an underscore & the action, where the action is one of coverage, segment and postprocess. For example, the output directory would contain a directory named bwa.cnvetti_coverage.P002-T1-DNA1-WES1.

Note

Tool-Specific Output

Each tool produces its own set of outputs, generally not in standard format. Some of these files are linked from work to output, but not necessarily all of them. Some tools (for example cnvkit) also produces a report, with tables and figures.

Default Configuration

The default configuration is as follows.

step_config:
  somatic_targeted_seq_cnv_calling:
    #tools:                                              # Options: 'cnvkit', 'sequenza', 'copywriter', 'cnvetti_on_target', 'cnvetti_off_target', 'purecn'
    #  - cnvkit
    #path_ngs_mapping: ../ngs_mapping
    #cnvkit:
    #
    #  # Path to target regions
    #  path_target:                                      # REQUIRED; Examples: ../panel_of_normals/output/cnvkit.target/out/cnvkit.target.bed
    #
    #  # Path to antitarget regions
    #  path_antitarget:                                  # REQUIRED; Examples: ../panel_of_normals/output/cnvkit.antitarget/out/cnvkit.antitarget.bed
    #
    #  # Path to panel of normals (reference)
    #  path_panel_of_normals:                            # REQUIRED; Examples: ../panel_of_normals/output/{mapper}.cnvkit.create_panel/out/{mapper}.cnvkit.panel_of_normals.cnn
    #
    #  # Generate plots (very slow)
    #  plot: true
    #
    #  # [coverage] Mininum mapping quality score to count a read for coverage depth
    #  min_mapq: 0
    #
    #  # [coverage] Alternative counting algorithm
    #  count: false
    #
    #  # [fix] Use GC correction
    #  gc_correction: true
    #
    #  # [fix] Use edge correction
    #  edge_correction: true
    #
    #  # [fix] Use rmask correction
    #  rmask_correction: true
    #
    #  # [segment] One of cbs, flasso, haar, hmm, hmm-tumor, hmm-germline, none
    #  segmentation_method: cbs                          # Options: 'cbs', 'flasso', 'haar', 'hmm', 'hmm-tumor', 'hmm-germline', 'none'
    #
    #  # [segment] Significance threshold (hmm methods: smoothing window size)
    #  segmentation_threshold: 1e-06
    #
    #  # [segment, call, genemetrics] Drop very low coverage bins
    #  drop_low_coverage: false
    #
    #  # [segment] Drop outlier bins (0 for no outlier filtering)
    #  drop_outliers: 10
    #
    #  # [segment] Additional smoothing of CBS segmentation (WARNING- not the default value)
    #  smooth_cbs: true
    #
    #  # [call] Either one of mean, median, mode, biweight, or a constant log2 ratio value.
    #  center:
    #
    #  # [call] One of ampdel, cn, ci, sem (merging segments flagged with the specified filter),
    #  # "" for no filtering
    #  filter: ampdel
    #
    #  # [call] One of threshold, clonal, none
    #  calling_method: threshold                         # Options: 'threshold', 'clonal', '' (none)
    #
    #  # [call] Thresholds for calling integer copy number
    #  call_thresholds: -1.1,-0.25,0.2,0.7
    #
    #  # [call] Ploidy of sample cells
    #  ploidy: 2
    #
    #  # [call] Estimated tumor cell fraction (0 for discarding tumor cell purity)
    #  purity: 0.0
    #
    #  # [call, diagram] Specify the chromosomal sex of all given samples as male or female.
    #  # Guess when missing
    #  gender: ''                                        # Options: 'male', 'female', '' (guess)
    #
    #  # [call, diagram] Create male reference
    #  male_reference: false
    #
    #  # [diagram] Copy number change threshold to label genes
    #  diagram_threshold: 0.5
    #
    #  # [diagram] Min number of covered probes to label genes
    #  diagram_min_probes: 3
    #
    #  # [diagram] Shift X & Y chromosomes according to sample sex
    #  shift_xy: true
    #
    #  # [breaks] Min number of covered probes for a break inside the gene
    #  breaks_min_probes: 1
    #
    #  # [genemetrics] Min number of covered probes to consider a gene
    #  genemetrics_min_probes: 3
    #
    #  # [genemetrics] Min abs log2 change to consider a gene
    #  genemetrics_threshold: 0.2
    #
    #  # [genemetrics] Significance cutoff
    #  genemetrics_alpha: 0.05
    #
    #  # [genemetrics] Number of bootstraps
    #  genemetrics_bootstrap: 100
    #
    #  # [segmetrics] Significance cutoff
    #  segmetrics_alpha: 0.05
    #
    #  # [segmetrics] Number of bootstraps
    #  segmetrics_bootstrap: 100
    #
    #  # [segmetrics] Smooth bootstrap results
    #  smooth_bootstrap: false
    #sequenza:
    #  length: 50
    #
    #  # Must be hg38 for GRCh38. See copynumber for complete list (augmented with hg38)
    #  assembly: hg19
    #
    #  # Extra arguments for sequenza bam2seqz
    #  extra_args: {}
    #
    #  # patterns of chromosome names to ignore
    #  ignore_chroms:
    #    - X
    #    - Y
    #    - MT
    #    - NC_007605. hs37d5
    #    - chrEBV
    #    - '*_decoy'
    #    - HLA-*
    #    - GL000220.*
    #
    #  # Valid arguments: see ?sequenza::sequenza.extract in R
    #  extra_args_extract:
    #    gamma: 60
    #    kmin: 50
    #
    #  # Valid arguments: see ?sequenza::sequenza.fit in R
    #  extra_args_fit:
    #    N_ratio_filter: 10
    #    N_BAF_filter: 1
    #    segment_filter: 3000000
    #    mufreq_treshold: 0.1
    #    ratio_priority: false
    #    ploidy:
    #      - 1.0
    #      - 1.1
    #      - 1.2
    #      - 1.3
    #      - 1.4
    #      - 1.5
    #      - 1.6
    #      - 1.7
    #      - 1.8
    #      - 1.9
    #      - 2.0
    #      - 2.1
    #      - 2.2
    #      - 2.3
    #      - 2.4
    #      - 2.5
    #      - 2.6
    #      - 2.7
    #      - 2.8
    #      - 2.9
    #      - 3.0
    #      - 3.1
    #      - 3.2
    #      - 3.3
    #      - 3.4
    #      - 3.5
    #      - 3.6
    #      - 3.7
    #      - 3.8
    #      - 3.9
    #      - 4.0
    #      - 4.1
    #      - 4.2
    #      - 4.3
    #      - 4.4
    #      - 4.5
    #      - 4.6
    #      - 4.7
    #      - 4.8
    #      - 4.9
    #      - 5.0
    #      - 5.1
    #      - 5.2
    #      - 5.3
    #      - 5.4
    #      - 5.5
    #copywriter:
    #
    #  # Path to target regions
    #  path_target_regions:                              # REQUIRED
    #  bin_size: 20000
    #
    #  # Path to civic annotation
    #  plot_genes:                                       # REQUIRED
    #
    #  # Could be hg38 (consider setting prefix to 'chr' when using GRCh38.v1)
    #  genome: hg19
    #  features: EnsDb.Hsapiens.v75::EnsDb.Hsapiens.v75
    #  prefix: ''
    #  nThread: 8
    #purecn:
    #
    #  # Must be one from hg18, hg19, hg38, mm9, mm10, rn4, rn5, rn6, canFam3
    #  genome_name: unknown                              # Options: 'hg18', 'hg19', 'hg38', 'mm9', 'mm10', 'rn4', 'rn5', 'rn6', 'canFam3'
    #
    #  # For filename only...
    #  enrichment_kit_name: unknown
    #
    #  # GRCh38:
    #  #  /fast/work/groups/cubi/projects/biotools/static_data/app_support/PureCN/hg38/mappability.bw
    #  mappability: ''
    #
    #  # Nothing for GRCh38
    #  reptiming: ''
    #  seed: 1234567
    #
    #  # Recommended extra arguments for PureCN, extra_commands: {} to clear them all
    #  extra_commands:
    #    model: betabin
    #    fun-segmentation: PSCBS
    #    post-optimize: ''
    #
    #  # A PureCN panel of normals is required,
    #  # with the container, the intervals & the PON rds file
    #  path_container:                                   # REQUIRED; Examples: ../panel_of_normals/work/containers/out/purecn.simg
    #  path_intervals:                                   # REQUIRED; Examples: ../panel_of_normals/output/purecn/out/<enrichement_kit_name>_<genome_name>.list
    #
    #  # Path to the PureCN panel of normal
    #  path_panel_of_normals:                            # REQUIRED; Examples: ../panel_of_normals/output/bwa.purecn/out/bwa.purecn.panel_of_normals.rds
    #
    #  # Path to the PureCN mapping bias file
    #  path_mapping_bias:                                # REQUIRED; Examples: ../panel_of_normals/output/bwa.purecn/out/bwa.purecn.mapping_bias.rds
    #
    #  # IMPORTANT NOTE:
    #  # Mutect2 must be called with "--genotype-germline-sites true --genotype-pon-sites true
    #  somatic_variant_caller: mutect2
    #  path_somatic_variants:                            # REQUIRED; Examples: ../somatic_variant_calling_for_purecn
    #cnvetti_on_target:
    #  path_target_regions:                              # REQUIRED
    #cnvetti_off_target:
    #  path_target_regions:                              # REQUIRED
    #  window_length: 20000

Available Somatic Targeted CNV Caller

  • cnvkit

  • sequenza

  • purecn. Note that purecn requires a panel of normals and a second set of variants called by mutect2, that includes germline ones.

  • copywriter (deprecated, the R package was removed with Bioconductor release 3.18)

  • cnvetti_on_target & cnvetti_off_target upsupported