Mapping and friends

To reliably identify somatic variants in low-purity, degraded samples, or when the focus is on sub-clonality, the quality of the mapping can become a limiting factor. To make the best use of the data, it is sometimes necessary to consider adapter trimming & base quality recalibration.

Adapter trimming is currently implemented as a separate pipeline step, while the handling of UMIs (or Molecular BarCodes MBCs) and Base Quality Score Re-calibration (BQSR) have both been included in the ngs_mapping step. Actually, UMIs & BQSR are included as sub-steps of the somatic mapping tool. This “tool” is a placeholder for a collection of operations required to perform mapping & BQSR in presence of UMIs.

Note

Adapter trimming should be included in this meta-tool, and eventually the whole operations carried out by the somatic tool should be done as the whole step. This is currently incompatible with the design of the ngs_mapping step, for efficiency reasons.

Adapter trimming

The ngs_mapping step provides a fast tool to trim adapters on-the-fly. However, there are cases where this fast but limited tool in not sufficient. The adapter_trimming step provides 2 advanced tools (bbduk & fastp) for this task.

It is important to note that, unlike other pipeline steps, adapter_trimming produces fastq files. This means that any subsequent pipeline step relying on fastq files for its input should be using the adapter_trimming output, not the file found in the standard way.

The configuration option path_link_in available for the ngs_mapping, hla_typing ngs_data_qc & somatic_gene_fusion_calling steps must be used after adapter_trimming.

The configuration snippet would then be similar to:

step_config:
  adapter_trimming:
    tools: [bbduk]

  ngs_mapping:
    path_link_in: <Absolute path to project folder>/adapter_trimming/output/bbduk
    ...

path_link_in is substituted to the search_path entries from the data_sets sections. The pattern matching remains unchainged, as the adapter_trimming step does not rename any of the fastq files.

Barcodes & UMIs

The handling of barcodes in generally done in 3 distinct operations:

  1. The barcodes (typically stored as the first few bases of the read) are clipped from the read sequencce and added to the read name or as comments to the description on the sequence identifer line of the read record.

  2. Mapping is carried out as normal, but depending on the downstream tool, the MBC sequences & their qualities must be added as tags in the output bam file.

  3. Aligned reads and the MBC sequences are used for de-duplication. This operation sometimes result in reads longer than actual reads.

The mapping tool can be selected from the mapping tools available for DNA. However, the pipeline has not been tested with long reads mappers, only with mappers for short reads (Illumina), i.e. bwa & bwa-mem2. The arguments for the mapper are taken from the corresponding section in the configuration file.

Note

Currently, only the AGeNT software from Agilent is implemented. Eventually, this commercial, non-free software should be replaced by open-source alternatives, for example UMI-tools

An example of the configuration required for the somatic tool would be:

ngs_mapping:
  tools:
    dna: [mbcs]
  somatic:
    mapping_tool: bwa_mem2
    barcode_tool: agent         # Only agent is currently implemented
    use_barcodes: true
    recalibrate: true
  bwa_mem2:
    path_index: <path_to_bwa-mem2 indices>
    trim_adapter: false
    mark_duplicates: true
    split_as_secondary: true
    extra_args: ["-C"]          # Use ["-C"] when UMI/MBC are present, and processed with AGeNT, otherwise [""]
  agent:
    prepare:
      path: <path to AGeNT trimmer software>
      lib_prep_type: v2         # Check AGeNT documentation, must be one of "halo", "hs", "xt", "v2", "qxt"
      extra_args:
      - "-polyG 8"              # Check AGeNT documentation, trimming polyG tails
      - "-minFractionRead 50"   # Check AGeNT documentation, ignore heavily trimmed reads
    mark_duplicates:
      path: <path to AGeNT creak software>
      path_baits: <path to baits>
      consensus_mode: HYBRID    # Check AGeNT documentation, must be one of "SINGLE", "DUPLEX"m "HYBRID"
      input_filter_args:        # Check AGeNT documentation, input read filters on mapping & base qualities
      - "-mm 13"
      - "-mr 13"
      - "-mq 30"
      consensus_filter_args: [] # Check AGeNT documentation, filtering over consensus
      extra_args: []            # Check AGeNT documentation, extra arguments
  bqsr:
    common_variants: <path to common germline variants>  # For example small_exac_common_3.vcf from the GATK bucket

A summary of the AGeNT documentation is available on the cluster (/fast/work/groups/cubi/projects/biotools/AGeNT/AGeNT ReadMe.pdf).

Note

The meta-tool is named somatic in the main description, but the mapper prefix in file name is mbcs. Both choices are poor, and will be eventually changed.

Base Quality Score Re-calibration

BQSR should generally be applied. It is only implemented as part of the somatic/mbcs tool. However, it can be appiled even in the absence of UMIs or barcodes, just set the configuration option use_barcodes to false.