Mapping and friends
To reliably identify somatic variants in low-purity, degraded samples, or when the focus is on sub-clonality, the quality of the mapping can become a limiting factor. To make the best use of the data, it is sometimes necessary to consider adapter trimming & base quality recalibration.
Adapter trimming is currently implemented as a separate pipeline step, while the handling of UMIs (or Molecular BarCodes MBCs) and Base Quality Score Re-calibration (BQSR) have both been included in the ngs_mapping
step.
Actually, UMIs & BQSR are included as sub-steps of the somatic
mapping tool.
This “tool” is a placeholder for a collection of operations required to perform mapping & BQSR in presence of UMIs.
Note
Adapter trimming should be included in this meta-tool, and eventually the whole operations carried out by the somatic
tool should be done as the whole step.
This is currently incompatible with the design of the ngs_mapping
step, for efficiency reasons.
Adapter trimming
The ngs_mapping
step provides a fast tool to trim adapters on-the-fly.
However, there are cases where this fast but limited tool in not sufficient.
The adapter_trimming
step provides 2 advanced tools (bbduk
& fastp
) for this task.
It is important to note that, unlike other pipeline steps, adapter_trimming
produces fastq
files.
This means that any subsequent pipeline step relying on fastq
files for its input should be using the adapter_trimming
output, not the file found in the standard way.
The configuration option path_link_in
available for the ngs_mapping
, hla_typing
ngs_data_qc
& somatic_gene_fusion_calling
steps must be used after adapter_trimming
.
The configuration snippet would then be similar to:
step_config:
adapter_trimming:
tools: [bbduk]
ngs_mapping:
path_link_in: <Absolute path to project folder>/adapter_trimming/output/bbduk
...
path_link_in
is substituted to the search_path
entries from the data_sets
sections.
The pattern matching remains unchainged, as the adapter_trimming
step does not rename any of the fastq
files.
Barcodes & UMIs
The handling of barcodes in generally done in 3 distinct operations:
The barcodes (typically stored as the first few bases of the read) are clipped from the read sequencce and added to the read name or as comments to the description on the sequence identifer line of the read record.
Mapping is carried out as normal, but depending on the downstream tool, the MBC sequences & their qualities must be added as tags in the output
bam
file.Aligned reads and the MBC sequences are used for de-duplication. This operation sometimes result in reads longer than actual reads.
The mapping tool can be selected from the mapping tools available for DNA.
However, the pipeline has not been tested with long reads mappers, only with mappers for short reads (Illumina), i.e. bwa
& bwa-mem2
.
The arguments for the mapper are taken from the corresponding section in the configuration file.
Note
Currently, only the AGeNT software from Agilent is implemented. Eventually, this commercial, non-free software should be replaced by open-source alternatives, for example UMI-tools
An example of the configuration required for the somatic
tool would be:
ngs_mapping:
tools:
dna: [mbcs]
somatic:
mapping_tool: bwa_mem2
barcode_tool: agent # Only agent is currently implemented
use_barcodes: true
recalibrate: true
bwa_mem2:
path_index: <path_to_bwa-mem2 indices>
trim_adapter: false
mark_duplicates: true
split_as_secondary: true
extra_args: ["-C"] # Use ["-C"] when UMI/MBC are present, and processed with AGeNT, otherwise [""]
agent:
prepare:
path: <path to AGeNT trimmer software>
lib_prep_type: v2 # Check AGeNT documentation, must be one of "halo", "hs", "xt", "v2", "qxt"
extra_args:
- "-polyG 8" # Check AGeNT documentation, trimming polyG tails
- "-minFractionRead 50" # Check AGeNT documentation, ignore heavily trimmed reads
mark_duplicates:
path: <path to AGeNT creak software>
path_baits: <path to baits>
consensus_mode: HYBRID # Check AGeNT documentation, must be one of "SINGLE", "DUPLEX"m "HYBRID"
input_filter_args: # Check AGeNT documentation, input read filters on mapping & base qualities
- "-mm 13"
- "-mr 13"
- "-mq 30"
consensus_filter_args: [] # Check AGeNT documentation, filtering over consensus
extra_args: [] # Check AGeNT documentation, extra arguments
bqsr:
common_variants: <path to common germline variants> # For example small_exac_common_3.vcf from the GATK bucket
A summary of the AGeNT
documentation is available on the cluster (/fast/work/groups/cubi/projects/biotools/AGeNT/AGeNT ReadMe.pdf
).
Note
The meta-tool is named somatic
in the main description, but the mapper prefix in file name is mbcs
.
Both choices are poor, and will be eventually changed.
Base Quality Score Re-calibration
BQSR should generally be applied.
It is only implemented as part of the somatic/mbcs
tool.
However, it can be appiled even in the absence of UMIs or barcodes, just set the configuration option use_barcodes
to false
.