NGS Mapping
Implementation of the ngs_mapping
step
The ngs_mapping step allows for the alignment of NGS data with standard read mappers, such as BWA for DNA data and STAR for RNA data. Also, it provides functionality to compute post-alignment statistics, such as the coverage of target (e.g., exome or panel) regions.
There is a distinction made between “normal” DNA reads (short reads from Illumina) and “long” DNA reads, such as PacBio/Oxford Nanopore. Again, the NGS mapping step will perform alignment of all NGS libraries.
The precise actions that are performed in the alignment are defined by the wrappers (e.g., the
bwa
or star
) wrappers. Generally, this includes converting into BAM format, sorting
by coordinate, an indexing using a BAI file. For short reads, this can include marking of
duplicates using Samblaster and depends on the actual configuration (see below for the default
configuration).
An additional somatic
meta-tool for DNA mapping data with UMIs or molecular barcodes (MBCs),
and that can also optionally perform Base Quality Score Re-calibration (BQSR).
Its usage & implementation is detailed below, as its design is different from the other tools.
Properties
overall stability
stable
applicable to
germline and somatic read alignment
generally applicable to
short and long read DNA and RNA sequencing
Step Input
For each library defined in all sample sheets, the instances of this step will search for the input
files according to the configuration. The found read files will be linked into
work/input_links/{library_name}
(status quo, not a output path, thus path not guaranteed
to be stable between minor versions).
This is different to the other steps that use the output of previous steps for their input.
Data Set Configuration
Consider the following data set definition from the main configuration file.
data_sets:
first_batch:
file: 01_first_batch.json
search_patterns:
# Note that currently only "left" and "right" key known
- {'left': '*/L???/*_R1.fastq.gz', 'right': '*/L???/*_R2.fastq.gz'}
search_paths: ['../input/01_first_batch']
Here, the data set first_batch
is defined. The sample sheet file is named
01_first_batch.json
and looked for in the relative path to the configuration file. The input
search will be start in the (one, but could be more than one) path ../input/01_first_batch
(relative to the directory containing the configuration file). The sample sheet provides a
folderName
extraInfo
entry for each NGS library. This folder name is searched for (e.g.,
P001-N1-DNA1-WES
). Once such a folder is found, the patterns in the values of the dict
search_patterns
are used for locating the paths of the actual files.
Currently, the only supported keys in the search_patterns
dict are "left"
and "right""
(the lattern can be omitted when only searching for single-end reads).
Consider the following example:
../input/
`-- 01_first_batch
|-- P001-N1-DNA1-WES1
| `-- 42KF5AAXX
| `-- L001
| |-- P001-N1-DNA1-WES1_R1.fastq.gz
| |-- P001-N1-DNA1-WES1_R1.fastq.gz.md5
| |-- P001-N1-DNA1-WES1_R2.fastq.gz
| `-- P001-N1-DNA1-WES1_R2.fastq.gz.md5
[...]
Here, the folder 01_first_batch
will be searched for a directory named P001-N1-DNA1-WES
.
After finding, the relative paths 42KF5AAXX/L001/P001-N1-DNA1-WES1_R1.fastq.gz
and
42KF5AAXX/L001/P001-N1-DNA1-WES1_R2.fastq.gz
will be found and used for the left/right parts of
a paired read set.
Mapping after adapter trimming
When the data is trimmed prior to mapping, the step must take its input from the output of the adapter_trimming
step,
rather than from the raw fastq
files.
When not empty, the configuration option path_link_in
overrides the search paths defined in the data_sets
section.
In the following example, the input fastq files are taken from the output of the adapter_trimming
step, after the
latter has been run with the bbduk
adapter trimming tool:
ngs_mapping:
path_link_in: <absolute path to adapter_trimming/output/bbduk>
tools:
dna: [bwa]
[...]
Note that only the search paths are overriden, the search patterns defined in the data_sets
section are still used
to find the input files. The adapter trimming tools do not rename fastq files.
Mixing Single-End and Paired-End Reads
By default, it is checked that for each search_pattern
, the same number of matching files
has to be found, otherwise directories are ignored. The reason is to reduce the number of
possible errors when linking in files. You can change this behaviour by specifying
mixed_se_pe: True
in the data set information. Then, it will be allowed to have the matches
for the right
entry to be empty. However, you will need to consistently have either SE or
PE data for each library; it is allowed to mix SE and PE libraries within one project but not
to have PE and SE data for one library.
Exome enrichment kit
The configuration allows for the mapping of cohorts sequenced on different exome enrichment kits. However, each sample must be assigned to the correct kit for the target coverage report to be accurate. The configuration and the sample sheet must prepared in the following way:
The sample sheet must contain the optional column libraryKit
, defined at the library level.
The column is filled with the name of the kit for each sample. These names are defined by the user
(choose wisely!), __default__
is a reserved name. The sample sheet would look like:
[Custom Fields]
key annotatedEntity docs type minimum maximum unit choices pattern
extractionType bioSample extraction type string 0 0 0 0 0
libraryKit ngsLibrary exome enrichment kit string 0 0 0 0 0
[Data]
patientName sampleName isTumor extractionType libraryType folderName libraryKit
case001 N1 N DNA WES case001-N1-DNA1-WES1 V5_noUTR
case001 T1 Y DNA WES case001-T1-DNA1-WES1 V5_noUTR
case002 N1 N DNA WES case002-N1-DNA1-WES1 V5_UTR
case002 T1 Y DNA WES case002-T1-DNA1-WES1 V5_UTR
case003 N1 N DNA WES case003-N1-DNA1-WES1 __default__
[...]
The library kit name is mapped to a kit via regular expression pattern recognition: the configuration
stores a list of enrichment kit classes, each with a name, a regex pattern & the path to the bed file describing
the kit locii. The __default__
kit name matches the __default__
class, without requiring a pattern.
The configuration would be:
ngs_mapping:
target_coverage_report:
path_target_interval_list_mapping:
- name: "Agilent SureSelect V5"
pattern: "V5_noUTR"
path: <absolute path to the Agilent V5 bed file>
- name: Agilent SureSelect V5 + UTRs"
pattern: "V5_UTR"
path: <absolute path to the Agilent V5 + UTRs bed file>
- name: __default__
path: <absolute path to another exome panel bed file>
Note that it is not possible to name the Agilent V5 library kit simply “V5” in the sample sheet, because “V5” would also match “V5_UTR”.
This pattern matching mechanism is usually not required for exome enrichment kits, but it may be very convenient with panels, which can have several revisions.
Step Output
For each NGS library with name library_name
and each read mapper mapper
that the library
has been aligned with, the pipeline step will create a directory
output/{mapper}.{library_name}/out
with symlinks of the following names to the resulting
sorted BAM files with corresponding BAI and MD5 files.
{mapper}.{library_name}.bam
{mapper}.{library_name}.bam.bai
{mapper}.{library_name}.bam.md5
{mapper}.{library_name}.bam.bai.md5
In addition, several tools are used to automatically generate reports based on the BAM and BAI files. See the Reports section below for more details
The BAM files are only postprocessed if configured so.
Note
In contrast to other pipeline steps, the NGS mapping step will also generate the BAM files for the background data sets as there are currently problems with Snakemake sub workflows and input functions.
Global Configuration
static_data_config/reference/path
must be set appropriately
Default Configuration
The default configuration is as follows.
step_config:
ngs_mapping:
# Aligners to use for the different NGS library types
tools: # REQUIRED
# Required if DNA analysis; otherwise, leave empty.
# dna: [] # Options: 'bwa', 'bwa_mem2'
#
# Required if RNA analysis; otherwise, leave empty.
# rna: [] # Options: 'star'
#
# Required if long-read mapper used; otherwise, leave empty.
# dna_long: [] # Options: 'minimap2'
#
# OPTIONAL Override data set configuration search paths for FASTQ files
#path_link_in: ''
#
# Thresholds for targeted sequencing coverage QC.
#target_coverage_report:
# path_target_interval_list_mapping: []
#
# Depth of coverage collection, mainly useful for genomes.
#bam_collect_doc:
# enabled: false
# window_length: 1000
#
# Compute fingerprints with ngs-chew
#ngs_chew_fingerprint:
# enabled: true
#
# Configuration for BWA
#bwa:
#
# # Required if listed in ngs_mapping.tools.dna; otherwise, can be removed.
# path_index: # REQUIRED
# num_threads_align: 16
# num_threads_trimming: 8
# num_threads_bam_view: 4
# num_threads_bam_sort: 4
# memory_bam_sort: 4G
# trim_adapters: false
# mask_duplicates: true
#
# # -M flag
# split_as_secondary: false
#
# # [ "-C" ] when molecular barcodes are processed with AGeNT in the somatic mode
# extra_args: []
#
# Configuration for BWA-MEM2
#bwa_mem2:
#
# # Required if listed in ngs_mapping.tools.dna; otherwise, can be removed.
# path_index: # REQUIRED
# num_threads_align: 16
# num_threads_trimming: 8
# num_threads_bam_view: 4
# num_threads_bam_sort: 4
# memory_bam_sort: 4G
# trim_adapters: false
# mask_duplicates: true
#
# # -M flag
# split_as_secondary: false
#
# # [ "-C" ] when molecular barcodes are processed with AGeNT in the somatic mode
# extra_args: []
#
# Configuration for somatic ngs_calling
# (separate read groups, molecular barcodes & base quality recalibration)
#somatic:
#
# # Either bwa of bwa_mem2. The indices & other parameters are taken from mapper config
# mapping_tool: # REQUIRED; Options: 'bwa', 'bwa_mem2'
#
# # Only agent currently implemented
# barcode_tool: agent # Options: 'agent'
# use_barcodes: false
# recalibrate: true
#bqsr:
#
# # Common germline variants (see /fast/work/groups/cubi/projects/biotools/static_data/app_support/GATK)
# common_variants: # REQUIRED
#agent:
# prepare: # REQUIRED
# path: # REQUIRED
#
# # One of "halo" (HaloPlex), "hs" (HaloPlexHS), "xt" (SureSelect XT, XT2, XT HS), "v2" (SureSelect XT HS2) & "qxt" (SureSelect QXT)
# lib_prep_type: # Options: 'halo' (HALO_PLEX), 'hs' (HALO_PLEX_HS), 'xt' (SURE_SELECT), 'v2' (SURE_SELECT_HS2), 'qxt' (SURE_SELECT_QXT)
#
# # Consider "-polyG 8" for NovaSeq data & "-minFractionRead 50" for 100 cycles data
# extra_args: []
# mark_duplicates: # REQUIRED
# path: # REQUIRED
# path_baits: # REQUIRED
#
# # One of "SINGLE", "HYBRID", "DUPLEX"
# consensus_mode: # Options: 'SINGLE', 'HYBRID', 'DUPLEX'
#
# # Consider -mm 13 (min base qual) -mr 13 (min barcode base qual) -mq 30 (min map qual)
# input_filter_args: []
# consensus_filter_args: []
#
# # Consider -d 1 (max nb barcode mismatch)
# extra_args: []
#
# Configuration for STAR
#star:
#
# # Required if listed in ngs_mapping.tools.rna; otherwise, can be removed.
# path_index: # REQUIRED
# num_threads_align: 16
# num_threads_trimming: 8
# num_threads_bam_view: 4
# num_threads_bam_sort: 4
# memory_bam_sort: 4G
# genome_load: NoSharedMemory
# raw_star_options: ''
# align_intron_max: 1000000
# align_intron_min: 20
# align_mates_gap_max: 1000000
# align_sjdb_overhang_min: 1
# align_sj_overhang_min: 8
# out_filter_mismatch_n_max: 999
# out_filter_mismatch_n_over_l_max: 0.04
# out_filter_multimap_n_max: 20
# out_filter_type: BySJout
#
# # or for cufflinks: RemoveNoncanonical
# out_filter_intron_motifs: ''
#
# # or for cufflinks: intronMotif
# out_sam_strand_field: ''
#
# # true to output transcript coordinate bam for RSEM
# transcriptome: false
# trim_adapters: false
# mask_duplicates: false
# include_unmapped: true
#strandedness:
#
# # Location of usually highly expressed genes. Known protein coding genes is a good choice
# path_exon_bed: # REQUIRED
#
# # -1: unknown value, use infer_, 0: unstranded, 1: forward, 2: reverse (from featurecounts)
# strand: -1 # Options: -1 (UNKNOWN), 0 (INFER), 1 (FORWARD), 2 (REVERSE)
#
# # Minimum proportion of reads mapped to forward/reverse direction to call the protocol
# threshold: 0.85
#minimap2:
# mapping_threads: 16
#mbcs:
# mapping_tool: # REQUIRED; Options: 'bwa', 'bwa_mem2'
# use_barcodes: # REQUIRED
# recalibrate: # REQUIRED
Available Read Mappers
The following read mappers are available for the alignment of DNA-seq and RNA-seq reads.
- (short/Illumina) DNA
"bwa"
"bwa_mem2"
"somatic"
"external"
- (short/Illumina) RNA-seq
"star"
"external"
- (long/PacBio/Nanopore) DNA
"minimap2"
"external"
Notes on STAR mapper configuration
Recent versions of STAR offer the possibility to output gene counts and alignments of reads on the transcritpome, rather than on the genome.
In both cases, this requires that STAR is aware of the genes, transcripts, exon & introns features. These can be provided either during the indexing stage, or with recent versions, during mapping.
The configuration provides the possibility to pass to STAR the location of a gtf file describing the features. This removes the need to include gene models into the generation of indices, so that the user can select the gene models (either from ENSEMBL or GENCODE, for example).
The computation of gene counts relies on the features defined in the static_data_config section of the configuration file. The other steps relying of feature annotations should use this too.
STAR outputs the counts for unstranded, forward and reverse strand protocols. When the user doesn’t supply the protocol code (0 for unstranded, 1 for forward & 2 for reverse), the step runs infer_experiment (from the rseqc library) to infer the protocol strandedness. In both cases, the step generate a copy of the STAR output containing only the relevant column included. This final version of the gene counts is found in output/star.{library_name}/out/star.{library_name}.GeneCounts.tab. Note that for infer_experiment to work efficiently, a 12-columns bed file must be provided (6-columns appears to work too), with locii of well-defined, ubiquitously expressed genes. The intervals must be non-overlapping & sorted.
If the configuration option transcriptome is set to true, the step will output a bam file of reads mapped to the transcriptome (output/stat.{library_name}/out/star.{library_name}.toTranscriptome.bam). As with gene counts, STAR will rely on the static data configuration to generate the mappings. Note that the mappings to the transcriptome will not be indexes using samtools index, because the absence of the positional mappings.
Notes on the somatic
meta-tool
The somatic
mapping tool should be used in presence of Molecular BarCodes (MBCs), or
when Base Quality Score Re-calibration (BQSR) must be carried out.
The mapping itself is done by one of the short DNA mappers (bwa
or bwa-mem2
).
The MBCs handling is optional (via option use_barcodes
), as is the BQSR (option recalibrate
).
The current implementation only implements the proprietary AGeNT software for MBCs.
Because it isn’t open source, and not in bioconda
, it must be installed separately from the pipeline.
Eventually, AGeNT
should be replaced by open source software, such as UMI-tools.
Another shortcominig of the current implementation is that multiple exome kits is not supported. Only one kit per cohort is allowed.
ngs_mapping:
tools:
dna: [mbcs]
somatic:
mapping_tool: bwa_mem2
barcode_tool: agent # Only agent is currently implemented
use_barcodes: true
recalibrate: true
bwa_mem2:
path_index: <path_to_bwa-mem2 indices>
[...] # Select bwa-mem2 options
extra_args: ["-C"] # Use ["-C"] when UMI/MBC are present, and processed with AGeNT, otherwise [""]
agent:
prepare:
path: <path to AGeNT trimmer software>
[...] # Check AGeNT documentation for the "trimmer" program
mark_duplicates:
path: <path to AGeNT creak software>
[...] # Check AGeNT documentation for the "trimmer" program
bqsr:
common_variants: <path to common germline variants> # For example small_exac_common_3.vcf from the GATK bucket
Note
This meta-tool is currently implemented by delegating execution of the different sub-steps (trimming barcodes, mapping, consensus merging, bqsr)
to a Snakefile. This is different from the other tools, with shell scripts wrappers.
This experimental implementation was chosen to avoid over-complex shell scripts, as the bwa
& bwa-mem2
wrappers already perform multiple operations.
Splitting the meta-tool into different sub-steps was also considered, but as sub-steps cannot share the same temporary directory,
that solution was not efficient enough.
Finally, the choice of names for the meta-tool (somatic
) and for the file prefix (mbcs
) are both unfortunate, as might be changed in the future.
Reports
Currently, the following reports are generated based on the BAM and BAI file output by this step.
- General Alignment Statistics (.txt)
The tools
samtools bamstats
,samtools flagstats
andsamtools idxstats
are always called by default, and are linked out into theoutput/{mapper}.{library_name}/report/bam_qc
directory. The file names for these reports (and their MD5s) use the following naming convention:{mapper}.{library_name}.bamstats.txt
{mapper}.{library_name}.flagstats.txt
{mapper}.{library_name}.idxstats.txt
{mapper}.{library_name}.bamstats.txt.md5
{mapper}.{library_name}.flagstats.txt.md5
{mapper}.{library_name}.idxstats.txt.md5
For example, it will look as follows for the example bam files shown above:
output/
+-- bwa.P001-N1-DNA1-WES1
| |-- out
| | |-- bwa.P001-N1-DNA1-WES1.bam
| | |-- bwa.P001-N1-DNA1-WES1.bam.bai
| | |-- bwa.P001-N1-DNA1-WES1.bam.bai.md5
| | `-- bwa.P001-N1-DNA1-WES1.bam.md5
| `-- report
| `-- bam_qc
| |-- bwa.P001-N1-DNA1-WES1.bam.bamstats.txt
| |-- bwa.P001-N1-DNA1-WES1.bam.bamstats.txt.md5
| |-- bwa.P001-N1-DNA1-WES1.bam.flagstats.txt
| |-- bwa.P001-N1-DNA1-WES1.bam.flagstats.txt.md5
| |-- bwa.P001-N1-DNA1-WES1.bam.idxstats.txt
| `-- bwa.P001-N1-DNA1-WES1.bam.idxstats.txt.md5
[...]
- Target Coverage Report (.alfred.json.gz)
If
ngs_mapping/path_target_regions
is set to a BED file with the target regions (either capture regions of capture kits in the case of targeted sequencing or exons for WES/WGS sequencing) a target coverage report is generated and linked out into theoutput/{mapper}.{library_name}/report/alfred_qc
directory. The file names for these reports (and their MD5s) use the following naming convention:{mapper}.{library_name}.alfred.json.gz
{mapper}.{library_name}.alfred.json.gz.md5
For example, it will look as follows for the example bam files shown above:
output/
+-- bwa.P001-N1-DNA1-WES1
| `-- report
| |-- bam_qc
| | [...]
| `-- alfred_qc
| |-- bwa.P001-N1-DNA1-WES1.alfred.json.gz
| `-- bwa.P001-N1-DNA1-WES1.alfred.json.gz.md5
[...]
(TODO: add file name rules and example)
work/
+-- bwa.P001-N1-DNA1-WES1
| `-- report
| `-- bam_qc
| |-- bwa.P001-N1-DNA1-WES1.bam.bamstats.d
| | |-- acgt-cycles.gp
| | |-- acgt-cycles.png
| | |-- coverage.gp
| | |-- coverage.png
| | |-- gc-content.gp
| | |-- gc-content.png
| | |-- gc-depth.gp
| | |-- gc-depth.png
| | |-- indel-cycles.gp
| | |-- indel-cycles.png
| | |-- indel-dist.gp
| | |-- indel-dist.png
| | |-- index.html
| | |-- insert-size.gp
| | |-- insert-size.png
| | |-- quals2.gp
| | |-- quals2.png
| | |-- quals3.gp
| | |-- quals3.png
| | |-- quals.gp
| | |-- quals-hm.gp
| | |-- quals-hm.png
| | `-- quals.png
| [...]
[...]
- Fingerprinting (.npz)
When the
ngs_chew_fingerprint
is enabled, afingerprint
report folder is created by the ngs_chew tool, which contains a fingerprint of the sample based on carefully selected germline variants. This fingerprint allows comparison between samples to check against sample swaps. For assignment of a sample to a patient, it is a more sensitive tool than HLA typing, because in stem-cell treatments, HLA types are matched between donor & recepient, and because the fingerprints can detect contamination in samples.