NGS Mapping

Implementation of the ngs_mapping step

The ngs_mapping step allows for the alignment of NGS data with standard read mappers, such as BWA for DNA data and STAR for RNA data. Also, it provides functionality to compute post-alignment statistics, such as the coverage of target (e.g., exome or panel) regions.

There is a distinction made between “normal” DNA reads (short reads from Illumina) and “long” DNA reads, such as PacBio/Oxford Nanopore. Again, the NGS mapping step will perform alignment of all NGS libraries.

The precise actions that are performed in the alignment are defined by the wrappers (e.g., the bwa or star) wrappers. Generally, this includes converting into BAM format, sorting by coordinate, an indexing using a BAI file. For short reads, this can include marking of duplicates using Samblaster and depends on the actual configuration (see below for the default configuration).

An additional somatic meta-tool for DNA mapping data with UMIs or molecular barcodes (MBCs), and that can also optionally perform Base Quality Score Re-calibration (BQSR). Its usage & implementation is detailed below, as its design is different from the other tools.

Properties

overall stability

stable

applicable to

germline and somatic read alignment

generally applicable to

short and long read DNA and RNA sequencing

Step Input

For each library defined in all sample sheets, the instances of this step will search for the input files according to the configuration. The found read files will be linked into work/input_links/{library_name} (status quo, not a output path, thus path not guaranteed to be stable between minor versions).

This is different to the other steps that use the output of previous steps for their input.

Data Set Configuration

Consider the following data set definition from the main configuration file.

data_sets:
  first_batch:
    file: 01_first_batch.json
    search_patterns:
      # Note that currently only "left" and "right" key known
      - {'left': '*/L???/*_R1.fastq.gz', 'right': '*/L???/*_R2.fastq.gz'}
    search_paths: ['../input/01_first_batch']

Here, the data set first_batch is defined. The sample sheet file is named 01_first_batch.json and looked for in the relative path to the configuration file. The input search will be start in the (one, but could be more than one) path ../input/01_first_batch (relative to the directory containing the configuration file). The sample sheet provides a folderName extraInfo entry for each NGS library. This folder name is searched for (e.g., P001-N1-DNA1-WES). Once such a folder is found, the patterns in the values of the dict search_patterns are used for locating the paths of the actual files.

Currently, the only supported keys in the search_patterns dict are "left" and "right"" (the lattern can be omitted when only searching for single-end reads).

Consider the following example:

../input/
`-- 01_first_batch
    |-- P001-N1-DNA1-WES1
    |   `-- 42KF5AAXX
    |       `-- L001
    |           |-- P001-N1-DNA1-WES1_R1.fastq.gz
    |           |-- P001-N1-DNA1-WES1_R1.fastq.gz.md5
    |           |-- P001-N1-DNA1-WES1_R2.fastq.gz
    |           `-- P001-N1-DNA1-WES1_R2.fastq.gz.md5
    [...]

Here, the folder 01_first_batch will be searched for a directory named P001-N1-DNA1-WES. After finding, the relative paths 42KF5AAXX/L001/P001-N1-DNA1-WES1_R1.fastq.gz and 42KF5AAXX/L001/P001-N1-DNA1-WES1_R2.fastq.gz will be found and used for the left/right parts of a paired read set.

Mapping after adapter trimming

When the data is trimmed prior to mapping, the step must take its input from the output of the adapter_trimming step, rather than from the raw fastq files. When not empty, the configuration option path_link_in overrides the search paths defined in the data_sets section. In the following example, the input fastq files are taken from the output of the adapter_trimming step, after the latter has been run with the bbduk adapter trimming tool:

ngs_mapping:
  path_link_in: <absolute path to adapter_trimming/output/bbduk>
  tools:
    dna: [bwa]
  [...]

Note that only the search paths are overriden, the search patterns defined in the data_sets section are still used to find the input files. The adapter trimming tools do not rename fastq files.

Mixing Single-End and Paired-End Reads

By default, it is checked that for each search_pattern, the same number of matching files has to be found, otherwise directories are ignored. The reason is to reduce the number of possible errors when linking in files. You can change this behaviour by specifying mixed_se_pe: True in the data set information. Then, it will be allowed to have the matches for the right entry to be empty. However, you will need to consistently have either SE or PE data for each library; it is allowed to mix SE and PE libraries within one project but not to have PE and SE data for one library.

Exome enrichment kit

The configuration allows for the mapping of cohorts sequenced on different exome enrichment kits. However, each sample must be assigned to the correct kit for the target coverage report to be accurate. The configuration and the sample sheet must prepared in the following way:

The sample sheet must contain the optional column libraryKit, defined at the library level. The column is filled with the name of the kit for each sample. These names are defined by the user (choose wisely!), __default__ is a reserved name. The sample sheet would look like:

[Custom Fields]
key     annotatedEntity docs    type    minimum maximum unit    choices pattern
extractionType  bioSample       extraction type string  0       0       0       0       0
libraryKit      ngsLibrary      exome enrichment kit    string  0       0       0       0       0

[Data]
patientName     sampleName      isTumor extractionType  libraryType     folderName      libraryKit
case001 N1      N       DNA     WES     case001-N1-DNA1-WES1    V5_noUTR
case001 T1      Y       DNA     WES     case001-T1-DNA1-WES1    V5_noUTR
case002 N1      N       DNA     WES     case002-N1-DNA1-WES1    V5_UTR
case002 T1      Y       DNA     WES     case002-T1-DNA1-WES1    V5_UTR
case003 N1      N       DNA     WES     case003-N1-DNA1-WES1    __default__
[...]

The library kit name is mapped to a kit via regular expression pattern recognition: the configuration stores a list of enrichment kit classes, each with a name, a regex pattern & the path to the bed file describing the kit locii. The __default__ kit name matches the __default__ class, without requiring a pattern.

The configuration would be:

ngs_mapping:
  target_coverage_report:
    path_target_interval_list_mapping:
    - name: "Agilent SureSelect V5"
      pattern: "V5_noUTR"
      path: <absolute path to the Agilent V5 bed file>
    - name: Agilent SureSelect V5 + UTRs"
      pattern: "V5_UTR"
      path: <absolute path to the Agilent V5 + UTRs bed file>
    - name: __default__
      path: <absolute path to another exome panel bed file>

Note that it is not possible to name the Agilent V5 library kit simply “V5” in the sample sheet, because “V5” would also match “V5_UTR”.

This pattern matching mechanism is usually not required for exome enrichment kits, but it may be very convenient with panels, which can have several revisions.

Step Output

For each NGS library with name library_name and each read mapper mapper that the library has been aligned with, the pipeline step will create a directory output/{mapper}.{library_name}/out with symlinks of the following names to the resulting sorted BAM files with corresponding BAI and MD5 files.

  • {mapper}.{library_name}.bam

  • {mapper}.{library_name}.bam.bai

  • {mapper}.{library_name}.bam.md5

  • {mapper}.{library_name}.bam.bai.md5

In addition, several tools are used to automatically generate reports based on the BAM and BAI files. See the Reports section below for more details

The BAM files are only postprocessed if configured so.

Note

In contrast to other pipeline steps, the NGS mapping step will also generate the BAM files for the background data sets as there are currently problems with Snakemake sub workflows and input functions.

Global Configuration

  • static_data_config/reference/path must be set appropriately

Default Configuration

The default configuration is as follows.

step_config:
  ngs_mapping:

    # Aligners to use for the different NGS library types
    tools:                                            # REQUIRED

      # Required if DNA analysis; otherwise, leave empty.
    #  dna: []                                         # Options: 'bwa', 'bwa_mem2'
    #
      # Required if RNA analysis; otherwise, leave empty.
    #  rna: []                                         # Options: 'star'
    #
      # Required if long-read mapper used; otherwise, leave empty.
    #  dna_long: []                                    # Options: 'minimap2'
    #
    # OPTIONAL Override data set configuration search paths for FASTQ files
    #path_link_in: ''
    #
    # Thresholds for targeted sequencing coverage QC.
    #target_coverage_report:
    #  path_target_interval_list_mapping: []
    #
    # Depth of coverage collection, mainly useful for genomes.
    #bam_collect_doc:
    #  enabled: false
    #  window_length: 1000
    #
    # Compute fingerprints with ngs-chew
    #ngs_chew_fingerprint:
    #  enabled: true
    #
    # Configuration for BWA
    #bwa:
    #
    #  # Required if listed in ngs_mapping.tools.dna; otherwise, can be removed.
    #  path_index:                                     # REQUIRED
    #  num_threads_align: 16
    #  num_threads_trimming: 8
    #  num_threads_bam_view: 4
    #  num_threads_bam_sort: 4
    #  memory_bam_sort: 4G
    #  trim_adapters: false
    #  mask_duplicates: true
    #
    #  # -M flag
    #  split_as_secondary: false
    #
    #  # [ "-C" ] when molecular barcodes are processed with AGeNT in the somatic mode
    #  extra_args: []
    #
    # Configuration for BWA-MEM2
    #bwa_mem2:
    #
    #  # Required if listed in ngs_mapping.tools.dna; otherwise, can be removed.
    #  path_index:                                     # REQUIRED
    #  num_threads_align: 16
    #  num_threads_trimming: 8
    #  num_threads_bam_view: 4
    #  num_threads_bam_sort: 4
    #  memory_bam_sort: 4G
    #  trim_adapters: false
    #  mask_duplicates: true
    #
    #  # -M flag
    #  split_as_secondary: false
    #
    #  # [ "-C" ] when molecular barcodes are processed with AGeNT in the somatic mode
    #  extra_args: []
    #
    # Configuration for somatic ngs_calling
    # (separate read groups, molecular barcodes & base quality recalibration)
    #somatic:
    #
    #  # Either bwa of bwa_mem2. The indices & other parameters are taken from mapper config
    #  mapping_tool:                                   # REQUIRED; Options: 'bwa', 'bwa_mem2'
    #
    #  # Only agent currently implemented
    #  barcode_tool: agent                             # Options: 'agent'
    #  use_barcodes: false
    #  recalibrate: true
    #bqsr:
    #
    #  # Common germline variants (see /fast/work/groups/cubi/projects/biotools/static_data/app_support/GATK)
    #  common_variants:                                # REQUIRED
    #agent:
    #  prepare:                                        # REQUIRED
    #    path:                                         # REQUIRED
    #
    #    # One of "halo" (HaloPlex), "hs" (HaloPlexHS), "xt" (SureSelect XT, XT2, XT HS), "v2" (SureSelect XT HS2) & "qxt" (SureSelect QXT)
    #    lib_prep_type:                                # Options: 'halo' (HALO_PLEX), 'hs' (HALO_PLEX_HS), 'xt' (SURE_SELECT), 'v2' (SURE_SELECT_HS2), 'qxt' (SURE_SELECT_QXT)
    #
    #    # Consider "-polyG 8" for NovaSeq data & "-minFractionRead 50" for 100 cycles data
    #    extra_args: []
    #  mark_duplicates:                                # REQUIRED
    #    path:                                         # REQUIRED
    #    path_baits:                                   # REQUIRED
    #
    #    # One of "SINGLE", "HYBRID", "DUPLEX"
    #    consensus_mode:                               # Options: 'SINGLE', 'HYBRID', 'DUPLEX'
    #
    #    # Consider -mm 13 (min base qual) -mr 13 (min barcode base qual) -mq 30 (min map qual)
    #    input_filter_args: []
    #    consensus_filter_args: []
    #
    #    # Consider -d 1 (max nb barcode mismatch)
    #    extra_args: []
    #
    # Configuration for STAR
    #star:
    #
    #  # Required if listed in ngs_mapping.tools.rna; otherwise, can be removed.
    #  path_index:                                     # REQUIRED
    #  num_threads_align: 16
    #  num_threads_trimming: 8
    #  num_threads_bam_view: 4
    #  num_threads_bam_sort: 4
    #  memory_bam_sort: 4G
    #  genome_load: NoSharedMemory
    #  raw_star_options: ''
    #  align_intron_max: 1000000
    #  align_intron_min: 20
    #  align_mates_gap_max: 1000000
    #  align_sjdb_overhang_min: 1
    #  align_sj_overhang_min: 8
    #  out_filter_mismatch_n_max: 999
    #  out_filter_mismatch_n_over_l_max: 0.04
    #  out_filter_multimap_n_max: 20
    #  out_filter_type: BySJout
    #
    #  # or for cufflinks: RemoveNoncanonical
    #  out_filter_intron_motifs: ''
    #
    #  # or for cufflinks: intronMotif
    #  out_sam_strand_field: ''
    #
    #  # true to output transcript coordinate bam for RSEM
    #  transcriptome: false
    #  trim_adapters: false
    #  mask_duplicates: false
    #  include_unmapped: true
    #strandedness:
    #
    #  # Location of usually highly expressed genes. Known protein coding genes is a good choice
    #  path_exon_bed:                                  # REQUIRED
    #
    #  # -1: unknown value, use infer_, 0: unstranded, 1: forward, 2: reverse (from featurecounts)
    #  strand: -1                                      # Options: -1 (UNKNOWN), 0 (INFER), 1 (FORWARD), 2 (REVERSE)
    #
    #  # Minimum proportion of reads mapped to forward/reverse direction to call the protocol
    #  threshold: 0.85
    #minimap2:
    #  mapping_threads: 16
    #mbcs:
    #  mapping_tool:                                   # REQUIRED; Options: 'bwa', 'bwa_mem2'
    #  use_barcodes:                                   # REQUIRED
    #  recalibrate:                                    # REQUIRED

Available Read Mappers

The following read mappers are available for the alignment of DNA-seq and RNA-seq reads.

  • (short/Illumina) DNA
    • "bwa"

    • "bwa_mem2"

    • "somatic"

    • "external"

  • (short/Illumina) RNA-seq
    • "star"

    • "external"

  • (long/PacBio/Nanopore) DNA
    • "minimap2"

    • "external"

Notes on STAR mapper configuration

Recent versions of STAR offer the possibility to output gene counts and alignments of reads on the transcritpome, rather than on the genome.

In both cases, this requires that STAR is aware of the genes, transcripts, exon & introns features. These can be provided either during the indexing stage, or with recent versions, during mapping.

The configuration provides the possibility to pass to STAR the location of a gtf file describing the features. This removes the need to include gene models into the generation of indices, so that the user can select the gene models (either from ENSEMBL or GENCODE, for example).

The computation of gene counts relies on the features defined in the static_data_config section of the configuration file. The other steps relying of feature annotations should use this too.

STAR outputs the counts for unstranded, forward and reverse strand protocols. When the user doesn’t supply the protocol code (0 for unstranded, 1 for forward & 2 for reverse), the step runs infer_experiment (from the rseqc library) to infer the protocol strandedness. In both cases, the step generate a copy of the STAR output containing only the relevant column included. This final version of the gene counts is found in output/star.{library_name}/out/star.{library_name}.GeneCounts.tab. Note that for infer_experiment to work efficiently, a 12-columns bed file must be provided (6-columns appears to work too), with locii of well-defined, ubiquitously expressed genes. The intervals must be non-overlapping & sorted.

If the configuration option transcriptome is set to true, the step will output a bam file of reads mapped to the transcriptome (output/stat.{library_name}/out/star.{library_name}.toTranscriptome.bam). As with gene counts, STAR will rely on the static data configuration to generate the mappings. Note that the mappings to the transcriptome will not be indexes using samtools index, because the absence of the positional mappings.

Notes on the somatic meta-tool

The somatic mapping tool should be used in presence of Molecular BarCodes (MBCs), or when Base Quality Score Re-calibration (BQSR) must be carried out. The mapping itself is done by one of the short DNA mappers (bwa or bwa-mem2). The MBCs handling is optional (via option use_barcodes), as is the BQSR (option recalibrate).

The current implementation only implements the proprietary AGeNT software for MBCs. Because it isn’t open source, and not in bioconda, it must be installed separately from the pipeline. Eventually, AGeNT should be replaced by open source software, such as UMI-tools.

Another shortcominig of the current implementation is that multiple exome kits is not supported. Only one kit per cohort is allowed.

ngs_mapping:
  tools:
    dna: [mbcs]
  somatic:
    mapping_tool: bwa_mem2
    barcode_tool: agent         # Only agent is currently implemented
    use_barcodes: true
    recalibrate: true
  bwa_mem2:
    path_index: <path_to_bwa-mem2 indices>
    [...]                       # Select bwa-mem2 options
    extra_args: ["-C"]          # Use ["-C"] when UMI/MBC are present, and processed with AGeNT, otherwise [""]
  agent:
    prepare:
      path: <path to AGeNT trimmer software>
      [...]                     # Check AGeNT documentation for the "trimmer" program
    mark_duplicates:
      path: <path to AGeNT creak software>
      [...]                     # Check AGeNT documentation for the "trimmer" program
  bqsr:
    common_variants: <path to common germline variants>  # For example small_exac_common_3.vcf from the GATK bucket

Note

This meta-tool is currently implemented by delegating execution of the different sub-steps (trimming barcodes, mapping, consensus merging, bqsr) to a Snakefile. This is different from the other tools, with shell scripts wrappers. This experimental implementation was chosen to avoid over-complex shell scripts, as the bwa & bwa-mem2 wrappers already perform multiple operations. Splitting the meta-tool into different sub-steps was also considered, but as sub-steps cannot share the same temporary directory, that solution was not efficient enough.

Finally, the choice of names for the meta-tool (somatic) and for the file prefix (mbcs) are both unfortunate, as might be changed in the future.

Reports

Currently, the following reports are generated based on the BAM and BAI file output by this step.

General Alignment Statistics (.txt)

The tools samtools bamstats, samtools flagstats and samtools idxstats are always called by default, and are linked out into the output/{mapper}.{library_name}/report/bam_qc directory. The file names for these reports (and their MD5s) use the following naming convention:

  • {mapper}.{library_name}.bamstats.txt

  • {mapper}.{library_name}.flagstats.txt

  • {mapper}.{library_name}.idxstats.txt

  • {mapper}.{library_name}.bamstats.txt.md5

  • {mapper}.{library_name}.flagstats.txt.md5

  • {mapper}.{library_name}.idxstats.txt.md5

For example, it will look as follows for the example bam files shown above:

output/
+-- bwa.P001-N1-DNA1-WES1
|   |-- out
|   |   |-- bwa.P001-N1-DNA1-WES1.bam
|   |   |-- bwa.P001-N1-DNA1-WES1.bam.bai
|   |   |-- bwa.P001-N1-DNA1-WES1.bam.bai.md5
|   |   `-- bwa.P001-N1-DNA1-WES1.bam.md5
|   `-- report
|       `-- bam_qc
|           |-- bwa.P001-N1-DNA1-WES1.bam.bamstats.txt
|           |-- bwa.P001-N1-DNA1-WES1.bam.bamstats.txt.md5
|           |-- bwa.P001-N1-DNA1-WES1.bam.flagstats.txt
|           |-- bwa.P001-N1-DNA1-WES1.bam.flagstats.txt.md5
|           |-- bwa.P001-N1-DNA1-WES1.bam.idxstats.txt
|           `-- bwa.P001-N1-DNA1-WES1.bam.idxstats.txt.md5
[...]
Target Coverage Report (.alfred.json.gz)

If ngs_mapping/path_target_regions is set to a BED file with the target regions (either capture regions of capture kits in the case of targeted sequencing or exons for WES/WGS sequencing) a target coverage report is generated and linked out into the output/{mapper}.{library_name}/report/alfred_qc directory. The file names for these reports (and their MD5s) use the following naming convention:

  • {mapper}.{library_name}.alfred.json.gz

  • {mapper}.{library_name}.alfred.json.gz.md5

For example, it will look as follows for the example bam files shown above:

  output/
  +-- bwa.P001-N1-DNA1-WES1
  |   `-- report
  |       |-- bam_qc
  |       |   [...]
  |       `-- alfred_qc
  |           |-- bwa.P001-N1-DNA1-WES1.alfred.json.gz
  |           `-- bwa.P001-N1-DNA1-WES1.alfred.json.gz.md5
  [...]


(TODO: add file name rules and example)
work/
+-- bwa.P001-N1-DNA1-WES1
|   `-- report
|       `-- bam_qc
|           |-- bwa.P001-N1-DNA1-WES1.bam.bamstats.d
|           |   |-- acgt-cycles.gp
|           |   |-- acgt-cycles.png
|           |   |-- coverage.gp
|           |   |-- coverage.png
|           |   |-- gc-content.gp
|           |   |-- gc-content.png
|           |   |-- gc-depth.gp
|           |   |-- gc-depth.png
|           |   |-- indel-cycles.gp
|           |   |-- indel-cycles.png
|           |   |-- indel-dist.gp
|           |   |-- indel-dist.png
|           |   |-- index.html
|           |   |-- insert-size.gp
|           |   |-- insert-size.png
|           |   |-- quals2.gp
|           |   |-- quals2.png
|           |   |-- quals3.gp
|           |   |-- quals3.png
|           |   |-- quals.gp
|           |   |-- quals-hm.gp
|           |   |-- quals-hm.png
|           |   `-- quals.png
|           [...]
[...]
Fingerprinting (.npz)

When the ngs_chew_fingerprint is enabled, a fingerprint report folder is created by the ngs_chew tool, which contains a fingerprint of the sample based on carefully selected germline variants. This fingerprint allows comparison between samples to check against sample swaps. For assignment of a sample to a patient, it is a more sensitive tool than HLA typing, because in stem-cell treatments, HLA types are matched between donor & recepient, and because the fingerprints can detect contamination in samples.