NGS Mapping

Implementation of the ngs_mapping step

The ngs_mapping step allows for the alignment of NGS data with standard read mappers, such as BWA for DNA data and STAR for RNA data. Also, it provides functionality to compute post-alignment statistics, such as the coverage of target (e.g., exome or panel) regions.

There is a distinction made between “normal” DNA reads (short reads from Illumina) and “long” DNA reads, such as PacBio/Oxford Nanopore. Again, the NGS mapping step will perform alignment of all NGS libraries.

The precise actions that are performed in the alignment are defined by the wrappers (e.g., the bwa or star) wrappers. Generally, this includes converting into BAM format, sorting by coordinate, an indexing using a BAI file. For short reads, this can include marking of duplicates using Samblaster and depends on the actual configuration (see below for the default configuration).

Properties

overall stability

stable

applicable to

germline and somatic read alignment

generally applicable to

short and long read DNA and RNA sequencing

Step Input

For each library defined in all sample sheets, the instances of this step will search for the input files according to the configuration. The found read files will be linked into work/input_links/{library_name} (status quo, not a output path, thus path not guaranteed to be stable between minor versions).

This is different to the other steps that use the output of previous steps for their input.

Data Set Configuration

Consider the following data set definition from the main configuration file.

data_sets:
  first_batch:
    file: 01_first_batch.json
    search_patterns:
      # Note that currently only "left" and "right" key known
      - {'left': '*/L???/*_R1.fastq.gz', 'right': '*/L???/*_R2.fastq.gz'}
    search_paths: ['../input/01_first_batch']

Here, the data set first_batch is defined. The sample sheet file is named 01_first_batch.json and looked for in the relative path to the configuration file. The input search will be start in the (one, but could be more than one) path ../input/01_first_batch (relative to the directory containing the configuration file). The sample sheet provides a folderName extraInfo entry for each NGS library. This folder name is searched for (e.g., P001-N1-DNA1-WES). Once such a folder is found, the patterns in the values of the dict search_patterns are used for locating the paths of the actual files.

Currently, the only supported keys in the search_patterns dict are "left" and "right"" (the lattern can be omitted when only searching for single-end reads).

Consider the following example:

../input/
`-- 01_first_batch
    |-- P001-N1-DNA1-WES1
    |   `-- 42KF5AAXX
    |       `-- L001
    |           |-- P001-N1-DNA1-WES1_R1.fastq.gz
    |           |-- P001-N1-DNA1-WES1_R1.fastq.gz.md5
    |           |-- P001-N1-DNA1-WES1_R2.fastq.gz
    |           `-- P001-N1-DNA1-WES1_R2.fastq.gz.md5
    [...]

Here, the folder 01_first_batch will be searched for a directory named P001-N1-DNA1-WES. After finding, the relative paths 42KF5AAXX/L001/P001-N1-DNA1-WES1_R1.fastq.gz and 42KF5AAXX/L001/P001-N1-DNA1-WES1_R2.fastq.gz will be found and used for the left/right parts of a paired read set.

Mixing Single-End and Paired-End Reads

By default, it is checked that for each search_pattern, the same number of matching files has to be found, otherwise directories are ignored. The reason is to reduce the number of possible errors when linking in files. You can change this behaviour by specifying mixed_se_pe: True in the data set information. Then, it will be allowed to have the matches for the right entry to be empty. However, you will need to consistently have either SE or PE data for each library; it is allowed to mix SE and PE libraries within one project but not to have PE and SE data for one library.

Step Output

For each NGS library with name library_name and each read mapper mapper that the library has been aligned with, the pipeline step will create a directory output/{mapper}.{library_name}/out with symlinks of the following names to the resulting sorted BAM files with corresponding BAI and MD5 files.

{mapper}.{library_name}.bam
{mapper}.{library_name}.bam.bai
{mapper}.{library_name}.bam.md5
{mapper}.{library_name}.bam.bai.md5

In addition, several tools are used to automatically generate reports based on the BAM and BAI files. See the Reports section below for more details

The BAM files are only postprocessed if configured so.

Note

In contrast to other pipeline steps, the NGS mapping step will also generate the BAM files for the background data sets as there are currently problems with Snakemake sub workflows and input functions.

Global Configuration

static_data_config/reference/path must be set appropriately

Default Configuration

The default configuration is as follows.

step_config:
  ngs_mapping:
    # Aligners to use for the different NGS library types
    tools:
      dna: []      # Required if DNA analysis; otherwise, leave empty. Example: 'bwa'.
      rna: []      # Required if RNA analysis; otherwise, leave empty. Example: 'star'.
      dna_long: [] # Required if long-read mapper used; otherwise, leave empty. Example: 'minimap2'.
    path_link_in: ""   # OPTIONAL Override data set configuration search paths for FASTQ files
    # Thresholds for targeted sequencing coverage QC.  Enabled by specifying
    # the path_arget_regions setting above
    target_coverage_report:
      # Mapping from enrichment kit to target region BED file, for either computing per--target
      # region coverage or selecting targeted exons.
      #
      # The following will match both the stock IDT library kit and the ones
      # with spike-ins seen fromr Yale genomics.  The path above would be
      # mapped to the name "default".
      # - name: IDT_xGen_V1_0
      #   pattern: "xGen Exome Research Panel V1\\.0*"
      #   path: "path/to/targets.bed"
      path_target_interval_list_mapping: []
      # Maximal/minimal/warning coverage
      max_coverage: 200
      min_cov_warning: 20  # >= 20x for WARNING
      min_cov_ok: 50  # >= 50x for OK
      detailed_reporting: false  # per-exon details (cannot go into multiqc)
    # Depth of coverage collection, mainly useful for genomes.
    bam_collect_doc:
      enabled: false
      window_length: 1000
    # Compute fingerprints with ngs-chew
    ngs_chew_fingerprint:
      enabled: true
    # Configuration for BWA
    bwa:
      path_index: REQUIRED # Required if listed in ngs_mapping.tools.dna; otherwise, can be removed.
      num_threads_align: 16
      num_threads_trimming: 8
      num_threads_bam_view: 4
      num_threads_bam_sort: 4
      memory_bam_sort: 4G
      trim_adapters: false
      mask_duplicates: true
      split_as_secondary: false  # -M flag
    # Configuration for BWA-MEM2
    bwa_mem2:
      path_index: REQUIRED # Required if listed in ngs_mapping.tools.dna; otherwise, can be removed.
      bwa_mode: auto  # in ['auto', 'bwa-aln', 'bwa-mem']
      num_threads_align: 16
      num_threads_trimming: 8
      num_threads_bam_view: 4
      num_threads_bam_sort: 4
      memory_bam_sort: 4G
      trim_adapters: false
      mask_duplicates: true
      split_as_secondary: true  # -M flag
    # Configuration for STAR
    star:
      path_index: REQUIRED # Required if listed in ngs_mapping.tools.rna; otherwise, can be removed.
      path_features: ""    # Required for computing gene counts
      num_threads_align: 16
      num_threads_trimming: 8
      num_threads_bam_view: 4
      num_threads_bam_sort: 4
      memory_bam_sort: 4G
      genome_load: NoSharedMemory
      raw_star_options: ''
      align_intron_max: 1000000                # ENCODE option
      align_intron_min: 20                     # ENCODE option
      align_mates_gap_max: 1000000             # ENCODE option
      align_sjdb_overhang_min: 1               # ENCODE option
      align_sj_overhang_min: 8                 # ENCODE option
      out_filter_mismatch_n_max: 999           # ENCODE option
      out_filter_mismatch_n_over_l_max: 0.04   # ENCODE option
      out_filter_multimap_n_max: 20            # ENCODE option
      out_filter_type: BySJout                 # ENCODE option
      out_filter_intron_motifs: None    # or for cufflinks: RemoveNoncanonical
      out_sam_strand_field: None        # or for cufflinks: intronMotif
      transcriptome: false              # true to output transcript coordinate bam for RSEM
      trim_adapters: false
      mask_duplicates: false
      include_unmapped: true
    strandedness:
      path_exon_bed: REQUIRED   # Location of usually highly expressed genes. Known protein coding genes is a good choice
      strand: -1                # -1: unknown value, use infer_, 0: unstranded, 1: forward, 2: reverse (from featurecounts)
      threshold: 0.85           # Minimum proportion of reads mapped to forward/reverse direction to call the protocol
    # Configuration for Minimap2
    minimap2:
      mapping_threads: 16

Available Read Mappers

The following read mappers are available for the alignment of DNA-seq and RNA-seq reads.

(short/Illumina) DNA
- "bwa"
- "bwa_mem2"
- "external"
(short/Illumina) RNA-seq
- "star"
- "external"
(long/PacBio/Nanopore) DNA
- "minimap2"
- "external"

Notes on STAR mapper configuration

Recent versions of STAR offer the possibility to output gene counts and alignments of reads on the transcritpome, rather than on the genome.

In both cases, this requires that STAR is aware of the genes, transcripts, exon & introns features. These can be provided either during the indexing stage, or with recent versions, during mapping.

The configuration provides the possibility to pass to STAR the location of a gtf file describing the features. This removes the need to include gene models into the generation of indices, so that the user can select the gene models (either from ENSEMBL or GENCODE, for example).

When the configuration option path_features is set, the step will output a table of expression counts for all genes, in output/star.{library_name}/out/star.{library_name}.GeneCounts.tab.

If the configuration option transcriptome is set to true, the step will output a bam file of reads mapped to the transcriptome (output/stat.{library_name}/out/star.{library_name}.toTranscriptome.bam). STAR will rely on the path_features configuration option, or on the gene models embedded in the indices to generate the mappings. If both are absent, the step will fail. Note that the mappings to the transcriptome will not be indexes using samtools index, because the absence of the positional mappings.

Reports

Currently, the following reports are generated based on the BAM and BAI file output by this step.

General Alignment Statistics (.txt)

The tools samtools bamstats, samtools flagstats and samtools idxstats are always called by default, and are linked out into the output/{mapper}.{library_name}/report/bam_qc directory. The file names for these reports (and their MD5s) use the following naming convention:

{mapper}.{library_name}.bamstats.txt
{mapper}.{library_name}.flagstats.txt
{mapper}.{library_name}.idxstats.txt
{mapper}.{library_name}.bamstats.txt.md5
{mapper}.{library_name}.flagstats.txt.md5
{mapper}.{library_name}.idxstats.txt.md5

For example, it will look as follows for the example bam files shown above:

output/
+-- bwa.P001-N1-DNA1-WES1
|   |-- out
|   |   |-- bwa.P001-N1-DNA1-WES1.bam
|   |   |-- bwa.P001-N1-DNA1-WES1.bam.bai
|   |   |-- bwa.P001-N1-DNA1-WES1.bam.bai.md5
|   |   `-- bwa.P001-N1-DNA1-WES1.bam.md5
|   `-- report
|       `-- bam_qc
|           |-- bwa.P001-N1-DNA1-WES1.bam.bamstats.txt
|           |-- bwa.P001-N1-DNA1-WES1.bam.bamstats.txt.md5
|           |-- bwa.P001-N1-DNA1-WES1.bam.flagstats.txt
|           |-- bwa.P001-N1-DNA1-WES1.bam.flagstats.txt.md5
|           |-- bwa.P001-N1-DNA1-WES1.bam.idxstats.txt
|           `-- bwa.P001-N1-DNA1-WES1.bam.idxstats.txt.md5
[...]

Target Coverage Report (.txt)

If ngs_mapping/path_target_regions is set to a BED file with the target regions (either capture regions of capture kits in the case of targeted sequencing or exons for WES/WGS sequencing) a target coverage report is generated and linked out into the output/{mapper}.{library_name}/report/cov_qc directory. The file names for these reports (and their MD5s) use the following naming convention:

{mapper}.{library_name}.txt
{mapper}.{library_name}.txt.md5

For example, it will look as follows for the example bam files shown above:

output/
+-- bwa.P001-N1-DNA1-WES1
|   `-- report
|       |-- bam_qc
|       |   [...]
|       `-- cov_qc
|           |-- bwa.P001-N1-DNA1-WES1.txt
|           `-- bwa.P001-N1-DNA1-WES1.txt.md5
[...]

Genome-wide Coverage Count (.bed.gz)

If ngs_mapping/compute_coverage_bed to be set to true a report is generated that gives the depth at each base of the genome. (note: currently this report only appears in work/ and is not yet linked out into the output/ directory).

(TODO: add file name rules and example)

work/
+-- bwa.P001-N1-DNA1-WES1
|   `-- report
|       `-- bam_qc
|           |-- bwa.P001-N1-DNA1-WES1.bam.bamstats.d
|           |   |-- acgt-cycles.gp
|           |   |-- acgt-cycles.png
|           |   |-- coverage.gp
|           |   |-- coverage.png
|           |   |-- gc-content.gp
|           |   |-- gc-content.png
|           |   |-- gc-depth.gp
|           |   |-- gc-depth.png
|           |   |-- indel-cycles.gp
|           |   |-- indel-cycles.png
|           |   |-- indel-dist.gp
|           |   |-- indel-dist.png
|           |   |-- index.html
|           |   |-- insert-size.gp
|           |   |-- insert-size.png
|           |   |-- quals2.gp
|           |   |-- quals2.png
|           |   |-- quals3.gp
|           |   |-- quals3.png
|           |   |-- quals.gp
|           |   |-- quals-hm.gp
|           |   |-- quals-hm.png
|           |   `-- quals.png
|           [...]
[...]