NGS Mapping
Implementation of the ngs_mapping
step
The ngs_mapping step allows for the alignment of NGS data with standard read mappers, such as BWA for DNA data and STAR for RNA data. Also, it provides functionality to compute post-alignment statistics, such as the coverage of target (e.g., exome or panel) regions.
There is a distinction made between “normal” DNA reads (short reads from Illumina) and “long” DNA reads, such as PacBio/Oxford Nanopore. Again, the NGS mapping step will perform alignment of all NGS libraries.
The precise actions that are performed in the alignment are defined by the wrappers (e.g., the
bwa
or star
) wrappers. Generally, this includes converting into BAM format, sorting
by coordinate, an indexing using a BAI file. For short reads, this can include marking of
duplicates using Samblaster and depends on the actual configuration (see below for the default
configuration).
Properties
overall stability
stable
applicable to
germline and somatic read alignment
generally applicable to
short and long read DNA and RNA sequencing
Step Input
For each library defined in all sample sheets, the instances of this step will search for the input
files according to the configuration. The found read files will be linked into
work/input_links/{library_name}
(status quo, not a output path, thus path not guaranteed
to be stable between minor versions).
This is different to the other steps that use the output of previous steps for their input.
Data Set Configuration
Consider the following data set definition from the main configuration file.
data_sets:
first_batch:
file: 01_first_batch.json
search_patterns:
# Note that currently only "left" and "right" key known
- {'left': '*/L???/*_R1.fastq.gz', 'right': '*/L???/*_R2.fastq.gz'}
search_paths: ['../input/01_first_batch']
Here, the data set first_batch
is defined. The sample sheet file is named
01_first_batch.json
and looked for in the relative path to the configuration file. The input
search will be start in the (one, but could be more than one) path ../input/01_first_batch
(relative to the directory containing the configuration file). The sample sheet provides a
folderName
extraInfo
entry for each NGS library. This folder name is searched for (e.g.,
P001-N1-DNA1-WES
). Once such a folder is found, the patterns in the values of the dict
search_patterns
are used for locating the paths of the actual files.
Currently, the only supported keys in the search_patterns
dict are "left"
and "right""
(the lattern can be omitted when only searching for single-end reads).
Consider the following example:
../input/
`-- 01_first_batch
|-- P001-N1-DNA1-WES1
| `-- 42KF5AAXX
| `-- L001
| |-- P001-N1-DNA1-WES1_R1.fastq.gz
| |-- P001-N1-DNA1-WES1_R1.fastq.gz.md5
| |-- P001-N1-DNA1-WES1_R2.fastq.gz
| `-- P001-N1-DNA1-WES1_R2.fastq.gz.md5
[...]
Here, the folder 01_first_batch
will be searched for a directory named P001-N1-DNA1-WES
.
After finding, the relative paths 42KF5AAXX/L001/P001-N1-DNA1-WES1_R1.fastq.gz
and
42KF5AAXX/L001/P001-N1-DNA1-WES1_R2.fastq.gz
will be found and used for the left/right parts of
a paired read set.
Mixing Single-End and Paired-End Reads
By default, it is checked that for each search_pattern
, the same number of matching files
has to be found, otherwise directories are ignored. The reason is to reduce the number of
possible errors when linking in files. You can change this behaviour by specifying
mixed_se_pe: True
in the data set information. Then, it will be allowed to have the matches
for the right
entry to be empty. However, you will need to consistently have either SE or
PE data for each library; it is allowed to mix SE and PE libraries within one project but not
to have PE and SE data for one library.
Step Output
For each NGS library with name library_name
and each read mapper mapper
that the library
has been aligned with, the pipeline step will create a directory
output/{mapper}.{library_name}/out
with symlinks of the following names to the resulting
sorted BAM files with corresponding BAI and MD5 files.
{mapper}.{library_name}.bam
{mapper}.{library_name}.bam.bai
{mapper}.{library_name}.bam.md5
{mapper}.{library_name}.bam.bai.md5
In addition, several tools are used to automatically generate reports based on the BAM and BAI files. See the Reports section below for more details
The BAM files are only postprocessed if configured so.
Note
In contrast to other pipeline steps, the NGS mapping step will also generate the BAM files for the background data sets as there are currently problems with Snakemake sub workflows and input functions.
Global Configuration
static_data_config/reference/path
must be set appropriately
Default Configuration
The default configuration is as follows.
step_config:
ngs_mapping:
# Aligners to use for the different NGS library types
tools:
dna: [] # Required if DNA analysis; otherwise, leave empty. Example: 'bwa'.
rna: [] # Required if RNA analysis; otherwise, leave empty. Example: 'star'.
dna_long: [] # Required if long-read mapper used; otherwise, leave empty. Example: 'minimap2'.
path_link_in: "" # OPTIONAL Override data set configuration search paths for FASTQ files
# Thresholds for targeted sequencing coverage QC. Enabled by specifying
# the path_arget_regions setting above
target_coverage_report:
# Mapping from enrichment kit to target region BED file, for either computing per--target
# region coverage or selecting targeted exons.
#
# The following will match both the stock IDT library kit and the ones
# with spike-ins seen fromr Yale genomics. The path above would be
# mapped to the name "default".
# - name: IDT_xGen_V1_0
# pattern: "xGen Exome Research Panel V1\\.0*"
# path: "path/to/targets.bed"
path_target_interval_list_mapping: []
# Maximal/minimal/warning coverage
max_coverage: 200
min_cov_warning: 20 # >= 20x for WARNING
min_cov_ok: 50 # >= 50x for OK
detailed_reporting: false # per-exon details (cannot go into multiqc)
# Depth of coverage collection, mainly useful for genomes.
bam_collect_doc:
enabled: false
window_length: 1000
# Compute fingerprints with ngs-chew
ngs_chew_fingerprint:
enabled: true
# Configuration for BWA
bwa:
path_index: REQUIRED # Required if listed in ngs_mapping.tools.dna; otherwise, can be removed.
num_threads_align: 16
num_threads_trimming: 8
num_threads_bam_view: 4
num_threads_bam_sort: 4
memory_bam_sort: 4G
trim_adapters: false
mask_duplicates: true
split_as_secondary: false # -M flag
# Configuration for BWA-MEM2
bwa_mem2:
path_index: REQUIRED # Required if listed in ngs_mapping.tools.dna; otherwise, can be removed.
bwa_mode: auto # in ['auto', 'bwa-aln', 'bwa-mem']
num_threads_align: 16
num_threads_trimming: 8
num_threads_bam_view: 4
num_threads_bam_sort: 4
memory_bam_sort: 4G
trim_adapters: false
mask_duplicates: true
split_as_secondary: true # -M flag
# Configuration for STAR
star:
path_index: REQUIRED # Required if listed in ngs_mapping.tools.rna; otherwise, can be removed.
path_features: "" # Required for computing gene counts
num_threads_align: 16
num_threads_trimming: 8
num_threads_bam_view: 4
num_threads_bam_sort: 4
memory_bam_sort: 4G
genome_load: NoSharedMemory
raw_star_options: ''
align_intron_max: 1000000 # ENCODE option
align_intron_min: 20 # ENCODE option
align_mates_gap_max: 1000000 # ENCODE option
align_sjdb_overhang_min: 1 # ENCODE option
align_sj_overhang_min: 8 # ENCODE option
out_filter_mismatch_n_max: 999 # ENCODE option
out_filter_mismatch_n_over_l_max: 0.04 # ENCODE option
out_filter_multimap_n_max: 20 # ENCODE option
out_filter_type: BySJout # ENCODE option
out_filter_intron_motifs: None # or for cufflinks: RemoveNoncanonical
out_sam_strand_field: None # or for cufflinks: intronMotif
transcriptome: false # true to output transcript coordinate bam for RSEM
trim_adapters: false
mask_duplicates: false
include_unmapped: true
strandedness:
path_exon_bed: REQUIRED # Location of usually highly expressed genes. Known protein coding genes is a good choice
strand: -1 # -1: unknown value, use infer_, 0: unstranded, 1: forward, 2: reverse (from featurecounts)
threshold: 0.85 # Minimum proportion of reads mapped to forward/reverse direction to call the protocol
# Configuration for Minimap2
minimap2:
mapping_threads: 16
Available Read Mappers
The following read mappers are available for the alignment of DNA-seq and RNA-seq reads.
- (short/Illumina) DNA
"bwa"
"bwa_mem2"
"external"
- (short/Illumina) RNA-seq
"star"
"external"
- (long/PacBio/Nanopore) DNA
"minimap2"
"external"
Notes on STAR mapper configuration
Recent versions of STAR offer the possibility to output gene counts and alignments of reads on the transcritpome, rather than on the genome.
In both cases, this requires that STAR is aware of the genes, transcripts, exon & introns features. These can be provided either during the indexing stage, or with recent versions, during mapping.
The configuration provides the possibility to pass to STAR the location of a gtf file describing the features. This removes the need to include gene models into the generation of indices, so that the user can select the gene models (either from ENSEMBL or GENCODE, for example).
When the configuration option path_features is set, the step will output a table of expression counts for all genes, in output/star.{library_name}/out/star.{library_name}.GeneCounts.tab.
If the configuration option transcriptome is set to true, the step will output a bam file of reads mapped to the transcriptome (output/stat.{library_name}/out/star.{library_name}.toTranscriptome.bam). STAR will rely on the path_features configuration option, or on the gene models embedded in the indices to generate the mappings. If both are absent, the step will fail. Note that the mappings to the transcriptome will not be indexes using samtools index, because the absence of the positional mappings.
Reports
Currently, the following reports are generated based on the BAM and BAI file output by this step.
- General Alignment Statistics (.txt)
The tools
samtools bamstats
,samtools flagstats
andsamtools idxstats
are always called by default, and are linked out into theoutput/{mapper}.{library_name}/report/bam_qc
directory. The file names for these reports (and their MD5s) use the following naming convention:{mapper}.{library_name}.bamstats.txt
{mapper}.{library_name}.flagstats.txt
{mapper}.{library_name}.idxstats.txt
{mapper}.{library_name}.bamstats.txt.md5
{mapper}.{library_name}.flagstats.txt.md5
{mapper}.{library_name}.idxstats.txt.md5
For example, it will look as follows for the example bam files shown above:
output/
+-- bwa.P001-N1-DNA1-WES1
| |-- out
| | |-- bwa.P001-N1-DNA1-WES1.bam
| | |-- bwa.P001-N1-DNA1-WES1.bam.bai
| | |-- bwa.P001-N1-DNA1-WES1.bam.bai.md5
| | `-- bwa.P001-N1-DNA1-WES1.bam.md5
| `-- report
| `-- bam_qc
| |-- bwa.P001-N1-DNA1-WES1.bam.bamstats.txt
| |-- bwa.P001-N1-DNA1-WES1.bam.bamstats.txt.md5
| |-- bwa.P001-N1-DNA1-WES1.bam.flagstats.txt
| |-- bwa.P001-N1-DNA1-WES1.bam.flagstats.txt.md5
| |-- bwa.P001-N1-DNA1-WES1.bam.idxstats.txt
| `-- bwa.P001-N1-DNA1-WES1.bam.idxstats.txt.md5
[...]
- Target Coverage Report (.txt)
If
ngs_mapping/path_target_regions
is set to a BED file with the target regions (either capture regions of capture kits in the case of targeted sequencing or exons for WES/WGS sequencing) a target coverage report is generated and linked out into theoutput/{mapper}.{library_name}/report/cov_qc
directory. The file names for these reports (and their MD5s) use the following naming convention:{mapper}.{library_name}.txt
{mapper}.{library_name}.txt.md5
For example, it will look as follows for the example bam files shown above:
output/
+-- bwa.P001-N1-DNA1-WES1
| `-- report
| |-- bam_qc
| | [...]
| `-- cov_qc
| |-- bwa.P001-N1-DNA1-WES1.txt
| `-- bwa.P001-N1-DNA1-WES1.txt.md5
[...]
- Genome-wide Coverage Count (.bed.gz)
If
ngs_mapping/compute_coverage_bed
to be set totrue
a report is generated that gives the depth at each base of the genome. (note: currently this report only appears inwork/
and is not yet linked out into theoutput/
directory).(TODO: add file name rules and example)
work/
+-- bwa.P001-N1-DNA1-WES1
| `-- report
| `-- bam_qc
| |-- bwa.P001-N1-DNA1-WES1.bam.bamstats.d
| | |-- acgt-cycles.gp
| | |-- acgt-cycles.png
| | |-- coverage.gp
| | |-- coverage.png
| | |-- gc-content.gp
| | |-- gc-content.png
| | |-- gc-depth.gp
| | |-- gc-depth.png
| | |-- indel-cycles.gp
| | |-- indel-cycles.png
| | |-- indel-dist.gp
| | |-- indel-dist.png
| | |-- index.html
| | |-- insert-size.gp
| | |-- insert-size.png
| | |-- quals2.gp
| | |-- quals2.png
| | |-- quals3.gp
| | |-- quals3.png
| | |-- quals.gp
| | |-- quals-hm.gp
| | |-- quals-hm.png
| | `-- quals.png
| [...]
[...]