Somatic Variant Filtration
Implementation of the somatic_variant_filtration
step
The current implementation supports two filtration schema:
the legacy schema, now deprecated, always runs the DKFZBiasFilter & EBFilter, and produces files for all combinations of available filters.
the new schema focuses on flexibility, allows any combination of filters, and returns a single fitlered file for each sample.
The new schema is used when the configuration option filter_list
is not empty.
The following document describes only this new schema.
Step Input
The step requires vcf
files from either the somatic_variant_calling
or somatic_variant_annotation
steps.
In the former case, the configuration option has_annotation
must be set to False
.
In both cases, it will use the regular output vcf
file, not *.full.vcf.gz
.
Step Output
For each tumor DNA NGS library with name lib_name
and each read mapper
mapper
that the library has been aligned with, and the variant caller var_caller
, the
pipeline step will create a directory output/{mapper}.{var_caller}.{annotator}.{lib_name}/out
with symlinks of the following names to the resulting VCF, TBI, and MD5 files.
Two vcf
files are produced:
{mapper}.{var_caller}.{annotator}.{lib_name}.vcf.gz
which contains only the variants that have passed all filters, or that were protected, and{mapper}.{var_caller}.{annotator}.{lib_name}.full.vcf.gz
which contains all variants, with the reason for rejection in theFILTER
column.
When the somatic_variant_annotation
step has been omitted, and the filtration is done directly from the output of the somatic_variant_calling
step,
then the output files are stored in the output/{mapper}.{var_caller}.{lib_name}/out
directory, under the names {mapper}.{var_caller}.{lib_name}.vcf.gz
&
{mapper}.{var_caller}.{lib_name}.full.vcf.gz
For example, it might look as follows for the example from above:
output/
+-- bwa.mutect2.vep.filtered.P001-N1-DNA1-WES1
| `-- out
| |-- bwa.mutect2.vep.filtered.P001-T1-DNA1-WES1.vcf.gz
| |-- bwa.mutect2.vep.filtered.P001-T1-DNA1-WES1.vcf.gz.tbi
| |-- bwa.mutect2.vep.filtered.P001-T1-DNA1-WES1.vcf.gz.md5
| `-- bwa.mutect2.vep.filtered.P001-T1-DNA1-WES1.vcf.gz.tbi.md5
| |-- bwa.mutect2.vep.filtered.P001-T1-DNA1-WES1.full.vcf.gz
| |-- bwa.mutect2.vep.filtered.P001-T1-DNA1-WES1.full.vcf.gz.tbi
| |-- bwa.mutect2.vep.filtered.P001-T1-DNA1-WES1.full.vcf.gz.md5
| `-- bwa.mutect2.vep.filtered.P001-T1-DNA1-WES1.full.vcf.gz.tbi.md5
[...]
Default Configuration
The default configuration is as follows.
step_config:
somatic_variant_filtration:
#path_somatic_variant: ../somatic_variant # Examples: ../somatic_variant_annotation, ../somatic_variant_calling
#
# Needed for dkfz & ebfilter
#path_ngs_mapping: ../ngs_mapping
#
# Default: use those defined in ngs_mapping step
#tools_ngs_mapping:
#
# Default: use those defined in somatic_variant_calling step
#tools_somatic_variant_calling:
#
# Default: use those defined in somatic_variant_annotation step
#tools_somatic_variant_annotation:
#has_annotation: true
#filtration_schema: list # Options: 'list', 'sets'
#filter_sets:
# no_filter:
# dkfz_only:
# dkfz_and_ebfilter:
# ebfilter_threshold: 2.4
# dkfz_and_ebfilter_and_oxog:
# vaf_threshold: 0.08
# coverage_threshold: 5.0
# dkfz_and_oxog:
# vaf_threshold: 0.08
# coverage_threshold: 5.0
#exon_lists: {}
#eb_filter:
# shuffle_seed: 1
# panel_of_normals_size: 25
# min_mapq: 20.0
# min_baseq: 15.0
#
# Available filters
# dkfz: {} # Not parametrisable
# ebfilter:
# ebfilter_threshold: 2.4
# shuffle_seed: 1
# panel_of_normals_size: 25
# min_mapq: 20
# min_baseq: 15
# bcftools:
# include: "" # Expression to be used in bcftools view --include
# exclude: "" # Expression to be used in bcftools view --exclude
# regions:
# path_bed: REQUIRED # Bed file of regions to be considered (variants outside are filtered out)
# protected:
# path_bed: REQUIRED # Bed file of regions that should not be filtered out at all.
#filter_list: []
Filters
The following filters are implemented:
dkfz
: uses orientiation biases to remove sequencing & PCR artifacts. The current implementation doesn’t allow any parametrisation of this filter. This filter will addbSeq
orbPcr
to the FILTER column of rejected variants.ebfilter
: Bayesian statistical model to score variants. Variants with a score lower thanebfilter_threshold
are rejected. The scoring algorithm can be parameterised from the coniguration. This filter will addebfilter_<n>
to the FILTER column of rejected variants.bcftools
: flexible filter based on bcftools expressions. The expression can be designed toinclude
orexclude
variants. This filter will addbcftools_<n>
to the FILTER column of rejected variants.regions
: filter to exclude variants outside of user’s defined regions. Typically used to reject variants outside of coding regions. This filter will addregions_<n>
to the FILTER column of rejected variants.protected
: anti-filter to avoid variants in protected regions to be otherwise filtered out. This filter “whitelists” variants in specific regions. This is valuable to protect known drivers against being filtered out, even if there is little experimental support for them. This filter will addPROTECTED
to the FILTER column of rejected variants.
In the above description, <n>
is here the sequence number of the filter in the filter list.
The filters can be used or not, and can be used multiple times. For example, it is possible to
use the bcftools
filter to reject differentially potential FFPE artifacts. The filter list would then be:
filter_list:
- dkfz: {}
- ebfilter:
ebfilter_threshold: 2.4
- bcftools:
exclude: "AD[1:0]+AD[1:1]<50 | AD[1:1]<5 | AD[1:1]/(AD[1:0]+AD[1:1])<0.05"
- bcftools:
exclude: "((REF='C' & ALT='T') | (REF='G' & ALT='A')) & AD[1:1]/(AD[1:0]+AD[1:1])<0.10"
- protected:
path_bed: hotspots_locii.bed
This list of filters would apply the DKFZBiasFilter, the EBFilter, reject all variants with depth lower than 50, less than 5 reads supporting the alternative allele, or with a variant allele fraction below 5%. It would also reject all C-to-T and G-to-A variants with a VAF lower than 10%, because they might be FFPE artifacts. All variants overlapping with hotspots locii would be protected against filtration.
Note that the parallelisation of ebfilter
has been removed, even though this operation can be slow when there are many variants (from WGS data for example).