Somatic Variant Filtration
Implementation of the somatic_variant_filtration
step
Default Configuration
The default configuration is as follows.
# Default configuration variant_annotation
step_config:
somatic_variant_filtration:
path_somatic_variant_annotation: ../somatic_variant_annotation
path_ngs_mapping: ../ngs_mapping
tools_ngs_mapping: null
tools_somatic_variant_calling: null
filter_sets:
# no_filter: no_filters # implicit, always defined
dkfz_only: '' # empty
dkfz_and_ebfilter:
ebfilter_threshold: 2.4
dkfz_and_ebfilter_and_oxog:
vaf_threshold: 0.08
coverage_threshold: 5
dkfz_and_oxog:
vaf_threshold: 0.08
coverage_threshold: 5
exon_lists: {}
# genome_wide: null # implicit, always defined
# ensembl74: path/to/ensembl47.bed
ignore_chroms: # patterns of chromosome names to ignore
- NC_007605 # herpes virus
- hs37d5 # GRCh37 decoy
- chrEBV # Eppstein-Barr Virus
- '*_decoy' # decoy contig
- 'HLA-*' # HLA genes
- 'GL000220.*' # Contig with problematic, repetitive DNA in GRCh37
eb_filter:
shuffle_seed: 1
panel_of_normals_size: 25
min_mapq: 20
min_baseq: 15
# Parallelization configuration
window_length: 10000000 # split input into windows of this size, each triggers a job
num_jobs: 500 # number of windows to process in parallel
use_profile: true # use Snakemake profile for parallel processing
restart_times: 5 # number of times to re-launch jobs in case of failure
max_jobs_per_second: 2 # throttling of job creation
max_status_checks_per_second: 10 # throttling of status checks
debug_trunc_tokens: 0 # truncation to first N tokens (0 for none)
keep_tmpdir: never # keep temporary directory, {always, never, onerror}
job_mult_memory: 1 # memory multiplier
job_mult_time: 1 # running time multiplier
merge_mult_memory: 1 # memory multiplier for merging
merge_mult_time: 1 # running time multiplier for merging
Important
Because the EB Filter step is so time consuming, the data going can be heavily prefiltered! (e.g. using Jannovar with the offExome flag).
TODO: document filter, for now see the eb_filter wrapper!
Concept
All variants are annotated with the dkfz-bias-filter to remove sequencing and PCR artifacts. The variants annotatated with EBFilter are variable, i.e. only variants that have the PASS flag set because we assume only those will be kept.
We borrowed the general workflow from variant_filtration, i.e. working with pre-defined filter sets and exon/region lists.
Workflow
Do the filtering genome wide (this file needs to be there, always)
dkfz-ebfilter-filterset1-genomewide
optionally, subset to regions defined in bed file, which return
dkfz-ebfilter-filterset1-regions1
and so on for filterset1 to n
filterset1: filter bPcr, bSeq flags from dkfz-bias-filter
filterset2: additionally filter variants with EBscore < x, x is configurable