Somatic Variant Filtration

Implementation of the somatic_variant_filtration step

Default Configuration

The default configuration is as follows.

# Default configuration variant_annotation
step_config:
  somatic_variant_filtration:
    path_somatic_variant_annotation: ../somatic_variant_annotation
    path_ngs_mapping: ../ngs_mapping
    tools_ngs_mapping: null
    tools_somatic_variant_calling: null
    filter_sets:
    # no_filter: no_filters    # implicit, always defined
      dkfz_only: ''  # empty
      dkfz_and_ebfilter:
        ebfilter_threshold: 2.4
      dkfz_and_ebfilter_and_oxog:
        vaf_threshold: 0.08
        coverage_threshold: 5
      dkfz_and_oxog:
        vaf_threshold: 0.08
        coverage_threshold: 5
    exon_lists: {}
    # genome_wide: null         # implicit, always defined
    # ensembl74: path/to/ensembl47.bed
    ignore_chroms:            # patterns of chromosome names to ignore
    - NC_007605    # herpes virus
    - hs37d5       # GRCh37 decoy
    - chrEBV       # Eppstein-Barr Virus
    - '*_decoy'    # decoy contig
    - 'HLA-*'      # HLA genes
    - 'GL000220.*' # Contig with problematic, repetitive DNA in GRCh37
    eb_filter:
      shuffle_seed: 1
      panel_of_normals_size: 25
      min_mapq: 20
      min_baseq: 15
      # Parallelization configuration
      window_length: 10000000   # split input into windows of this size, each triggers a job
      num_jobs: 500             # number of windows to process in parallel
      use_profile: true         # use Snakemake profile for parallel processing
      restart_times: 5          # number of times to re-launch jobs in case of failure
      max_jobs_per_second: 2    # throttling of job creation
      max_status_checks_per_second: 10   # throttling of status checks
      debug_trunc_tokens: 0     # truncation to first N tokens (0 for none)
      keep_tmpdir: never        # keep temporary directory, {always, never, onerror}
      job_mult_memory: 1        # memory multiplier
      job_mult_time: 1          # running time multiplier
      merge_mult_memory: 1      # memory multiplier for merging
      merge_mult_time: 1        # running time multiplier for merging

Important

Because the EB Filter step is so time consuming, the data going can be heavily prefiltered! (e.g. using Jannovar with the offExome flag).

TODO: document filter, for now see the eb_filter wrapper!

Concept

All variants are annotated with the dkfz-bias-filter to remove sequencing and PCR artifacts. The variants annotatated with EBFilter are variable, i.e. only variants that have the PASS flag set because we assume only those will be kept.

We borrowed the general workflow from variant_filtration, i.e. working with pre-defined filter sets and exon/region lists.

Workflow

1. Do the filtering genome wide (this file needs to be there, always)
- dkfz-ebfilter-filterset1-genomewide
1. optionally, subset to regions defined in bed file, which return
- dkfz-ebfilter-filterset1-regions1

and so on for filterset1 to n

filterset1: filter bPcr, bSeq flags from dkfz-bias-filter

filterset2: additionally filter variants with EBscore < x, x is configurable