Germline Variant Phasing

Implementation of the germline variant_phasing step

This step takes the result of the variant_annotation step and performs phasing of the variants using the GATK tools. Note that there are some issues with the GATK tools implementing this step:

  • The result of the PhaseByTransmission tool changes the genotype of some variants which is problematic when trying to phase de novo variants.

  • The read backed phasing is also not 100% reliable at the moment.

Thus, the functionality of the tools is made available by this pipeline step but it is not as fully integrated as it could because it is unclear how useful this is for clinical studies. Also, so far only the GATK variant caller results can be phased.

Also note that this step generates one output file for each child in a pedigree where both parents have been sequenced.

Step Input

The variant annotation step uses the output of the following CUBI pipeline steps:

  • ngs_mapping

  • variant_annotation

Step Output

For each input VCF file (i.e., for each mapper and pedigree), a directory output/{mapper}.{caller}.{phaser}.{index_ngs_library}/out will be created with the following output files.

The {phaser} placeholder can take the values gatk_phase_by_transmission, gatk_read_backed_phasing, and gatk_phased_both (for the latter, first phasing by transmission and then read backed phasing is performed).

Global Configuration

  • static_data_config/reference/path must be set appropriately

Default Configuration

The default configuration is as follows.

# Default configuration wgs_sv_filtration
step_config:
  variant_phasing:
    path_ngs_mapping: ../ngs_mapping
    path_variant_annotation: ../variant_annotation
    tools_ngs_mapping: []       # expected tools for ngs mapping
    tools_variant_calling: []   # expected tools for variant calling
    phasings:
    - gatk_phasing_both
    ignore_chroms:            # patterns of chromosome names to ignore
    - NC_007605               # herpes virus
    - hs37d5                  # GRCh37 decoy
    - chrEBV                  # Eppstein-Barr Virus
    - '*_decoy'               # decoy contig
    - 'HLA-*'                 # HLA genes
    gatk_read_backed_phasing:
      phase_quality_threshold: 20.0  # quality threshold for phasing
      window_length: 5000000    # split input into windows of this size, each triggers a job
      num_jobs: 1000            # number of windows to process in parallel
      use_profil: true          # use Snakemake profile for parallel processing
      restart_times: 0          # number of times to re-launch jobs in case of failure
      max_jobs_per_second: 10   # throttling of job creation
      max_status_checks_per_second: 10   # throttling of status checks
      debug_trunc_tokens: 0     # truncation to first N tokens (0 for none)
      keep_tmpdir: never        # keep temporary directory, {always, never, onerror}
      job_mult_memory: 1        # memory multiplier
      job_mult_time: 1          # running time multiplier
      merge_mult_memory: 1      # memory multiplier for merging
      merge_mult_time: 1        # running time multiplier for merging
    gatk_phase_by_transmission:
      de_novo_prior: 1e-8       # default, use 1e-6 when interested in phasing de novos

Reports

Currently, no reports are generated.