Germline Variant Calling

Implementation of the variant_calling step

The variant_calling` step takes the output of the ``ngs_mapping step and performs small variant calling on the read alignments. The output are variant calls in VCF (and optionally gVCF) files and quality control statistics on these data.

Properties

overall stability

stable

applicable to

germline variant calling

generally applicable to

short read variant calling

Step Input

BAM files from the ngs_mapping step.

Step Output

Creates one output directory for each read mapper (from ngs_mapping), each variant caller, and each pedigree from the germline sample sheet.

Primary Output

  • output/{mapper}.{caller}.{index_library}/out/{mapper}.{caller}.{index_library}.vcf.gz

Additional Output

The callers implementing a gVCF workflow (currently only gatk4_hc_gvcf) also create one output gVCF file for the pedigree.

  • output/{mapper}.{caller}.{index_library}/out/{mapper}.{caller}.{index_library}.g.vcf.gz

Further, each VCF and gVCF file gets an appropriate TBI index file {vcf_file}.tbi and each output is gets an appropriate MD5 checksum file {file}.md5.

Global Configuration

  • If GATK HaplotypeCaller or GATK UnifiedGenotyper are activated then static_data_config/dbsnp/path must be properly configured

  • static_data_config/reference/path must be set appropriately

Default Configuration

The default configuration is as follows.

# Default configuration variant_calling
step_config:
  variant_calling:
    # Common configuration
    path_ngs_mapping: ../ngs_mapping  # REQUIRED

    # Report generation
    baf_file_generation:
      enabled: true
      min_dp: 10  # minimal DP of variant, must be >=1
    bcftools_stats:
      enabled: true
    jannovar_stats:
      enabled: true
      path_ser: REQUIRED  # REQUIRED
    bcftools_roh:
      enabled: true
      path_targets: null  # REQUIRED; optional
      path_af_file: null  # REQUIRED
      ignore_homref: false
      skip_indels: false
      rec_rate: 1e-8

    # Variant calling tools and their configuration
    #
    # Common configuration
    tools: ['gatk4_hc_gvcf']  # REQUIRED
    ignore_chroms:
    - '^NC_007605$' # herpes virus
    - '^hs37d5$'    # GRCh37 decoy
    - '^chrEBV$'    # Eppstein-Barr Virus
    - '_decoy$'     # decoy contig
    - '^HLA-'       # HLA genes

    # Variant caller specific configuration
    bcftools_call:
      max_depth: 250
      max_indel_depth: 250
      window_length: 10000000
      num_threads: 16
    gatk3_hc:
      num_threads: 16
      window_length: 10000000
      allow_seq_dict_incompatibility: false
    gatk3_ug:
      num_threads: 16
      window_length: 10000000
      allow_seq_dict_incompatibility: false
      downsample_to_coverage: 250
    gatk4_hc_joint:
      window_length: 10000000
      num_threads: 16
      allow_seq_dict_incompatibility: false
    gatk4_hc_gvcf:
      window_length: 10000000
      num_threads: 16
      allow_seq_dict_incompatibility: false

Variant Callers

The following germline variant callers are currently available.

gatk4_hc_gvcf

Variant calling with GATK v4 HaplotypeCaller using the gVCF workflow consisting of variant discovery with HaplotypeCaller, merging of the gVCF files withing each pedigree with CombineGVCFs and genotyping with GenotypeGVCFs.

This is the mainly used variant caller and the only one enabled by default.

The reason is this being the main advertised run mode by the GATK team and this workflow enables physical phasing information in the output VCF files.

gatk4_hc_joint

Variant calling with the GATK v4 HaplotypeCaller using joint calling with direct VCF generation.

This variant caller is provided as a fallback to explore problems with de novo variant calls that may have been introduced by the gVCF workflow.

Disabled by default.

gatk3_hc

Joint calling with GATK v3 HaplotypeCaller.

This caller is provided for historical reasons as earlier versions of SNAPPY pipeline were based on this workflow.

Disabled by default.

gatk3_ug

Joint calling with GATK v3 UnifiedGenotyper.

This caller is provided for historical reasons and to provide a vote in creating consensus sets of variant calls.

bcftools_call

Variant calling with bcftools mpileup | bcftools call.

This caller is provided for establishing baseline variant calls in benchmark situations. BCFtools allows for fast and efficient variant calling at the cost of some sensitivity and specificity.

Disabled by default.

Reports

jannovar_stats

Create statistics on variants using jannovar statsistics for each pedigree.

report/jannovar_stats/{mapper}.{caller}.{index_library}.{donor_library}.txt

bcftools_stats

Create statistics on variants using bcftools stats for each donor in each pedigree for each mapper and caller.

report/bcftools_stats/{mapper}.{caller}.{index_library}.{donor_library}.txt

baf_file_generation

Create one UCSC BigWig file for each individual in each pedigree for each mapper and caller with B-allele fraction. These files can be used for to visually confirm structural variants or runs of homozygosity.

report/baf/{mapper}.{caller}.{index_library}.{donor_library}.bw

roh_calling

Perform run-of-homozygosity calling with bcftools roh.

Log Files

For each variant caller and report generator, the following log files are created into the log directory.

{file}.conda_info.txt

Output of conda info of the executing conda environment.

{file}.conda_list.txt

Output of conda list of the executing conda environment with list of the full package list and exact versions.

{file}.log

Log output of the execution.

{file}.wrapper.py

The actual Snakemake wrapper file with all input / output / parameter values.

Implementation Notes

  • All variant callers are parallelized using GNU parallel on genome-wide windows generated by GATK v4 PreprocessIntervals.

  • Each output file has an accompanying MD5 sum.

Example Output

Given a pedigree with index index and two more donors mother and father, the following files would be created into output/ (each VCF file has a .tbi file and overall each file has a .md5 file). In this case, the read mapper is bwa and the variant caller is gatk4_hc_gvcf.

` bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.baf_file_generation_run.conda_info.txt bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.baf_file_generation_run.conda_list.txt bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.baf_file_generation_run.environment.yaml bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.baf_file_generation_run.log bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.baf_file_generation_run.wrapper.py bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.bcftools_stats_run.conda_info.txt bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.bcftools_stats_run.conda_list.txt bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.bcftools_stats_run.environment.yaml bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.bcftools_stats_run.log bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.bcftools_stats_run.wrapper.py bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.baf_file_generation_run.conda_info.txt bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.baf_file_generation_run.conda_list.txt bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.baf_file_generation_run.environment.yaml bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.baf_file_generation_run.log bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.baf_file_generation_run.wrapper.py bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.bcftools_stats_run.conda_info.txt bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.bcftools_stats_run.conda_list.txt bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.bcftools_stats_run.environment.yaml bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.bcftools_stats_run.log bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.bcftools_stats_run.wrapper.py bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.baf_file_generation_run.conda_info.txt bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.baf_file_generation_run.conda_list.txt bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.baf_file_generation_run.environment.yaml bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.baf_file_generation_run.log bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.baf_file_generation_run.wrapper.py bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.bcftools_stats_run.conda_info.txt bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.bcftools_stats_run.conda_list.txt bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.bcftools_stats_run.environment.yaml bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.bcftools_stats_run.log bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.bcftools_stats_run.wrapper.py bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.gatk4_hc_gvcf_genotype.conda_info.txt bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.gatk4_hc_gvcf_genotype.conda_list.txt bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.gatk4_hc_gvcf_genotype.environment.yaml bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.gatk4_hc_gvcf_genotype.log bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.gatk4_hc_gvcf_genotype.wrapper.py bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.jannovar_stats_run.conda_info.txt bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.jannovar_stats_run.conda_list.txt bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.jannovar_stats_run.environment.yaml bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.jannovar_stats_run.log bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.jannovar_stats_run.wrapper.py bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/out/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.g.vcf.gz bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/out/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.vcf.gz bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/report/baf/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.baf.bw bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/report/baf/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.baf.bw bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/report/baf/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.baf.bw bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/report/bcftools_stats/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.txt bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/report/bcftools_stats/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.txt bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/report/bcftools_stats/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.txt bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/report/jannovar_stats/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.txt `