Germline Variant Calling
Implementation of the variant_calling
step
The variant_calling` step takes the output of the ``ngs_mapping
step and performs small
variant calling on the read alignments. The output are variant calls in VCF (and optionally gVCF)
files and quality control statistics on these data.
Properties
overall stability
stable
applicable to
germline variant calling
generally applicable to
short read variant calling
Step Input
BAM files from the ngs_mapping
step.
Step Output
Creates one output directory for each read mapper (from ngs_mapping
), each variant caller, and
each pedigree from the germline sample sheet.
Primary Output
output/{mapper}.{caller}.{index_library}/out/{mapper}.{caller}.{index_library}.vcf.gz
Additional Output
The callers implementing a gVCF workflow (currently only gatk4_hc_gvcf
) also create one output
gVCF file for the pedigree.
output/{mapper}.{caller}.{index_library}/out/{mapper}.{caller}.{index_library}.g.vcf.gz
Further, each VCF and gVCF file gets an appropriate TBI index file {vcf_file}.tbi
and each output
is gets an appropriate MD5 checksum file {file}.md5
.
Global Configuration
If GATK HaplotypeCaller or GATK UnifiedGenotyper are activated then
static_data_config/dbsnp/path
must be properly configuredstatic_data_config/reference/path
must be set appropriately
Default Configuration
The default configuration is as follows.
step_config:
variant_calling:
#path_ngs_mapping: ../ngs_mapping
#tools: # Options: 'bcftools_call', 'gatk3_hc', 'gatk3_ug', 'gatk4_hc_joint', 'gatk4_hc_gvcf'
# - gatk4_hc_gvcf
#ignore_chroms:
# - ^NC_007605$
# - ^hs37d5$
# - ^chrEBV$
# - _decoy$
# - ^HLA-
#baf_file_generation:
# enabled: false
#
# # minimal DP of variant, must be >=1
# min_dp: 10
#bcftools_stats:
# enabled: false
#jannovar_stats:
# enabled: false
# path_ser: # REQUIRED
#bcftools_roh:
# enabled: false
# path_targets:
# path_af_file: # REQUIRED
# ignore_homref: false
# skip_indels: false
# rec_rate: 1e-08
#bcftools_call:
# max_depth: 250
# max_indel_depth: 250
# window_length: 10000000
# num_threads: 16
#gatk3_hc:
# num_threads: 16
# window_length: 10000000
# allow_seq_dict_incompatibility: false
#gatk3_ug:
# num_threads: 16
# window_length: 10000000
# allow_seq_dict_incompatibility: false
# downsample_to_coverage: 250
#gatk4_hc_joint:
# window_length: 10000000
# num_threads: 16
# allow_seq_dict_incompatibility: false
#gatk4_hc_gvcf:
# window_length: 10000000
# num_threads: 16
# allow_seq_dict_incompatibility: false
Variant Callers
The following germline variant callers are currently available.
gatk4_hc_gvcf
Variant calling with GATK v4 HaplotypeCaller using the gVCF workflow consisting of variant discovery with
HaplotypeCaller
, merging of the gVCF files withing each pedigree withCombineGVCFs
and genotyping withGenotypeGVCFs
.This is the mainly used variant caller and the only one enabled by default.
The reason is this being the main advertised run mode by the GATK team and this workflow enables physical phasing information in the output VCF files.
gatk4_hc_joint
Variant calling with the GATK v4 HaplotypeCaller using joint calling with direct VCF generation.
This variant caller is provided as a fallback to explore problems with de novo variant calls that may have been introduced by the gVCF workflow.
Disabled by default.
gatk3_hc
Joint calling with GATK v3 HaplotypeCaller.
This caller is provided for historical reasons as earlier versions of SNAPPY pipeline were based on this workflow.
Disabled by default.
gatk3_ug
Joint calling with GATK v3 UnifiedGenotyper.
This caller is provided for historical reasons and to provide a vote in creating consensus sets of variant calls.
bcftools_call
Variant calling with
bcftools mpileup | bcftools call
.This caller is provided for establishing baseline variant calls in benchmark situations. BCFtools allows for fast and efficient variant calling at the cost of some sensitivity and specificity.
Disabled by default.
Reports
jannovar_stats
Create statistics on variants using
jannovar statsistics
for each pedigree.report/jannovar_stats/{mapper}.{caller}.{index_library}.{donor_library}.txt
bcftools_stats
Create statistics on variants using
bcftools stats
for each donor in each pedigree for each mapper and caller.report/bcftools_stats/{mapper}.{caller}.{index_library}.{donor_library}.txt
baf_file_generation
Create one UCSC BigWig file for each individual in each pedigree for each mapper and caller with B-allele fraction. These files can be used for to visually confirm structural variants or runs of homozygosity.
report/baf/{mapper}.{caller}.{index_library}.{donor_library}.bw
roh_calling
Perform run-of-homozygosity calling with
bcftools roh
.
Log Files
For each variant caller and report generator, the following log files are created into the
log
directory.
{file}.conda_info.txt
Output of
conda info
of the executing conda environment.
{file}.conda_list.txt
Output of
conda list
of the executing conda environment with list of the full package list and exact versions.
{file}.log
Log output of the execution.
{file}.wrapper.py
The actual Snakemake wrapper file with all input / output / parameter values.
Implementation Notes
All variant callers are parallelized using GNU parallel on genome-wide windows generated by GATK v4
PreprocessIntervals
.Each output file has an accompanying MD5 sum.
Example Output
Given a pedigree with index index
and two more donors mother
and father
, the following
files would be created into output/
(each VCF file has a .tbi
file and overall each file has
a .md5
file). In this case, the read mapper is bwa
and the variant caller is gatk4_hc_gvcf
.
`
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.baf_file_generation_run.conda_info.txt
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.baf_file_generation_run.conda_list.txt
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.baf_file_generation_run.environment.yaml
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.baf_file_generation_run.log
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.baf_file_generation_run.wrapper.py
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.bcftools_stats_run.conda_info.txt
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.bcftools_stats_run.conda_list.txt
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.bcftools_stats_run.environment.yaml
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.bcftools_stats_run.log
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.bcftools_stats_run.wrapper.py
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.baf_file_generation_run.conda_info.txt
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.baf_file_generation_run.conda_list.txt
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.baf_file_generation_run.environment.yaml
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.baf_file_generation_run.log
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.baf_file_generation_run.wrapper.py
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.bcftools_stats_run.conda_info.txt
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.bcftools_stats_run.conda_list.txt
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.bcftools_stats_run.environment.yaml
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.bcftools_stats_run.log
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.bcftools_stats_run.wrapper.py
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.baf_file_generation_run.conda_info.txt
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.baf_file_generation_run.conda_list.txt
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.baf_file_generation_run.environment.yaml
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.baf_file_generation_run.log
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.baf_file_generation_run.wrapper.py
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.bcftools_stats_run.conda_info.txt
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.bcftools_stats_run.conda_list.txt
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.bcftools_stats_run.environment.yaml
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.bcftools_stats_run.log
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.bcftools_stats_run.wrapper.py
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.gatk4_hc_gvcf_genotype.conda_info.txt
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.gatk4_hc_gvcf_genotype.conda_list.txt
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.gatk4_hc_gvcf_genotype.environment.yaml
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.gatk4_hc_gvcf_genotype.log
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.gatk4_hc_gvcf_genotype.wrapper.py
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.jannovar_stats_run.conda_info.txt
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.jannovar_stats_run.conda_list.txt
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.jannovar_stats_run.environment.yaml
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.jannovar_stats_run.log
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/log/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.jannovar_stats_run.wrapper.py
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/out/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.g.vcf.gz
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/out/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.vcf.gz
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/report/baf/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.baf.bw
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/report/baf/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.baf.bw
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/report/baf/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.baf.bw
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/report/bcftools_stats/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.index-N1-DNA1-WES1.txt
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/report/bcftools_stats/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.father-N1-DNA1-WES1.txt
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/report/bcftools_stats/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.mother-N1-DNA1-WES1.txt
bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1/report/jannovar_stats/bwa.gatk4_hc_gvcf.index-N1-DNA1-WES1.txt
`