Somatic Variant Calling Dissection
Note
Before reading this chapter, you should have
have knowledge from the user’s perspective of CUBI pipeline (start a Usage)
read chapter Developer’s Introduction.
After reading this chapter, you should
understand the
BaseStep
andBaseStepPart
classes and how to subclass them and override the different functionsunderstand how the objects of these classes are tied into the
Snakefile
of each pipeline stepunderstand how to create and use Snakemake wrappers for tools
Pipeline Step File Structure
Generally, all pipeline steps go into a sub module of snappy_pipeline.workflows
(thus, a sub directory).
In this chapter, we look at the somatic_variant_calling
pipeline step.
This step has the following structure on the file system:
somatic_variant_calling/
|-- __init__.py
`-- Snakefile
As you can see, it is a Python module (as it contains an __init__.py
file) that also contains non-Python files (here Snakefile
).
The directory could also contain more files.
This could be any small static data file that the module could require.
Further, we could decide to factorize out the rules for a tool that requires many small rules (such as the tool cnvkit in somatic_targeted_seq_cnv_calling
).
The code for generating input and output file lists etc. is in the __init__.py
file and the module is available as snappy_pipeline.workflow.somatic_variant_calling
.
The Snakefile
is used for creating the Snakemake workflow.
When executing cubi-snake
for a somatic_variant_calling
step instance, you will note that the Snakemake command line displayed at the top will use the --snakefile
argument and put the value to the Snakefile
inside the somatic_variant_calling
directory at the argument’s value.
Thus, cubi-snake
is no real “magic” but simply a shortcut to the snakemake
executable.
The Snakefile
We will first consider the Snakefile
.
Necessary Imports
At the top, it starts with a line specifying UTF-8 coding and a Python docstring giving a short synopsis.
It does some Python imports
for making the expand_ref()
function and the SomaticVariantCallingWorkflow
class available.
# -*- coding: utf-8 -*-
"""CUBI Pipeline somatic_variant_calling step Snakefile"""
import os
from snappy_pipeline import expand_ref
from snappy_pipeline.workflows.somatic_variant_calling import SomaticVariantCallingWorkflow
__author__ = "Manuel Holtgrewe <manuel.holtgrewe@bihealth.de>"
Configuration
Then follows the loading of the configuration.
The Snakemake configfile:
statement loads the file config.yaml
from the current working directory.
When executing the pipeline step with cubi-snake
, this is either the directory the command is called in or the value of the --directory
argument if given.
In the last line of this chunk, the JSON pointers in the configuration are expanded, i.e., the "$ref"
values are interpreted.
This is used for implementing “overriding behaviour”, i.e., including and extending (OOP-like) the project’s main configuration file with the per–step instance one.
This way, you can set the pipeline_step/name
to somatic_variant_calling
in the somatic variant calling pipeline instance directory and to ngs_mapping
in the ngs mapping pipeline step instance directory, for example.
configfile: "config.yaml"
# Expand "$ref" JSON pointers in configuration (also works for YAML)
config, lookup_paths, config_paths = expand_ref("config.yaml", config)
Local Rules / Rule all
In the next chunk, the rules that are to be executed locally and not generate any cluster jobs are defined.
Then, the all
rule is defined to obtain the list of files to generate by default using the get_result_files()
method of the SomaticVariantCallingWorkflow()
class.
localrules:
# Linking files from work/ to output/ should be done locally
somatic_variant_calling_link_out_run,
rule all:
House-Keeping Rules
Next follow the “house-keeping” rules that do not perform any real work.
In this case, the somatic_variant_calling_link_out_run
rule performs the linking from the work/
directory into the output/
directory.
Note that the rule names are generated by concatenating the step name (here somatic_variant_calling
), the part of the pipeline step (here link_out
), followed by the action to be performed (here run
as there is no other action for link_out
).
wf.get_result_files(),
# House-Keeping ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Generic linking out ---------------------------------------------------------
rule somatic_variant_calling_link_out_run:
Rule for MuTect
Next comes the rule for running mutect.
Note that both for input:
and output:
, dict
values will be passed.
These should be unpacked (similar to Python **kwargs
unpacking).
As we are using an input function (i.e., a function object that accepts a wildcards argument), we have to use the recently introduced Snakemake unpack
directive.
This allows for lazy unpacking after the input function has been called with the wildcards argument.
For the output files, no wildcards are required as only strings with placeholders are returned.
Thus, the dict with key/value pairs of the named output files is to be unpacked directly using two asterisks (**
).
wf.get_input_files("link_out", "run"),
output:
wf.get_output_files("link_out", "run"),
run:
shell(wf.get_shell_cmd("link_out", "run", wildcards))
# Somatic Variant Calling ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Run MuTect ------------------------------------------------------------------
Rule for Scalpel
The rule for Scalpel looks similar. However, here the name of the normal library is required (as it is not part of the wildcards in contrast to the somatic library).
The way this is implemented here is to introduce a parameter called normal_library_name
.
The attribute get_normal_lib_name
of the scalpel BaseStepPart
sub class object is passed in here (which is actually a function).
On execution of the rule, the function will be called with the wildcards object as the parameter.
It will then lookup the name of the normal library for the matched tumor NGS library and return it.
This value is then available as params.normal_library_name
.
Also note that for scalpel, the call is not generated directly by the BaseStepPart
sub class, but a Snakemake wrapper is used instead.
This wrapper is located in the directory ../wrappers/scalpel/run
, relative to the somatic_variant_calling
Snakefile.
The method wrapper_path()
builds the correct path relative to the wrappers
directory in the snappy_pipeline
directory and its return value is passed to the wrapper:
section.
rule somatic_variant_calling_mutect_run:
input:
unpack(wf.get_input_files("mutect", "run")),
output:
**wf.get_output_files("mutect", "run"),
threads: wf.get_resource("mutect", "run", "threads")
resources:
time=wf.get_resource("mutect", "run", "time"),
The Module
Module Documentation
The file starts with the module-level documentation. Only the first four lines are shown below. This module-level documentation is also included into the user documentation, e.g., the one for the somatic variant calling module is included at Somatic Variant Calling.
# -*- coding: utf-8 -*-
"""Implementation of the ``somatic_variant_calling`` step
The ``somatic_variant_calling`` step takes as the input the results of the ``ngs_mapping`` step
Imports
Then follow the necessary imports for the module. Note that the classes of the steps that are used for the input are also imported (this will be important below).
from itertools import chain
import os
import sys
from biomedsheets.shortcuts import CancerCaseSheet, CancerCaseSheetOptions, is_not_background
from snakemake.io import expand
from snappy_pipeline.utils import dictify, listify
from snappy_pipeline.workflows.abstract import (
BaseStep,
BaseStepPart,
LinkOutStepPart,
ResourceUsage,
)
Constants
The imports are followed by constant definitions.
In the case of the somatic variant calling methods, the different tools generate a common set of core files, here VCF files with TBI indices and MD5 files for both. Thus, it makes sense to store the extensions (and names for named input and output file lists) in module-level constants. Note the use of tuples over lists for marking this datas explicitely as immutable.
Further, the DEFAULT_CONFIG
constant is defined with default configuration in YAML format.
This is also displayed in the user configuration so the users know where configuration settings are available.
Required configuration without any defaults should be set to null
(or []
/{}
for empty lists/dicts) and marked with a # REQUIRED
comment.
The different values are to be documented with YAML comments.
This default configuration will be loaded when initializing the BaseStep
sub class object and then overridden with the project- and pipeline step instance–wide configuration.
__author__ = "Manuel Holtgrewe <manuel.holtgrewe@bih-charite.de>"
#: Extensions of files to create as main payload
EXT_VALUES = (".vcf.gz", ".vcf.gz.tbi", ".vcf.gz.md5", ".vcf.gz.tbi.md5")
#: Names of the files to create for the extension
EXT_NAMES = ("vcf", "vcf_tbi", "vcf_md5", "vcf_tbi_md5")
EXT_MATCHED = {
"mutect": {
"vcf": ".vcf.gz",
"vcf_md5": ".vcf.gz.md5",
"vcf_tbi": ".vcf.gz.tbi",
"vcf_tbi_md5": ".vcf.gz.tbi.md5",
"full_vcf": ".full.vcf.gz",
"full_vcf_md5": ".full.vcf.gz.md5",
"full_vcf_tbi": ".full.vcf.gz.tbi",
"full_vcf_tbi_md5": ".full.vcf.gz.tbi.md5",
"txt": ".txt",
"txt_md5": ".txt.md5",
"wig": ".wig",
"wig_md5": ".wig.md5",
},
"scalpel": {
"vcf": ".vcf.gz",
"vcf_md5": ".vcf.gz.md5",
"vcf_tbi": ".vcf.gz.tbi",
The BaseStep Sub Class
Let’s jump towards the end of the file.
Here is the BaseStep
sub class SomaticVariantCallingWorkflow
.
Each sub class has to configure the name
and sheet_shortcut_class
class members.
They will be used for identifying the step name and the BioMed Sheet sheet shortcut class.
The somatic variant calling step sets these to "somatic_variant_calling"
and the CancerCaseSheet
class.
The static method default_config_yaml()
must be overridden in each BaseStep
sub class.
Each of these functions will have the same content but it is important for the scope of accessing DEFAULT_CONFIG
in the current module.
allow_seq_dict_incompatibility: false
annotations:
- BaseQualityRankSumTest
- FisherStrand
- GCContent
- HaplotypeScore
- HomopolymerRun
- MappingQualityRankSumTest
- MappingQualityZero
- QualByDepth
- ReadPosRankSumTest
- RMSMappingQuality
The constructor of the class calls the super class’ constructor with the arguments from the Snakefile
.
It is very important to note that it also gets an iterable (here a one-element tuple) of the BaseStep
sub classes that provide input for this step (here only NGSMappingWorkflow
imported at the top).
This information is required for making the default configuration of these steps available.
The constructor then proceeds to register the sub step classes that are used to implement the actual behaviour of the pipeline step.
Here, it is for running MuTect, Scalpel, and linking out the somatic variant call VCF files from work/
into output/
.
- Coverage
- ClippingRankSumTest
- DepthPerSampleHC
gatk_ug_joint:
# Parallelization configuration
num_cores: 2 # number of cores to use locally
window_length: 50000000 # split input into windows of this size, each triggers a job
num_jobs: 500 # number of windows to process in parallel
use_profile: true # use Snakemake profile for parallel processing
restart_times: 5 # number of times to re-launch jobs in case of failure
The method get_result_files()
returns a list of result files of this pipeline step.
For this, it uses the _yield_result_files_()
helper method and generates path in the output/
folder.
Snakemake knows that the link out rule will create these files but need corresponding files in work/
for this.
Through this mechanism, the individual tools’ rules will be triggered.
Note the use of cubi.utils.listify()
decorator that converts a generator (as created by using yield
in the function) to a function returning a list with the yielded objects in the ordere that they are yielded.
debug_trunc_tokens: 0 # truncation to first N tokens (0 for none)
keep_tmpdir: never # keep temporary directory, {always, never, onerror}
job_mult_memory: 1 # memory multiplier
job_mult_time: 1 # running time multiplier
merge_mult_memory: 1 # memory multiplier for merging
merge_mult_time: 1 # running time multiplier for merging
# GATK UG--specific configuration
downsample_to_coverage: 250
allow_seq_dict_incompatibility: false
annotations:
- BaseQualityRankSumTest
- FisherStrand
- GCContent
- HaplotypeScore
- HomopolymerRun
- MappingQualityRankSumTest
- MappingQualityZero
- QualByDepth
- ReadPosRankSumTest
- RMSMappingQuality
- DepthPerAlleleBySample
- Coverage
- ClippingRankSumTest
- DepthPerSampleHC
Finally, the check_config()
implementation ensures that the path to the NGS mapping step is configured for the somatic variant calling step.
window_length: 5000000 # split input into windows of this size, each triggers a job
num_jobs: 500 # number of windows to process in parallel
use_profile: true # use Snakemake profile for parallel processing
restart_times: 5 # number of times to re-launch jobs in case of failure
max_jobs_per_second: 2 # throttling of job creation
Module-Level BaseStepPart Sub Class
Now, we move up towards the top of the file again.
The SomaticVariantCallingStepPart
class is the base class for the somatic variant calling implementations.
The constructor builds a template string for generating result/output paths.
It then builds the member cancer_ngs_library_to_sample_pair
with a mapping from tumor DNA NGS library name to the BioMed Sheets CancerSamplePair
object that contains information about both the tumor and normal sample.
Note the use of OrderedDict
to keep the order from the sample sheet definition.
"full_vcf_tbi": ".full.vcf.gz.tbi",
"full_vcf_tbi_md5": ".full.vcf.gz.tbi.md5",
"tar": ".tar.gz",
"tar_md5": ".tar.gz.md5",
},
"mutect2": {
"vcf": ".vcf.gz",
"vcf_md5": ".vcf.gz.md5",
"vcf_tbi": ".vcf.gz.tbi",
"vcf_tbi_md5": ".vcf.gz.tbi.md5",
"full_vcf": ".full.vcf.gz",
"full_vcf_md5": ".full.vcf.gz.md5",
"full_vcf_tbi": ".full.vcf.gz.tbi",
"full_vcf_tbi_md5": ".full.vcf.gz.tbi.md5",
},
}
The implementation of get_input_files()
returns an input function that given the wildcards returns a dict with paths to the normal and tumor libraries’ aligned BAM file from the sub workflow ngs_mapping
.
Note that the input function returns the actual path without any wildcards.
SOMATIC_VARIANT_CALLERS_MATCHED = ("mutect", "mutect2", "scalpel")
#: Available somatic variant callers that just call all samples from one donor together.
SOMATIC_VARIANT_CALLERS_JOINT = (
"bcftools_joint",
"platypus_joint",
"gatk_hc_joint",
"gatk_ug_joint",
"varscan_joint",
)
#: Available somatic variant callers
SOMATIC_VARIANT_CALLERS = tuple(
chain(SOMATIC_VARIANT_CALLERS_MATCHED, SOMATIC_VARIANT_CALLERS_JOINT)
)
#: Available somatic variant callers assuming matched samples.
SOMATIC_VARIANT_CALLERS_MATCHED = ("mutect", "mutect2", "scalpel", "strelka2")
#: Available somatic variant callers that just call all samples from one donor together.
The method get_normal_lib_name()
returns the name of the matched normal NGS library for the given tumor NGS library name.
"bcftools_joint",
"platypus_joint",
"gatk_hc_joint",
"gatk_ug_joint",
The implementation of get_output_files()
returns a dict with named output files for Snakemake.
Note that this function returns named output files with wildcard placeholders.
)
#: Default configuration for the somatic_variant_calling schema
DEFAULT_CONFIG = r"""
# Default configuration somatic_variant_calling
step_config:
somatic_variant_calling:
Finally, the method get_log_file()
returns the path to the log file to create by Snakemake.
This is available as {log}
in shell commands and as {snakemake.log}
in Snakemake wrappers.
path_ngs_mapping: ../ngs_mapping # REQUIRED
ignore_chroms: # patterns of chromosome names to ignore
- NC_007605 # herpes virus
MuTect BaseStepPart Sub Class
First, the class defines the class attribute name
and sets it to 'mutect'
.
This is used by the super class and also by BaseStep
in the places where the name of the implementation is needed.
The check_config()
implementation ensures that the necessary MuTect-specific configuration has been set.
- '*_decoy' # decoy contig
- 'HLA-*' # HLA genes
- 'GL000220.*' # Contig with problematic, repetitive DNA in GRCh37
# Configuration for joint calling with samtools+bcftools.
bcftools_joint:
max_depth: 4000
max_indel_depth: 4000
window_length: 10000000
num_threads: 16
# Configuration for joint calling with Platypus.
platypus_joint:
split_complex_mnvs: true # whether or not to split complex and MNV variants
num_threads: 16
# VCF annotation databases are given as mapping from name to
# {'file': '/path.vcf.gz',
# 'info_tag': 'VCF_TAG',
# 'description': 'VCF header description'}
The function get_shell_cmd()
generates the shell command to the CUBI wrapper (not Snakemake wrapper ;) to the MuTect tool.
Here, the parallel CUBI wrapper for MuTect is used with appropriate configuration.
Note the direct use of configuration and that no complex string operations are required for building the call to MuTect.
mutect:
# Parallelization configuration
num_cores: 2 # number of cores to use locally
window_length: 3500000 # split input into windows of this size, each triggers a job
num_jobs: 500 # number of windows to process in parallel
use_profile: true # use Snakemake profile for parallel processing
restart_times: 5 # number of times to re-launch jobs in case of failure
max_jobs_per_second: 2 # throttling of job creation
max_status_checks_per_second: 10 # throttling of status checks
debug_trunc_tokens: 0 # truncation to first N tokens (0 for none)
keep_tmpdir: never # keep temporary directory, {always, never, onerror}
job_mult_memory: 1 # memory multiplier
job_mult_time: 1 # running time multiplier
merge_mult_memory: 1 # memory multiplier for merging
merge_mult_time: 1 # running time multiplier for merging
# Configuration for MuTect 2
mutect2:
panel_of_normals: '' # Set path to panel of normals vcf if required
germline_resource: '' # Germline variants resource (same as panel of normals)
common_variants: '' # Common germline variants for contamination estimation
extra_arguments: [] # List additional Mutect2 arguments
# Each additional argument xust be in the form:
# "--<argument name> <argument value>"
# For example, to filter reads prior to calling & to
# add annotations to the output vcf:
# - "--read-filter CigarContainsNoNOperator"
Finally, the method update_cluster_config()
takes the Snakemake cluster
configuration and updates it appropriately.
The three settings cluster.h_vmem
, cluster.h_rt
, and cluster.pe
are used in the generated Snakemake command line generated by the cubi-snake
tool appropriately for SGE.
# Parallelization configuration
num_cores: 2 # number of cores to use locally
window_length: 50000000 # split input into windows of this size, each triggers a job
num_jobs: 500 # number of windows to process in parallel
use_profile: true # use Snakemake profile for parallel processing
restart_times: 5 # number of times to re-launch jobs in case of failure
Scalpel BaseStepPart Sub Class
The integration for Scalpel is even simpler as Snakemake wrappers can be used and there is no need for the get_shell_cmd()
function.
The class sets the class attribute name
to 'scalpel'
.
Then, the check_config()
implementation ensure the presence of the required configuration.
debug_trunc_tokens: 0 # truncation to first N tokens (0 for none)
keep_tmpdir: never # keep temporary directory, {always, never, onerror}
job_mult_memory: 1 # memory multiplier
job_mult_time: 1 # running time multiplier
merge_mult_memory: 1 # memory multiplier for merging
merge_mult_time: 1 # running time multiplier for merging
# Configuration for Scalpel
scalpel:
path_target_regions: REQUIRED # REQUIRED
# Configuration for strelka2
strelka2:
Output file generation is similarly easy as for the MuTect part. However, the Scalpel working directory is tar-gzed in case the users wants to have access to the intermediate results and query the built MuTect database for variants with different coverages.
gatk_hc_joint:
# Parallelization configuration
num_cores: 2 # number of cores to use locally
window_length: 50000000 # split input into windows of this size, each triggers a job
num_jobs: 500 # number of windows to process in parallel
use_profile: true # use Snakemake profile for parallel processing
restart_times: 5 # number of times to re-launch jobs in case of failure
max_jobs_per_second: 10 # throttling of job creation
max_status_checks_per_second: 10 # throttling of status checks
The update_cluster_config()
method’s implementation is also very simple.
keep_tmpdir: never # keep temporary directory, {always, never, onerror}
job_mult_memory: 1 # memory multiplier
job_mult_time: 1 # running time multiplier
merge_mult_memory: 1 # memory multiplier for merging
merge_mult_time: 1 # running time multiplier for merging
# GATK HC--specific configuration