.. _dev_somatic_variant_calling: ================================== Somatic Variant Calling Dissection ================================== .. note:: Before reading this chapter, you should have - have knowledge from the user's perspective of CUBI pipeline (start a :ref:`usage`) - read chapter :ref:`dev_intro`. After reading this chapter, you should - understand the :py:class:`~snappy_pipeline.workflows.abstract.BaseStep` and :py:class:`~snappy_pipeline.workflows.abstract.BaseStepPart` classes and how to subclass them and override the different functions - understand how the objects of these classes are tied into the ``Snakefile`` of each pipeline step - understand how to create and use Snakemake wrappers for tools ---------------------------- Pipeline Step File Structure ---------------------------- Generally, all pipeline steps go into a sub module of ``snappy_pipeline.workflows`` (thus, a sub directory). In this chapter, we look at the ``somatic_variant_calling`` pipeline step. This step has the following structure on the file system: :: somatic_variant_calling/ |-- __init__.py `-- Snakefile As you can see, it is a Python module (as it contains an ``__init__.py`` file) that also contains non-Python files (here ``Snakefile``). The directory could also contain more files. This could be any small static data file that the module could require. Further, we could decide to factorize out the rules for a tool that requires many small rules (such as the tool cnvkit in ``somatic_targeted_seq_cnv_calling``). The code for generating input and output file lists etc. is in the ``__init__.py`` file and the module is available as ``snappy_pipeline.workflow.somatic_variant_calling``. The ``Snakefile`` is used for creating the Snakemake workflow. When executing ``cubi-snake`` for a ``somatic_variant_calling`` step instance, you will note that the Snakemake command line displayed at the top will use the ``--snakefile`` argument and put the value to the ``Snakefile`` inside the ``somatic_variant_calling`` directory at the argument's value. Thus, ``cubi-snake`` is no real "magic" but simply a shortcut to the ``snakemake`` executable. ------------- The Snakefile ------------- We will first consider the ``Snakefile``. Necessary Imports ================= At the top, it starts with a line specifying UTF-8 coding and a Python docstring giving a short synopsis. It does some Python ``imports`` for making the :py:func:`~snappy_pipeline.base.expand_ref` function and the :py:class:`~snappy_pipeline.workflows.somatic_variant_calling.SomaticVariantCallingWorkflow` class available. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/Snakefile :language: python :lines: 1-10 Configuration ============= Then follows the loading of the configuration. The Snakemake ``configfile:`` statement loads the file ``config.yaml`` from the current working directory. When executing the pipeline step with ``cubi-snake``, this is either the directory the command is called in or the value of the ``--directory`` argument if given. In the last line of this chunk, the JSON pointers in the configuration are expanded, i.e., the ``"$ref"`` values are interpreted. This is used for implementing "overriding behaviour", i.e., including and extending (OOP-like) the project's main configuration file with the per--step instance one. This way, you can set the ``pipeline_step/name`` to ``somatic_variant_calling`` in the somatic variant calling pipeline instance directory and to ``ngs_mapping`` in the ngs mapping pipeline step instance directory, for example. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/Snakefile :language: python :lines: 13-18 Local Rules / Rule ``all`` ========================== In the next chunk, the rules that are to be executed locally and not generate any cluster jobs are defined. Then, the ``all`` rule is defined to obtain the list of files to generate by default using the :py:meth:`~snappy_pipeline.workflows.somatic_variant_calling.SomaticVariantCallingWorkflow.get_result_files` method of the :py:meth:`~snappy_pipeline.workflows.somatic_variant_calling.SomaticVariantCallingWorkflow` class. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/Snakefile :language: python :lines: 25-32 House-Keeping Rules =================== Next follow the "house-keeping" rules that do not perform any real work. In this case, the ``somatic_variant_calling_link_out_run`` rule performs the linking from the ``work/`` directory into the ``output/`` directory. Note that the rule names are generated by concatenating the step name (here ``somatic_variant_calling``), the part of the pipeline step (here ``link_out``), followed by the action to be performed (here ``run`` as there is no other action for ``link_out``). .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/Snakefile :language: python :lines: 34-42 Rule for MuTect =============== Next comes the rule for running mutect. Note that both for ``input:`` and ``output:``, ``dict`` values will be passed. These should be unpacked (similar to Python ``**kwargs`` unpacking). As we are using an **input function** (i.e., a function object that accepts a wildcards argument), we have to use the recently introduced Snakemake ``unpack`` directive. This allows for lazy unpacking after the input function has been called with the wildcards argument. For the output files, no wildcards are required as only strings with placeholders are returned. Thus, the dict with key/value pairs of the named output files is to be unpacked directly using two asterisks (``**``). .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/Snakefile :language: python :lines: 44-53 Rule for Scalpel ================ The rule for Scalpel looks similar. However, here the name of the normal library is required (as it is not part of the wildcards in contrast to the somatic library). The way this is implemented here is to introduce a parameter called ``normal_library_name``. The attribute ``get_normal_lib_name`` of the scalpel :py:class:`~snappy_pipeline.workflows.abstract.BaseStepPart` sub class object is passed in here (which is actually a function). On execution of the rule, the function will be called with the wildcards object as the parameter. It will then lookup the name of the normal library for the matched tumor NGS library and return it. This value is then available as ``params.normal_library_name``. Also note that for scalpel, the call is not generated directly by the :py:class:`~snappy_pipeline.workflows.abstract.BaseStepPart` sub class, but a `Snakemake wrapper `_ is used instead. This wrapper is located in the directory ``../wrappers/scalpel/run``, relative to the ``somatic_variant_calling`` Snakefile. The method :py:meth:`~snappy_pipeline.workflows.somatic_variant_calling.SomaticVariantCallingWorkflow.wrapper_path` builds the correct path relative to the ``wrappers`` directory in the ``snappy_pipeline`` directory and its return value is passed to the ``wrapper:`` section. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/Snakefile :language: python :lines: 55-63 ---------- The Module ---------- .. NOTE: the lines here have to be updated accordingly to changes in the somatic variant calling module Module Documentation ==================== The file starts with the module-level documentation. Only the first four lines are shown below. This module-level documentation is also included into the user documentation, e.g., the one for the somatic variant calling module is included at :ref:`step_somatic_variant_calling`. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/__init__.py :language: python :lines: 1-4 Imports ======= Then follow the necessary imports for the module. Note that the classes of the steps that are used for the input are also imported (this will be important below). .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/__init__.py :language: python :lines: 81-94 Constants ========= The imports are followed by constant definitions. In the case of the somatic variant calling methods, the different tools generate a common set of core files, here VCF files with TBI indices and MD5 files for both. Thus, it makes sense to store the extensions (and names for named input and output file lists) in module-level constants. Note the use of tuples over lists for marking this datas explicitely as immutable. Further, the ``DEFAULT_CONFIG`` constant is defined with default configuration in YAML format. This is also displayed in the user configuration so the users know where configuration settings are available. Required configuration without any defaults should be set to ``null`` (or ``[]``/``{}`` for empty lists/dicts) and marked with a ``# REQUIRED`` comment. The different values are to be documented with YAML comments. This default configuration will be loaded when initializing the :py:class:`~snappy_pipeline.workflows.abstract.BaseStep` sub class object and then overridden with the project- and pipeline step instance--wide configuration. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/__init__.py :language: python :lines: 96-123 The BaseStep Sub Class ====================== Let's jump towards the end of the file. Here is the :py:class:`~snappy_pipeline.workflows.abstract.BaseStep` sub class :py:class:`~snappy_pipeline.workflows.somatic_variant_calling.SomaticVariantCallingWorkflow`. Each sub class has to configure the ``name`` and ``sheet_shortcut_class`` class members. They will be used for identifying the step name and the BioMed Sheet sheet shortcut class. The somatic variant calling step sets these to ``"somatic_variant_calling"`` and the ``CancerCaseSheet`` class. The static method ``default_config_yaml()`` must be overridden in each :py:class:`~snappy_pipeline.workflows.abstract.BaseStep` sub class. Each of these functions will have the same content but it is important for the scope of accessing ``DEFAULT_CONFIG`` in the current module. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/__init__.py :language: python :lines: 265-276 The constructor of the class calls the super class' constructor with the arguments from the ``Snakefile``. It is very important to note that it also gets an iterable (here a one-element tuple) of the :py:class:`~snappy_pipeline.workflows.abstract.BaseStep` sub classes that provide input for this step (here only ``NGSMappingWorkflow`` imported at the top). This information is required for making the default configuration of these steps available. The constructor then proceeds to register the sub step classes that are used to implement the actual behaviour of the pipeline step. Here, it is for running MuTect, Scalpel, and linking out the somatic variant call VCF files from ``work/`` into ``output/``. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/__init__.py :language: python :lines: 278-287 The method ``get_result_files()`` returns a list of result files of this pipeline step. For this, it uses the ``_yield_result_files_()`` helper method and generates path in the ``output/`` folder. Snakemake knows that the link out rule will create these files but need corresponding files in ``work/`` for this. Through this mechanism, the individual tools' rules will be triggered. Note the use of :py:func:`cubi.utils.listify` decorator that converts a generator (as created by using ``yield`` in the function) to a function returning a list with the yielded objects in the ordere that they are yielded. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/__init__.py :language: python :lines: 290-313 Finally, the ``check_config()`` implementation ensures that the path to the NGS mapping step is configured for the somatic variant calling step. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/__init__.py :language: python :lines: 317-321 Module-Level BaseStepPart Sub Class =================================== Now, we move up towards the top of the file again. The ``SomaticVariantCallingStepPart`` class is the base class for the somatic variant calling implementations. The constructor builds a template string for generating result/output paths. It then builds the member ``cancer_ngs_library_to_sample_pair`` with a mapping from tumor DNA NGS library name to the BioMed Sheets ``CancerSamplePair`` object that contains information about both the tumor and normal sample. Note the use of ``OrderedDict`` to keep the order from the sample sheet definition. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/__init__.py :language: python :lines: 127-143 The implementation of ``get_input_files()`` returns an input function that given the wildcards returns a dict with paths to the normal and tumor libraries' aligned BAM file from the sub workflow ``ngs_mapping``. Note that the input function returns the actual path without any wildcards. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/__init__.py :language: python :lines: 145-164 The method ``get_normal_lib_name()`` returns the name of the matched normal NGS library for the given tumor NGS library name. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/__init__.py :language: python :lines: 166-169 The implementation of ``get_output_files()`` returns a dict with named output files for Snakemake. Note that this function returns named output files with wildcard placeholders. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/__init__.py :language: python :lines: 171-177 Finally, the method ``get_log_file()`` returns the path to the log file to create by Snakemake. This is available as ``{log}`` in shell commands and as ``{snakemake.log}`` in Snakemake wrappers. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/__init__.py :language: python :lines: 179-181 MuTect BaseStepPart Sub Class ============================= First, the class defines the class attribute ``name`` and sets it to ``'mutect'``. This is used by the super class and also by ``BaseStep`` in the places where the name of the implementation is needed. The ``check_config()`` implementation ensures that the necessary MuTect-specific configuration has been set. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/__init__.py :language: python :lines: 184-200 The function ``get_shell_cmd()`` generates the shell command to the CUBI wrapper (not Snakemake wrapper ;) to the MuTect tool. Here, the parallel CUBI wrapper for MuTect is used with appropriate configuration. Note the direct use of configuration and that no complex string operations are required for building the call to MuTect. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/__init__.py :language: python :lines: 202-227 Finally, the method ``update_cluster_config()`` takes the Snakemake ``cluster`` configuration and updates it appropriately. The three settings ``cluster.h_vmem``, ``cluster.h_rt``, and ``cluster.pe`` are used in the generated Snakemake command line generated by the ``cubi-snake`` tool appropriately for SGE. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/__init__.py :language: python :lines: 229-234 Scalpel BaseStepPart Sub Class ============================== The integration for Scalpel is even simpler as Snakemake wrappers can be used and there is no need for the ``get_shell_cmd()`` function. The class sets the class attribute ``name`` to ``'scalpel'``. Then, the ``check_config()`` implementation ensure the presence of the required configuration. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/__init__.py :language: python :lines: 237-247 Output file generation is similarly easy as for the MuTect part. However, the Scalpel working directory is tar-gzed in case the users wants to have access to the intermediate results and query the built MuTect database for variants with different coverages. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/__init__.py :language: python :lines: 249-257 The ``update_cluster_config()`` method's implementation is also very simple. .. literalinclude:: ../snappy_pipeline/workflows/somatic_variant_calling/__init__.py :language: python :lines: 259-264