Generic Pipeline Step Description

Generally, each pipeline step takes some input, processes it in a work directory, and then creates an output directory with the pipeline step’s result. Each pipeline step is implemented as a Snakemake workflow and a step instance corresponds to a Snakemake working directory on the file system.

File System Layout

The overall layout for a pipeline step instance is as follows:

working_dir_name/
+-- [input/]
+-- work/
+-- output/
`-- config.yaml

Directory input/

An optional input directory. This is directory is only created if files are to be linked into the directory that are not generated by another workflow. For example, the ngs_mapping pipeline step links in variable data the input FASTQ files into the input/ directory.

Note that static data (such as reference, read mapper indices, annotation, etc., all that can be statically configured) is not linked into the input/ directory. In contrast, the variant_calling step does not need an input/ directory as it only works on the read alignments generated by the ngs_mapping step.

Directory work/

This is the working directory that contains all results of the pipeline, including logs as well as intermediary and final results. Intermediary results should be marked by the Snakemake temp() directive but there is no guarantee that temporary files are removed after the pipeline step finishes. Also note that you as a user have to consider the directory structure and file names in work/ as unstable.

In short: in work/, the pipeline step authors can do whatever they want, including changing it between minor versions.

Directory output/

This is the “public” output directory. It contains a stable directory structure with stable names. The output/ directory contains no files but rather symlinks into the work/ directory.

By convention, the directories and file names should mirror the ones in work/ (and thus form a subset) for simplicity. However, in order to keep semantic versioning, this convention might be broken to keep paths in the output/ directory stable when something in work/ changes.

Step Instance Configuration config.yaml

Each step instance must have a configuration file config.yaml. The file contains a YAML or JSON-formatted directory structure and typically looks as follows.

pipeline_step:
  name: ngs_mapping
  version: 1

$ref: 'file://../.snappy/config.yaml'

Consider the second part first. Here, JSON Pointer notation is used for referencing and loading the file ../.snappy_pipeline/config.yaml at the root of YAML file. This file contains the basic configuration for all pipeline step instances in a project. The configuration file config.yaml in the pipeline step instance directory can then override settings as fit. These settings are placed into the YAML file and on loading of the config.yaml file, the configuration settings of both the including and the included file will be merged. The settings of the including file overriding the settings from the included files.

Consider the first part now. Here, it is simply configured that the pipeline step to be executed is named ngs_mapping and version 1 is assumed to be present. The versioning allows the pipeline step to check whether there are incompatibilities in the pipeline step implementation version and the version used when writing the step instance configuration.

Note

Background Data Sets

These data sets are available for use as background data. The provided data can be sparser (e.g., only NGS library for normal samples in an otherwise matched cancer/normal study).

The execution of cubi-snake in a directory will not automatically generate these files. Rather, they are only generated when used in a pipeline step such as somatic_targeted_cnv_calling.