Somatic Variant Annotation

Implementation of the somatic_variant_annotation step

The somatic_variant_annotation step takes as the input the results of the somatic_variant_calling step (bgzip-ed and indexed VCF files) and performs annotation of the somatic variants. The result are annotated versions of the somatic variant VCF files (again bgzip-ed and indexed VCF files).

Step Input

The somatic variant annotation step uses Snakemake sub workflows for using the result of the somatic_variant_calling step.

The main assumption is that each VCF file contains the two matched normal and tumor samples.

Step Input

The variant annotation step uses Snakemake sub workflows for using the result of the variant_calling step.

Step Output

Users can annotate all genes & transcripts overlapping with the variant locus, or they can select one representative gene and transcript for annotation. In the latter case, the output vcf file will only contain one annotation per variant, while in the former case, there might be over 100 annotations for each variant.

The ordering of features driving the representative annotation choice is under user control. The default order is:

  1. biotype: protein coding genes come first, it is unclear what is the order for other types of genes

  2. mane: the MANE transcript is selected before other transcripts

  3. appris: the APPRIS principal isoform is selected before alternates

  4. tsl: Transcript Support Level values in increasing order

  5. ccds: Transcripts with CCDS ids are selected before those without

  6. canonical: ENSEMBL canonical transcripts are selected before the others

  7. rank: VEP internal ranking is used

  8. length: longer transcripts are preferred to shorter ones

This order is (hopefully) suitable for cBioPortal export, as well defined transcripts from protein-coding genes are selected when possible. However, it is recommended to check the full annotation for variants in or nearby disease-relevant genes.

All annotators generate a vcf with one annotation per transcript, and some annotators (only ENSEMBL’s Variant Effect Predictor in the current implementation) can also produce another output containing all annotations. The single annotation vcf is named <mapper>.<caller>.<annotator>.vcf.gz and the full annotation output is named <mapper>.<caller>.<annotator>.full.vcf.gz

Global Configuration

TODO

Default Configuration

The default configuration is as follows.

# Default configuration variant_annotation
step_config:
  variant_annotation:
    path_variant_calling: ../variant_calling
    tools:
      - vep
    vep:
      # We will always run VEP in cache mode.  You have to provide the directory to the
      # cache to use (VEP would be ``~/.vep``).
      cache_dir: null # OPTIONAL
      # The cache version to use.  gnomAD v2 used 85, gnomAD v3.1 uses 101.
      cache_version: "85"
      # The assembly to use.  gnomAD v2 used "GRCh37", gnomAD v3.1 uses "GRCh38".
      assembly: "GRCh37"
      # The flag selecting the transcripts.  One of "gencode_basic", "refseq", and "merged".
      tx_flag: "gencode_basic"
      # Number of threads to use with forking, set to 0 to disable forking.
      num_threads: 16
      # Additional flags.
      more_flags: "--af_gnomade --af_gnomadg"
      # The --buffer_size parameter
      buffer_size: 100000

Reports

Currently, no reports are generated.