Somatic Variant Annotation
Implementation of the somatic_variant_annotation
step
The somatic_variant_annotation
step takes as the input the results of the
somatic_variant_calling
step (bgzip-ed and indexed VCF files) and performs annotation of the
somatic variants. The result are annotated versions of the somatic variant VCF files (again
bgzip-ed and indexed VCF files).
Step Input
The somatic variant annotation step uses Snakemake sub workflows for using the result of the
somatic_variant_calling
step.
The main assumption is that each VCF file contains the two matched normal and tumor samples.
Step Input
The variant annotation step uses Snakemake sub workflows for using the result of the
variant_calling
step.
Step Output
Users can annotate all genes & transcripts overlapping with the variant locus, or they can select one representative gene and transcript for annotation. In the latter case, the output vcf file will only contain one annotation per variant, while in the former case, there might be over 100 annotations for each variant.
The ordering of features driving the representative annotation choice is under user control. The default order is:
biotype
: protein coding genes come first, it is unclear what is the order for other types of genesmane
: the MANE transcript is selected before other transcriptsappris
: the APPRIS principal isoform is selected before alternatestsl
: Transcript Support Level values in increasing orderccds
: Transcripts with CCDS ids are selected before those withoutcanonical
: ENSEMBL canonical transcripts are selected before the othersrank
: VEP internal ranking is usedlength
: longer transcripts are preferred to shorter ones
This order is (hopefully) suitable for cBioPortal export, as well defined transcripts from protein-coding genes are selected when possible. However, it is recommended to check the full annotation for variants in or nearby disease-relevant genes.
All annotators generate a vcf with one annotation per transcript, and some annotators
(only ENSEMBL’s Variant Effect Predictor in the current implementation) can also produce another
output containing all annotations.
The single annotation vcf is named <mapper>.<caller>.<annotator>.vcf.gz
and
the full annotation output is named <mapper>.<caller>.<annotator>.full.vcf.gz
Global Configuration
TODO
Default Configuration
The default configuration is as follows.
step_config:
variant_annotation:
# Path to variant calling
#path_variant_calling: ../variant_calling # Examples: ../variant_calling
#tools: # Options: 'vep'
# - vep
#vep:
#
# # Defaults to $HOME/.vep Not a good idea on the cluster
# cache_dir: ''
# species: homo_sapiens
#
# # The assembly to use. gnomAD v2 used "GRCh37", gnomAD v3.1 uses "GRCh38".
# assembly: GRCh37
#
# # The cache version to use. gnomAD v2 used 85, gnomAD v3.1 uses 101.
# cache_version: '85'
#
# # The flag selecting the transcripts. One of "gencode_basic", "refseq", and "merged".
# tx_flag: gencode_basic # Options: 'gencode_basic', 'refseq', 'merged'
# pick_order:
# - biotype
# - mane
# - appris
# - tsl
# - ccds
# - canonical
# - rank
# - length
# num_threads: 16
# buffer_size: 100000
# output_options:
# - everything
# more_flags: --af_gnomade --af_gnomadg
Reports
Currently, no reports are generated.