Tutorial 5: Annotating VCFs with a Pre-Trained Model

This tutorial shows how to take a MuTopia model that has already been trained and use it to assign signature components to every mutation in your own VCF file. You do not need to have completed Tutorials 1–4 (building a G-Tensor, training a model) — we release a set of ready-to-use pre-trained models, and this walkthrough is designed to be run against them directly.

By the end you will have:

  • Downloaded a pre-trained model for a tumor type of interest

  • Run topo-model setup to pre-compute locus-level predictions

  • Run mutopia-sbs annotate-vcf against an example liver-cancer VCF

  • An output VCF with per-mutation component posteriors and per-sample component fractions in the header

How the assignment works

Assignment happens in two phases:

  1. Refitting (E-step) — MuTopia iteratively estimates the per-component mutation counts \(N_k\) for the sample, taking the pre-trained model’s local (locus-level) predictions as prior information. This is similar to standard NMF refitting, but constrained by the topographic prior learned during training.

  2. Per-mutation posteriors — Given \(N_k\), the probability that each individual mutation was generated by component k is computed as \(\phi_{ki} \propto N_k \cdot p(\text{mutation}_i \mid k)\). The log posteriors are written to the output VCF as INFO fields.

Prerequisites

The following tools must be on your PATH:

  • bcftools

  • bedtools

  • mutopia (the MuTopia package, installed into the active conda environment)

You do not need the full tutorial_data.tar.gz bundle for this tutorial. Every file you need can be downloaded individually from the MuTopia Zenodo record — see the next section.

Step 1: Download a pre-trained model

MuTopia ships pre-trained models for several tumor types on the Zenodo record. For each tumor type <tumor_type> there are three files at the record root:

  • <tumor_type>.nc — the reference G-Tensor the model was trained on (provides locus coordinates, genomic features, and the trinucleotide context for every 10 kb window in hg38).

  • <tumor_type>.nc.regions.bed — the sidecar BED file that lists the genomic coordinates of each locus in the G-Tensor. MuTopia expects it to live next to the .nc file.

  • <tumor_type>.model.pkl — the trained topographic model itself.

Pick one that matches your samples. For this walkthrough we use the liver model, which pairs naturally with the example VCF in Step 3.

TUMOR=Liver-HCC
ZENODO=https://zenodo.org/records/18803136/files

wget -O ${TUMOR}.nc             ${ZENODO}/${TUMOR}.nc
wget -O ${TUMOR}.nc.regions.bed ${ZENODO}/${TUMOR}.nc.regions.bed
wget -O ${TUMOR}.model.pkl      ${ZENODO}/${TUMOR}.model.pkl

Note

Every G-Tensor has a sidecar <name>.nc.regions.bed file that records the genomic coordinates of each locus. MuTopia resolves it by path relative to the .nc file, so keep both files next to each other.

Note

Every pre-trained model is fit against hg38. If your VCF is in GRCh37/hg19, lift it over first (e.g. with CrossMap or picard LiftoverVcf).

Step 2: Initialize the dataset (one-time setup)

Before annotating any sample, run topo-model setup once to pre-compute the model’s locus-level predictions (local variables and normalizers) and bake them into a copy of the G-Tensor. The resulting *_setup.nc file can be reused across many VCF samples, and annotation is much faster against it than against the raw dataset.

topo-model setup \
    Liver-HCC.model.pkl \
    Liver-HCC.nc \
    Liver-HCC_setup.nc \
    -@ 4

Step 3: Annotate a VCF

To make this tutorial fully reproducible without requiring you to bring your own sample, we also host a single example VCF:

  • CHC197.sample.hg38.vcf.gz — whole-genome somatic SNVs from a LICA-FR liver cancer sample (PCAWG donor CHC197), lifted to hg38. Roughly 12k PASS SNVs, ~260 KB.

You will also need an hg38 reference FASTA. MuTopia does not ship one — point -fa at whichever copy you already have (e.g. the one you use with bcftools/samtools). The FASTA must be indexed (hg38.fa.fai next to it); run samtools faidx hg38.fa once if it isn’t.

Download the example VCF and run the annotator:

wget -O CHC197.sample.hg38.vcf.gz \
    https://github.com/sigscape/MuTopia/releases/download/v1.0.5/CHC197.sample.hg38.vcf.gz

mutopia-sbs annotate-vcf \
    Liver-HCC.model.pkl \
    Liver-HCC_setup.nc \
    CHC197.sample.hg38.vcf.gz \
    -fa /path/to/hg38.fa \
    -w VAF \
    --no-pass-only \
    --no-cluster \
    -o CHC197_annotated.vcf

A few notes on the flags:

  • -fa supplies the hg38 reference FASTA so MuTopia can resolve the trinucleotide context of each mutation.

  • -w VAF tells MuTopia to read the variant allele fraction from the VAF INFO field and use it as a per-mutation weight. (The example VCF already carries VAF=... on every record; for your own data, pick whichever INFO tag your caller emits.)

  • --no-pass-only keeps every record in the VCF, not just those with FILTER=PASS. The example VCF has an empty FILTER column on every row, so without this flag nothing would make it through.

  • --no-cluster skips MuTopia’s kataegis clustering pre-pass. You can drop it if you also supply a background mutation-rate bedgraph via -m:

mutopia-sbs annotate-vcf \
    Liver-HCC.model.pkl \
    Liver-HCC_setup.nc \
    CHC197.sample.hg38.vcf.gz \
    -fa /path/to/hg38.fa \
    -w VAF \
    --no-pass-only \
    -m mutation-rate.hg38.bedgraph.gz \
    -o CHC197_annotated.vcf

While running, progress messages are printed to stderr:

Loading model ...
Dataset has 279694 regions, 279694 after coord subsetting.
Ingesting mutations ...
Ingested 12309 mutations.
Built sparse sample, 12309 nonzero entries.
Loading dataset ...
Whitelist applied: 279694/279694 regions have nonzero exposure.
Setting up dataset (this may take a while) ...
Computing per-mutation posteriors ...
Estimated component fractions:
  SBS1    0.3421
  SBS5    0.2876
  SBS40   0.1503
  SBS22   0.0984
  ...
Annotating VCF ...

Step 4: Reading the output

The estimated component fractions are written to the VCF header:

##mutopia_Nk=<Description="Estimated per-component mutation counts",SBS1=4210.5,SBS5=3540.7,...>

Each variant record receives one INFO field per component:

chr1  3122496  .  G  A  .  PASS  logp_SBS1=-0.412;logp_SBS5=-1.893;logp_SBS40=-3.114;...

The logp_<component> values are log posterior probabilities normalized over all components, so they sum to 0 in log-space (i.e., the un-logged posteriors sum to 1). A value close to 0 means the mutation is very likely assigned to that component; a strongly negative value means it is unlikely.

To retrieve the most probable component for each mutation:

bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO\n' CHC197_annotated.vcf \
  | python3 - << 'EOF'
import sys

for line in sys.stdin:
    chrom, pos, ref, alt, info = line.rstrip().split('\t')
    fields = dict(f.split('=') for f in info.split(';') if '=' in f)
    logps = {k: float(v) for k, v in fields.items() if k.startswith('logp_')}
    best = max(logps, key=logps.get).replace('logp_', '')
    print(chrom, pos, ref, alt, best)
EOF

Using a whitelist for panel and exome data

When mutations come from targeted sequencing (exome capture, gene panel), the model must know which genomic regions were actually assayed. Without this information, the model assumes the entire genome was observed, which biases signature assignments because the model expects many more background mutations than were detected.

Provide a BED file of assayed regions with --white-list:

mutopia-sbs annotate-vcf \
    Liver-HCC.model.pkl \
    Liver-HCC_setup.nc \
    sample.vcf.gz \
    -fa /path/to/hg38.fa \
    -w VAF \
    --no-pass-only \
    --no-cluster \
    --white-list capture_targets.bed \
    -o sample_annotated.vcf

The whitelist adjusts two things:

  1. Exposure correction — Per-region exposure terms are scaled by the fraction of each 10 kb genomic window covered by the whitelist intervals, so the model penalizes the absence of mutations in uncaptured territory.

  2. Mutation filtering — Mutations outside the whitelisted regions are excluded from the analysis, discarding off-target artefacts.

Note

The whitelist BED file should use the same chromosome naming convention as your VCF and reference genome (chr1 vs 1). Use the -chr flag to add or remove a prefix if needed.

Multi-sample VCFs

If your VCF contains calls for multiple samples, specify which one to annotate:

mutopia-sbs annotate-vcf \
    Liver-HCC.model.pkl \
    Liver-HCC_setup.nc \
    cohort.vcf.gz \
    -fa /path/to/hg38.fa \
    -w VAF \
    --no-pass-only \
    --no-cluster \
    -name SAMPLE_ID \
    -o SAMPLE_ID_annotated.vcf

Additional options

-fa, --fasta FILE               Reference genome FASTA (required)
-w, --weight-tag TAG            INFO tag to use as per-mutation weight (e.g. "VAF")
-m, --mutation-rate-file FILE   Background mutation rate bedgraph (for clustering)
--no-cluster                    Skip kataegis mutation clustering
--no-pass-only                  Include non-PASS variants
-name, --sample-name NAME       Sample name in multi-sample VCF
-chr, --chr-prefix PREFIX       Chromosome prefix to add (e.g. "chr")
-@, --threads N                 Parallel threads (default: 1)
--skip-sort                     Skip VCF sorting (VCF must be lexicographically sorted)
-o, --output FILE               Output VCF path (default: stdout)

Run mutopia-sbs annotate-vcf --help for the full option reference.