Tutorial 5: Annotating VCFs with a Pre-Trained Model¶
This tutorial shows how to take a MuTopia model that has already been trained and use it to assign signature components to every mutation in your own VCF file. You do not need to have completed Tutorials 1–4 (building a G-Tensor, training a model) — we release a set of ready-to-use pre-trained models, and this walkthrough is designed to be run against them directly.
By the end you will have:
Downloaded a pre-trained model for a tumor type of interest
Run
topo-model setupto pre-compute locus-level predictionsRun
mutopia-sbs annotate-vcfagainst an example liver-cancer VCFAn output VCF with per-mutation component posteriors and per-sample component fractions in the header
How the assignment works¶
Assignment happens in two phases:
Refitting (E-step) — MuTopia iteratively estimates the per-component mutation counts \(N_k\) for the sample, taking the pre-trained model’s local (locus-level) predictions as prior information. This is similar to standard NMF refitting, but constrained by the topographic prior learned during training.
Per-mutation posteriors — Given \(N_k\), the probability that each individual mutation was generated by component k is computed as \(\phi_{ki} \propto N_k \cdot p(\text{mutation}_i \mid k)\). The log posteriors are written to the output VCF as
INFOfields.
Prerequisites¶
The following tools must be on your PATH:
bcftoolsbedtoolsmutopia(the MuTopia package, installed into the active conda environment)
You do not need the full tutorial_data.tar.gz bundle for this tutorial.
Every file you need can be downloaded individually from the MuTopia Zenodo
record — see the next section.
Step 1: Download a pre-trained model¶
MuTopia ships pre-trained models for several tumor types on the
Zenodo record. For each tumor type
<tumor_type> there are three files at the record root:
<tumor_type>.nc— the reference G-Tensor the model was trained on (provides locus coordinates, genomic features, and the trinucleotide context for every 10 kb window in hg38).<tumor_type>.nc.regions.bed— the sidecar BED file that lists the genomic coordinates of each locus in the G-Tensor. MuTopia expects it to live next to the.ncfile.<tumor_type>.model.pkl— the trained topographic model itself.
Pick one that matches your samples. For this walkthrough we use the liver model, which pairs naturally with the example VCF in Step 3.
TUMOR=Liver-HCC
ZENODO=https://zenodo.org/records/18803136/files
wget -O ${TUMOR}.nc ${ZENODO}/${TUMOR}.nc
wget -O ${TUMOR}.nc.regions.bed ${ZENODO}/${TUMOR}.nc.regions.bed
wget -O ${TUMOR}.model.pkl ${ZENODO}/${TUMOR}.model.pkl
Note
Every G-Tensor has a sidecar <name>.nc.regions.bed file that records
the genomic coordinates of each locus. MuTopia resolves it by path
relative to the .nc file, so keep both files next to each other.
Note
Every pre-trained model is fit against hg38. If your VCF is in
GRCh37/hg19, lift it over first (e.g. with CrossMap or
picard LiftoverVcf).
Step 2: Initialize the dataset (one-time setup)¶
Before annotating any sample, run topo-model setup once to pre-compute the
model’s locus-level predictions (local variables and normalizers) and bake
them into a copy of the G-Tensor. The resulting *_setup.nc file can be
reused across many VCF samples, and annotation is much faster against it than
against the raw dataset.
topo-model setup \
Liver-HCC.model.pkl \
Liver-HCC.nc \
Liver-HCC_setup.nc \
-@ 4
Step 3: Annotate a VCF¶
To make this tutorial fully reproducible without requiring you to bring your own sample, we also host a single example VCF:
CHC197.sample.hg38.vcf.gz— whole-genome somatic SNVs from a LICA-FR liver cancer sample (PCAWG donor CHC197), lifted to hg38. Roughly 12k PASS SNVs, ~260 KB.
You will also need an hg38 reference FASTA. MuTopia does not ship one — point
-fa at whichever copy you already have (e.g. the one you use with
bcftools/samtools). The FASTA must be indexed (hg38.fa.fai next to
it); run samtools faidx hg38.fa once if it isn’t.
Download the example VCF and run the annotator:
wget -O CHC197.sample.hg38.vcf.gz \
https://github.com/sigscape/MuTopia/releases/download/v1.0.5/CHC197.sample.hg38.vcf.gz
mutopia-sbs annotate-vcf \
Liver-HCC.model.pkl \
Liver-HCC_setup.nc \
CHC197.sample.hg38.vcf.gz \
-fa /path/to/hg38.fa \
-w VAF \
--no-pass-only \
--no-cluster \
-o CHC197_annotated.vcf
A few notes on the flags:
-fasupplies the hg38 reference FASTA so MuTopia can resolve the trinucleotide context of each mutation.-w VAFtells MuTopia to read the variant allele fraction from theVAFINFOfield and use it as a per-mutation weight. (The example VCF already carriesVAF=...on every record; for your own data, pick whichever INFO tag your caller emits.)--no-pass-onlykeeps every record in the VCF, not just those withFILTER=PASS. The example VCF has an emptyFILTERcolumn on every row, so without this flag nothing would make it through.--no-clusterskips MuTopia’s kataegis clustering pre-pass. You can drop it if you also supply a background mutation-rate bedgraph via-m:
mutopia-sbs annotate-vcf \
Liver-HCC.model.pkl \
Liver-HCC_setup.nc \
CHC197.sample.hg38.vcf.gz \
-fa /path/to/hg38.fa \
-w VAF \
--no-pass-only \
-m mutation-rate.hg38.bedgraph.gz \
-o CHC197_annotated.vcf
While running, progress messages are printed to stderr:
Loading model ...
Dataset has 279694 regions, 279694 after coord subsetting.
Ingesting mutations ...
Ingested 12309 mutations.
Built sparse sample, 12309 nonzero entries.
Loading dataset ...
Whitelist applied: 279694/279694 regions have nonzero exposure.
Setting up dataset (this may take a while) ...
Computing per-mutation posteriors ...
Estimated component fractions:
SBS1 0.3421
SBS5 0.2876
SBS40 0.1503
SBS22 0.0984
...
Annotating VCF ...
Step 4: Reading the output¶
The estimated component fractions are written to the VCF header:
##mutopia_Nk=<Description="Estimated per-component mutation counts",SBS1=4210.5,SBS5=3540.7,...>
Each variant record receives one INFO field per component:
chr1 3122496 . G A . PASS logp_SBS1=-0.412;logp_SBS5=-1.893;logp_SBS40=-3.114;...
The logp_<component> values are log posterior probabilities normalized
over all components, so they sum to 0 in log-space (i.e., the un-logged
posteriors sum to 1). A value close to 0 means the mutation is very likely
assigned to that component; a strongly negative value means it is unlikely.
To retrieve the most probable component for each mutation:
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO\n' CHC197_annotated.vcf \
| python3 - << 'EOF'
import sys
for line in sys.stdin:
chrom, pos, ref, alt, info = line.rstrip().split('\t')
fields = dict(f.split('=') for f in info.split(';') if '=' in f)
logps = {k: float(v) for k, v in fields.items() if k.startswith('logp_')}
best = max(logps, key=logps.get).replace('logp_', '')
print(chrom, pos, ref, alt, best)
EOF
Using a whitelist for panel and exome data¶
When mutations come from targeted sequencing (exome capture, gene panel), the model must know which genomic regions were actually assayed. Without this information, the model assumes the entire genome was observed, which biases signature assignments because the model expects many more background mutations than were detected.
Provide a BED file of assayed regions with --white-list:
mutopia-sbs annotate-vcf \
Liver-HCC.model.pkl \
Liver-HCC_setup.nc \
sample.vcf.gz \
-fa /path/to/hg38.fa \
-w VAF \
--no-pass-only \
--no-cluster \
--white-list capture_targets.bed \
-o sample_annotated.vcf
The whitelist adjusts two things:
Exposure correction — Per-region exposure terms are scaled by the fraction of each 10 kb genomic window covered by the whitelist intervals, so the model penalizes the absence of mutations in uncaptured territory.
Mutation filtering — Mutations outside the whitelisted regions are excluded from the analysis, discarding off-target artefacts.
Note
The whitelist BED file should use the same chromosome naming convention as
your VCF and reference genome (chr1 vs 1). Use the -chr flag to
add or remove a prefix if needed.
Multi-sample VCFs¶
If your VCF contains calls for multiple samples, specify which one to annotate:
mutopia-sbs annotate-vcf \
Liver-HCC.model.pkl \
Liver-HCC_setup.nc \
cohort.vcf.gz \
-fa /path/to/hg38.fa \
-w VAF \
--no-pass-only \
--no-cluster \
-name SAMPLE_ID \
-o SAMPLE_ID_annotated.vcf
Additional options¶
-fa, --fasta FILE Reference genome FASTA (required)
-w, --weight-tag TAG INFO tag to use as per-mutation weight (e.g. "VAF")
-m, --mutation-rate-file FILE Background mutation rate bedgraph (for clustering)
--no-cluster Skip kataegis mutation clustering
--no-pass-only Include non-PASS variants
-name, --sample-name NAME Sample name in multi-sample VCF
-chr, --chr-prefix PREFIX Chromosome prefix to add (e.g. "chr")
-@, --threads N Parallel threads (default: 1)
--skip-sort Skip VCF sorting (VCF must be lexicographically sorted)
-o, --output FILE Output VCF path (default: stdout)
Run mutopia-sbs annotate-vcf --help for the full option reference.