Tutorial 5: Annotating VCFs with a Pre-Trained Model ===================================================== This tutorial shows how to take a MuTopia model that has *already been trained* and use it to assign signature components to every mutation in your own VCF file. You do **not** need to have completed Tutorials 1–4 (building a G-Tensor, training a model) — we release a set of ready-to-use pre-trained models, and this walkthrough is designed to be run against them directly. By the end you will have: - Downloaded a pre-trained model for a tumor type of interest - Run ``topo-model setup`` to pre-compute locus-level predictions - Run ``mutopia-sbs annotate-vcf`` against an example liver-cancer VCF - An output VCF with per-mutation component posteriors and per-sample component fractions in the header How the assignment works ------------------------ Assignment happens in two phases: 1. **Refitting (E-step)** — MuTopia iteratively estimates the per-component mutation counts :math:`N_k` for the sample, taking the pre-trained model's local (locus-level) predictions as prior information. This is similar to standard NMF refitting, but constrained by the topographic prior learned during training. 2. **Per-mutation posteriors** — Given :math:`N_k`, the probability that each individual mutation was generated by component *k* is computed as :math:`\phi_{ki} \propto N_k \cdot p(\text{mutation}_i \mid k)`. The log posteriors are written to the output VCF as ``INFO`` fields. Prerequisites ------------- The following tools must be on your ``PATH``: - ``bcftools`` - ``bedtools`` - ``mutopia`` (the MuTopia package, installed into the active conda environment) You do **not** need the full ``tutorial_data.tar.gz`` bundle for this tutorial. Every file you need can be downloaded individually from the MuTopia Zenodo record — see the next section. Step 1: Download a pre-trained model ------------------------------------- MuTopia ships pre-trained models for several tumor types on the `Zenodo record `_. For each tumor type ```` there are three files at the record root: - ``.nc`` — the reference G-Tensor the model was trained on (provides locus coordinates, genomic features, and the trinucleotide context for every 10 kb window in hg38). - ``.nc.regions.bed`` — the sidecar BED file that lists the genomic coordinates of each locus in the G-Tensor. MuTopia expects it to live next to the ``.nc`` file. - ``.model.pkl`` — the trained topographic model itself. Pick one that matches your samples. For this walkthrough we use the liver model, which pairs naturally with the example VCF in Step 3. .. code-block:: bash TUMOR=Liver-HCC ZENODO=https://zenodo.org/records/18803136/files wget -O ${TUMOR}.nc ${ZENODO}/${TUMOR}.nc wget -O ${TUMOR}.nc.regions.bed ${ZENODO}/${TUMOR}.nc.regions.bed wget -O ${TUMOR}.model.pkl ${ZENODO}/${TUMOR}.model.pkl .. note:: Every G-Tensor has a sidecar ``.nc.regions.bed`` file that records the genomic coordinates of each locus. MuTopia resolves it by path relative to the ``.nc`` file, so keep both files next to each other. .. note:: Every pre-trained model is fit against **hg38**. If your VCF is in GRCh37/hg19, lift it over first (e.g. with ``CrossMap`` or ``picard LiftoverVcf``). Step 2: Initialize the dataset (one-time setup) ------------------------------------------------ Before annotating any sample, run ``topo-model setup`` once to pre-compute the model's locus-level predictions (local variables and normalizers) and bake them into a copy of the G-Tensor. The resulting ``*_setup.nc`` file can be reused across many VCF samples, and annotation is much faster against it than against the raw dataset. .. code-block:: bash topo-model setup \ Liver-HCC.model.pkl \ Liver-HCC.nc \ Liver-HCC_setup.nc \ -@ 4 Step 3: Annotate a VCF ---------------------- To make this tutorial fully reproducible without requiring you to bring your own sample, we also host a single example VCF: - ``CHC197.sample.hg38.vcf.gz`` — whole-genome somatic SNVs from a LICA-FR liver cancer sample (PCAWG donor CHC197), lifted to hg38. Roughly 12k PASS SNVs, ~260 KB. You will also need an hg38 reference FASTA. MuTopia does not ship one — point ``-fa`` at whichever copy you already have (e.g. the one you use with ``bcftools``/``samtools``). The FASTA must be indexed (``hg38.fa.fai`` next to it); run ``samtools faidx hg38.fa`` once if it isn't. Download the example VCF and run the annotator: .. code-block:: bash wget -O CHC197.sample.hg38.vcf.gz \ https://github.com/sigscape/MuTopia/releases/download/v1.0.5/CHC197.sample.hg38.vcf.gz mutopia-sbs annotate-vcf \ Liver-HCC.model.pkl \ Liver-HCC_setup.nc \ CHC197.sample.hg38.vcf.gz \ -fa /path/to/hg38.fa \ -w VAF \ --no-pass-only \ --no-cluster \ -o CHC197_annotated.vcf A few notes on the flags: - ``-fa`` supplies the hg38 reference FASTA so MuTopia can resolve the trinucleotide context of each mutation. - ``-w VAF`` tells MuTopia to read the variant allele fraction from the ``VAF`` ``INFO`` field and use it as a per-mutation weight. (The example VCF already carries ``VAF=...`` on every record; for your own data, pick whichever INFO tag your caller emits.) - ``--no-pass-only`` keeps every record in the VCF, not just those with ``FILTER=PASS``. The example VCF has an empty ``FILTER`` column on every row, so without this flag nothing would make it through. - ``--no-cluster`` skips MuTopia's kataegis clustering pre-pass. You can drop it if you also supply a background mutation-rate bedgraph via ``-m``: .. code-block:: bash mutopia-sbs annotate-vcf \ Liver-HCC.model.pkl \ Liver-HCC_setup.nc \ CHC197.sample.hg38.vcf.gz \ -fa /path/to/hg38.fa \ -w VAF \ --no-pass-only \ -m mutation-rate.hg38.bedgraph.gz \ -o CHC197_annotated.vcf While running, progress messages are printed to stderr: .. code-block:: text Loading model ... Dataset has 279694 regions, 279694 after coord subsetting. Ingesting mutations ... Ingested 12309 mutations. Built sparse sample, 12309 nonzero entries. Loading dataset ... Whitelist applied: 279694/279694 regions have nonzero exposure. Setting up dataset (this may take a while) ... Computing per-mutation posteriors ... Estimated component fractions: SBS1 0.3421 SBS5 0.2876 SBS40 0.1503 SBS22 0.0984 ... Annotating VCF ... Step 4: Reading the output -------------------------- The estimated component fractions are written to the VCF header: .. code-block:: text ##mutopia_Nk= Each variant record receives one ``INFO`` field per component: .. code-block:: text chr1 3122496 . G A . PASS logp_SBS1=-0.412;logp_SBS5=-1.893;logp_SBS40=-3.114;... The ``logp_`` values are log posterior probabilities normalized over all components, so they sum to 0 in log-space (i.e., the un-logged posteriors sum to 1). A value close to 0 means the mutation is very likely assigned to that component; a strongly negative value means it is unlikely. To retrieve the most probable component for each mutation: .. code-block:: bash bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO\n' CHC197_annotated.vcf \ | python3 - << 'EOF' import sys for line in sys.stdin: chrom, pos, ref, alt, info = line.rstrip().split('\t') fields = dict(f.split('=') for f in info.split(';') if '=' in f) logps = {k: float(v) for k, v in fields.items() if k.startswith('logp_')} best = max(logps, key=logps.get).replace('logp_', '') print(chrom, pos, ref, alt, best) EOF Using a whitelist for panel and exome data ------------------------------------------ When mutations come from targeted sequencing (exome capture, gene panel), the model must know which genomic regions were actually assayed. Without this information, the model assumes the entire genome was observed, which biases signature assignments because the model expects many more background mutations than were detected. Provide a BED file of assayed regions with ``--white-list``: .. code-block:: bash mutopia-sbs annotate-vcf \ Liver-HCC.model.pkl \ Liver-HCC_setup.nc \ sample.vcf.gz \ -fa /path/to/hg38.fa \ -w VAF \ --no-pass-only \ --no-cluster \ --white-list capture_targets.bed \ -o sample_annotated.vcf The whitelist adjusts two things: 1. **Exposure correction** — Per-region exposure terms are scaled by the fraction of each 10 kb genomic window covered by the whitelist intervals, so the model penalizes the absence of mutations in uncaptured territory. 2. **Mutation filtering** — Mutations outside the whitelisted regions are excluded from the analysis, discarding off-target artefacts. .. note:: The whitelist BED file should use the same chromosome naming convention as your VCF and reference genome (``chr1`` vs ``1``). Use the ``-chr`` flag to add or remove a prefix if needed. Multi-sample VCFs ----------------- If your VCF contains calls for multiple samples, specify which one to annotate: .. code-block:: bash mutopia-sbs annotate-vcf \ Liver-HCC.model.pkl \ Liver-HCC_setup.nc \ cohort.vcf.gz \ -fa /path/to/hg38.fa \ -w VAF \ --no-pass-only \ --no-cluster \ -name SAMPLE_ID \ -o SAMPLE_ID_annotated.vcf Additional options ------------------ .. code-block:: text -fa, --fasta FILE Reference genome FASTA (required) -w, --weight-tag TAG INFO tag to use as per-mutation weight (e.g. "VAF") -m, --mutation-rate-file FILE Background mutation rate bedgraph (for clustering) --no-cluster Skip kataegis mutation clustering --no-pass-only Include non-PASS variants -name, --sample-name NAME Sample name in multi-sample VCF -chr, --chr-prefix PREFIX Chromosome prefix to add (e.g. "chr") -@, --threads N Parallel threads (default: 1) --skip-sort Skip VCF sorting (VCF must be lexicographically sorted) -o, --output FILE Output VCF path (default: stdout) Run ``mutopia-sbs annotate-vcf --help`` for the full option reference.