Tutorial 1: Building G-Tensors

This tutorial demonstrates how to build a genomic tensor (G-Tensor) from scratch using MuTopia.

Overview

A G-Tensor is a multi-dimensional data structure that integrates various genomic features and annotations with mutational topography data. This tutorial will walk you through:

  1. Data Preparation: Downloading and preprocessing genomic annotations

  2. Configuration: Setting up a YAML configuration template

  3. G-Tensor Construction: Building the tensor object and ingesting data

Prerequisites

Before starting this tutorial, ensure you have:

  • MuTopia package installed

  • Download the pre-compiled data to the tutorial_data directory

  • Download or symlink the hg38 reference genome to tutorial_data/hg38.fa.gz

1. Download and preprocess feature annotations

The first step in building a G-Tensor is to gather genomic annotations that will serve as features for our model. We’ll download:

  • Gene expression data from ENCODE for Lung cells

  • ATAC-seq peaks representing chromatin accessibility

  • Histone modification data (H3K27me3) for epigenetic state

A real attempt at modeling would require many more features, usually, but this will suffice for a demo.

1.1 Download gene expression data

First, we’ll download RNA-seq quantification data from ENCODE and convert it into a format suitable for our G-Tensor. The download_expression function downloads TSV files and extracts gene coordinates and expression values - we use transcripts per million (called pme_TPM by ENCODE).

[9]:
%%bash
function download_expression {
  local url="${1}"

  ( set -o pipefail
    curl -fsSL "$url" \
    | awk -F '\t' 'NR>1 { print $1 "\t" $10 }' \
    | gtensor utils make-expression-bedfile
  )
}

mkdir -p data
download_expression "https://www.encodeproject.org/files/ENCFF769IPE/@@download/ENCFF769IPE.tsv" > data/LungExpression.bed
head -n3 data/LungExpression.bed
INFO: Mutopia:Downloading GTF file: MANE.GRCh38.v1.3.ensembl_genomic.gtf ...
INFO: Mutopia:Creating annotation file: MANE.GRCh38.annotation.bed ...
INFO: Mutopia:Writing quantitation file: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'> ...
chr1    65418   71584   ENSG00000186092 0.03    +
chr1    450739  451677  ENSG00000284733 0.06    -
chr1    685715  686653  ENSG00000284662 0.06    -
[10]:
%%bash
function download_atac {
    url=$1
    wget $url -O - 2> /dev/null \
        | gzip -dc \
        | sort -k1,1 -k2,2n \
        | awk -v OFS="\t" '{print $1,$2,$3,"Peak"}'
}

download_atac "https://www.encodeproject.org/files/ENCFF818EYD/@@download/ENCFF818EYD.bed.gz" > "data/LungAtac.bed"
head -n3 "data/LungAtac.bed"
chr1    42005   42399   Peak
chr1    778532  778983  Peak
chr1    817239  817623  Peak

2. Specify a configuration template

Now we’ll create a YAML configuration template that defines how our G-Tensor should be structured. This template uses Jinja2 templating to make it flexible and reusable.

The configuration specifies:

  • Data type: SBS

  • Reference files: Genome sizes, blacklisted regions, and FASTA sequence

  • Features: Different types of genomic annotations with their normalization methods

  • Region size: Size of genomic windows (10kb in this example)

There are many different normalization strategies for the features, depending on the input data format:

  • [Bed]

    • categorical: A discrete region of the genome which is modeled using the nonlinear GBT estimators

    • mesoscale: A discrete region of the genome which is allowed to influence the signature distributions

    • strand: A discrete region of the genome which has feature values of either “+” or “-”, marks the strand orientation of a genomic element.

  • [Bed, Bedgraph, BigWig]

    • log1p_cpm: Log1p-counts per million

    • standardize: z-score normalization

    • quantile: uniform quantile normalization

    • power: variance-stabilizing power transformation

    • robust: robust z-score, which trims outliers

Finally, the gex normalization is a little bit special. It must be applied when you’re working with expression-style data since it normalizes the data per gene rather than across loci. If you don’t specify gex for expression data, the values won’t be comparable across celltypes.

[11]:
import mutopia.analysis as mu
from mutopia.utils import fill_jinja_template

template="""
name: {{name}}
dtype: SBS
chromsizes: {{chromsizes}}
blacklist: {{blacklist}}
fasta: {{fasta}}
region_size: {{region_size}}

features:
  GCContent:
    normalization: standardize
    sources:
      - {{ files["GCContent"] }}

  Transcription:
    normalization: gex
    column: 5
    sources:
      - {{ files["Transcription"] }}

  GeneStrand:
    normalization: strand
    column: 6
    sources:
      - {{ files["Transcription"] }}

  ATAC_accessible:
    normalization: categorical
    sources:
      - {{ files["ATAC_accessible"] }}

  H3K27ac:
    normalization: log1p_cpm
    sources:
      - {{ files["H3K27me3"] }}
"""

To configure a feature, the basic format is:

features:
    <feature_name: str>:
        normalization: <str>
        column: <Optional[int], default=4>
        null: <Optional[int]>
        group: <Optional[str], default="all">
        classes: <Optional[List[str]]>
        sources:
            - <path or URL>

Where:

  • normalization is outlined above.

  • column is applicable for bedfiles, and says which column to extract the data from

  • null indicates which value corresponds to no data in a bed or bedGraph file

  • classes specifies a class ordering for the categories found in a bed file.

  • group - can be used to set exclusive interaction groups between features. Default behavior is to not do any grouping.

  • sources - A list of files/URLs

3. Fill the template using annotations

Now we’ll populate our template with actual file paths and configuration values. The fill_jinja_template function takes our template and replaces the placeholders with real values. I like to define configuration files for each celltype so I can dynamically generate config files for different scenarios.

Note: If you don’t have to preprocess your data in any way (like it’s available as a bigWig, ready to ingest), you can simply provide a URL! MuTopia will download it for you.

[12]:
from mutopia.utils import fill_jinja_template

config = fill_jinja_template(
    template,
    name="example_gtensor",
    chromsizes="tutorial_data/hg38.mainchroms.sizes",
    fasta="tutorial_data/hg38.fa",
    blacklist="tutorial_data/hg38-blacklist.v2.sorted.bed.gz",
    region_size=10_000,
    files={
        "GCContent": "tutorial_data/at_gc_ratio.hg38.bedgraph.gz",
        "Transcription": "data/LungExpression.bed",
        "ATAC_accessible": "data/LungATAC.bed",
        "H3K27me3" : "https://www.encodeproject.org/files/ENCFF440HPT/@@download/ENCFF440HPT.bigWig",
    }
)

print(config)

name: example_gtensor
dtype: SBS
chromsizes: tutorial_data/hg38.mainchroms.sizes
blacklist: tutorial_data/hg38-blacklist.v2.sorted.bed.gz
fasta: tutorial_data/hg38.fa
region_size: 10000

features:
  GCContent:
    normalization: standardize
    sources:
      - tutorial_data/at_gc_ratio.hg38.bedgraph.gz

  Transcription:
    normalization: gex
    column: 5
    sources:
      - data/LungExpression.bed

  GeneStrand:
    normalization: strand
    column: 6
    sources:
      - data/LungExpression.bed

  ATAC_accessible:
    normalization: categorical
    sources:
      - data/LungATAC.bed

  H3K27ac:
    normalization: log1p_cpm
    sources:
      - https://www.encodeproject.org/files/ENCFF440HPT/@@download/ENCFF440HPT.bigWig
[13]:
with open("example_gtensor.yaml", "w") as f:
    f.write(config)

Save the configuration to a YAML file that we’ll use in the next step:

4. Build the G-Tensor object and ingest the annotations

With our configuration ready, we can now build the G-Tensor using the gtensor compose command. This step:

  1. Creates the tensor structure: Sets up the multi-dimensional array

  2. Downloads remote files: Fetches any BigWig or other remote annotations

  3. Ingests features: Ingests features using the correct paths depending on the filetype and normalization strategy

The -w 1 flag specifies using 1 worker process. For larger datasets, you can increase this number to parallelize the work. This command runs a luigi command in the background which can resume the pipeline upon any failures.

[14]:
!gtensor compose example_gtensor.yaml -w 1
/Users/allen/miniconda3/envs/my-mutopia/lib/python3.11/site-packages/requests/__init__.py:86: RequestsDependencyWarning: Unable to find acceptable character detection dependency (chardet or charset_normalizer).
  warnings.warn(
INFO:luigi-interface:Loaded []
INFO:luigi:logging configured by default settings
INFO: Mutopia:Creating regions with cutouts for: GeneStrand, ATAC_accessible, Transcription
INFO: Mutopia:Creating genomic regions ...
INFO: Mutopia:Using chromosomes:
        chr1, chr2, chr3, chr4,
        chr5, chr6, chr7, chr8,
        chr9, chr11, chr10, chr12,
        chr13, chr14, chr15, chr16,
        chr17, chr18, chr20, chr19,
        chr22, chr21
INFO: Mutopia:Making initial coarse-grained regions ...
INFO: Mutopia:Building regions ...
INFO: Mutopia:Wrote 25000 windows ...
INFO: Mutopia:Wrote 50000 windows ...
INFO: Mutopia:Wrote 75000 windows ...
INFO: Mutopia:Wrote 100000 windows ...
INFO: Mutopia:Wrote 125000 windows ...
INFO: Mutopia:Wrote 150000 windows ...
INFO: Mutopia:Wrote 175000 windows ...
INFO: Mutopia:Wrote 200000 windows ...
INFO: Mutopia:Wrote 225000 windows ...
INFO: Mutopia:Wrote 250000 windows ...
INFO: Mutopia:Wrote 275000 windows ...
INFO: Mutopia:Wrote 300000 windows ...
INFO: Mutopia:Wrote 325000 windows ...
INFO: Mutopia:Wrote 350000 windows ...
Window size report
----------------------
Num windows   | 353685
Smallest      | 101
Largest       | 10000
Quantile=0.1  | 420
Quantile=0.25 | 5174
Quantile=0.5  | 10000
Quantile=0.75 | 10000
Quantile=0.9  | 10000
INFO: Mutopia:Calculating context frequencies ...
Aggregating trinucleotide content: 353685it [08:47, 670.02it/s]
INFO: Mutopia:Formatting G-Tensor ...
INFO: Mutopia:Writing G-Tensor to example_gtensor.nc, and accompanying regions file to example_gtensor.nc.regions.bed ...
Make sure not to leave the regions file behind if you move the G-Tensor!.
INFO: Mutopia:Ingesting data from tutorial_data/at_gc_ratio.hg38.bedgraph.gz ...
INFO: Mutopia:First values of feature GCContent: 0.35, 0.31, 0.31, 0.31, 0.23
First non-null values: 0.35, 0.31, 0.31, 0.31, 0.23
INFO: Mutopia:Added feature: GCContent
INFO: Mutopia:Ingesting data from downloads/ENCFF440HPT.bigWig ...
INFO: Mutopia:First values of feature H3K27ac: 0.07, 0.08, 0.07, 0.14, 0.05
First non-null values: 0.07, 0.08, 0.07, 0.14, 0.05
INFO: Mutopia:Added feature: H3K27ac
INFO: Mutopia:Ingesting data from data/LungExpression.bed ...
INFO: Mutopia:First values of feature Transcription: nan, nan, nan, nan, nan
First non-null values: 2.87, 2.87, 2.87, 2.87, 2.87
INFO: Mutopia:Added feature: Transcription
INFO: Mutopia:Ingesting data from data/LungATAC.bed ...
INFO: Mutopia:Found classes: "None", "Peak"
INFO: Mutopia:Added feature: ATAC_accessible

Note: Make sure the command above runs to completion, you can check by executing it again and making sure it does not try to launch additional tasks.

[15]:
!gtensor compose example_gtensor.yaml -w 1
/Users/allen/miniconda3/envs/my-mutopia/lib/python3.11/site-packages/requests/__init__.py:86: RequestsDependencyWarning: Unable to find acceptable character detection dependency (chardet or charset_normalizer).
  warnings.warn(
INFO:luigi-interface:Loaded []
INFO:luigi:logging configured by default settings

5. Ingest samples

Now we’ll add some samples to our G-Tensor. The gtensor sample add command processes VCF/BCF files and extracts all of the relevant information from each mutation. Before you start, you need to organize some metadata:

  1. For each mutation in each VCF, you must add an INFO column for the mutation’s apparent VAF. This is simply calculated as: (reads supporting variant)/(total reads). We would calculate this internally, but due to the heterogeneity in how different groups store this information, we leave it to the user.

  2. The background mutation rate - this is used to determine a baseline mutation rate to discover mutation clusters. Need not be tailored to the tissue, I would just use the one we provide.

  3. (Optional) The purity of each sample (if working with tumors). If you have this on-hand you should use it, but it’s not necessary.

Add a sample with gtensor sample add, making sure to give each sample a unique ID, and to set the “sample-weight” to the reciprocal of purity: \(\text{sample weight} = \frac{1}{\text{purity}}\)

vcf_path="path/to/your/vcf-or-bcf"
reciprocal_purity=1.25  # e.g. for a sample with 80% purity

gtensor sample add \
    -id "example.001"\
    --sample-weight $reciprocal_purity \
    --mutation-rate-file "tutorial_data/mutation-rate.hg38.bedgraph.gz" \
    --pass-only \
    example_gtensor.nc \
    $vcf_path

Let’s examine the structure and contents of our newly created G-Tensor:

[17]:
!gtensor info example_gtensor.nc
Num features: 5
Num samples: 0
Dataset name: example_gtensor
Dataset dimensions:
        configuration: 2
        context: 96
        locus: 353685
        clustered: 2
Dataset attributes:
        dtype: SBS
        regions_file: example_gtensor.nc.regions.bed
        genome_file: tutorial_data/hg38.mainchroms.sizes
        fasta_file: tutorial_data/hg38.fa
        blacklist_file: tutorial_data/hg38-blacklist.v2.sorted.bed.gz
        region_size: 10000
        coordinates: mutation

You can check out the features you ingested with:

[18]:
!gtensor feature ls example_gtensor.nc
Feature name    | normalization | group
----------------+---------------+------
GCContent       | standardize   | all
H3K27ac         | log1p_cpm     | all
Transcription   | gex           | all
ATAC_accessible | categorical   | all
GeneStrand      | strand        | all

6. Manipulating G-Tensors

The G-Tensor CLI contains many useful functionalities for editing and munging G-Tensors from the terminal. The CLI structure is hierarchical and (hopefully) self-documenting.
For samples, offsets - more on this later, and features, you can view information, add, remove, and edit:
[19]:
!gtensor feature --help
Usage: gtensor feature [OPTIONS] COMMAND [ARGS]...

  Manage features in G-Tensor datasets.

  Features represent genomic annotations or experimental data values
  associated with each genomic region in the G-Tensor.

Options:
  --help  Show this message and exit.

Commands:
  add   Add features to a G-Tensor
  edit  Edit feature attributes
  ls    List features in a G-Tensor
  rm    Remove features from a G-Tensor

You can also slice and subset G-Tensors using the slice commands:

[20]:
!gtensor slice --help
Usage: gtensor slice [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  regions  Extract specific genomic regions from a G-Tensor dataset.
  samples  Extract specific samples from a G-Tensor dataset into a new file.

Check out all the functionalities using:

[21]:
!gtensor --help
Usage: gtensor [OPTIONS] COMMAND [ARGS]...

  Manage G-Tensor datasets for genomic data analysis.

  G-Tensors are structured genomic datasets that organize samples, features,
  and observations across genomic loci for machine learning applications.

Options:
  --help  Show this message and exit.

Commands:
  compose   Run the G-Tensor construction pipeline
  convert   Convert a G-Tensor to a different modality
  create    Create a new G-Tensor
  feature   Manage features in a G-Tensor
  info      Display information about a G-Tensor
  offsets   Manage genomic locus exposure offsets
  sample    Manage samples in a G-Tensor
  set-attr  Set attributes on a G-Tensor
  slice     Slice a G-Tensor by samples or regions
  split     Split a G-Tensor into training and test sets
  utils     Utility functions for G-Tensors