Tutorial 1: Building G-Tensors¶

This tutorial demonstrates how to build a genomic tensor (G-Tensor) from scratch using MuTopia.

Overview¶

A G-Tensor is a multi-dimensional data structure that integrates various genomic features and annotations with mutational topography data. This tutorial will walk you through:

Data Preparation: Downloading and preprocessing genomic annotations
Configuration: Setting up a YAML configuration template
G-Tensor Construction: Building the tensor object and ingesting data

Prerequisites¶

Before starting this tutorial, ensure you have:

MuTopia package installed
Download the pre-compiled data to the tutorial_data directory
Download or symlink the hg38 reference genome to tutorial_data/hg38.fa.gz

1. Download and preprocess feature annotations¶

The first step in building a G-Tensor is to gather genomic annotations that will serve as features for our model. We’ll download:

Gene expression data from ENCODE for Lung cells
ATAC-seq peaks representing chromatin accessibility
Histone modification data (H3K27me3) for epigenetic state

A real attempt at modeling would require many more features, usually, but this will suffice for a demo.

1.1 Download gene expression data¶

First, we’ll download RNA-seq quantification data from ENCODE and convert it into a format suitable for our G-Tensor. The download_expression function downloads TSV files and extracts gene coordinates and expression values - we use transcripts per million (called pme_TPM by ENCODE).

[9]:

%%bash
function download_expression {
  local url="${1}"

  ( set -o pipefail
    curl -fsSL "$url" \
    | awk -F '\t' 'NR>1 { print $1 "\t" $10 }' \
    | gtensor utils make-expression-bedfile
  )
}

mkdir -p data
download_expression "https://www.encodeproject.org/files/ENCFF769IPE/@@download/ENCFF769IPE.tsv" > data/LungExpression.bed
head -n3 data/LungExpression.bed

INFO: Mutopia:Downloading GTF file: MANE.GRCh38.v1.3.ensembl_genomic.gtf ...
INFO: Mutopia:Creating annotation file: MANE.GRCh38.annotation.bed ...
INFO: Mutopia:Writing quantitation file: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'> ...

chr1    65418   71584   ENSG00000186092 0.03    +
chr1    450739  451677  ENSG00000284733 0.06    -
chr1    685715  686653  ENSG00000284662 0.06    -

[10]:

%%bash
function download_atac {
    url=$1
    wget $url -O - 2> /dev/null \
        | gzip -dc \
        | sort -k1,1 -k2,2n \
        | awk -v OFS="\t" '{print $1,$2,$3,"Peak"}'
}

download_atac "https://www.encodeproject.org/files/ENCFF818EYD/@@download/ENCFF818EYD.bed.gz" > "data/LungAtac.bed"
head -n3 "data/LungAtac.bed"

chr1    42005   42399   Peak
chr1    778532  778983  Peak
chr1    817239  817623  Peak

2. Specify a configuration template¶

Now we’ll create a YAML configuration template that defines how our G-Tensor should be structured. This template uses Jinja2 templating to make it flexible and reusable.

The configuration specifies:

Data type: SBS
Reference files: Genome sizes, blacklisted regions, and FASTA sequence
Features: Different types of genomic annotations with their normalization methods
Region size: Size of genomic windows (10kb in this example)

There are many different normalization strategies for the features, depending on the input data format:

[Bed]
- categorical: A discrete region of the genome which is modeled using the nonlinear GBT estimators
- mesoscale: A discrete region of the genome which is allowed to influence the signature distributions
- strand: A discrete region of the genome which has feature values of either “+” or “-”, marks the strand orientation of a genomic element.
[Bed, Bedgraph, BigWig]
- log1p_cpm: Log1p-counts per million
- standardize: z-score normalization
- quantile: uniform quantile normalization
- power: variance-stabilizing power transformation
- robust: robust z-score, which trims outliers

Finally, the gex normalization is a little bit special. It must be applied when you’re working with expression-style data since it normalizes the data per gene rather than across loci. If you don’t specify gex for expression data, the values won’t be comparable across celltypes.

[ ]:

import mutopia.analysis as mu
from mutopia.utils import fill_jinja_template

template="""
name: {{name}}
dtype: SBS
region_size: {{region_size}}

genome:
  chromsizes: {{chromsizes}}
  blacklist: {{blacklist}}
  fasta: {{fasta}}

features:
  GCContent:
    normalization: standardize
    sources:
      - {{ files["GCContent"] }}

  Transcription:
    normalization: gex
    column: 5
    sources:
      - {{ files["Transcription"] }}

  GeneStrand:
    normalization: strand
    column: 6
    sources:
      - {{ files["Transcription"] }}

  ATAC_accessible:
    normalization: categorical
    sources:
      - {{ files["ATAC_accessible"] }}

  H3K27ac:
    normalization: log1p_cpm
    sources:
      - {{ files["H3K27me3"] }}
"""

To configure a feature, the basic format is:

features:
    <feature_name: str>:
        normalization: <str>
        column: <Optional[int], default=4>
        null: <Optional[int]>
        group: <Optional[str], default="all">
        classes: <Optional[List[str]]>
        sources:
            - <path or URL>

Where:

normalization is outlined above.
column is applicable for bedfiles, and says which column to extract the data from
null indicates which value corresponds to no data in a bed or bedGraph file
classes specifies a class ordering for the categories found in a bed file.
group - can be used to set exclusive interaction groups between features. Default behavior is to not do any grouping.
sources - A list of files/URLs

3. Fill the template using annotations¶

Now we’ll populate our template with actual file paths and configuration values. The fill_jinja_template function takes our template and replaces the placeholders with real values. I like to define configuration files for each celltype so I can dynamically generate config files for different scenarios.

Note: If you don’t have to preprocess your data in any way (like it’s available as a bigWig, ready to ingest), you can simply provide a URL! MuTopia will download it for you.

[ ]:

FASTA =  # TODO: add path to FASTA file here

from mutopia.utils import fill_jinja_template

config = fill_jinja_template(
    template,
    name="example_gtensor",
    chromsizes="tutorial_data/hg38.mainchroms.sizes",
    fasta=FASTA,
    blacklist="tutorial_data/hg38-blacklist.v2.sorted.bed.gz",
    region_size=10_000,
    files={
        "GCContent": "tutorial_data/at_gc_ratio.hg38.bedgraph.gz",
        "Transcription": "data/LungExpression.bed",
        "ATAC_accessible": "data/LungATAC.bed",
        "H3K27me3" : "https://www.encodeproject.org/files/ENCFF440HPT/@@download/ENCFF440HPT.bigWig",
    }
)

print(config)


name: example_gtensor
dtype: SBS
region_size: 10000

genome:
  chromsizes: tutorial_data/hg38.mainchroms.sizes
  blacklist: tutorial_data/hg38-blacklist.v2.sorted.bed.gz
  fasta: path/to/hg38.fa

features:
  GCContent:
    normalization: standardize
    sources:
      - tutorial_data/at_gc_ratio.hg38.bedgraph.gz

  Transcription:
    normalization: gex
    column: 5
    sources:
      - data/LungExpression.bed

  GeneStrand:
    normalization: strand
    column: 6
    sources:
      - data/LungExpression.bed

  ATAC_accessible:
    normalization: categorical
    sources:
      - data/LungATAC.bed

  H3K27ac:
    normalization: log1p_cpm
    sources:
      - https://www.encodeproject.org/files/ENCFF440HPT/@@download/ENCFF440HPT.bigWig

[13]:

with open("example_gtensor.yaml", "w") as f:
    f.write(config)

Save the configuration to a YAML file that we’ll use in the next step:

4. Build the G-Tensor object and ingest the annotations¶

With our configuration ready, we can now build the G-Tensor using the gtensor compose command. This step:

Creates the tensor structure: Sets up the multi-dimensional array
Downloads remote files: Fetches any BigWig or other remote annotations
Ingests features: Ingests features using the correct paths depending on the filetype and normalization strategy

The -w 1 flag specifies using 1 worker process. For larger datasets, you can increase this number to parallelize the work. This command runs a luigi command in the background which can resume the pipeline upon any failures.

[ ]:

!gtensor compose example_gtensor.yaml -w 1 -o example_gtensor

/Users/allen/miniconda3/envs/my-mutopia/lib/python3.11/site-packages/requests/__init__.py:86: RequestsDependencyWarning: Unable to find acceptable character detection dependency (chardet or charset_normalizer).
  warnings.warn(
INFO:luigi-interface:Loaded []
INFO:luigi:logging configured by default settings
INFO: Mutopia:Creating regions with cutouts for: GeneStrand, ATAC_accessible, Transcription
INFO: Mutopia:Creating genomic regions ...
INFO: Mutopia:Using chromosomes:
        chr1, chr2, chr3, chr4,
        chr5, chr6, chr7, chr8,
        chr9, chr11, chr10, chr12,
        chr13, chr14, chr15, chr16,
        chr17, chr18, chr20, chr19,
        chr22, chr21
INFO: Mutopia:Making initial coarse-grained regions ...
INFO: Mutopia:Building regions ...
INFO: Mutopia:Wrote 25000 windows ...
INFO: Mutopia:Wrote 50000 windows ...
INFO: Mutopia:Wrote 75000 windows ...
INFO: Mutopia:Wrote 100000 windows ...
INFO: Mutopia:Wrote 125000 windows ...
INFO: Mutopia:Wrote 150000 windows ...
INFO: Mutopia:Wrote 175000 windows ...
INFO: Mutopia:Wrote 200000 windows ...
INFO: Mutopia:Wrote 225000 windows ...
INFO: Mutopia:Wrote 250000 windows ...
INFO: Mutopia:Wrote 275000 windows ...
INFO: Mutopia:Wrote 300000 windows ...
INFO: Mutopia:Wrote 325000 windows ...
INFO: Mutopia:Wrote 350000 windows ...
Window size report
----------------------
Num windows   | 353685
Smallest      | 101
Largest       | 10000
Quantile=0.1  | 420
Quantile=0.25 | 5174
Quantile=0.5  | 10000
Quantile=0.75 | 10000
Quantile=0.9  | 10000
INFO: Mutopia:Calculating context frequencies ...
Aggregating trinucleotide content: 353685it [08:47, 670.02it/s]
INFO: Mutopia:Formatting G-Tensor ...
INFO: Mutopia:Writing G-Tensor to example_gtensor.nc, and accompanying regions file to example_gtensor.nc.regions.bed ...
Make sure not to leave the regions file behind if you move the G-Tensor!.
INFO: Mutopia:Ingesting data from tutorial_data/at_gc_ratio.hg38.bedgraph.gz ...
INFO: Mutopia:First values of feature GCContent: 0.35, 0.31, 0.31, 0.31, 0.23
First non-null values: 0.35, 0.31, 0.31, 0.31, 0.23
INFO: Mutopia:Added feature: GCContent
INFO: Mutopia:Ingesting data from downloads/ENCFF440HPT.bigWig ...
INFO: Mutopia:First values of feature H3K27ac: 0.07, 0.08, 0.07, 0.14, 0.05
First non-null values: 0.07, 0.08, 0.07, 0.14, 0.05
INFO: Mutopia:Added feature: H3K27ac
INFO: Mutopia:Ingesting data from data/LungExpression.bed ...
INFO: Mutopia:First values of feature Transcription: nan, nan, nan, nan, nan
First non-null values: 2.87, 2.87, 2.87, 2.87, 2.87
INFO: Mutopia:Added feature: Transcription
INFO: Mutopia:Ingesting data from data/LungATAC.bed ...
INFO: Mutopia:Found classes: "None", "Peak"
INFO: Mutopia:Added feature: ATAC_accessible

Note: Make sure the command above runs to completion, you can check by executing it again and making sure it does not try to launch additional tasks.

[ ]:

!gtensor compose example_gtensor.yaml -w 1 -o example_gtensor

/Users/allen/miniconda3/envs/my-mutopia/lib/python3.11/site-packages/requests/__init__.py:86: RequestsDependencyWarning: Unable to find acceptable character detection dependency (chardet or charset_normalizer).
  warnings.warn(
INFO:luigi-interface:Loaded []
INFO:luigi:logging configured by default settings

5. Ingest samples¶

Now we’ll add some samples to our G-Tensor. The gtensor sample add command processes VCF/BCF files and extracts all of the relevant information from each mutation. Before you start, you need to organize some metadata:

For each mutation in each VCF, you must add an INFO column for the mutation’s apparent VAF. This is simply calculated as: (reads supporting variant)/(total reads). We would calculate this internally, but due to the heterogeneity in how different groups store this information, we leave it to the user.
The background mutation rate - this is used to determine a baseline mutation rate to discover mutation clusters. Need not be tailored to the tissue, I would just use the one we provide.
(Optional) The purity of each sample (if working with tumors). If you have this on-hand you should use it, but it’s not necessary.

Add a sample with gtensor sample add, making sure to give each sample a unique ID, and to set the “sample-weight” to the reciprocal of purity: \(\text{sample weight} = \frac{1}{\text{purity}}\)

vcf_path="path/to/your/vcf-or-bcf"
reciprocal_purity=1.25  # e.g. for a sample with 80% purity

gtensor sample add \
    -id "example.001"\
    --sample-weight $reciprocal_purity \
    --mutation-rate-file "tutorial_data/mutation-rate.hg38.bedgraph.gz" \
    --pass-only \
    example_gtensor.nc \
    $vcf_path

Let’s examine the structure and contents of our newly created G-Tensor:

[17]:

!gtensor info example_gtensor.nc

Num features: 5
Num samples: 0
Dataset name: example_gtensor
Dataset dimensions:
        configuration: 2
        context: 96
        locus: 353685
        clustered: 2
Dataset attributes:
        dtype: SBS
        regions_file: example_gtensor.nc.regions.bed
        genome_file: tutorial_data/hg38.mainchroms.sizes
        fasta_file: tutorial_data/hg38.fa
        blacklist_file: tutorial_data/hg38-blacklist.v2.sorted.bed.gz
        region_size: 10000
        coordinates: mutation

You can check out the features you ingested with:

[18]:

!gtensor feature ls example_gtensor.nc

Feature name    | normalization | group
----------------+---------------+------
GCContent       | standardize   | all
H3K27ac         | log1p_cpm     | all
Transcription   | gex           | all
ATAC_accessible | categorical   | all
GeneStrand      | strand        | all

6. Manipulating G-Tensors¶

The G-Tensor CLI contains many useful functionalities for editing and munging G-Tensors from the terminal. The CLI structure is hierarchical and (hopefully) self-documenting.

For samples, offsets - more on this later, and features, you can view information, add, remove, and edit:

[19]:

!gtensor feature --help

Usage: gtensor feature [OPTIONS] COMMAND [ARGS]...

  Manage features in G-Tensor datasets.

  Features represent genomic annotations or experimental data values
  associated with each genomic region in the G-Tensor.

Options:
  --help  Show this message and exit.

Commands:
  add   Add features to a G-Tensor
  edit  Edit feature attributes
  ls    List features in a G-Tensor
  rm    Remove features from a G-Tensor

You can also slice and subset G-Tensors using the slice commands:

[20]:

!gtensor slice --help

Usage: gtensor slice [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  regions  Extract specific genomic regions from a G-Tensor dataset.
  samples  Extract specific samples from a G-Tensor dataset into a new file.

Check out all the functionalities using:

[21]:

!gtensor --help

Usage: gtensor [OPTIONS] COMMAND [ARGS]...

  Manage G-Tensor datasets for genomic data analysis.

  G-Tensors are structured genomic datasets that organize samples, features,
  and observations across genomic loci for machine learning applications.

Options:
  --help  Show this message and exit.

Commands:
  compose   Run the G-Tensor construction pipeline
  convert   Convert a G-Tensor to a different modality
  create    Create a new G-Tensor
  feature   Manage features in a G-Tensor
  info      Display information about a G-Tensor
  offsets   Manage genomic locus exposure offsets
  sample    Manage samples in a G-Tensor
  set-attr  Set attributes on a G-Tensor
  slice     Slice a G-Tensor by samples or regions
  split     Split a G-Tensor into training and test sets
  utils     Utility functions for G-Tensors