Tutorial 1: Building G-Tensors¶
This tutorial demonstrates how to build a genomic tensor (G-Tensor) from scratch using MuTopia.
Overview¶
A G-Tensor is a multi-dimensional data structure that integrates various genomic features and annotations with mutational topography data. This tutorial will walk you through:
Data Preparation: Downloading and preprocessing genomic annotations
Configuration: Setting up a YAML configuration template
G-Tensor Construction: Building the tensor object and ingesting data
Prerequisites¶
Before starting this tutorial, ensure you have:
MuTopia package installed
Download the pre-compiled data to the
tutorial_datadirectoryDownload or symlink the hg38 reference genome to
tutorial_data/hg38.fa.gz
1. Download and preprocess feature annotations¶
The first step in building a G-Tensor is to gather genomic annotations that will serve as features for our model. We’ll download:
Gene expression data from ENCODE for Lung cells
ATAC-seq peaks representing chromatin accessibility
Histone modification data (H3K27me3) for epigenetic state
A real attempt at modeling would require many more features, usually, but this will suffice for a demo.
1.1 Download gene expression data¶
First, we’ll download RNA-seq quantification data from ENCODE and convert it into a format suitable for our G-Tensor. The download_expression function downloads TSV files and extracts gene coordinates and expression values - we use transcripts per million (called pme_TPM by ENCODE).
[9]:
%%bash
function download_expression {
local url="${1}"
( set -o pipefail
curl -fsSL "$url" \
| awk -F '\t' 'NR>1 { print $1 "\t" $10 }' \
| gtensor utils make-expression-bedfile
)
}
mkdir -p data
download_expression "https://www.encodeproject.org/files/ENCFF769IPE/@@download/ENCFF769IPE.tsv" > data/LungExpression.bed
head -n3 data/LungExpression.bed
INFO: Mutopia:Downloading GTF file: MANE.GRCh38.v1.3.ensembl_genomic.gtf ...
INFO: Mutopia:Creating annotation file: MANE.GRCh38.annotation.bed ...
INFO: Mutopia:Writing quantitation file: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'> ...
chr1 65418 71584 ENSG00000186092 0.03 +
chr1 450739 451677 ENSG00000284733 0.06 -
chr1 685715 686653 ENSG00000284662 0.06 -
[10]:
%%bash
function download_atac {
url=$1
wget $url -O - 2> /dev/null \
| gzip -dc \
| sort -k1,1 -k2,2n \
| awk -v OFS="\t" '{print $1,$2,$3,"Peak"}'
}
download_atac "https://www.encodeproject.org/files/ENCFF818EYD/@@download/ENCFF818EYD.bed.gz" > "data/LungAtac.bed"
head -n3 "data/LungAtac.bed"
chr1 42005 42399 Peak
chr1 778532 778983 Peak
chr1 817239 817623 Peak
2. Specify a configuration template¶
Now we’ll create a YAML configuration template that defines how our G-Tensor should be structured. This template uses Jinja2 templating to make it flexible and reusable.
The configuration specifies:
Data type: SBS
Reference files: Genome sizes, blacklisted regions, and FASTA sequence
Features: Different types of genomic annotations with their normalization methods
Region size: Size of genomic windows (10kb in this example)
There are many different normalization strategies for the features, depending on the input data format:
[Bed]
categorical: A discrete region of the genome which is modeled using the nonlinear GBT estimators
mesoscale: A discrete region of the genome which is allowed to influence the signature distributions
strand: A discrete region of the genome which has feature values of either “+” or “-”, marks the strand orientation of a genomic element.
[Bed, Bedgraph, BigWig]
log1p_cpm: Log1p-counts per million
standardize: z-score normalization
quantile: uniform quantile normalization
power: variance-stabilizing power transformation
robust: robust z-score, which trims outliers
Finally, the gex normalization is a little bit special. It must be applied when you’re working with expression-style data since it normalizes the data per gene rather than across loci. If you don’t specify gex for expression data, the values won’t be comparable across celltypes.
[11]:
import mutopia.analysis as mu
from mutopia.utils import fill_jinja_template
template="""
name: {{name}}
dtype: SBS
chromsizes: {{chromsizes}}
blacklist: {{blacklist}}
fasta: {{fasta}}
region_size: {{region_size}}
features:
GCContent:
normalization: standardize
sources:
- {{ files["GCContent"] }}
Transcription:
normalization: gex
column: 5
sources:
- {{ files["Transcription"] }}
GeneStrand:
normalization: strand
column: 6
sources:
- {{ files["Transcription"] }}
ATAC_accessible:
normalization: categorical
sources:
- {{ files["ATAC_accessible"] }}
H3K27ac:
normalization: log1p_cpm
sources:
- {{ files["H3K27me3"] }}
"""
To configure a feature, the basic format is:
features:
<feature_name: str>:
normalization: <str>
column: <Optional[int], default=4>
null: <Optional[int]>
group: <Optional[str], default="all">
classes: <Optional[List[str]]>
sources:
- <path or URL>
Where:
normalizationis outlined above.columnis applicable for bedfiles, and says which column to extract the data fromnullindicates which value corresponds to no data in a bed or bedGraph fileclassesspecifies a class ordering for the categories found in a bed file.group- can be used to set exclusive interaction groups between features. Default behavior is to not do any grouping.sources- A list of files/URLs
3. Fill the template using annotations¶
Now we’ll populate our template with actual file paths and configuration values. The fill_jinja_template function takes our template and replaces the placeholders with real values. I like to define configuration files for each celltype so I can dynamically generate config files for different scenarios.
Note: If you don’t have to preprocess your data in any way (like it’s available as a bigWig, ready to ingest), you can simply provide a URL! MuTopia will download it for you.
[12]:
from mutopia.utils import fill_jinja_template
config = fill_jinja_template(
template,
name="example_gtensor",
chromsizes="tutorial_data/hg38.mainchroms.sizes",
fasta="tutorial_data/hg38.fa",
blacklist="tutorial_data/hg38-blacklist.v2.sorted.bed.gz",
region_size=10_000,
files={
"GCContent": "tutorial_data/at_gc_ratio.hg38.bedgraph.gz",
"Transcription": "data/LungExpression.bed",
"ATAC_accessible": "data/LungATAC.bed",
"H3K27me3" : "https://www.encodeproject.org/files/ENCFF440HPT/@@download/ENCFF440HPT.bigWig",
}
)
print(config)
name: example_gtensor
dtype: SBS
chromsizes: tutorial_data/hg38.mainchroms.sizes
blacklist: tutorial_data/hg38-blacklist.v2.sorted.bed.gz
fasta: tutorial_data/hg38.fa
region_size: 10000
features:
GCContent:
normalization: standardize
sources:
- tutorial_data/at_gc_ratio.hg38.bedgraph.gz
Transcription:
normalization: gex
column: 5
sources:
- data/LungExpression.bed
GeneStrand:
normalization: strand
column: 6
sources:
- data/LungExpression.bed
ATAC_accessible:
normalization: categorical
sources:
- data/LungATAC.bed
H3K27ac:
normalization: log1p_cpm
sources:
- https://www.encodeproject.org/files/ENCFF440HPT/@@download/ENCFF440HPT.bigWig
[13]:
with open("example_gtensor.yaml", "w") as f:
f.write(config)
Save the configuration to a YAML file that we’ll use in the next step:
4. Build the G-Tensor object and ingest the annotations¶
With our configuration ready, we can now build the G-Tensor using the gtensor compose command. This step:
Creates the tensor structure: Sets up the multi-dimensional array
Downloads remote files: Fetches any BigWig or other remote annotations
Ingests features: Ingests features using the correct paths depending on the filetype and normalization strategy
The -w 1 flag specifies using 1 worker process. For larger datasets, you can increase this number to parallelize the work. This command runs a luigi command in the background which can resume the pipeline upon any failures.
[14]:
!gtensor compose example_gtensor.yaml -w 1
/Users/allen/miniconda3/envs/my-mutopia/lib/python3.11/site-packages/requests/__init__.py:86: RequestsDependencyWarning: Unable to find acceptable character detection dependency (chardet or charset_normalizer).
warnings.warn(
INFO:luigi-interface:Loaded []
INFO:luigi:logging configured by default settings
INFO: Mutopia:Creating regions with cutouts for: GeneStrand, ATAC_accessible, Transcription
INFO: Mutopia:Creating genomic regions ...
INFO: Mutopia:Using chromosomes:
chr1, chr2, chr3, chr4,
chr5, chr6, chr7, chr8,
chr9, chr11, chr10, chr12,
chr13, chr14, chr15, chr16,
chr17, chr18, chr20, chr19,
chr22, chr21
INFO: Mutopia:Making initial coarse-grained regions ...
INFO: Mutopia:Building regions ...
INFO: Mutopia:Wrote 25000 windows ...
INFO: Mutopia:Wrote 50000 windows ...
INFO: Mutopia:Wrote 75000 windows ...
INFO: Mutopia:Wrote 100000 windows ...
INFO: Mutopia:Wrote 125000 windows ...
INFO: Mutopia:Wrote 150000 windows ...
INFO: Mutopia:Wrote 175000 windows ...
INFO: Mutopia:Wrote 200000 windows ...
INFO: Mutopia:Wrote 225000 windows ...
INFO: Mutopia:Wrote 250000 windows ...
INFO: Mutopia:Wrote 275000 windows ...
INFO: Mutopia:Wrote 300000 windows ...
INFO: Mutopia:Wrote 325000 windows ...
INFO: Mutopia:Wrote 350000 windows ...
Window size report
----------------------
Num windows | 353685
Smallest | 101
Largest | 10000
Quantile=0.1 | 420
Quantile=0.25 | 5174
Quantile=0.5 | 10000
Quantile=0.75 | 10000
Quantile=0.9 | 10000
INFO: Mutopia:Calculating context frequencies ...
Aggregating trinucleotide content: 353685it [08:47, 670.02it/s]
INFO: Mutopia:Formatting G-Tensor ...
INFO: Mutopia:Writing G-Tensor to example_gtensor.nc, and accompanying regions file to example_gtensor.nc.regions.bed ...
Make sure not to leave the regions file behind if you move the G-Tensor!.
INFO: Mutopia:Ingesting data from tutorial_data/at_gc_ratio.hg38.bedgraph.gz ...
INFO: Mutopia:First values of feature GCContent: 0.35, 0.31, 0.31, 0.31, 0.23
First non-null values: 0.35, 0.31, 0.31, 0.31, 0.23
INFO: Mutopia:Added feature: GCContent
INFO: Mutopia:Ingesting data from downloads/ENCFF440HPT.bigWig ...
INFO: Mutopia:First values of feature H3K27ac: 0.07, 0.08, 0.07, 0.14, 0.05
First non-null values: 0.07, 0.08, 0.07, 0.14, 0.05
INFO: Mutopia:Added feature: H3K27ac
INFO: Mutopia:Ingesting data from data/LungExpression.bed ...
INFO: Mutopia:First values of feature Transcription: nan, nan, nan, nan, nan
First non-null values: 2.87, 2.87, 2.87, 2.87, 2.87
INFO: Mutopia:Added feature: Transcription
INFO: Mutopia:Ingesting data from data/LungATAC.bed ...
INFO: Mutopia:Found classes: "None", "Peak"
INFO: Mutopia:Added feature: ATAC_accessible
Note: Make sure the command above runs to completion, you can check by executing it again and making sure it does not try to launch additional tasks.
[15]:
!gtensor compose example_gtensor.yaml -w 1
/Users/allen/miniconda3/envs/my-mutopia/lib/python3.11/site-packages/requests/__init__.py:86: RequestsDependencyWarning: Unable to find acceptable character detection dependency (chardet or charset_normalizer).
warnings.warn(
INFO:luigi-interface:Loaded []
INFO:luigi:logging configured by default settings
5. Ingest samples¶
Now we’ll add some samples to our G-Tensor. The gtensor sample add command processes VCF/BCF files and extracts all of the relevant information from each mutation. Before you start, you need to organize some metadata:
For each mutation in each VCF, you must add an
INFOcolumn for the mutation’s apparent VAF. This is simply calculated as: (reads supporting variant)/(total reads). We would calculate this internally, but due to the heterogeneity in how different groups store this information, we leave it to the user.The background mutation rate - this is used to determine a baseline mutation rate to discover mutation clusters. Need not be tailored to the tissue, I would just use the one we provide.
(Optional) The purity of each sample (if working with tumors). If you have this on-hand you should use it, but it’s not necessary.
Add a sample with gtensor sample add, making sure to give each sample a unique ID, and to set the “sample-weight” to the reciprocal of purity: \(\text{sample weight} = \frac{1}{\text{purity}}\)
vcf_path="path/to/your/vcf-or-bcf"
reciprocal_purity=1.25 # e.g. for a sample with 80% purity
gtensor sample add \
-id "example.001"\
--sample-weight $reciprocal_purity \
--mutation-rate-file "tutorial_data/mutation-rate.hg38.bedgraph.gz" \
--pass-only \
example_gtensor.nc \
$vcf_path
Let’s examine the structure and contents of our newly created G-Tensor:
[17]:
!gtensor info example_gtensor.nc
Num features: 5
Num samples: 0
Dataset name: example_gtensor
Dataset dimensions:
configuration: 2
context: 96
locus: 353685
clustered: 2
Dataset attributes:
dtype: SBS
regions_file: example_gtensor.nc.regions.bed
genome_file: tutorial_data/hg38.mainchroms.sizes
fasta_file: tutorial_data/hg38.fa
blacklist_file: tutorial_data/hg38-blacklist.v2.sorted.bed.gz
region_size: 10000
coordinates: mutation
You can check out the features you ingested with:
[18]:
!gtensor feature ls example_gtensor.nc
Feature name | normalization | group
----------------+---------------+------
GCContent | standardize | all
H3K27ac | log1p_cpm | all
Transcription | gex | all
ATAC_accessible | categorical | all
GeneStrand | strand | all
6. Manipulating G-Tensors¶
[19]:
!gtensor feature --help
Usage: gtensor feature [OPTIONS] COMMAND [ARGS]...
Manage features in G-Tensor datasets.
Features represent genomic annotations or experimental data values
associated with each genomic region in the G-Tensor.
Options:
--help Show this message and exit.
Commands:
add Add features to a G-Tensor
edit Edit feature attributes
ls List features in a G-Tensor
rm Remove features from a G-Tensor
You can also slice and subset G-Tensors using the slice commands:
[20]:
!gtensor slice --help
Usage: gtensor slice [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
regions Extract specific genomic regions from a G-Tensor dataset.
samples Extract specific samples from a G-Tensor dataset into a new file.
Check out all the functionalities using:
[21]:
!gtensor --help
Usage: gtensor [OPTIONS] COMMAND [ARGS]...
Manage G-Tensor datasets for genomic data analysis.
G-Tensors are structured genomic datasets that organize samples, features,
and observations across genomic loci for machine learning applications.
Options:
--help Show this message and exit.
Commands:
compose Run the G-Tensor construction pipeline
convert Convert a G-Tensor to a different modality
create Create a new G-Tensor
feature Manage features in a G-Tensor
info Display information about a G-Tensor
offsets Manage genomic locus exposure offsets
sample Manage samples in a G-Tensor
set-attr Set attributes on a G-Tensor
slice Slice a G-Tensor by samples or regions
split Split a G-Tensor into training and test sets
utils Utility functions for G-Tensors