{ "cells": [ { "cell_type": "markdown", "id": "bc42e37d", "metadata": {}, "source": [ "# Tutorial 1: Building G-Tensors\n", "\n", "This tutorial demonstrates how to build a genomic tensor (G-Tensor) from scratch using MuTopia.\n", "\n", "## Overview\n", "\n", "A G-Tensor is a multi-dimensional data structure that integrates various genomic features and annotations with mutational topography data. This tutorial will walk you through:\n", "\n", "1. **Data Preparation**: Downloading and preprocessing genomic annotations\n", "2. **Configuration**: Setting up a YAML configuration template\n", "3. **G-Tensor Construction**: Building the tensor object and ingesting data\n", "\n", "## Prerequisites\n", "\n", "Before starting this tutorial, ensure you have:\n", "- MuTopia package installed\n", "- Download the pre-compiled data to the `tutorial_data` directory\n", "- Download or symlink the hg38 reference genome to `tutorial_data/hg38.fa.gz`\n", "\n", "## 1. Download and preprocess feature annotations\n", "\n", "The first step in building a G-Tensor is to gather genomic annotations that will serve as features for our model. We'll download:\n", "\n", "- **Gene expression data** from ENCODE for Lung cells\n", "- **ATAC-seq peaks** representing chromatin accessibility\n", "- **Histone modification data** (H3K27me3) for epigenetic state\n", "\n", "A real attempt at modeling would require many more features, usually, but this will suffice for a demo.\n", "\n", "### 1.1 Download gene expression data\n", "\n", "First, we'll download RNA-seq quantification data from ENCODE and convert it into a format suitable for our G-Tensor. The `download_expression` function downloads TSV files and extracts gene coordinates and expression values - we use transcripts per million (called pme_TPM by ENCODE)." ] }, { "cell_type": "code", "execution_count": 9, "id": "856c0218", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO: Mutopia:Downloading GTF file: MANE.GRCh38.v1.3.ensembl_genomic.gtf ...\n", "INFO: Mutopia:Creating annotation file: MANE.GRCh38.annotation.bed ...\n", "INFO: Mutopia:Writing quantitation file: <_io.TextIOWrapper name='' mode='w' encoding='utf-8'> ...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "chr1\t65418\t71584\tENSG00000186092\t0.03\t+\n", "chr1\t450739\t451677\tENSG00000284733\t0.06\t-\n", "chr1\t685715\t686653\tENSG00000284662\t0.06\t-\n" ] } ], "source": [ "%%bash\n", "function download_expression {\n", " local url=\"${1}\"\n", "\n", " ( set -o pipefail\n", " curl -fsSL \"$url\" \\\n", " | awk -F '\\t' 'NR>1 { print $1 \"\\t\" $10 }' \\\n", " | gtensor utils make-expression-bedfile\n", " )\n", "}\n", "\n", "mkdir -p data\n", "download_expression \"https://www.encodeproject.org/files/ENCFF769IPE/@@download/ENCFF769IPE.tsv\" > data/LungExpression.bed\n", "head -n3 data/LungExpression.bed" ] }, { "cell_type": "code", "execution_count": 10, "id": "f589ed75", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "chr1\t42005\t42399\tPeak\n", "chr1\t778532\t778983\tPeak\n", "chr1\t817239\t817623\tPeak\n" ] } ], "source": [ "%%bash\n", "function download_atac {\n", " url=$1\n", " wget $url -O - 2> /dev/null \\\n", " | gzip -dc \\\n", " | sort -k1,1 -k2,2n \\\n", " | awk -v OFS=\"\\t\" '{print $1,$2,$3,\"Peak\"}'\n", "}\n", "\n", "download_atac \"https://www.encodeproject.org/files/ENCFF818EYD/@@download/ENCFF818EYD.bed.gz\" > \"data/LungAtac.bed\"\n", "head -n3 \"data/LungAtac.bed\"" ] }, { "cell_type": "markdown", "id": "a55a2b88", "metadata": {}, "source": "## 2. Specify a configuration template\n\nNow we'll create a YAML configuration template that defines how our G-Tensor should be structured. This template uses Jinja2 templating to make it flexible and reusable.\n\nThe configuration specifies:\n- **Data type**: SBS\n- **Reference files**: Genome sizes, blacklisted regions, and FASTA sequence\n- **Features**: Different types of genomic annotations with their normalization methods\n- **Region size**: Size of genomic windows (10kb in this example)\n\nThere are many different normalization strategies for the features, depending on the input data format:\n- [Bed]\n - categorical: A discrete region of the genome which is modeled using the nonlinear GBT estimators\n - mesoscale: A discrete region of the genome which is allowed to influence the signature distributions\n - strand: A discrete region of the genome which has feature values of either \"+\" or \"-\", marks the strand orientation of a genomic element.\n- [Bed, Bedgraph, BigWig]\n - log1p_cpm: Log1p-counts per million\n - standardize: z-score normalization\n - quantile: uniform quantile normalization\n - power: variance-stabilizing power transformation\n - robust: robust z-score, which trims outliers\n\nFinally, the `gex` normalization is a little bit special. It must be applied when you're working with expression-style data since it normalizes\nthe data per gene rather than across loci. If you don't specify `gex` for expression data, the values won't be comparable across celltypes.\n" }, { "cell_type": "code", "execution_count": 11, "id": "e5423dbe", "metadata": {}, "outputs": [], "source": [ "import mutopia.analysis as mu\n", "from mutopia.utils import fill_jinja_template\n", "\n", "template=\"\"\"\n", "name: {{name}}\n", "dtype: SBS\n", "chromsizes: {{chromsizes}}\n", "blacklist: {{blacklist}}\n", "fasta: {{fasta}}\n", "region_size: {{region_size}}\n", "\n", "features:\n", " GCContent:\n", " normalization: standardize\n", " sources:\n", " - {{ files[\"GCContent\"] }}\n", "\n", " Transcription:\n", " normalization: gex\n", " column: 5\n", " sources: \n", " - {{ files[\"Transcription\"] }}\n", "\n", " GeneStrand:\n", " normalization: strand\n", " column: 6\n", " sources:\n", " - {{ files[\"Transcription\"] }}\n", "\n", " ATAC_accessible:\n", " normalization: categorical\n", " sources:\n", " - {{ files[\"ATAC_accessible\"] }}\n", "\n", " H3K27ac:\n", " normalization: log1p_cpm\n", " sources:\n", " - {{ files[\"H3K27me3\"] }}\n", "\"\"\"" ] }, { "cell_type": "markdown", "id": "c888ebdd", "metadata": {}, "source": "To configure a feature, the basic format is:\n```\nfeatures:\n :\n normalization: \n column: \n null: \n group: \n classes: \n sources:\n - \n```\nWhere:\n* `normalization` is outlined above.\n* `column` is applicable for bedfiles, and says which column to extract the data from\n* `null` indicates which value corresponds to no data in a bed or bedGraph file\n* `classes` specifies a class ordering for the categories found in a bed file.\n* `group` - can be used to set exclusive interaction groups between features. Default behavior is to not do any grouping.\n* `sources` - A list of files/URLs\n\n## 3. Fill the template using annotations\n\nNow we'll populate our template with actual file paths and configuration values. The `fill_jinja_template` function takes our template and replaces the placeholders with real values.\nI like to define configuration files for each celltype so I can dynamically generate config files for different scenarios.\n\n**Note**: If you don't have to preprocess your data in any way (like it's available as a bigWig, ready to ingest), you can simply provide a URL! MuTopia will download it for you." }, { "cell_type": "code", "execution_count": 12, "id": "50dfc0b1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "name: example_gtensor\n", "dtype: SBS\n", "chromsizes: tutorial_data/hg38.mainchroms.sizes\n", "blacklist: tutorial_data/hg38-blacklist.v2.sorted.bed.gz\n", "fasta: tutorial_data/hg38.fa\n", "region_size: 10000\n", "\n", "features:\n", " GCContent:\n", " normalization: standardize\n", " sources:\n", " - tutorial_data/at_gc_ratio.hg38.bedgraph.gz\n", "\n", " Transcription:\n", " normalization: gex\n", " column: 5\n", " sources: \n", " - data/LungExpression.bed\n", "\n", " GeneStrand:\n", " normalization: strand\n", " column: 6\n", " sources:\n", " - data/LungExpression.bed\n", "\n", " ATAC_accessible:\n", " normalization: categorical\n", " sources:\n", " - data/LungATAC.bed\n", "\n", " H3K27ac:\n", " normalization: log1p_cpm\n", " sources:\n", " - https://www.encodeproject.org/files/ENCFF440HPT/@@download/ENCFF440HPT.bigWig\n" ] } ], "source": [ "from mutopia.utils import fill_jinja_template\n", "\n", "config = fill_jinja_template(\n", " template,\n", " name=\"example_gtensor\",\n", " chromsizes=\"tutorial_data/hg38.mainchroms.sizes\",\n", " fasta=\"tutorial_data/hg38.fa\",\n", " blacklist=\"tutorial_data/hg38-blacklist.v2.sorted.bed.gz\",\n", " region_size=10_000,\n", " files={\n", " \"GCContent\": \"tutorial_data/at_gc_ratio.hg38.bedgraph.gz\",\n", " \"Transcription\": \"data/LungExpression.bed\",\n", " \"ATAC_accessible\": \"data/LungATAC.bed\",\n", " \"H3K27me3\" : \"https://www.encodeproject.org/files/ENCFF440HPT/@@download/ENCFF440HPT.bigWig\",\n", " }\n", ")\n", "\n", "print(config)" ] }, { "cell_type": "code", "execution_count": 13, "id": "8ee088b6", "metadata": {}, "outputs": [], "source": [ "with open(\"example_gtensor.yaml\", \"w\") as f:\n", " f.write(config)" ] }, { "cell_type": "markdown", "id": "7496b7c3", "metadata": {}, "source": [ "Save the configuration to a YAML file that we'll use in the next step:\n", "\n", "## 4. Build the G-Tensor object and ingest the annotations\n", "\n", "With our configuration ready, we can now build the G-Tensor using the `gtensor compose` command. This step:\n", "\n", "1. **Creates the tensor structure**: Sets up the multi-dimensional array\n", "2. **Downloads remote files**: Fetches any BigWig or other remote annotations\n", "3. **Ingests features**: Ingests features using the correct paths depending on the filetype and normalization strategy\n", "\n", "The `-w 1` flag specifies using 1 worker process. For larger datasets, you can increase this number to parallelize the work.\n", "This command runs a luigi command in the background which can resume the pipeline upon any failures.\n", "\n" ] }, { "cell_type": "code", "execution_count": 14, "id": "2727fa4d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/allen/miniconda3/envs/my-mutopia/lib/python3.11/site-packages/requests/__init__.py:86: RequestsDependencyWarning: Unable to find acceptable character detection dependency (chardet or charset_normalizer).\n", " warnings.warn(\n", "INFO:luigi-interface:Loaded []\n", "INFO:luigi:logging configured by default settings\n", "INFO: Mutopia:Creating regions with cutouts for: GeneStrand, ATAC_accessible, Transcription\n", "INFO: Mutopia:Creating genomic regions ...\n", "INFO: Mutopia:Using chromosomes: \n", "\tchr1, chr2, chr3, chr4,\n", "\tchr5, chr6, chr7, chr8,\n", "\tchr9, chr11, chr10, chr12,\n", "\tchr13, chr14, chr15, chr16,\n", "\tchr17, chr18, chr20, chr19,\n", "\tchr22, chr21\n", "INFO: Mutopia:Making initial coarse-grained regions ...\n", "INFO: Mutopia:Building regions ...\n", "INFO: Mutopia:Wrote 25000 windows ...\n", "INFO: Mutopia:Wrote 50000 windows ...\n", "INFO: Mutopia:Wrote 75000 windows ...\n", "INFO: Mutopia:Wrote 100000 windows ...\n", "INFO: Mutopia:Wrote 125000 windows ...\n", "INFO: Mutopia:Wrote 150000 windows ...\n", "INFO: Mutopia:Wrote 175000 windows ...\n", "INFO: Mutopia:Wrote 200000 windows ...\n", "INFO: Mutopia:Wrote 225000 windows ...\n", "INFO: Mutopia:Wrote 250000 windows ...\n", "INFO: Mutopia:Wrote 275000 windows ...\n", "INFO: Mutopia:Wrote 300000 windows ...\n", "INFO: Mutopia:Wrote 325000 windows ...\n", "INFO: Mutopia:Wrote 350000 windows ...\n", "Window size report\n", "----------------------\n", "Num windows | 353685\n", "Smallest | 101\n", "Largest | 10000 \n", "Quantile=0.1 | 420\n", "Quantile=0.25 | 5174\n", "Quantile=0.5 | 10000\n", "Quantile=0.75 | 10000\n", "Quantile=0.9 | 10000\n", "INFO: Mutopia:Calculating context frequencies ...\n", "Aggregating trinucleotide content: 353685it [08:47, 670.02it/s] \n", "INFO: Mutopia:Formatting G-Tensor ...\n", "INFO: Mutopia:Writing G-Tensor to example_gtensor.nc, and accompanying regions file to example_gtensor.nc.regions.bed ...\n", "Make sure not to leave the regions file behind if you move the G-Tensor!.\n", "INFO: Mutopia:Ingesting data from tutorial_data/at_gc_ratio.hg38.bedgraph.gz ...\n", "INFO: Mutopia:First values of feature GCContent: 0.35, 0.31, 0.31, 0.31, 0.23\n", "First non-null values: 0.35, 0.31, 0.31, 0.31, 0.23\n", "INFO: Mutopia:Added feature: GCContent\n", "INFO: Mutopia:Ingesting data from downloads/ENCFF440HPT.bigWig ...\n", "INFO: Mutopia:First values of feature H3K27ac: 0.07, 0.08, 0.07, 0.14, 0.05\n", "First non-null values: 0.07, 0.08, 0.07, 0.14, 0.05\n", "INFO: Mutopia:Added feature: H3K27ac\n", "INFO: Mutopia:Ingesting data from data/LungExpression.bed ...\n", "INFO: Mutopia:First values of feature Transcription: nan, nan, nan, nan, nan\n", "First non-null values: 2.87, 2.87, 2.87, 2.87, 2.87\n", "INFO: Mutopia:Added feature: Transcription\n", "INFO: Mutopia:Ingesting data from data/LungATAC.bed ...\n", "INFO: Mutopia:Found classes: \"None\", \"Peak\"\n", "INFO: Mutopia:Added feature: ATAC_accessible\n" ] } ], "source": [ "!gtensor compose example_gtensor.yaml -w 1" ] }, { "cell_type": "markdown", "id": "6b5c12bd", "metadata": {}, "source": [ "**Note:** Make sure the command above runs to completion, you can check by executing it again and making sure it does not try to launch additional tasks." ] }, { "cell_type": "code", "execution_count": 15, "id": "c27e456d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/allen/miniconda3/envs/my-mutopia/lib/python3.11/site-packages/requests/__init__.py:86: RequestsDependencyWarning: Unable to find acceptable character detection dependency (chardet or charset_normalizer).\n", " warnings.warn(\n", "INFO:luigi-interface:Loaded []\n", "INFO:luigi:logging configured by default settings\n" ] } ], "source": [ "!gtensor compose example_gtensor.yaml -w 1" ] }, { "cell_type": "markdown", "id": "bfa84071", "metadata": {}, "source": [ "## 5. Ingest samples\n", "\n", "Now we'll add some samples to our G-Tensor. The `gtensor sample add` command processes VCF/BCF files and extracts all of the relevant information from each mutation. Before you start, you need to organize some metadata:\n", "1. For each mutation in each VCF, you must add an `INFO` column for the mutation's apparent VAF. This is simply calculated as: (reads supporting variant)/(total reads). We would calculate this internally, but due to the heterogeneity in how different groups store this information, we leave it to the user.\n", "2. The background mutation rate - this is used to determine a baseline mutation rate to discover mutation clusters. Need not be tailored to the tissue, I would just use the one we provide.\n", "3. (Optional) The purity of each sample (if working with tumors). If you have this on-hand you should use it, but it's not necessary.\n", "\n", "Add a sample with `gtensor sample add`, making sure to give each sample a unique ID, and to set the \"sample-weight\" to the reciprocal of purity:\n", "$\\text{sample weight} = \\frac{1}{\\text{purity}}$\n", "\n", "```bash\n", "\n", "vcf_path=\"path/to/your/vcf-or-bcf\"\n", "reciprocal_purity=1.25 # e.g. for a sample with 80% purity\n", "\n", "gtensor sample add \\\n", " -id \"example.001\"\\\n", " --sample-weight $reciprocal_purity \\\n", " --mutation-rate-file \"tutorial_data/mutation-rate.hg38.bedgraph.gz\" \\\n", " --pass-only \\\n", " example_gtensor.nc \\\n", " $vcf_path\n", "```" ] }, { "cell_type": "markdown", "id": "fad07aba", "metadata": {}, "source": [ "Let's examine the structure and contents of our newly created G-Tensor:\n" ] }, { "cell_type": "code", "execution_count": 17, "id": "1fe13af4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Num features: 5\n", "Num samples: 0\n", "Dataset name: example_gtensor\n", "Dataset dimensions:\n", "\tconfiguration: 2\n", "\tcontext: 96\n", "\tlocus: 353685\n", "\tclustered: 2\n", "Dataset attributes:\n", "\tdtype: SBS\n", "\tregions_file: example_gtensor.nc.regions.bed\n", "\tgenome_file: tutorial_data/hg38.mainchroms.sizes\n", "\tfasta_file: tutorial_data/hg38.fa\n", "\tblacklist_file: tutorial_data/hg38-blacklist.v2.sorted.bed.gz\n", "\tregion_size: 10000\n", "\tcoordinates: mutation\n" ] } ], "source": [ "!gtensor info example_gtensor.nc" ] }, { "cell_type": "markdown", "id": "58ed2168", "metadata": {}, "source": [ "You can check out the features you ingested with:" ] }, { "cell_type": "code", "execution_count": 18, "id": "d4771c68", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Feature name | normalization | group\n", "----------------+---------------+------\n", "GCContent | standardize | all \n", "H3K27ac | log1p_cpm | all \n", "Transcription | gex | all \n", "ATAC_accessible | categorical | all \n", "GeneStrand | strand | all \n" ] } ], "source": [ "!gtensor feature ls example_gtensor.nc" ] }, { "cell_type": "markdown", "id": "c5f2a90a", "metadata": {}, "source": "## 6. Manipulating G-Tensors\n\nThe G-Tensor CLI contains many useful functionalities for editing and munging G-Tensors from the terminal. The CLI structure is hierarchical and (hopefully) self-documenting. \n\\\nFor samples, offsets - more on this later, and features, you can view information, add, remove, and edit:" }, { "cell_type": "code", "execution_count": 19, "id": "6962bc99", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: gtensor feature [OPTIONS] COMMAND [ARGS]...\n", "\n", " Manage features in G-Tensor datasets.\n", "\n", " Features represent genomic annotations or experimental data values\n", " associated with each genomic region in the G-Tensor.\n", "\n", "Options:\n", " --help Show this message and exit.\n", "\n", "Commands:\n", " add Add features to a G-Tensor\n", " edit Edit feature attributes\n", " ls List features in a G-Tensor\n", " rm Remove features from a G-Tensor\n" ] } ], "source": [ "!gtensor feature --help" ] }, { "cell_type": "markdown", "id": "6fe8bdb0", "metadata": {}, "source": [ "You can also slice and subset G-Tensors using the `slice` commands:" ] }, { "cell_type": "code", "execution_count": 20, "id": "bd6508c7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: gtensor slice [OPTIONS] COMMAND [ARGS]...\n", "\n", "Options:\n", " --help Show this message and exit.\n", "\n", "Commands:\n", " regions Extract specific genomic regions from a G-Tensor dataset.\n", " samples Extract specific samples from a G-Tensor dataset into a new file.\n" ] } ], "source": [ "!gtensor slice --help" ] }, { "cell_type": "markdown", "id": "558c5388", "metadata": {}, "source": [ "**Check out all the functionalities using:**" ] }, { "cell_type": "code", "execution_count": 21, "id": "b87e77d1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: gtensor [OPTIONS] COMMAND [ARGS]...\n", "\n", " Manage G-Tensor datasets for genomic data analysis.\n", "\n", " G-Tensors are structured genomic datasets that organize samples, features,\n", " and observations across genomic loci for machine learning applications.\n", "\n", "Options:\n", " --help Show this message and exit.\n", "\n", "Commands:\n", " compose Run the G-Tensor construction pipeline\n", " convert Convert a G-Tensor to a different modality\n", " create Create a new G-Tensor\n", " feature Manage features in a G-Tensor\n", " info Display information about a G-Tensor\n", " offsets Manage genomic locus exposure offsets\n", " sample Manage samples in a G-Tensor\n", " set-attr Set attributes on a G-Tensor\n", " slice Slice a G-Tensor by samples or regions\n", " split Split a G-Tensor into training and test sets\n", " utils Utility functions for G-Tensors\n" ] } ], "source": [ "!gtensor --help" ] } ], "metadata": { "kernelspec": { "display_name": "my-mutopia", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 5 }