GTensor API¶

The GTensor module for genomic tensor analysis. This module provides functionality for creating, manipulating, and analyzing genomic tensors, including loading datasets, applying transformations, and generating explanations for model components.

GTensors are hierarchical, multi-dimensional arrays designed to represent complex genomic data structures. They are sliceable along multiple dimensions, and support lazy loading for memory efficiency.

Use the Gtensor CLI tool to interact with and build GTensor datasets from the command line - the python API is mostly intended for analysis and visualization.

Below is an interactive example of using GTensors in a Jupyter notebook:

<xarray.Dataset> Size: 1GB
Dimensions:                         (configuration: 2, locus: 55220,
                                     context: 96, component: 18,
                                     shap_features: 19, shap_component: 18,
                                     shap_locus: 2000, genome_state: 5,
                                     feature: 19, sample: 736,
                                     mesoscale_state: 9)
Coordinates:
  * configuration                   (configuration) <U12 96B 'C/T-centered' '...
  * locus                           (locus) int64 442kB 2 3 4 ... 65746 65747
  * context                         (context) <U7 3kB 'A[C>A]A' ... 'T[T>C]T'
  * component                       (component) <U3 216B 'M0' 'M1' ... 'M17'
  * shap_features                   (shap_features) <U15 1kB 'ATACAccessible'...
  * shap_component                  (shap_component) <U3 216B 'M0' ... 'M17'
  * shap_locus                      (shap_locus) int64 16kB 24008 ... 22805
  * genome_state                    (genome_state) <U30 600B 'Baseline' ... '...
  * feature                         (feature) <U15 1kB 'ATACAccessible' ... '...
  * sample                          (sample) <U19 56kB 'BR001.purple' ... 'TC...
Dimensions without coordinates: mesoscale_state
Data variables: (12/48)
    Spectra/interactions            (component, genome_state, context) float64 69kB ...
    Spectra/shared_effects          (component, genome_state) float64 720B 0....
    Spectra/spectra                 (component, context, genome_state) float64 69kB ...
    Regions/chrom                   (locus) <U4 884kB 'chr1' 'chr1' ... 'chr1'
    Regions/start                   (locus) int64 442kB 819220 ... 248905601
    Regions/end                     (locus) int64 442kB 820420 ... 248906235
    ...                              ...
    State/mesoscale_idx             (configuration, locus) int64 884kB 4 4 ... 4
    State/log_context_distribution  (component, context, mesoscale_state) float64 124kB ...
    State/locus_features            (locus, feature) float32 4MB 0.0 ... -0.1123
    State/log_locus_distribution    (component, locus) float64 8MB -0.6555 .....
    ploidy                          (sample, locus) float32 0B <COO: nnz=0, fill_value=0.0>
    X                               (sample, configuration, context, locus) float32 329MB <GCXS: nnz=329103, fill_value=0.0>
Attributes:
    name:            breast
    dtype:           sbs
    genome_file:     /n/data1/hms/dbmi/park/ctDNA_loci_project/locusregressio...
    fasta_file:      /n/data1/hms/dbmi/park/SOFTWARE/REFERENCE/hg38/cgap_matc...
    blacklist_file:  /n/data1/hms/dbmi/park/ctDNA_loci_project/locusregressio...
    region_size:     10000
    filename:        retry2.nc
    regions_file:    ENFORM-Breast.nc.regions.bed

Format	gcxs
Data Type	float32
Shape	(736, 2, 96, 55220)
nnz	329103
Density	4.2175126691849206e-05
Read-only	True
Size	313.8M
Storage ratio	0.01
Compressed Axes	(0, 3)

Format	coo
Data Type	float32
Shape	(736, 55220)
nnz	0
Density	0.0
Read-only	True
Size	0
Storage ratio	0.00

GTensor API Reference¶

mutopia.gtensor.gtensor.GTensor(modality, *, name, chrom, start, end, context_frequencies, exposures=None, dtype=None)[source]¶

Create a GTensor dataset for genomic tensor analysis.

This function constructs an xarray Dataset with the standardized structure required for genomic tensor operations, including region coordinates, context frequencies, and metadata.

Parameters:

modality (object) – Modality object containing coordinate information and mode configuration
name (str) – Name identifier for the dataset
chrom (List[str]) – List of chromosome names for each genomic region
start (List[int]) – List of start positions for each genomic region
end (List[int]) – List of end positions for each genomic region
context_frequencies (xr.DataArray) – Array containing context frequency data for each region
exposures (Union[None, NDArray[np.number]], optional) – Exposure values for each region. If None, defaults to ones
dtype (optional) – Data type for the dataset. If None, uses modality.MODE_ID

Returns:

Structured dataset with regions, coordinates, and metadata

Return type:

xr.Dataset

mutopia.gtensor.gtensor.annot_empirical_marginal(dataset, key='empirical_marginal')[source]¶

Calculate and add empirical marginal mutation rates to a dataset.

This method computes empirical marginal mutation rates by aggregating observed mutations across all samples in the dataset and normalizing by context frequencies and region lengths.

Parameters:

dataset (GTensorDataset) – Dataset containing mutation data to analyze
key (str, default="empirical_marginal") – Base name for the mutation rate variables to be added to the dataset. Creates two variables: {key} and {key}_locus

Returns:

The input dataset with empirical marginal rates added as new variables: - {key}: Marginal mutation rates normalized by context frequencies - {key}_locus: Per-locus marginal rates normalized by region length

Return type:

GTensorDataset

mutopia.gtensor.gtensor.apply_to_samples(data, func, bar=True)[source]¶

Apply a function to each sample in a dataset with parallel processing.

This function applies a given function to each sample (region) in the dataset, handling the parallelization and aggregation of results. It’s designed for operations that need to process each genomic region independently.

Parameters:

data (GTensorDataset) – Input dataset or data loader containing samples to process
func (callable) – Function to apply to each sample. Should accept a dataset slice and return a result that can be concatenated
bar (bool, default=True) – Whether to display a progress bar during processing

Returns:

Dataset containing the concatenated results from all sample applications

Return type:

GTensorDataset

mutopia.gtensor.gtensor.dims_except_for(dims, *keepdims)[source]¶

mutopia.gtensor.gtensor.eager_load(dataset)[source]¶

Load a dataset eagerly with samples but without state information.

This is a convenience function that loads a dataset with sample data but excludes state information for faster access patterns.

Parameters:: dataset (str or path-like) – Path or identifier for the dataset to load
Returns:: Eager dataset interface with samples loaded into memory
Return type:: GTensorDataset

mutopia.gtensor.gtensor.eager_train_test_load(dataset, *test_chroms)[source]¶

Load a dataset and perform eager train/test split by chromosomes.

This convenience function combines eager loading with train/test splitting, loading all data into memory for fast access.

Parameters:

dataset (str or path-like) – Path or identifier for the dataset to load
*test_chroms (str) – Chromosome names to reserve for the test set

Returns:

Training and testing dataset interfaces

Return type:

tuple[CorpusInterface, CorpusInterface]

mutopia.gtensor.gtensor.equal_size_quantiles(dataset, var_name, n_bins=10, key=None)[source]¶

Create equal-size quantile bins for a variable in the dataset.

This function bins the values of a specified variable into quantiles of equal cumulative region length, which is useful for creating balanced genomic bins.

Parameters:

dataset (GTensorDataset) – Dataset containing the variable to bin
var_name (str) – Name of the variable to create quantile bins for
n_bins (int, default=10) – Number of quantile bins to create
key (str, optional) – Custom name for the output bin variable. If None, generates name as ‘{var_name_base}_qbins_{n_bins}’ where var_name_base is the last part of var_name after splitting on ‘/’

Returns:

The input dataset with quantile bins added as a new variable

Return type:

GTensorDataset

mutopia.gtensor.gtensor.excel_report(self, dataset, output, normalization='global')[source]¶

Generate a comprehensive Excel report with model results.

This method creates an Excel file containing signature data, sample contributions, and SHAP values (if available) across multiple worksheets.

Parameters:

dataset (GTensorDataset) – Dataset containing the model results to export
output (str) – Output file path for the Excel report

Raises:

ImportError – If openpyxl is not installed for Excel writing support

Notes

The Excel file will contain the following sheets: - Signature_{name}: Normalized signature data for each component - Sample_contributions: Component contributions per sample (if available) - SHAP_transformed_features: SHAP feature data (if available) - SHAP_original_features: Original feature data for SHAP (if available) - SHAP_values_{component}: SHAP values for each component (if available)

Requires openpyxl to be installed: pip install openpyxl

mutopia.gtensor.gtensor.fetch_component(dataset, component_name)[source]¶

Retrieve the mutational spectrum for a specific component.

This function extracts the signature spectrum (mutational profile) for a specified component from the dataset. The spectrum describes the relative frequency of different mutation types for this component.

Parameters:

dataset (GTensorDataset) – Dataset containing component spectra
component_name (Union[str, int]) – Name or index of the component to retrieve

Returns:

DataArray containing the component’s mutational spectrum with appropriate dimensions and coordinates

Return type:

xr.DataArray

Raises:

ValueError – If the specified component is not found in the dataset

mutopia.gtensor.gtensor.fetch_features(dataset, *feature_names, source=None)[source]¶

Extract numerical features from the dataset’s “Features” section.

Parameters:

dataset (GTensorDataset) – Dataset containing feature variables under the “Features” group.
*feature_names (str) – Glob patterns or basenames of features to select. When empty, all numeric features are returned.
source (str, optional) – Restrict selection to features within this source directory. When None, features from all sources are considered.

Returns:

A DataArray with dims (“feature”, “locus”) and coords “locus”, “feature” (full paths), “feature_name” (basenames), and “source”.

Return type:

xarray.DataArray

Notes

All selected features must share a compatible numeric dtype.

mutopia.gtensor.gtensor.fetch_interactions(dataset, component_name)[source]¶

Retrieve interaction effects for a specific component.

This function extracts the interaction matrix for a specified component, showing how the mutational spectrum varies across different genomic contexts (e.g., strand orientation, replication timing, gene regions).

Parameters:

dataset (GTensorDataset) – Dataset containing component interaction data
component_name (Union[str, int]) – Name or index of the component to retrieve

Returns:

DataArray containing the component’s interaction effects with appropriate dimensions and coordinates

Return type:

xr.DataArray

Raises:

ValueError – If the specified component is not found in the dataset

mutopia.gtensor.gtensor.fetch_shared_effects(dataset, component_name)[source]¶

Retrieve shared effects for a specific component.

This function extracts the shared effects matrix for a specified component, representing effects that are common across different contexts or conditions. Shared effects capture baseline mutational patterns that don’t vary with genomic features.

Parameters:

dataset (GTensorDataset) – Dataset containing component shared effects data
component_name (Union[str, int]) – Name or index of the component to retrieve

Returns:

DataArray containing the component’s shared effects with appropriate dimensions and coordinates

Return type:

xr.DataArray

Raises:

ValueError – If the specified component is not found in the dataset

mutopia.gtensor.gtensor.fetch_source(dataset, source)[source]¶

Extract and restructure data for a specific source from a multi-source dataset.

This function filters and reorganizes a dataset to contain only data relevant to a specified source, while maintaining shared features and state variables that are common across all sources.

Parameters:

dataset (GTensorDataset) – The input dataset containing data from multiple sources, organized with hierarchical variable names (e.g., “Features/source/variable”, “State/source/variable”).
source (str) – The name of the source to extract data for. Must be present in the dataset.

Returns:

A new dataset containing: - Source-specific features and state variables (with paths flattened) - Shared features and state variables (common to all sources) - Other data variables from the original dataset - Updated name attribute reflecting the source - Source dimension removed if present

Return type:

GTensorDataset

Raises:

ValueError – If the specified source is not found in the dataset.

Notes

The function performs the following transformations: 1. Validates that the source exists in the dataset 2. Separates source-specific and shared variables from Features and State groups 3. Creates a rename mapping to flatten source-specific variable paths 4. Combines source-specific, shared, and other variables into a new dataset 5. Updates dataset attributes and coordinates while removing source dimension

mutopia.gtensor.gtensor.get_explanation(dataset, component)[source]¶

Generate SHAP explanations for a specific model component.

This function extracts and formats SHAP values for interpretability analysis, creating an explanation object that can be used with SHAP visualization tools.

Parameters:

dataset (GTensorDataset) – Dataset containing SHAP values and feature information
component (str) – Name of the model component to explain

Returns:

SHAP explanation object with values, features, and display data

Return type:

shap.Explanation

Raises:

ImportError – If SHAP library is not installed
ValueError – If the specified component doesn’t have SHAP values in the dataset

mutopia.gtensor.gtensor.get_shap_summary(data, source=None)[source]¶

Generate a summary of SHAP values for model components.

This function computes summary statistics for SHAP values across all components, including effect sizes (97th percentile of absolute SHAP values) and correlations between SHAP values and feature values. This provides a high-level view of which features have the strongest associations with each component.

Parameters:

data (GTensorDataset) – Dataset containing SHAP values and feature information
source (str, optional) – Source identifier to analyze. Required if the dataset is a mixture dataset with multiple sources.

Returns:

DataFrame with columns: - component: Component name - feature: Feature name - effect_size: 97th percentile of absolute SHAP values - correlation: Pearson correlation between SHAP values and feature values

Return type:

pd.DataFrame

Raises:

ValueError – If the dataset is a mixture dataset and no source is specified

mutopia.gtensor.gtensor.infer_source_celltypes(dataset)[source]¶

Infer source cell types from feature names and assign to dataset coordinates.

This function examines the feature names in the dataset to identify unique source cell types based on directory structure. It then assigns these inferred cell types to the ‘source’ coordinate of the dataset.

Parameters:: dataset (GTensorDataset) – Input dataset containing features with potential source information
Returns:: Dataset with ‘source’ coordinate added, reflecting inferred cell types
Return type:: GTensorDataset
Raises:: ValueError – If no features are found in the dataset to infer sources from

mutopia.gtensor.gtensor.is_mixture_dataset(dataset)[source]¶

Check if a dataset contains data from multiple sources.

This function determines whether the dataset is a mixture dataset by checking if it contains more than one source. Mixture datasets have source-specific features and require special handling for analysis.

Parameters:: dataset (GTensorDataset) – Input dataset to check
Returns:: True if the dataset contains multiple sources, False otherwise
Return type:: bool

mutopia.gtensor.gtensor.lazy_load(dataset)[source]¶

Load a dataset lazily without samples or state information.

This is a convenience function that loads a dataset with minimal memory footprint by excluding sample data and state information.

Parameters:: dataset (str or path-like) – Path or identifier for the dataset to load
Returns:: Lazy dataset interface that loads data on demand
Return type:: GTensorDataset

mutopia.gtensor.gtensor.lazy_train_test_load(dataset, *test_chroms)[source]¶

Load a dataset and perform lazy train/test split by chromosomes.

This convenience function combines lazy loading with train/test splitting, providing memory-efficient access to training and testing data.

Parameters:

dataset (str or path-like) – Path or identifier for the dataset to load
*test_chroms (str) – Chromosome names to reserve for the test set

Returns:

Training and testing dataset slicers

Return type:

tuple[LazySlicer, LazySlicer]

mutopia.gtensor.gtensor.list_components(dataset)[source]¶

List all component names in the dataset.

This function extracts and returns the names of all model components (mutational signatures or processes) present in the dataset.

Parameters:: dataset (GTensorDataset) – Input dataset containing model components
Returns:: List of component names
Return type:: List[str]
Raises:: ValueError – If the dataset does not contain a ‘component’ coordinate

mutopia.gtensor.gtensor.list_sources(dataset)[source]¶

List all source identifiers in the dataset.

This function extracts and returns the names of all sources present in the dataset. Sources typically represent different cell types, tissues, or experimental conditions.

Parameters:: dataset (GTensorDataset) – Input dataset containing source information
Returns:: List of source names. Returns an empty list if the dataset has no ‘source’ coordinate defined.
Return type:: List[str]

mutopia.gtensor.gtensor.load_dataset(dataset, with_samples=True, with_state=True)[source]¶

Load a dataset from disk with configurable loading options.

This function loads a dataset from disk storage. The loading behavior can be customized based on whether samples and state information should be included.

Parameters:

dataset (str or path-like) – Path or identifier for the dataset to load
with_samples (bool, default=True) – Whether to load sample data along with the dataset structure
with_state (bool, default=True) – Whether to load state information (model parameters, etc.)

Returns:

Loaded dataset interface. Returns LazySampleLoader if with_samples=False, otherwise returns CorpusInterface

Return type:

GTensorDataset

mutopia.gtensor.gtensor.make_mixture_dataset(**datasets)[source]¶

Create a mixed dataset by combining multiple source datasets.

This function merges multiple datasets, renaming their features and state variables to include source identifiers, enabling comparative analysis across different data sources.

Parameters:: **datasets (GTensorDataset) – Named datasets to combine. Keys become source identifiers.
Returns:: Combined dataset with source-specific feature namespaces
Return type:: GTensorDataset

mutopia.gtensor.gtensor.match_dims(X, **dim_sizes)[source]¶

mutopia.gtensor.gtensor.mutate_method(func)[source]¶

Decorator function to modify a dataset in place for class methods.

This decorator allows running mutations on a dataset without disrupting the interface chains, specifically for methods that take ‘self’ as the first parameter.

Parameters:: func (callable) – Method that takes self and dataset as first two arguments and returns a modified dataset
Returns:: Wrapped method that can be used with dataset.mutate()
Return type:: callable

mutopia.gtensor.gtensor.num_sources(dataset)[source]¶

Get the number of distinct sources in a dataset.

This function counts the number of unique sources present in the dataset’s ‘source’ coordinate, which is useful for determining if the dataset contains data from multiple cell types or conditions.

Parameters:: dataset (GTensorDataset) – Input dataset to query for sources
Returns:: Number of distinct sources in the dataset. Returns 0 if no sources are defined.
Return type:: int

mutopia.gtensor.gtensor.rename_components(dataset, names)[source]¶

Rename the components of the model and update the dataset coordinates accordingly.

Parameters:

dataset (GTensorDataset) – The dataset containing model components to be renamed.
names (List[str]) – New names for the components. Must have the same length as the number of components in the model.

Returns:

The dataset with updated component names in coordinates.

Return type:

GTensorDataset

Raises:

ValueError – If the number of provided names doesn’t match the number of components.
KeyError – If some components in the dataset’s “shap_component” coordinate don’t match the model components.

Notes

This method also updates the internal _component_names attribute of the model.

mutopia.gtensor.gtensor.slice_regions(dataset, *regions, lazy=False)[source]¶

Extract genomic regions that overlap with specified intervals.

This function filters the dataset to include only regions that overlap with any of the specified genomic intervals. Intervals can be specified in multiple formats: “chr:start-end”, “chr” (entire chromosome), or a comma-separated list of such specifications.

Parameters:

dataset (GTensorDataset) – Input dataset containing genomic regions
regions (str) – Region specification(s) in formats: - “chr:start-end” (e.g., “chr1:1000-2000”) - “chr” (entire chromosome, e.g., “chr1”) - List of any of the above
lazy (bool, default=False) – Whether to return a lazy slicer instead of materializing the data

Returns:

Filtered dataset containing only overlapping regions

Return type:

GTensorDataset

Raises:

ValueError – If no regions match the specified query intervals

mutopia.gtensor.gtensor.slice_samples(dataset, samples)[source]¶

Extract a subset of samples from the dataset.

This function filters the dataset to include only the specified samples, enabling focused analysis on particular samples of interest while maintaining all other dataset dimensions and attributes.

Parameters:

dataset (GTensorDataset) – Input dataset containing multiple samples
samples (List[str]) – List of sample names to extract from the dataset. Sample names must exist in the dataset’s sample coordinate.

Returns:

Filtered dataset containing only the specified samples wrapped in a SampleSlice interface

Return type:

GTensorDataset

Raises:

KeyError – If any of the specified samples are not found in the dataset

Notes

If an empty list is provided, the original dataset is returned unchanged. The function uses the mutate pattern to maintain interface chain compatibility.

mutopia.gtensor.gtensor.train_test_split(dataset, *test_chroms, lazy=False)[source]¶

Split a dataset into training and testing sets based on chromosomes.

This function splits the dataset by chromosomes, with specified chromosomes reserved for testing and the remainder used for training. The split can be performed eagerly (loading all data) or lazily (for memory efficiency).

Parameters:

dataset (GTensorDataset) – Input dataset to split
*test_chroms (Union[str, List[str]]) – Chromosome names to reserve for the test set. Can be provided as multiple string arguments or lists of strings
lazy (bool, default=False) – Whether to perform lazy splitting. If True, returns LazySlicer objects that don’t load data until accessed

Returns:

Training and testing dataset interfaces

Return type:

tuple[GTensorDataset, GTensorDataset]

Raises:

ValueError – If no test chromosomes are provided or none of the specified chromosomes are found in the dataset

mutopia.gtensor.gtensor.unstack_regions(dataset)[source]¶

Unstack regions from a compressed format to full coordinate arrays.

This function expands region data from a compact representation to full coordinate arrays, using external region file information to reconstruct chromosome, start, and end coordinates.

Parameters:: dataset (GTensorDataset) – Dataset with stacked region representation
Returns:: Dataset with unstacked region coordinates
Return type:: GTensorDataset