TopographyModel API

class mutopia.model.base.TopographyModel[source]

Bases: ABC, BaseEstimator

The TopographyModel is the heart of the mutopia modeling framework. It decomposes genome topography data into discrete components, simultaneously capturing their genomic distributions and spectra. MuTopia models are memory and compute efficient, and can be accelerated using multiple-threading.

The models mostly work like Scikit-learn estimators. After fitting a model, you can use it to annotate your data using the annot_data method.

Parameters:
  • num_components (int, default=15) – Number of components in the signature model.

  • init_components (list, default=[]) – List of initial components to use.

  • fix_components (list, default=[]) – List of components to keep fixed during optimization.

  • seed (int, default=0) – Random seed for reproducibility.

  • context_reg (float, default=0.0001) – Regularization parameter for context model.

  • context_conditioning (float, default=1e-9) – Conditioning parameter for context model.

  • conditioning_alpha (float, default=1e-9) – Alpha parameter for conditioning.

  • pi_prior (float, default=1.0) – Prior parameter for pi in the locals model.

  • tau_prior (float, default=1.0) – Prior parameter for tau in the locals model.

  • locus_model_type (str, default="gbt") – Type of model for locus. Gradient Boosted Trees by default.

  • tree_learning_rate (float, default=0.15) – Learning rate for tree-based models.

  • max_depth (int, default=5) – Maximum depth of trees in the locus model.

  • max_trees_per_iter (int, default=25) – Maximum number of trees per iteration.

  • max_leaf_nodes (int, default=31) – Maximum number of leaf nodes in each tree.

  • min_samples_leaf (int, default=30) – Minimum number of samples required at a leaf node.

  • max_features (float, default=1.0) – Fraction of features to consider when looking for best split.

  • n_iter_no_change (int, default=1) – Number of iterations with no improvement to wait before early stopping.

  • use_groups (bool, default=True) – Whether to use groups in the model.

  • add_corpus_intercepts (bool, default=False) – Whether to add corpus-specific intercepts.

  • convolution_width (int, default=0) – Width of convolution window.

  • l2_regularization (float, default=1) – L2 regularization strength.

  • max_iter (int, default=25) – Maximum number of iterations for the optimization.

  • init_variance_theta (float, default=0.03) – Initial variance for theta parameters.

  • init_variance_context (float, default=0.1) – Initial variance for context parameters.

  • empirical_bayes (bool, default=True) – Whether to use empirical Bayes for parameter estimation.

  • begin_prior_updates (int, default=50) – Iteration to begin prior updates.

  • stop_condition (int, default=50) – Stopping condition for optimization.

  • num_epochs (int, default=2000) – Number of epochs for training.

  • locus_subsample (float or None, default=None) – Fraction of loci to subsample in each iteration.

  • batch_subsample (float or None, default=None) – Fraction of batches to subsample in each iteration.

  • threads (int, default=1) – Number of threads for parallel execution.

  • kappa (float, default=0.5) – Kappa parameter for optimization.

  • tau (float, default=1.0) – Tau parameter for optimization.

  • callback (callable or None, default=None) – Callback function to be called during optimization.

  • eval_every (int, default=10) – Evaluate model every N iterations.

  • verbose (int, default=0) – Verbosity level (0: quiet, >0: increasingly verbose).

  • time_limit (float or None, default=None) – Time limit for optimization in seconds.

  • test_chroms (tuple, default=("chr1",)) – Chromosomes to use for testing.

Examples

>>> import mutopia as mu
>>> data = mu.gt.load_dataset("example_data.nc")
>>> # Create and fit a model with subsampling and 15 components
>>> model = data.modality().TopographyModel(locus_subsample=0.125, num_components=15)
>>> model.fit(train_data)
GT
property alpha_
annot_SHAP_values(dataset, *components, threads=1, scan=False, n_samples=2000, seed=42, key='SHAP_values', source=None)[source]

Calculate and add SHAP values to explain component predictions.

This method uses SHAP (SHapley Additive exPlanations) to compute feature importance values for understanding how genomic features contribute to component predictions.

Parameters:
  • dataset (GTensorDataset) – Dataset to analyze and annotate with SHAP values

  • *components (int or str) – Component indices or names to calculate SHAP values for. If none provided, calculates for all components.

  • threads (int, default=1) – Number of parallel threads to use for computation

  • scan (bool, default=False) – If True, calculates SHAP values for all loci. If False, subsamples loci.

  • n_samples (int, default=2000) – Number of loci to subsample when scan=False

  • seed (int, default=42) – Random seed for reproducible subsampling

  • key (str, default="SHAP_values") – Name of the variable to store SHAP values in the dataset

Returns:

The input dataset with SHAP values added as a new variable with the specified key name and dimensions (‘shap_component’, ‘locus’ or ‘shap_locus’, ‘shap_features’)

Return type:

GTensorDataset

Raises:

ImportError – If the SHAP library is not installed

Notes

This method requires the SHAP library to be installed. Install with: pip install shap

annot_component_distributions(dataset, threads=1, key='component_distributions')[source]

Calculate and add component distributions to a dataset.

This method computes the probability distributions for each component across genomic contexts and adds them as new variables to the dataset.

Parameters:
  • dataset (GTensorDataset) – Dataset to analyze and annotate with component distributions

  • threads (int, default=1) – Number of parallel threads to use for computation

  • key (str, default="component_distributions") – Name of the variable to store distributions in the dataset. Per-locus distributions will be stored with the name “{key}_locus”

Returns:

The input dataset with the calculated component distributions added as new variables: - key: Full distributions with dimensions (‘source’, ‘component’, …) - {key}_locus: Per-locus distributions normalized by region length

Return type:

GTensorDataset

Notes

If the dataset does not have corpus state initialized, this method will automatically set it up using setup_corpus().

annot_components(dataset, normalization='global')[source]
annot_contributions(dataset, threads=1, key='contributions', locus_subsample=None, **svi_kw)[source]

Calculate and add component contributions to a dataset.

This method computes the contributions of each component to the dataset using the locals model and adds them as a new variable to the dataset.

Parameters:
  • dataset (GTensorDataset) – Dataset to analyze and annotate with contributions

  • threads (int, default=1) – Number of parallel threads to use for computation

  • key (str, default="contributions") – Name of the variable to store contributions in the dataset

Returns:

The input dataset with the calculated contributions added as a new variable with the specified key name and dimensions (‘sample’, ‘component’)

Return type:

GTensorDataset

Notes

If the dataset does not have corpus state initialized, this method will automatically set it up using setup_corpus().

annot_data(dataset, subset_region=None, threads=1, source=None, calc_shap=True)[source]

Annotate a dataset with comprehensive model analysis information.

This method applies a series of annotation functions to enrich the dataset with various types of model-derived insights including component analysis, contribution calculations, SHAP values, component distributions, and marginal predictions.

Parameters:
  • dataset – The input dataset to be annotated with model analysis information.

  • threads (int, optional) – Number of threads to use for parallel processing in annotation functions that support multithreading. Defaults to 1.

Returns:

The annotated dataset containing the original data plus all computed annotations from the applied annotation functions.

Note

The annotation functions are applied sequentially in the following order: 1. Component annotations 2. Contribution annotations 3. SHAP value annotations 4. Component distribution annotations 5. Marginal prediction annotations

annot_marginal_prediction(dataset, threads=1, key='predicted_marginal')[source]

Calculate and add marginal predictions to a dataset.

This method computes marginal mutation rate predictions by marginalizing over component distributions weighted by their contributions.

Parameters:
  • dataset (GTensorDataset) – Dataset to analyze and annotate with marginal predictions

  • threads (int, default=1) – Number of parallel threads to use for computation

  • key (str, default="predicted_marginal") – Name of the variable to store marginal predictions in the dataset. Per-locus marginal predictions will be stored with the name “{key}_locus”

Returns:

The input dataset with marginal predictions added as new variables: - key: Marginal mutation rate predictions - {key}_locus: Per-locus marginal predictions normalized by region length

Return type:

GTensorDataset

Notes

This method requires ‘component_distributions’ and ‘contributions’ to be present in the dataset. If they are missing, they will be calculated automatically.

property component_names
fit(train_datasets, test_datasets=None)[source]

Fit the model to the provided training datasets.

This method fits the model using a combination of local and factor models. If test datasets are not provided, it automatically splits the training data into train and test partitions.

Parameters:
  • train_datasets (GTensorDataset or sequence of GTensorDataset) – One or more datasets to use for training the model.

  • test_datasets (GTensorDataset or sequence of GTensorDataset, optional) – Datasets to use for testing the model. If None, a portion of the training datasets will be used for testing.

Returns:

The fitted estimator.

Return type:

TopographyModel

Notes

This method sets the following attributes: - modality_ : The modality of the training datasets - factor_model_ : The fitted factor model - locals_model_ : The fitted locals model - test_scores_ : Performance metrics on test datasets

init_model(train_datasets)[source]
property modality
property n_components
needs_setup(dataset)[source]

Check if the dataset needs to be set up with corpus state.

This method checks if the dataset has the necessary corpus state initialized for modeling. If not, it indicates that the dataset needs to be set up.

Parameters:

dataset (GTensorDataset) – The dataset to check for corpus state initialization.

Returns:

True if the dataset needs to be set up (i.e., corpus state is not initialized), False otherwise.

Return type:

bool

sample_params(trial, extensive=0)[source]
save(path)[source]

Save the model to a file.

Parameters:

path (str) – The file path where the model should be saved.

Examples

>>> model.save("my_model.pkl")
set_fit_request(*, test_datasets='$UNCHANGED$', train_datasets='$UNCHANGED$')

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • test_datasets (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for test_datasets parameter in fit.

  • train_datasets (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for train_datasets parameter in fit.

Returns:

self – The updated object.

Return type:

object

setup_corpus(dataset, threads=1)[source]

Set up the corpus dataset with initial state and update normalization factors.

This method initializes the dataset state using the factor and locals models, updates the state from scratch, and then applies normalizers to all expanded datasets.

Parameters:
  • dataset (GTensorDataset) – The dataset to be set up for modeling

  • threads (int, optional) – The number of threads to use for parallel processing, by default 1

Returns:

The initialized and normalized dataset

Return type:

GTensorDataset