API Documentation¶

class sodirac.utils.RBFWeight(alpha: float | None = None)[source]¶

Bases: object

set_alpha(X: numpy.ndarray, n_max: int | None = None, dm: numpy.ndarray | None = None) → None[source]¶

Set the alpha parameter of a Gaussian RBF kernel as the median distance between points in an array of observations.

Parameters¶

Xnp.ndarray: [N, P] matrix of observations and features.
n_maxint: maximum number of observations to use for median distance computation.
dmnp.ndarray, optional: [N, N] distance matrix for setting the RBF kernel parameter. speeds computation if pre-computed.

Returns¶

None. Sets self.alpha.

References¶

A Kernel Two-Sample Test Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, Alexander Smola. JMLR, 13(Mar):723−773, 2012. http://jmlr.csail.mit.edu/papers/v13/gretton12a.html

sodirac.utils.adata_to_cluster_expression(adata, cluster_label, scale=True, add_density=True)[source]¶

Convert an AnnData to a new AnnData with cluster expressions. Clusters are based on cluster_label in adata.obs. The returned AnnData has an observation for each cluster, with the cluster-level expression equals to the average expression for that cluster. All annotations in adata.obs except cluster_label are discarded in the returned AnnData.

Args:: adata (AnnData): single cell data cluster_label (String): field in adata.obs used for aggregating values scale (bool): Optional. Whether weight input single cell by # of cells in cluster. Default is True. add_density (bool): Optional. If True, the normalized number of cells in each cluster is added to the returned AnnData as obs.cluster_density. Default is True.
Returns:: AnnData: aggregated single cell data

sodirac.utils.append_categorical_to_data(X: numpy.ndarray | scipy.sparse.csr.csr_matrix, categorical: numpy.ndarray)[source]¶

Convert categorical to a one-hot vector and append this vector to each sample in X.

Parameters¶

Xnp.ndarray, sparse.csr.csr_matrix: [Cells, Features]
categoricalnp.ndarray: [Cells,]

Returns¶

Xanp.ndarray: [Cells, Features + N_Categories]
categoriesnp.ndarray: [N_Categories,] str category descriptors.

sodirac.utils.argmax_pred_class(grouping: numpy.ndarray, prediction: numpy.ndarray)[source]¶

Assign class to elements in groups based on the most common predicted class for that group.

Parameters¶

groupingnp.ndarray: [N,] partition values defining groups to be classified.
predictionnp.ndarray: [N,] predicted values for each element in grouping.

Returns¶

assigned_classesnp.ndarray: [N,] class labels based on the most common class assigned to elements in the group partition.

Examples¶

>>> grouping = np.array([0,0,0,1,1,1,2,2,2,2])
>>> prediction = np.array(['A','A','A','B','A','B','C','A','B','C'])
>>> argmax_pred_class(grouping, prediction)
np.ndarray(['A','A','A','B','B','B','C','C','C','C',])

Notes¶

scNym classifiers do not incorporate neighborhood information. This simple heuristic leverages cluster information obtained by an orthogonal method and assigns all cells in a given cluster the majority class label within that cluster.

sodirac.utils.build_classification_matrix(X: numpy.ndarray | scipy.sparse.csr.csr_matrix, model_genes: numpy.ndarray, sample_genes: numpy.ndarray, gene_batch_size: int = 512) → numpy.ndarray | scipy.sparse.csr.csr_matrix[source]¶

Build a matrix for classification using only genes that overlap between the current sample and the pre-trained model.

Parameters¶

Xnp.ndarray, sparse.csr_matrix: [Cells, Genes] count matrix.
model_genesnp.ndarray: gene identifiers in the order expected by the model.
sample_genesnp.ndarray: gene identifiers for the current sample.
gene_batch_sizeint: number of genes to copy between arrays per batch. controls a speed vs. memory trade-off.

Returns¶

Nnp.ndarray, sparse.csr_matrix: [Cells, len(model_genes)] count matrix. Values where a model gene was not present in the sample are left as zeros. type(N) will match type(X).

sodirac.utils.compute_entropy_of_mixing(X: numpy.ndarray, y: numpy.ndarray, n_neighbors: int, n_iters: int | None = None, **kwargs) → numpy.ndarray[source]¶

Compute the entropy of mixing among groups given a distance matrix.

Parameters¶

Xnp.ndarray: [N, P] feature matrix.
ynp.ndarray: [N,] group labels.
n_neighborsint: number of nearest neighbors to draw for each iteration of the entropy computation.
n_itersint: number of iterations to perform. if n_iters is None, uses every point.

Returns¶

entropy_of_mixingnp.ndarray: [n_iters,] entropy values for each iteration.

Notes¶

The entropy of batch mixing is computed by sampling n_per_sample cells from a local neighborhood in the nearest neighbor graph and contructing a probability vector based on their group membership. The entropy of this probability vector is computed as a metric of intermixing between groups.

If groups are more mixed, the probability vector will have higher entropy, and vice-versa.

sodirac.utils.get_adata_asarray(adata: anndata.AnnData) → numpy.ndarray | scipy.sparse.csr.csr_matrix[source]¶

Get the gene expression matrix .X of an AnnData object as an array rather than a view.

Parameters¶

adataanndata.AnnData: [Cells, Genes] AnnData experiment.

Returns¶

Xnp.ndarray, sparse.csr.csr_matrix: [Cells, Genes] .X attribute as an array in memory.

Notes¶

Returned X will match the type of adata.X view.

sodirac.utils.get_multi_edge_index(pos, regions, graph_methods='knn', n_neighbors=None, n_radius=None)[source]¶

sodirac.utils.get_single_edge_index(pos, graph_methods='knn', n_neighbors=None, n_radius=None)[source]¶

sodirac.utils.knn_smooth_pred_class(X: numpy.ndarray, pred_class: numpy.ndarray, grouping: numpy.ndarray | None = None, k: int = 15) → numpy.ndarray[source]¶

Smooths class predictions by taking the modal class from each cell’s nearest neighbors.

Parameters¶

Xnp.ndarray: [N, Features] embedding space for calculation of nearest neighbors.
pred_classnp.ndarray: [N,] array of unique class labels.
groupingsnp.ndarray: [N,] unique grouping labels for i.e. clusters. if provided, only considers nearest neighbors within the cluster.
kint: number of nearest neighbors to use for smoothing.

Returns¶

smooth_pred_classnp.ndarray: [N,] unique class labels, smoothed by kNN.

Examples¶

>>> smooth_pred_class = knn_smooth_pred_class(
...     X = X,
...     pred_class = raw_predicted_classes,
...     grouping = louvain_cluster_groups,
...     k = 15,)

Notes¶

scNym classifiers do not incorporate neighborhood information. By using a simple kNN smoothing heuristic, we can leverage neighborhood information to improve classification performance, smoothing out cells that have an outlier prediction relative to their local neighborhood.

sodirac.utils.knn_smooth_pred_class_prob(X: numpy.ndarray, pred_probs: numpy.ndarray, names: numpy.ndarray, grouping: numpy.ndarray | None = None, k: Callable | int = 15, dm: numpy.ndarray | None = None, **kwargs) → numpy.ndarray[source]¶

Smooths class predictions by taking the modal class from each cell’s nearest neighbors.

Parameters¶

Xnp.ndarray: [N, Features] embedding space for calculation of nearest neighbors.
pred_probsnp.ndarray: [N, C] array of class prediction probabilities.
namesnp.ndarray,: [C,] names of predicted classes in pred_probs.
groupingsnp.ndarray: [N,] unique grouping labels for i.e. clusters. if provided, only considers nearest neighbors within the cluster.
kint: number of nearest neighbors to use for smoothing.
dmnp.ndarray, optional: [N, N] distance matrix for setting the RBF kernel parameter. speeds computation if pre-computed.

Returns¶

smooth_pred_classnp.ndarray: [N,] unique class labels, smoothed by kNN.

Examples¶

>>> smooth_pred_class = knn_smooth_pred_class_prob(
...     X = X,
...     pred_probs = predicted_class_probs,
...     grouping = louvain_cluster_groups,
...     k = 15,)

Notes¶

sodirac.utils.lsi(adata: anndata.AnnData, n_comps: int = 20, use_highly_variable: bool | None = None, **kwargs) → None[source]¶

LSI analysis (following the Seurat v3 approach)

Parameters¶

adata: Input dataset
n_components: Number of dimensions to use
use_highly_variable: Whether to use highly variable features only, stored in adata.var['highly_variable']. By default uses them if they have been determined beforehand.
**kwargs: Additional keyword arguments are passed to sklearn.utils.extmath.randomized_svd()

Returns¶

adataanndata.AnnData: The input AnnData object with LSI results stored in adata.obsm[“X_lsi”].

sodirac.utils.make_one_hot(labels: torch.LongTensor, C=2) → torch.FloatTensor[source]¶

Converts an integer label torch.autograd.Variable to a one-hot Variable.

Parameters¶

labelstorch.LongTensor or torch.cuda.LongTensor: [N, 1], where N is batch size. Each value is an integer representing correct classification.
Cint: number of classes in labels.

Returns¶

targettorch.FloatTensor or torch.cuda.FloatTensor: [N, C,], where C is class number. One-hot encoded.

sodirac.utils.mclust_R(adata, num_cluster, modelNames='EEE', used_obsm='emb_pca', random_seed=2020, key_added='mclust')[source]¶: Clustering using the mclust algorithm. The parameters are the same as those in the R package mclust.

sodirac.utils.pp_adatas(adata_sc, adata_sp, genes=None, gene_to_lowercase=True)[source]¶

Pre-process AnnDatas so that they can be mapped. Specifically: - Remove genes that all entries are zero - Find the intersection between adata_sc, adata_sp and given marker gene list, save the intersected markers in two adatas - Calculate density priors and save it with adata_sp Args:

adata_sc (AnnData): single cell data adata_sp (AnnData): spatial expression data genes (List): Optional. List of genes to use. If None, all genes are used.

Returns:: update adata_sc by creating uns training_genes overlap_genes fields update adata_sp by creating uns training_genes overlap_genes fields and creating obs rna_count_based_density & uniform_density field

sodirac.utils.tfidf(X)[source]¶

TF-IDF normalization (following the Seurat v3 approach) Parameters ———- X

Input matrix

Returns¶

X_tfidf: TF-IDF normalized matrix