scalign package

class scalign.reference.reference(path, key_atlas_var='.ensembl', use_parametric_if_available=True, use_expression_if_available=False, use_gpu_if_available=True)

Reference atlas

This is the core export class of scalign. It loads the reference atlas from a directory. At present, this package do not contain methods to building a reference dump automatically, this will be added in later versions.

path

The directory to the reference atlas. This should always contains a metadata.h5ad file and a scvi directory, and contain either or both embedder.pkl and / or parametric directory. These two store the non-parametric and parametric UMAP embedder respectively. Parametric UMAP embedder requires keras >= 3.1 and tensorflow >= 2.0 as additional dependencies, and can run much faster if you have configured valid GPUs. The non-parametric UMAP embedder serves as a fallback point and runs faster than the parametric model when no GPU is installed.

Type:

str

key_atlas_var

The matching gene metadata in the atlas to the query set. You may pick an identifier that your query set contains. It is .ensembl by default indicating a column of ENSEMBL IDs you should set query(key_var = '...') to the corresponding ENSEMBL IDs as this.

Type:

str

use_parametric_if_available

If set to True, this will use parametric model if tensorflow is installed. If set to False, you may force the aligner to use the non-parametric one.

Type:

bool

use_expression_if_available

If the expression data of the atlas is available, try to load them into the model. One can build a reference atlas with gene expression quantification within them by supplying a log-normalized matrix. This will enable more analysis and visualization capacity of the atlas mapper. However, if the atlas is relatively large, this make take extra long time to load and more disk space (as well as working memory). For a lite distribution of the atlas, one do not need the expression data, and the mapping program takes an average of 5 Gb memory to perform its job for a 1,250,000 cell atlas. (1.25 M cells) This is considered a large atlas already, but is capable to analysis on a single laptop computer. However, the expression matrix of atlas at such size may become at least ~150 Gb. A full distribution that contains such data should take about 160 Gb disk space, and nearly 200 Gb memory to load them successfully. So the user should check the configuration of their machine before turning the switch on. Otherwise it will crash the program.

Type:

str

property converter

The converter dictionary from the key specified in key_atlas_var corresponding in the variable metadata to the atlas variable key.

density(query, stratification='query', atlas_ptsize=2, atlas_embedding=None, atlas_color_mode='categorical', key_atlas_var='.name', atlas_gene=None, atlas_hue=None, atlas_hue_order=None, atlas_default_color='#e0e0e0', atlas_alpha=1.0, atlas_palette='hls', atlas_rasterize=True, atlas_annotate=True, atlas_annotate_style='index', atlas_annotate_foreground='black', atlas_annotate_stroke='white', atlas_legend=True, key_query_embeddings='umap', query_plot=True, query_ptsize=8, query_hue=None, query_hue_order=None, query_default_color='black', query_alpha=0.5, query_palette='hls', query_rasterize=True, query_annotate=True, query_annotate_style='index', query_annotate_foreground='black', query_annotate_stroke='white', query_legend=True, contour_plot=True, contour_fill=False, contour_hue=None, contour_hue_order=None, contour_linewidth=0.8, contour_default_color='black', contour_palette='hls', contour_alpha=1, contour_levels=10, contour_bw=0.5, legend_col=1, add_outline=False, outline_color='black', width=5, height=5, dpi=100, elegant=False, title='Embeddings', save=None)

Plot mapping density

This function is a helper to plot alignment density. Either be shown to the interactive console, or save to disk files.

Parameters:
  • query (anndata.AnnData) – The mapped query set. Must run with reference.query() beforehand. Since this function requires the data to contain .uns['.align'] and .obsm['umap'].

  • stratification (Literal['query', 'atlas'] = 'query') – The plot function will only show one in the two cases. Either coloring a categorical metadata from the atlas, or a metadata from the query set. The legend will automatically show for each.

  • add_outline (bool = False) – Whether to add an outline to the atlas embedding region. This may stress the atlas boundary.

  • outline_color (str = 'black') – A named matplotlib color (or hex code) to the outline

  • query_plot (bool = True) – Whether to plot the scatter points from the query dataset. Note that this do not affect the plotting of query labels or query legends if they are set to be plotted.

  • contour_plot (bool = True) – Whether to plot the isoheight contours.

  • legend_col (int = 1) – Number of columns to display legend markers. Set to an adequate number for aethesty when the groupings have a lot of possible values.

  • atlas_color_mode (str = Literal['categorical', 'expression']) – How to plot the atlas color. If set to categorical, this will require atlas_hue to set to a categorical metadata name. If set to expression, this will plot the expression levels of a specified gene (with atlas_gene) on the base UMAP. This requires an expression matrix to be loaded into the atlas when creating it (by supplying use_expression_if_available argument)

  • atlas_gene (str = None) – The gene to plot. Must be valid name presented in .variables[key_atlas_var].

  • ptsize (float = (atlas: 2, query: 8)) – The point size of the atlas basis plot and the query scatter. Typically the query data points should be plotted larger than the atlas, since the atlas contains more cells.

  • hue (str = None) – The categorical variable specified for groupings of the atlas or the query. Note that only the selected layer by stratification will be plotted, since plotting both the data with colors will obfuscate the graph. This variable must exist within the obs slot of the corresponding anndata. If set to None, we will plot the data points in the same color specified by atlas_default_color or query_default_color.

  • order (list[str] = None) – Specify the order of hue variable. This is useful in combination with the manually specified palette to determine exact color used.

  • default_color (str = ('#e0e0e0', 'black')) – A named matplotlib color (or hex code) for the atlas scatter and the query scatter if not colored by category. If atlas_hue or query_hue is not None, the value of this parameter will be ignored, and the coloring of the graph is then specified by atlas_palette and query_palette.

  • alpha (float = (1, 0.5)) – The transparency of data points.

  • palette (str | list[str] = 'hls') – The color palette. Could either be a string indicating named palette names (or following the syntax of color palette names by seaborn), or a list of color strings specifying exact colors (and their order). If the length of the colors do not meet the length of categorical values, the automatic palette cycling rule will be applied by matplotlib.

  • rasterize (bool = True) – Whether to rasterize the scatter plot. We strongly recommend setting these values to True, for an atlas of a large scale will blow up the graphic object, resulting in ridiculously large vector formats and slow performance.

  • annotate (bool = True) – If a hue is specified, whether to mark the categories onto the map.

  • annotate_style (Literal['index', 'label'] = 'index') – The markers of categories on map. index will mark a circled index according to the legend marker, and label will mark the category text.

  • annotate_foreground (str = 'black') – A named matplotlib color (or hex code). Foreground color to the annotated text.

  • annotate_stroke (str = 'white') – A named matplotlib color (or hex code). Stroke color to the annotated text.

  • legend (bool = True) – Whether to show the categorical legend.

  • contour_fill (bool = False) – Whether to fill the isoheight contours with a color gradient. If this is set to True, the value of contour_linewidth will be ignored.

  • contour_linewidth (float = 0.8) – The line width of the non-filled isoheight contours.

  • contour_levels (int | list[float] = 10) – The levels of the contours. If a single integer value is provided, the whole range is splitted evenly to match the levels (e.g. setting to 5 will have the same effects as [0.2, 0.4, 0.6, 0.8]), or specify a list of levels to plot the contour manually.

  • contour_bw (float = 0.5) – The larger the parameter is, the smoother the contours will be.

  • width (int = 5) – Width of figure

  • height (int = 5) – Height of figure

  • dpi (int = 100) – DPI. If saving to vector graphics (e.g. PDF, SVG etc.), you should note that some part of the graphics is rasterized by default to reduce object size. The resolution of such rasterized objects is still affected by DPI.

  • elegant (int = False) – Show no boundary.

  • title (str = 'Embeddings') – Title of the plot, or None to hide the title.

  • save (str = None) – If set to None, the plot will be displayed using matplotlib.pyplot.show(). Otherwise, set the parameter to a valid file name to save the image to disk.

Returns:

If save is None, return the plotting figure in matplotlib format. If save is set, will write the image to disk and return None.

Return type:

None | Figure

property epoch

Number of epochs for each sample that came across when training.

property expression

Get the expression matrix in log normalized counts.

network_summary()

Print the network summary for a parametric model. This function will print an error text if the model is loaded with non-parametric model.

property observations

Readonly observation metadata of the atlas

query(input, batch_key=None, key_var=None, key_query_latent='scvi', key_query_embeddings='umap', scvi_epoch_reduction=3, retrain=False, landmark_reduction=60, landmark_loss_weight=0.01, n_jobs=1, n_epochs=10)

Query the reference atlas with a dataset

Parameters:
  • input (anndata.AnnData) –

    The query dataset to be aligned. The variable identifier will be mapped to the reference atlas by the specified variable metadata column (in reference(key_atlas_var = ...)). This column in the atlas metadata of genes will match the query dataset’s metadata column specified by key_var. If key_var is not specified, the query dataset’s variable names will be used as identifier.

    The query dataset must have unique variable names and observation names. Otherwise the program will raise an error. You can use index.is_unique to check this.

  • batch_key (str) – The observation metadata key specifying sample batches. This will be used to correct batch effect using scvi model. If not specified, the program will generate a obs slot named batch and assign all samples to the same batch. Note that if you have an observation metadata column named batch, it will be overwritten.

  • key_var (str) – The variable metadata key specifying the gene names. This should match the key selected in the atlas (by default, a list of ENSEMBL IDs). If not specified, the program will use the variable names. You should make sure that the contents in this column are unique. After the alignment, the variable names will be transformed to the same as the atlas. The original variable names will be stored in .index slot. You should keep a copy of that if you need them thereafter.

  • key_query_latent (str) – The obsm key to store scVI latent space. If there is already a key with the same name, the calculation of scVI components will skip, and the data inside the slot will be used directly as the scVI latent space.

  • key_query_embeddings (str) –

    The obsm key to store UMAP embeddings. This embeddings will mostly share the same structure as the reference atlas. Since the exact UMAP model is used to transform the latent space. If retrain is set to False, the UMAP will just serve as a prediction model to transform between dimensions without training on them. This is rather fast, but may introduce errors in the predicted embeddings (since the model have not seen the data totally during its training). Non-parametric model do not support retraining, and can only be used as a prediction model.

    Parametric UMAP models have the capability to be retrained with new data. This will help the new data points better integrated into the atlas, and revealing more accurate alignment. However, the atlas embedding is somewhat affected by the new ones. Though we use landmarking points to help preserve the original structure, there may be some small differences between the new atlas and the original one.

    If there is already a key with the same name, UMAP embedding calculation will be skipped.

  • scvi_epoch_reduction (int) – Since the scVI model has been trained, we just need a few epochs to adapt it to the new data. The epochs may be less than what scVI expected to be. This saves a lot of time when running on CPU machines without reducing the performance too much. By default, the reduction ratio is set to 4.

  • retrain (bool) – Whether to retrain the model. This is only supported for parametric model.

  • landmark_reduction (int) – Partition to randomly select as landmarking points. The trainer will select 1 out of N points from the original atlas to help make the overall space not change dramatically. The less the reduction ratio is, the more samples from the original atlas will be used in retraining. By default is set to 60.

  • landmark_loss_weight (float) – The weight of the landmark loss. By default 0.01.

  • n_jobs (int) – Number of threads to use when running UMAP embedding.

Returns:

The modified anndata object. with the following slots set:

  • .obs: batch

  • .var: index, var_names

  • .obsm: key_query_latent, key_query_embeddings

  • .uns: .align

These changes is made inplace, however, the modified object is still returned for convenience.

Return type:

anndata.AnnData

property reproduceable

Whether the model is trained with a deterministic random state. Setting the random state will make the training process reproduceable, but can only use 1 CPU cores during neighbor finding. This random seed can not be changed after the model is built. So this field is readonly. You can make a request to the atlas distributor if you would like a reproduceable model, or to train a model with counts by yourself.

However, since the prediction process do not alter the model, you will still get reproduceable results if you map the query dataset with retrain set to False. Minor differences will only occur if you retrain the model, and this won’t change much since we apply an additional loss trying to keep the atlas of the same shape.

training_loss()

Get the training loss vector that recorded during the model’s training and retraining process. The original vector

property use_expression

Whether the atlas mapper load the expression matrix.

property use_gpu

Whether the atlas mapper load the expression matrix.

property use_parametric

Whether the atlas mapper use the parametric UMAP model. You may alter the use_parametric_if_available switch when creating the reference to inform that you prefer a parametric model to be used. However, if you did not configure keras and tensorflow packages correctly, the program will automatically fallback to non-parametric model as a default. The actual model of use might not be what you expected, as you can check this property to see which model is actually loaded and used.

property variables

Readonly variable metadata of the atlas