scalign package¶
- class scalign.reference.reference(path, key_atlas_var='.ensembl', use_parametric_if_available=True, use_expression_if_available=False, use_gpu_if_available=True)¶
Reference atlas
This is the core export class of
scalign. It loads the reference atlas from a directory. At present, this package do not contain methods to building a reference dump automatically, this will be added in later versions.- path¶
The directory to the reference atlas. This should always contains a
metadata.h5adfile and ascvidirectory, and contain either or bothembedder.pkland / orparametricdirectory. These two store the non-parametric and parametric UMAP embedder respectively. Parametric UMAP embedder requireskeras >= 3.1andtensorflow >= 2.0as additional dependencies, and can run much faster if you have configured valid GPUs. The non-parametric UMAP embedder serves as a fallback point and runs faster than the parametric model when no GPU is installed.- Type:
- key_atlas_var¶
The matching gene metadata in the atlas to the query set. You may pick an identifier that your query set contains. It is
.ensemblby default indicating a column of ENSEMBL IDs you should setquery(key_var = '...')to the corresponding ENSEMBL IDs as this.- Type:
- use_parametric_if_available¶
If set to
True, this will use parametric model if tensorflow is installed. If set toFalse, you may force the aligner to use the non-parametric one.- Type:
- use_expression_if_available¶
If the expression data of the atlas is available, try to load them into the model. One can build a reference atlas with gene expression quantification within them by supplying a log-normalized matrix. This will enable more analysis and visualization capacity of the atlas mapper. However, if the atlas is relatively large, this make take extra long time to load and more disk space (as well as working memory). For a lite distribution of the atlas, one do not need the expression data, and the mapping program takes an average of 5 Gb memory to perform its job for a 1,250,000 cell atlas. (1.25 M cells) This is considered a large atlas already, but is capable to analysis on a single laptop computer. However, the expression matrix of atlas at such size may become at least ~150 Gb. A full distribution that contains such data should take about 160 Gb disk space, and nearly 200 Gb memory to load them successfully. So the user should check the configuration of their machine before turning the switch on. Otherwise it will crash the program.
- Type:
- property converter¶
The converter dictionary from the key specified in
key_atlas_varcorresponding in the variable metadata to the atlas variable key.
- density(query, stratification='query', atlas_ptsize=2, atlas_embedding=None, atlas_color_mode='categorical', key_atlas_var='.name', atlas_gene=None, atlas_hue=None, atlas_hue_order=None, atlas_default_color='#e0e0e0', atlas_alpha=1.0, atlas_palette='hls', atlas_rasterize=True, atlas_annotate=True, atlas_annotate_style='index', atlas_annotate_foreground='black', atlas_annotate_stroke='white', atlas_legend=True, key_query_embeddings='umap', query_plot=True, query_ptsize=8, query_hue=None, query_hue_order=None, query_default_color='black', query_alpha=0.5, query_palette='hls', query_rasterize=True, query_annotate=True, query_annotate_style='index', query_annotate_foreground='black', query_annotate_stroke='white', query_legend=True, contour_plot=True, contour_fill=False, contour_hue=None, contour_hue_order=None, contour_linewidth=0.8, contour_default_color='black', contour_palette='hls', contour_alpha=1, contour_levels=10, contour_bw=0.5, legend_col=1, add_outline=False, outline_color='black', width=5, height=5, dpi=100, elegant=False, title='Embeddings', save=None)¶
Plot mapping density
This function is a helper to plot alignment density. Either be shown to the interactive console, or save to disk files.
- Parameters:
query (anndata.AnnData) – The mapped query set. Must run with
reference.query()beforehand. Since this function requires the data to contain.uns['.align']and.obsm['umap'].stratification (Literal['query', 'atlas'] = 'query') – The plot function will only show one in the two cases. Either coloring a categorical metadata from the atlas, or a metadata from the query set. The legend will automatically show for each.
add_outline (bool = False) – Whether to add an outline to the atlas embedding region. This may stress the atlas boundary.
outline_color (str = 'black') – A named matplotlib color (or hex code) to the outline
query_plot (bool = True) – Whether to plot the scatter points from the query dataset. Note that this do not affect the plotting of query labels or query legends if they are set to be plotted.
contour_plot (bool = True) – Whether to plot the isoheight contours.
legend_col (int = 1) – Number of columns to display legend markers. Set to an adequate number for aethesty when the groupings have a lot of possible values.
atlas_color_mode (str = Literal['categorical', 'expression']) – How to plot the atlas color. If set to
categorical, this will requireatlas_hueto set to a categorical metadata name. If set toexpression, this will plot the expression levels of a specified gene (withatlas_gene) on the base UMAP. This requires an expression matrix to be loaded into the atlas when creating it (by supplyinguse_expression_if_availableargument)atlas_gene (str = None) – The gene to plot. Must be valid name presented in
.variables[key_atlas_var].ptsize (float = (atlas: 2, query: 8)) – The point size of the atlas basis plot and the query scatter. Typically the query data points should be plotted larger than the atlas, since the atlas contains more cells.
hue (str = None) – The categorical variable specified for groupings of the atlas or the query. Note that only the selected layer by
stratificationwill be plotted, since plotting both the data with colors will obfuscate the graph. This variable must exist within theobsslot of the corresponding anndata. If set to None, we will plot the data points in the same color specified byatlas_default_colororquery_default_color.order (list[str] = None) – Specify the order of hue variable. This is useful in combination with the manually specified palette to determine exact color used.
default_color (str = ('#e0e0e0', 'black')) – A named matplotlib color (or hex code) for the atlas scatter and the query scatter if not colored by category. If
atlas_hueorquery_hueis notNone, the value of this parameter will be ignored, and the coloring of the graph is then specified byatlas_paletteandquery_palette.alpha (float = (1, 0.5)) – The transparency of data points.
palette (str | list[str] = 'hls') – The color palette. Could either be a string indicating named palette names (or following the syntax of color palette names by
seaborn), or a list of color strings specifying exact colors (and their order). If the length of the colors do not meet the length of categorical values, the automatic palette cycling rule will be applied bymatplotlib.rasterize (bool = True) – Whether to rasterize the scatter plot. We strongly recommend setting these values to
True, for an atlas of a large scale will blow up the graphic object, resulting in ridiculously large vector formats and slow performance.annotate (bool = True) – If a
hueis specified, whether to mark the categories onto the map.annotate_style (Literal['index', 'label'] = 'index') – The markers of categories on map.
indexwill mark a circled index according to the legend marker, andlabelwill mark the category text.annotate_foreground (str = 'black') – A named matplotlib color (or hex code). Foreground color to the annotated text.
annotate_stroke (str = 'white') – A named matplotlib color (or hex code). Stroke color to the annotated text.
legend (bool = True) – Whether to show the categorical legend.
contour_fill (bool = False) – Whether to fill the isoheight contours with a color gradient. If this is set to
True, the value ofcontour_linewidthwill be ignored.contour_linewidth (float = 0.8) – The line width of the non-filled isoheight contours.
contour_levels (int | list[float] = 10) – The levels of the contours. If a single integer value is provided, the whole range is splitted evenly to match the levels (e.g. setting to
5will have the same effects as[0.2, 0.4, 0.6, 0.8]), or specify a list of levels to plot the contour manually.contour_bw (float = 0.5) – The larger the parameter is, the smoother the contours will be.
width (int = 5) – Width of figure
height (int = 5) – Height of figure
dpi (int = 100) – DPI. If saving to vector graphics (e.g. PDF, SVG etc.), you should note that some part of the graphics is rasterized by default to reduce object size. The resolution of such rasterized objects is still affected by DPI.
elegant (int = False) – Show no boundary.
title (str = 'Embeddings') – Title of the plot, or
Noneto hide the title.save (str = None) – If set to
None, the plot will be displayed usingmatplotlib.pyplot.show(). Otherwise, set the parameter to a valid file name to save the image to disk.
- Returns:
If
saveisNone, return the plotting figure in matplotlib format. Ifsaveis set, will write the image to disk and returnNone.- Return type:
None | Figure
- property epoch¶
Number of epochs for each sample that came across when training.
- property expression¶
Get the expression matrix in log normalized counts.
- network_summary()¶
Print the network summary for a parametric model. This function will print an error text if the model is loaded with non-parametric model.
- property observations¶
Readonly observation metadata of the atlas
- query(input, batch_key=None, key_var=None, key_query_latent='scvi', key_query_embeddings='umap', scvi_epoch_reduction=3, retrain=False, landmark_reduction=60, landmark_loss_weight=0.01, n_jobs=1, n_epochs=10)¶
Query the reference atlas with a dataset
- Parameters:
input (anndata.AnnData) –
The query dataset to be aligned. The variable identifier will be mapped to the reference atlas by the specified variable metadata column (in
reference(key_atlas_var = ...)). This column in the atlas metadata of genes will match the query dataset’s metadata column specified bykey_var. Ifkey_varis not specified, the query dataset’s variable names will be used as identifier.The query dataset must have unique variable names and observation names. Otherwise the program will raise an error. You can use
index.is_uniqueto check this.batch_key (str) – The observation metadata key specifying sample batches. This will be used to correct batch effect using
scvimodel. If not specified, the program will generate a obs slot namedbatchand assign all samples to the same batch. Note that if you have an observation metadata column namedbatch, it will be overwritten.key_var (str) – The variable metadata key specifying the gene names. This should match the key selected in the atlas (by default, a list of ENSEMBL IDs). If not specified, the program will use the variable names. You should make sure that the contents in this column are unique. After the alignment, the variable names will be transformed to the same as the atlas. The original variable names will be stored in
.indexslot. You should keep a copy of that if you need them thereafter.key_query_latent (str) – The obsm key to store scVI latent space. If there is already a key with the same name, the calculation of scVI components will skip, and the data inside the slot will be used directly as the scVI latent space.
key_query_embeddings (str) –
The obsm key to store UMAP embeddings. This embeddings will mostly share the same structure as the reference atlas. Since the exact UMAP model is used to transform the latent space. If
retrainis set toFalse, the UMAP will just serve as a prediction model to transform between dimensions without training on them. This is rather fast, but may introduce errors in the predicted embeddings (since the model have not seen the data totally during its training). Non-parametric model do not support retraining, and can only be used as a prediction model.Parametric UMAP models have the capability to be retrained with new data. This will help the new data points better integrated into the atlas, and revealing more accurate alignment. However, the atlas embedding is somewhat affected by the new ones. Though we use landmarking points to help preserve the original structure, there may be some small differences between the new atlas and the original one.
If there is already a key with the same name, UMAP embedding calculation will be skipped.
scvi_epoch_reduction (int) – Since the scVI model has been trained, we just need a few epochs to adapt it to the new data. The epochs may be less than what scVI expected to be. This saves a lot of time when running on CPU machines without reducing the performance too much. By default, the reduction ratio is set to 4.
retrain (bool) – Whether to retrain the model. This is only supported for parametric model.
landmark_reduction (int) – Partition to randomly select as landmarking points. The trainer will select 1 out of N points from the original atlas to help make the overall space not change dramatically. The less the reduction ratio is, the more samples from the original atlas will be used in retraining. By default is set to 60.
landmark_loss_weight (float) – The weight of the landmark loss. By default 0.01.
n_jobs (int) – Number of threads to use when running UMAP embedding.
- Returns:
The modified anndata object. with the following slots set:
.obs:batch.var:index,var_names.obsm:key_query_latent,key_query_embeddings.uns:.align
These changes is made inplace, however, the modified object is still returned for convenience.
- Return type:
anndata.AnnData
- property reproduceable¶
Whether the model is trained with a deterministic random state. Setting the random state will make the training process reproduceable, but can only use 1 CPU cores during neighbor finding. This random seed can not be changed after the model is built. So this field is readonly. You can make a request to the atlas distributor if you would like a reproduceable model, or to train a model with counts by yourself.
However, since the prediction process do not alter the model, you will still get reproduceable results if you map the query dataset with
retrainset toFalse. Minor differences will only occur if you retrain the model, and this won’t change much since we apply an additional loss trying to keep the atlas of the same shape.
- training_loss()¶
Get the training loss vector that recorded during the model’s training and retraining process. The original vector
- property use_expression¶
Whether the atlas mapper load the expression matrix.
- property use_gpu¶
Whether the atlas mapper load the expression matrix.
- property use_parametric¶
Whether the atlas mapper use the parametric UMAP model. You may alter the
use_parametric_if_availableswitch when creating the reference to inform that you prefer a parametric model to be used. However, if you did not configurekerasandtensorflowpackages correctly, the program will automatically fallback to non-parametric model as a default. The actual model of use might not be what you expected, as you can check this property to see which model is actually loaded and used.
- property variables¶
Readonly variable metadata of the atlas