In this vignette, we demonstrate how to use Canek to correct batch effects from SingleCellExperiment objects following the recommendations from Bioconductor multi-sample analysis. The toy data sets used in this vignette are included in the Canek R package.

Init

library(Canek)
library(SingleCellExperiment)
library(scater)
library(batchelor)
library(scran)
library(ggplot2)
library(patchwork)

Create SingleCellExperiment objects

#Create independent SingleCellExperiment objects
sce <- lapply(names(Canek::SimBatches$batches), function(batch) {
  counts <- SimBatches$batches[[batch]]     #get the counts
  SingleCellExperiment(list(counts=counts),      #create the sce object
                       mainExpName = "example_sce")
})

#Add batch labels
names_batch <- c("B1", "B2")
names(sce) <- names_batch
sce$B1[["batch"]] <- "B1"
sce$B2[["batch"]] <- "B2"

#Add cell-type labels (included in Canek's package)
celltypes <- Canek::SimBatches$cell_types
ct_B1_idx <- 1:ncol(sce$B1)
ct_B2_idx <- (ncol(sce$B1)+1):length(celltypes)
sce$B1[["celltype"]] <- celltypes[ct_B1_idx]
sce$B2[["celltype"]] <- celltypes[ct_B2_idx]

Let’s check the cells distribution across batches.

table(sce$B1$batch)
#> 
#>  B1 
#> 631
table(sce$B2$batch)
#> 
#>  B2 
#> 948

Pre-processing

In the pre-processing steps, we perform the normalization of counts and the variance analysis of genes independently for each batch. Then, we find common variable genes to use in downstream analyses.

Normalization

We perform scaling normalization within each batch using the multiBatchNorm function from batchelor Bioconductor package.

sce <- batchelor::multiBatchNorm(sce[["B1"]], sce[["B2"]])
names(sce) <- names_batch

Feature selection

We use the combineVar function from scran Bioconductor package. This function receives independent variance models of genes for each of the batches. This allows us to select highly variable genes while preserving within-batch differences.

We found 273 combined variable features.

#Get independent features' variant models
sce_var <- lapply(X = sce, FUN = scran::modelGeneVar) 
#Combine the independent variables
combined_features <- scran::combineVar(sce_var[["B1"]], sce_var[["B2"]])
#Select features using the `bio` statistics (see combineVar documentation for further details)
combined_features <- combined_features$bio > 0
sum(combined_features)
#> [1] 273

Visualization before batch effects correction

Then, we merge the two datasets and use the combined variable features for the Principal Components Analysis.

uncorrected <- cbind(sce[["B1"]], sce[["B2"]])
uncorrected <- scater::runPCA(x = uncorrected[combined_features,], scale = TRUE)

Let’s check out the elbow plot of the PCs.

pca_uncorrected <- SingleCellExperiment::reducedDim(x = uncorrected, type = "PCA")
variance <- apply(X = pca_uncorrected, MARGIN = 2, FUN = var)
plot(x = 1:30, y = variance[1:30], xlab = "Principal components", ylab = "Variance")

The variance stabilizes after the first seven PCs. In this test, we will use 10 PCs to calculate the UMAP visualization, but feel free to try other numbers.

set.seed(777)
uncorrected <- scater::runUMAP(x = uncorrected, dimred = "PCA", pca = 10)

We can now visualize the uncorrected data by batch and cell-type labels.

p1 <- scater::plotUMAP(object = uncorrected, colour_by="batch") 
p2 <- scater::plotUMAP(object = uncorrected, colour_by="celltype") 
p1 + p2

We can observe that the cells labeled as Cell Type 1 got divided into two groups that correlated with batch labels. Let’s minimize this batch differences with Canek.

Correcting batch effects with Canek

Canek accepts a SingleCellExperiments object with a batch identifier. We can pass a vector of variable features to use in the integration.

features <- rownames(uncorrected)[combined_features]
#RunCanek in a list of SingleCellExperiment objects
#corrected <- Canek::RunCanek(x = sce, features = features)
#RunCanek in a single object with a batch identifier
corrected <- Canek::RunCanek(x = uncorrected, batches = "batch", features = features)

Visualization after batch effects correction with Canek

To perform PCA it’s important to use the corrected log counts. These are saved in the assay Canek in the SingleCellExperiment object and can be specified by changing the exprs_values parameter.

corrected <- scater::runPCA(x = corrected, scale = TRUE, exprs_values = "Canek")

Let’s check out the elbow plot after correction.

pca_corrected <- SingleCellExperiment::reducedDim(x = corrected, type = "PCA")
variance <- apply(X = pca_corrected, MARGIN = 2, FUN = var)
plot(x = 1:30, y = variance[1:30], xlab = "Principal components", ylab = "Variance")

After correction, the elbow plot has changed and 5 PCs will be enough to calculate the UMAP visualization.

set.seed(777)
corrected <- scater::runUMAP(x = corrected, dimred = "PCA", pca = 5)

We can now visualize the corrected data by batch and cell-type labels. We can observe that after batch-correction with Canek, we minimize the batch difference in cells from Cell Type 1 while preserving the other cell types.

p1 <- scater::plotUMAP(object = corrected, colour_by="batch")
p2 <- scater::plotUMAP(object = corrected, colour_by="celltype")
p1 + p2

## Session info

sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] patchwork_1.1.3             scran_1.28.2               
#>  [3] batchelor_1.16.0            scater_1.28.0              
#>  [5] ggplot2_3.4.4               scuttle_1.10.3             
#>  [7] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.2
#>  [9] Biobase_2.60.0              GenomicRanges_1.52.1       
#> [11] GenomeInfoDb_1.36.4         IRanges_2.34.1             
#> [13] S4Vectors_0.38.2            BiocGenerics_0.46.0        
#> [15] MatrixGenerics_1.12.3       matrixStats_1.0.0          
#> [17] Canek_0.2.4                
#> 
#> loaded via a namespace (and not attached):
#>   [1] bitops_1.0-7              gridExtra_2.3            
#>   [3] rlang_1.1.1               magrittr_2.0.3           
#>   [5] compiler_4.3.2            flexmix_2.3-19           
#>   [7] DelayedMatrixStats_1.22.6 systemfonts_1.0.5        
#>   [9] vctrs_0.6.4               stringr_1.5.0            
#>  [11] pkgconfig_2.0.3           crayon_1.5.2             
#>  [13] fastmap_1.1.1             XVector_0.40.0           
#>  [15] labeling_0.4.3            utf8_1.2.4               
#>  [17] rmarkdown_2.25            ggbeeswarm_0.7.2         
#>  [19] ragg_1.2.6                numbers_0.8-5            
#>  [21] purrr_1.0.2               xfun_0.41                
#>  [23] modeltools_0.2-23         bluster_1.10.0           
#>  [25] zlibbioc_1.46.0           cachem_1.0.8             
#>  [27] beachmat_2.16.0           jsonlite_1.8.7           
#>  [29] highr_0.10                DelayedArray_0.26.7      
#>  [31] fpc_2.2-10                BiocParallel_1.34.2      
#>  [33] irlba_2.3.5.1             parallel_4.3.2           
#>  [35] prabclus_2.3-3            cluster_2.1.4            
#>  [37] R6_2.5.1                  bslib_0.5.1              
#>  [39] stringi_1.7.12            limma_3.56.2             
#>  [41] jquerylib_0.1.4           diptest_0.76-0           
#>  [43] Rcpp_1.0.11               knitr_1.45               
#>  [45] FNN_1.1.3.2               Matrix_1.6-1.1           
#>  [47] nnet_7.3-19               igraph_1.5.1             
#>  [49] tidyselect_1.2.0          viridis_0.6.4            
#>  [51] abind_1.4-5               yaml_2.3.7               
#>  [53] codetools_0.2-19          lattice_0.21-9           
#>  [55] tibble_3.2.1              withr_2.5.2              
#>  [57] evaluate_0.23             desc_1.4.2               
#>  [59] mclust_6.0.0              kernlab_0.9-32           
#>  [61] pillar_1.9.0              generics_0.1.3           
#>  [63] rprojroot_2.0.3           RCurl_1.98-1.13          
#>  [65] sparseMatrixStats_1.12.2  munsell_0.5.0            
#>  [67] scales_1.2.1              class_7.3-22             
#>  [69] glue_1.6.2                metapod_1.8.0            
#>  [71] tools_4.3.2               BiocNeighbors_1.18.0     
#>  [73] robustbase_0.99-0         ScaledMatrix_1.8.1       
#>  [75] locfit_1.5-9.8            fs_1.6.3                 
#>  [77] cowplot_1.1.1             grid_4.3.2               
#>  [79] edgeR_3.42.4              colorspace_2.1-0         
#>  [81] GenomeInfoDbData_1.2.10   beeswarm_0.4.0           
#>  [83] BiocSingular_1.16.0       vipor_0.4.5              
#>  [85] cli_3.6.1                 rsvd_1.0.5               
#>  [87] textshaping_0.3.7         fansi_1.0.5              
#>  [89] viridisLite_0.4.2         S4Arrays_1.0.6           
#>  [91] dplyr_1.1.3               uwot_0.1.16              
#>  [93] ResidualMatrix_1.10.0     gtable_0.3.4             
#>  [95] DEoptimR_1.1-3            sass_0.4.7               
#>  [97] digest_0.6.33             dqrng_0.3.1              
#>  [99] ggrepel_0.9.4             farver_2.1.1             
#> [101] memoise_2.0.1             htmltools_0.5.6.1        
#> [103] pkgdown_2.0.7             lifecycle_1.0.3          
#> [105] statmod_1.5.0             MASS_7.3-60