4 Cell QC

Compiled: 2025-05-08 Written by Jiaying Zeng and Daihan Ji

library(SingleCellMQC)
#Load a pre-existing Seurat object or process raw data using SingleCellMQC.
pbmc <- readRDS("/data/pbmc.rds")

The cell-level QC in SingleCellMQC includes the identification of (i) empty droplets, (ii) doublets, and (iii) low-quality cells. By integrating a variety of publicly available tools and in-house developed QC strategies, the cell QC module provides a systematic framework to detect and remove problematic droplets or cells across multiple omics layers.

Where can you view the results of cell-level QC?

You can explore them in pbmc@meta.data or pbmc@misc[["SingleCellMQC"]][["perQCMetrics"]][[perCell]],pbmc@misc[["SingleCellMQC"]][["LowQuality"]],pbmc@misc[["SingleCellMQC"]][["Doublet"]].

4.1 Empty droplets detection (If Required)

Generally, count matrix files generated from single-cell preprocessing tools such as Cell Ranger and dnbc4tools have already undergone empty droplet removal. However, if users wish to perform this step themselves, SingleCellMQC implements empty droplet detection based on scRNA-seq data using the dropletUtils package.

# Empty droplets detection (DropletUtils)
out <- RunEmptyDroplet(dir_name="/data/raw/")

4.2 Low quality cells detection

Before you conduct low quality cells detection, ensure that you have run CalculateMetrics.

4.2.1 Fixed cutoff

In this method, you can filter out low-quality cells by setting a fixed cutoff for certain cell QC metrics. Before choosing a suitable cutoff, you can use GetMetricsRange to obtain the commonly used cutoff for the corresponding tissue. The cutoffs for each QC parameter was obtained by 240 single-cell studies spanning 11 human tissues. Here, we take “blood” as an example.

cutoff_plot <- GetMetricsRange(tissue = "blood")
cutoff_plot$MT

cutoff_plot$nCount_min

cutoff_plot$nCount_max

Alternatively, you can visualize cell QC metrics using regular violin plots, and manually select appropriate cutoffs by yourself.

# View the cell QC metrics, such as percentage of mitochondria. 
PlotCellMetrics(pbmc, metrics ="percent.mt" , log.y = F )

Then you can select your cutoff to set a fixed threshold to mark low-quality cells.

pbmc <- RunLQ_fixed(pbmc, percent.mt = 10, min.nCount_RNA=500, min.nFeature_RNA= 200)
# Show the number of low-quality cells marked.
table(pbmc$lq_fixed, pbmc$orig.ident)
##       
##         TP1  TP2  TP3 TP3-rep
##   Fail  401  936 1803     473
##   Pass 9735 8127 1894    5129

4.2.2 MAD method

You can use SingleCellMQC to automatically identify low-quality cells by calculating outliers for RNA metrics (nCount_RNA, nFeature_RNA, percent.mt) and ADT metrics (nCount_ADT, nFeature_ADT, percent.isotype) using the Median Absolute Deviation (MAD) method.

# `split.by` can be set to sample or batch information, etc., to explore the number of low-quality cells detected in different groups.
# The `nmads` parameter is usually set to 3.
pbmc <- RunLQ_MAD(pbmc, split.by = "orig.ident", nmads = 3)

table(pbmc$lq_RNA_gene_3mad, pbmc$orig.ident)
##       
##         TP1  TP2  TP3 TP3-rep
##   Fail  621  449   53     481
##   Pass 9515 8614 3644    5121
table(pbmc$lq_RNA_mt_3mad, pbmc$orig.ident)
##       
##         TP1  TP2  TP3 TP3-rep
##   Fail  787 1218  565     808
##   Pass 9349 7845 3132    4794

4.2.3 miQC method

Other publicly available tools for low-quality cells detection are integrated in this part. The miQC method is one of them. You can turn to reference and github link for more detailed information.

# `split.by` can be set to sample or batch information, etc.
pbmc <- RunLQ_miQC(pbmc, split.by = "orig.ident")

table(pbmc$lq_miQC, pbmc$orig.ident)
##       
##         TP1  TP2  TP3 TP3-rep
##   Fail  753 1255  869     977
##   Pass 9383 7808 2828    4625

4.2.4 ddqc method

The ddqc method is another publicly available tool. You can turn to reference and github link for more detailed information.

# `split.by` can be set to sample or batch information, etc.
pbmc <- RunLQ_ddqc(pbmc, split.by = "orig.ident")

table(pbmc$lq_ddqc, pbmc$orig.ident)
##       
##         TP1  TP2  TP3 TP3-rep
##   Fail  298  500  230     223
##   Pass 9838 8563 3467    5379

4.2.5 Run multiple methods

Use RunLQ and list the methods you want in the methods parameter.

pbmc <- RunLQ(pbmc, split.by = "orig.ident", methods=c("MAD", "miQC") )

4.3 Check low-quality cells (optional)

To minimize false positives, SingleCellMQC allows re-clustering of flagged cells based on their expression profiles before removal. This step helps identify biologically relevant cell populations that may exhibit outlier QC metrics due to molecular expression variability rather than poor quality, and these cells are rescued and retained for downstream analysis.

# pbmc <- RunLQ_fixed(pbmc, min.nFeature_RNA = 200, min.nCount_RNA = 500, percent.mt = 10)
# For `method_column` parameter, enter the colnames of the result of previous methods that you want to re-cluster, such as "lq_Fixed".
out <- RunLQReClustering(pbmc, method_column = c("lq_fixed"),cluster_resolution = 1)

After the re-clustering is completed, the following three parts can be used to confirm the cell identity and check whether there are any missed cell groups.

Use COSG method to find markers in each low-quality cluster, assisting in cell identity determination.

COSG_marker <- FindAllMarkerCOSG(LQ_result, cluster.by = "seurat_clusters")
head(COSG_marker)

##        gene cluster  cosg avg_logFC meanlog_1 meanlog_2  pct_1  pct_2
## GNLY   GNLY  Fail_4 0.321     1.826    3.1322    1.3065 0.6639 0.1033
## MATK   MATK  Fail_4 0.219     1.028    1.2990    0.2708 0.2389 0.0384
## SPON2 SPON2  Fail_4 0.198     1.202    1.7013    0.4992 0.3444 0.0621
## SYNE1 SYNE1  Fail_4 0.197     1.177    2.1275    0.9501 0.5333 0.1491
## CCL5   CCL5  Fail_4 0.187     1.301    2.7265    1.4259 0.6583 0.1589
## SYTL2 SYTL2  Fail_4 0.182     0.956    1.3290    0.3731 0.2500 0.0504

Call ClusterProfiler package to perform pathway enrichment analysis to view the pathways of each low-quality cluster, assisting in cell identity determination.

PlotMarkerEnrichKegg(cosg_marker = COSG_marker)

Perform ScType annotation to assist in determining cell identities. Cells labeled as “Unknown” are more likely to represent low-quality cells.

LQ_result <- RunScType(LQ_result, data_source ="ScType")
PlotReducedDim(LQ_result, group.by = "seurat_clusters")+
PlotReducedDim(LQ_result, group.by = "ScType")

4.4 Doublets detection

For doublets detection, SingleCellMQC integrates five widely used RNA expression-based methods, including DoubletFinder, scDbFinder, cxds, bcds, hybrid, one V(D)J chain based method on TCR/BCR sequencing data, along with one in-house developed multi-omics strategy based on surface protein expression (ADT-based).

4.4.1 scDblFinder method

The scDblFinder method is one of the publicly available tools for doublets detection. You can turn to reference and github link for more detailed information.

# The `split.by` parameter must refer to your sample. 
# The `add.Seurat` parameter means whether you want to add results in the Seurat object.
pbmc<- RunDbt_scDblFinder(pbmc, split.by = "orig.ident", add.Seurat=T, do.topscore=T)

table(pbmc$db_scDblFinder, pbmc$orig.ident)
##       
##         TP1  TP2  TP3 TP3-rep
##   Fail  821  657  109     251
##   Pass 9315 8406 3588    5351
head(pbmc$db_scDblFinder_score)
## TP1_AAACCTGAGAATGTTG-1 TP1_AAACCTGAGCCCTAAT-1 TP1_AAACCTGAGGACTGGT-1 TP1_AAACCTGCAAGCGATG-1 TP1_AAACCTGCAATGGAGC-1 TP1_AAACCTGCACAGTCGC-1 
##           0.0060459864           0.9996028543           0.0016912580           0.0008436530           0.9318320155           0.0001187744

4.4.2 hybrid method

The hybrid method is one of the publicly available tools for doublets detection. You can turn to reference and github link for more detailed information.

# The `split.by` parameter must refer to your sample. 
# The `add.Seurat` parameter means whether you want to add results in the Seurat object.
pbmc<- RunDbt_hybrid(pbmc, split.by = "orig.ident", add.Seurat=T)

table(pbmc$db_hybrid, pbmc$orig.ident)
##       
##         TP1  TP2  TP3 TP3-rep
##   Fail  821  657  109     251
##   Pass 9315 8406 3588    5351
head(pbmc$db_hybrid_score)
## TP1_AAACCTGAGAATGTTG-1 TP1_AAACCTGAGCCCTAAT-1 TP1_AAACCTGAGGACTGGT-1 TP1_AAACCTGCAAGCGATG-1 TP1_AAACCTGCAATGGAGC-1 TP1_AAACCTGCACAGTCGC-1 
##              0.3104272              1.7186222              0.2577161              0.2417090              1.0813173              0.0843216

4.4.3 bcds method

The bcds method is one of the publicly available tools for doublets detection. You can turn to reference and github link for more detailed information.

# The `split.by` parameter must refer to your sample. 
# The `add.Seurat` parameter means whether you want to add results in the Seurat object.
pbmc<- RunDbt_bcds(pbmc, split.by = "orig.ident", add.Seurat=T)

table(pbmc$db_bcds, pbmc$orig.ident)
##       
##         TP1  TP2  TP3 TP3-rep
##   Fail  821  657  109     251
##   Pass 9315 8406 3588    5351
head(pbmc$db_bcds_score)
## TP1_AAACCTGAGAATGTTG-1 TP1_AAACCTGAGCCCTAAT-1 TP1_AAACCTGAGGACTGGT-1 TP1_AAACCTGCAAGCGATG-1 TP1_AAACCTGCAATGGAGC-1 TP1_AAACCTGCACAGTCGC-1 
##             0.01293355             0.93673164             0.02887737             0.02212005             0.80253756             0.01673022

4.4.4 cxds method

The cxds method is one of the publicly available tools for doublets detection. You can turn to reference and github link for more detailed information.

# The `split.by` parameter must refer to your sample. 
# The `add.Seurat` parameter means whether you want to add results in the Seurat object.
pbmc<- RunDbt_cxds(pbmc, split.by = "orig.ident", add.Seurat=T)

table(pbmc$db_cxds, pbmc$orig.ident)
##       
##         TP1  TP2  TP3 TP3-rep
##   Fail  821  657  109     251
##   Pass 9315 8406 3588    5351
head(pbmc$db_cxds_score)
## TP1_AAACCTGAGAATGTTG-1 TP1_AAACCTGAGCCCTAAT-1 TP1_AAACCTGAGGACTGGT-1 TP1_AAACCTGCAAGCGATG-1 TP1_AAACCTGCAATGGAGC-1 TP1_AAACCTGCACAGTCGC-1 
##              160102.42              497734.06              147398.61              145724.55              212804.72               46557.39

4.4.5 DoubletFinder method

The DoubletFinder method is one of the publicly available tools for doublets detection. You can turn to reference and github link for more detailed information.

# The `split.by` parameter must refer to your sample. 
# The `add.Seurat` parameter means whether you want to add results in the Seurat object.
pbmc<- RunDbt_DoubletFinder(pbmc, split.by = "orig.ident", add.Seurat=T)

table(pbmc$db_DoubletFinder, pbmc$orig.ident)
##       
##         TP1  TP2  TP3 TP3-rep
##   Fail  821  657  109     251
##   Pass 9315 8406 3588    5351
head(pbmc$db_DoubletFinder_score)
## TP1_AAACCTGAGAATGTTG-1 TP1_AAACCTGAGCCCTAAT-1 TP1_AAACCTGAGGACTGGT-1 TP1_AAACCTGCAAGCGATG-1 TP1_AAACCTGCAATGGAGC-1 TP1_AAACCTGCACAGTCGC-1 
##              0.1684285              0.3484848              0.1758280              0.1370684              0.2152925              0.2026075

4.4.6 VDJ method

The VDJ-based strategy can identify homotypic doublets by detecting multi-chain cells. In immune repertoire sequencing, cells with multiple TCR or BCR chains (or both) are flagged as potential doublets.

# The premise is that you have VDJ data and it has been processed by the `runMetrics` function.
pbmc<- RunDbt_VDJ(pbmc, add.Seurat=T)

table(pbmc$db_VDJ, pbmc$orig.ident)
##       
##         TP1  TP2  TP3 TP3-rep
##   Fail  306  259   63     136
##   Pass 2956 2981 1107    2163

4.4.7 ADT method

The ADT-based strategy can identify heterotypic doublets by analyzing mutually exclusive surface markers in ADT sequencing. This ADT-based method calculates positive expression thresholds for each surface protein. Cells with both expressions of exclusive paired surface markers were considered as doublets (Details see our manuscript). Users can customize exclusive surface marker pairs to target specific cell types based on their experiment needs, such as CD19 and CD3, which differentiate B and T cells. Here we take CD20/CD3 and CD19/CD3 as examples.

You can use the ADT method to detect doublets. In default mode, ADT-based clustering is applied.

pbmc <- RunDbt_ADT(pbmc, split.by = "orig.ident", add.Seurat =T, 
                   feature1 = c("Hu.CD20-2H7","Hu.CD19"), 
                   feature2 = c("Hu.CD3-UCHT1","Hu.CD3-UCHT1"))

table(pbmc$db_ADT, pbmc$orig.ident)
##       
##          TP1   TP2   TP3 TP3-rep
##   Fail    87   165   153      75
##   Pass 10049  8898  3544    5527

If you are concerned about the impact of protein background noise, you can use the do.correct parameter to call the scCDC package to denoise the ADT data.

pbmc<- RunDbt_ADT(pbmc, feature1 = c("Hu.CD20-2H7","Hu.CD19"), 
                   feature2 = c("Hu.CD3-UCHT1","Hu.CD3-UCHT1"), 
                  split.by = "orig.ident", preprocess = "rna.umap", do.correct = T)

What if you want to return the threshold of expression levels used to distinguish between negative and positive cell populations?

# Calculate the cutoff of doublets
cutoff <- RunDbt_ADT(pbmc, feature1 = c("Hu.CD20-2H7","Hu.CD19"), 
                   feature2 = c("Hu.CD3-UCHT1","Hu.CD3-UCHT1"), 
                   return.cutoff = T, split.by = "orig.ident")

head(cutoff)
##   Sample cluster_num resolution    feature1   cutoff1     feature2  cutoff2
## 1    TP1          24        1.5 Hu.CD20-2H7 0.7860469 Hu.CD3-UCHT1 1.178921
## 2    TP1          24        1.5     Hu.CD19 0.7283664 Hu.CD3-UCHT1 1.178921
## 3    TP2          24        1.5 Hu.CD20-2H7 0.8705569 Hu.CD3-UCHT1 1.029872
## 4    TP2          24        1.5     Hu.CD19 0.5773097 Hu.CD3-UCHT1 1.029872
## 5    TP3          21        1.5 Hu.CD20-2H7 0.8418260 Hu.CD3-UCHT1 1.227219
## 6    TP3          21        1.5     Hu.CD19 0.5099738 Hu.CD3-UCHT1 1.227219
# Check out the results of TP1
PlotADTCutoff(pbmc, cutoff, feature1 = "Hu.CD20-2H7", feature2 = "Hu.CD3-UCHT1",  
              Sample="TP1", group.by = "chain_pair")

4.4.8 Run multiple methods

Use RunDbt and list the methods you want in the methods parameter.

pbmc<- RunDbt(pbmc, split.by = "orig.ident", add.Seurat=T, methods= c("scDblFinder", "bcds"),do.topscore=T )

4.5 Visualization of cell QC results

4.5.1 Visualization of the percentage of cells filtered by different methods

# Choose the detection session you want in `type.detection` parameter, including "lq" for low-quality cells and "db" for doublets.
# Plotting the percentage of cells filtered by low-quality cell detection methods
PlotCellMethodFiltration(pbmc, type.detection = "lq", return.type = "plot")

# Plotting the percentage of cells filtered by doublet detection methods
PlotCellMethodFiltration(pbmc, type.detection = "db", return.type = "plot")

# Plotting interactive table 
interactive_table <- PlotCellMethodFiltration(pbmc, type.detection = "lq", return.type = "interactive_table")
interactive_table$pct ##Show filtered percentages

Low quality cell filtered percentage

interactive_table$count ##Show filtered counts

Low quality cell filtered number

4.5.2 Overlap between different methods (UpSet plot)

Low-quality cell results as an example, the same applies to doublet methods.

# Plotting overlap of low-quality cell detection methods (all samples).
PlotCellMethodUpset(pbmc, type.detection = "lq")

# Plotting overlap of low-quality cell detection methods (each sample).
# Choose the column representing your sample for the `split.by` parameter.  
out <- PlotCellMethodUpset(pbmc, type.detection = "lq", split.by = "orig.ident")
names(out)

## [1] "TP1"     "TP2"     "TP3"     "TP3-rep"

out$TP1

4.5.3 Comparison of metrics between Pass and Fail cells identified by methods

Taking percentage of percent.mt of low-quality cells as an example, the same applies to other metrics and doublets.

# Plotting comparison of low-quality cell detection methods (all samples).
PlotCellMethodVln(pbmc, metrics.by = "percent.mt", type.detection = "lq")

# Plotting comparison of low-quality cell detection methods (each sample).
# Choose the column representing your sample for the `split.by` parameter. 
out <- PlotCellMethodVln(pbmc, metrics.by = "percent.mt", type.detection = "lq", split.by = "orig.ident")
names(out)

## [1] "TP1"     "TP2"     "TP3"     "TP3-rep"

out$TP1

4.5.4 Visualization of scatter plots for cell metrics

You can view two metrics simultaneously in 2D mode.

# Overall visualization of cell metrics
PlotCellMetricsScatter(pbmc, metrics.by = c("percent.mt", "nFeature_RNA") , log.x = T, log.y = T)

# Visualization of cell metrics for certain detection method, like `lq_fixed`
PlotCellMetricsScatter(pbmc, metrics.by = c("percent.mt", "nFeature_RNA") , log.x = T, log.y = T, group.by = "lq_fixed")

# Visualization of cell metrics for certain detection method in each sample
# PlotCellMetricsScatter(pbmc, metrics.by = c("percent.mt", "nFeature_RNA") , log.x = T, log.y = T, group.by = "lq_fixed", split.by="orig.ident")

You can view three metrics simultaneously in 3D mode.

# Overall visualization of cell metrics
PlotCellMetricsScatter(pbmc, metrics.by = c("percent.mt", "nFeature_RNA","nCount_RNA") , log.x = T, log.y = T)

# Visualization of cell metrics for certain detection method, like `lq_fixed`
PlotCellMetricsScatter(pbmc, metrics.by = c("percent.mt", "nFeature_RNA","nCount_RNA") , log.x = T, log.y = T, group.by = "lq_fixed")

# Visualization of cell metrics for certain detection method in each sample
# PlotCellMetricsScatter(pbmc, metrics.by = c("percent.mt", "nFeature_RNA","nCount_RNA") , log.x = T, log.y = T, group.by = "lq_fixed", split.by="orig.ident")

4.6 Filter out problematic droplets or cells

After executing the above steps, you can choose the filtering strategy you need. For example, you can select the results from lq_RNA_mt_3mad and db_scDblFinder to filter out the cells marked as Fail.

pbmc <- FilterCells(pbmc, filter_columns = c("lq_RNA_mt_3mad", "db_scDblFinder"), 
                    filter_logic = "or")

If you want to filter out the Fail cells that are in the intersection of multiple methods, such as db_scDblFinder and db_hybrid, you can change the filter_logic to “and”.

pbmc <- FilterCells(pbmc, filter_columns = c("db_scDblFinder", "db_hybrid"), 
                    filter_logic = "and")

If you want to return the specific count values of each sample before and after filtering, take lq_RNA_mt_3mad for example.

out <- FilterCells(pbmc,  filter_columns = c("lq_RNA_mt_3mad", "db_scDblFinder"), 
                   filter_logic = "or", return.table = T, split.by = "orig.ident")
## Total cells before filtering: 28498 
## Total cells after filtering: 23344 
## >>>> Sample: TP1 
##   Cells before filtering: 10136 
##   Cells after filtering: 8549 
## >>>> Sample: TP2 
##   Cells before filtering: 9063 
##   Cells after filtering: 7220 
## >>>> Sample: TP3 
##   Cells before filtering: 3697 
##   Cells after filtering: 3023 
## >>>> Sample: TP3-rep 
##   Cells before filtering: 5602 
##   Cells after filtering: 4552
out
##    Sample Before_Filtering After_Filtering
## 1   Total            28498           23344
## 2     TP1            10136            8549
## 3     TP2             9063            7220
## 4     TP3             3697            3023
## 5 TP3-rep             5602            4552