2 Load and add data
Compiled: 2025-05-08 Written by Jiaying Zeng and Daihan Ji
SingleCellMQC provides number of functions to simplify the process of loading single-cell multi-omics data as well as clinical information. Multi-omics includes scRNA-seq, CITE-seq (RNA+ADT), and scTCR/BCR-seq. Generally, GEX data will be automatically constructed into a Seurat object after reading. If you want to reduce memory usage, we encourage users to use BPCells
package read and save data. We support BPCells
read operations in our read functions.
2.1 Loading data of scRNA-seq or CITE-seq
2.1.1 Data import Without BPCells
Upon standard Cell Ranger pipeline, you will have a lot of output files. The input directory should contained barcodes.tsv.gz, features.tsv.gz and matrix.mtx.gz.
sample_filtered_feature_bc_matrix/
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz
# A vector of outputs of the Cell Ranger pipeline from 10X
dir_GEX <- c(
"/data/SingleCellMQC/CellRanger/TP1/sample_filtered_feature_bc_matrix/",
"/data/SingleCellMQC/CellRanger/TP2/sample_filtered_feature_bc_matrix/",
"/data/SingleCellMQC/CellRanger/TP3/sample_filtered_feature_bc_matrix/",
"/data/SingleCellMQC/CellRanger/TP3-rep/sample_filtered_feature_bc_matrix/"
)
sample_name <- c("TP1", "TP2", "TP3", "TP3-rep")
pbmc_GEX <- Read10XData(dir_GEX = dir_GEX, sample = sample_name)
pbmc_GEX
## An object of class Seurat
## 36738 features across 28498 samples within 2 assays
## Active assay: RNA (36601 features, 0 variable features)
## 1 layer present: counts
## 1 other assay present: ADT
# Show the size of the subject
print(object.size(pbmc_GEX), units = "auto")
## 575.9 Mb
If your cell×gene expression matrix data is stored in HDF5 file format, use Read10XH5Data
instead. The input file sample_filtered_feature_bc_matrix.h5 should be contained.
# A vector of outputs of the Cell Ranger pipeline from 10X
dir_GEX <- c(
"/data/SingleCellMQC/CellRanger/TP1/sample_filtered_feature_bc_matrix.h5",
"/data/SingleCellMQC/CellRanger/TP2/sample_filtered_feature_bc_matrix.h5",
"/data/SingleCellMQC/CellRanger/TP3/sample_filtered_feature_bc_matrix.h5",
"/data/SingleCellMQC/CellRanger/TP3-rep/sample_filtered_feature_bc_matrix.h5"
)
sample_name <- c("TP1", "TP2", "TP3", "TP3-rep")
pbmc_GEX <- Read10XH5Data(dir_GEX = dir_GEX, sample = sample_name)
2.1.2 Data import With BPCells
BPCells
is a powerful tool designed for efficient handling of large-scale single-cell datasets. In this section, we demonstrate how to import single-cell data using BPCells, leveraging its optimized storage and memory management capabilities. Whether you’re working with 10x Genomics, HDF5, or other formats, BPCells
provides a seamless and scalable solution for data import, enabling more memory-efficient workflows. To enable BPCells-based data reading, simply set saveBPCells=TRUE
and specify the dir_BPCells
path for saving the processed data. Read10XData()
and Read10XH5Data()
functions support BPCells-based data reading.
# A vector of outputs of the Cell Ranger pipeline from 10X
dir_GEX <- c(
"/data/SingleCellMQC/CellRanger/TP1/sample_filtered_feature_bc_matrix/",
"/data/SingleCellMQC/CellRanger/TP2/sample_filtered_feature_bc_matrix/",
"/data/SingleCellMQC/CellRanger/TP3/sample_filtered_feature_bc_matrix/",
"/data/SingleCellMQC/CellRanger/TP3-rep/sample_filtered_feature_bc_matrix/"
)
sample_name <- c("TP1", "TP2", "TP3", "TP3-rep")
pbmc <- Read10XData(dir_GEX = dir_GEX,
sample = sample_name,
saveBPCells = T,
dir_BPCells = "./BPCellData")
pbmc
## An object of class Seurat
## 36738 features across 28498 samples within 2 assays
## Active assay: RNA (36601 features, 0 variable features)
## 1 layer present: counts
## 1 other assay present: ADT
# Show the size of the subject
print(object.size(pbmc), units = "auto")
## 27.7 Mb
2.2 Loading data of scV(D)J-seq
Upon standard Cell Ranger pipeline, you will have a lot of output files. For scV(D)J-seq, the input file filtered_contig_annotations.csv should be contained.
vdj_out/\
├── filtered_contig_annotations.csv \<-- **This contains the count data we want!**\
├── clonotypes.csv\
└── ...
# A vector of outputs of the Cell Ranger pipeline from 10X
dir_TCR <- c(
"/data/SingleCellMQC/CellRanger/TP1/vdj_t/filtered_contig_annotations.csv",
"/data/SingleCellMQC/CellRanger/TP2/vdj_t/filtered_contig_annotations.csv",
"/data/SingleCellMQC/CellRanger/TP3/vdj_t/filtered_contig_annotations.csv",
"/data/SingleCellMQC/CellRanger/TP3-rep/vdj_t/filtered_contig_annotations.csv"
)
sample_name <- c("TP1", "TP2", "TP3", "TP3-rep")
pbmc_tcr <- Read10XData(dir_TCR = dir_TCR, sample = sample_name)
str(pbmc_tcr, list.len = 4)
## List of 4
## $ TP1 :'data.frame': 5360 obs. of 32 variables:
## ..$ barcode : chr [1:5360] "AAACCTGAGAATGTTG-1" "AAACCTGAGAATGTTG-1" "AAACCTGCAAGCGATG-1" "AAACCTGCAAGCGATG-1" ...
## ..$ is_cell : logi [1:5360] TRUE TRUE TRUE TRUE TRUE TRUE ...
## ..$ contig_id : chr [1:5360] "AAACCTGAGAATGTTG-1_contig_1" "AAACCTGAGAATGTTG-1_contig_2" "AAACCTGCAAGCGATG-1_contig_1" "AAACCTGCAAGCGATG-1_contig_2" ...
## ..$ high_confidence : logi [1:5360] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. [list output truncated]
## $ TP2 :'data.frame': 4271 obs. of 32 variables:
## ..$ barcode : chr [1:4271] "AAACCTGAGTGAAGTT-1" "AAACCTGGTCACACGC-1" "AAACCTGGTCACACGC-1" "AAACCTGTCTGCAAGT-1" ...
## ..$ is_cell : logi [1:4271] TRUE TRUE TRUE TRUE TRUE TRUE ...
## ..$ contig_id : chr [1:4271] "AAACCTGAGTGAAGTT-1_contig_1" "AAACCTGGTCACACGC-1_contig_1" "AAACCTGGTCACACGC-1_contig_2" "AAACCTGTCTGCAAGT-1_contig_1" ...
## ..$ high_confidence : logi [1:4271] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. [list output truncated]
## $ TP3 :'data.frame': 695 obs. of 32 variables:
## ..$ barcode : chr [1:695] "AAACCTGTCGTTGCCT-1" "AAAGATGCATGGATGG-1" "AAAGATGCATGGATGG-1" "AACACGTAGAGTAAGG-1" ...
## ..$ is_cell : logi [1:695] TRUE TRUE TRUE TRUE TRUE TRUE ...
## ..$ contig_id : chr [1:695] "AAACCTGTCGTTGCCT-1_contig_1" "AAAGATGCATGGATGG-1_contig_1" "AAAGATGCATGGATGG-1_contig_2" "AACACGTAGAGTAAGG-1_contig_1" ...
## ..$ high_confidence : logi [1:695] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. [list output truncated]
## $ TP3-rep:'data.frame': 2883 obs. of 32 variables:
## ..$ barcode : chr [1:2883] "AAACCTGAGTGGGATC-1" "AAACCTGCACTTAACG-1" "AAACCTGCACTTAACG-1" "AAACCTGGTGAAGGCT-1" ...
## ..$ is_cell : logi [1:2883] TRUE TRUE TRUE TRUE TRUE TRUE ...
## ..$ contig_id : chr [1:2883] "AAACCTGAGTGGGATC-1_contig_1" "AAACCTGCACTTAACG-1_contig_1" "AAACCTGCACTTAACG-1_contig_2" "AAACCTGGTGAAGGCT-1_contig_1" ...
## ..$ high_confidence : logi [1:2883] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. [list output truncated]
## - attr(*, "class")= chr [1:2] "list" "VDJ"
# A vector of outputs of the Cell Ranger pipeline from 10X
dir_BCR <- c(
"/data/SingleCellMQC/CellRanger/TP1/vdj_b/filtered_contig_annotations.csv",
"/data/SingleCellMQC/CellRanger/TP2/vdj_b/filtered_contig_annotations.csv",
"/data/SingleCellMQC/CellRanger/TP3/vdj_b/filtered_contig_annotations.csv",
"/data/SingleCellMQC/CellRanger/TP3-rep/vdj_b/filtered_contig_annotations.csv"
)
sample_name <- c("TP1", "TP2", "TP3", "TP3-rep")
pbmc_bcr <- Read10XData(dir_BCR = dir_BCR, sample = sample_name)
str(pbmc_bcr, list.len = 4)
## List of 4
## $ TP1 :'data.frame': 1110 obs. of 32 variables:
## ..$ barcode : chr [1:1110] "AAACCTGGTAGCCTAT-1" "AAACCTGGTAGCCTAT-1" "AAACCTGGTTCGAATC-1" "AAACCTGGTTCGAATC-1" ...
## ..$ is_cell : logi [1:1110] TRUE TRUE TRUE TRUE TRUE TRUE ...
## ..$ contig_id : chr [1:1110] "AAACCTGGTAGCCTAT-1_contig_1" "AAACCTGGTAGCCTAT-1_contig_2" "AAACCTGGTTCGAATC-1_contig_1" "AAACCTGGTTCGAATC-1_contig_2" ...
## ..$ high_confidence : logi [1:1110] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. [list output truncated]
## $ TP2 :'data.frame': 2061 obs. of 32 variables:
## ..$ barcode : chr [1:2061] "AAACCTGCATCGGGTC-1" "AAACCTGCATCGGGTC-1" "AAACCTGCATCGGGTC-1" "AAACCTGGTAGAGTGC-1" ...
## ..$ is_cell : logi [1:2061] TRUE TRUE TRUE TRUE TRUE TRUE ...
## ..$ contig_id : chr [1:2061] "AAACCTGCATCGGGTC-1_contig_1" "AAACCTGCATCGGGTC-1_contig_2" "AAACCTGCATCGGGTC-1_contig_3" "AAACCTGGTAGAGTGC-1_contig_1" ...
## ..$ high_confidence : logi [1:2061] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. [list output truncated]
## $ TP3 :'data.frame': 1538 obs. of 32 variables:
## ..$ barcode : chr [1:1538] "AAACGGGCAGCAGTTT-1" "AAACGGGCAGCAGTTT-1" "AAACGGGGTTCAGTAC-1" "AAACGGGGTTCAGTAC-1" ...
## ..$ is_cell : logi [1:1538] TRUE TRUE TRUE TRUE TRUE TRUE ...
## ..$ contig_id : chr [1:1538] "AAACGGGCAGCAGTTT-1_contig_1" "AAACGGGCAGCAGTTT-1_contig_2" "AAACGGGGTTCAGTAC-1_contig_1" "AAACGGGGTTCAGTAC-1_contig_2" ...
## ..$ high_confidence : logi [1:1538] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. [list output truncated]
## $ TP3-rep:'data.frame': 1371 obs. of 32 variables:
## ..$ barcode : chr [1:1371] "AAACCTGGTTGTTTGG-1" "AAACCTGGTTGTTTGG-1" "AAACGGGCAAGTACCT-1" "AAACGGGCAAGTACCT-1" ...
## ..$ is_cell : logi [1:1371] TRUE TRUE TRUE TRUE TRUE TRUE ...
## ..$ contig_id : chr [1:1371] "AAACCTGGTTGTTTGG-1_contig_1" "AAACCTGGTTGTTTGG-1_contig_2" "AAACGGGCAAGTACCT-1_contig_1" "AAACGGGCAAGTACCT-1_contig_2" ...
## ..$ high_confidence : logi [1:1371] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. [list output truncated]
## - attr(*, "class")= chr [1:2] "list" "VDJ"
2.3 Loading data of multi-omics (scRNA-seq/CITE-seq + scV(D)J-seq)
Upon standard Cell Ranger pipeline, you will have a lot of output files. The input directory (scRNA-seq or CITE-seq) barcodes.tsv.gz, features.tsv.gz and matrix.mtx.gz should contained. If your cell×gene expression matrix data is stored in HDF5 file format, use Read10XH5Data
instead. The input file (scV(D)J-seq) filtered_contig_annotations.csv also should be contained.
sample_filtered_feature_bc_matrix/
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz
vdj_out/
├── filtered_contig_annotations.csv <– This contains the count data we want!
├── clonotypes.csv
└── …
# A vector of outputs of the Cell Ranger pipeline from 10X
dir_GEX <- c(
"/data/SingleCellMQC/CellRanger/TP1/sample_filtered_feature_bc_matrix/",
"/data/SingleCellMQC/CellRanger/TP2/sample_filtered_feature_bc_matrix/",
"/data/SingleCellMQC/CellRanger/TP3/sample_filtered_feature_bc_matrix/",
"/data/SingleCellMQC/CellRanger/TP3-rep/sample_filtered_feature_bc_matrix/"
)
dir_TCR <- c(
"/data/SingleCellMQC/CellRanger/TP1/vdj_t/filtered_contig_annotations.csv",
"/data/SingleCellMQC/CellRanger/TP2/vdj_t/filtered_contig_annotations.csv",
"/data/SingleCellMQC/CellRanger/TP3/vdj_t/filtered_contig_annotations.csv",
"/data/SingleCellMQC/CellRanger/TP3-rep/vdj_t/filtered_contig_annotations.csv"
)
dir_BCR <- c(
"/data/SingleCellMQC/CellRanger/TP1/vdj_b/filtered_contig_annotations.csv",
"/data/SingleCellMQC/CellRanger/TP2/vdj_b/filtered_contig_annotations.csv",
"/data/SingleCellMQC/CellRanger/TP3/vdj_b/filtered_contig_annotations.csv",
"/data/SingleCellMQC/CellRanger/TP3-rep/vdj_b/filtered_contig_annotations.csv"
)
sample_name <- c("TP1", "TP2", "TP3", "TP3-rep")
pbmc <- Read10XData(dir_GEX = dir_GEX, dir_TCR = dir_TCR, dir_BCR = dir_BCR, sample = sample_name)
pbmc
# Show the size of the subject
print(object.size(pbmc), units = "auto")
What if you want to implement more memory-efficient workflows by BPCells-based data reading?
Note, that this is only possible for Seurat V5, you can use the following:
# A vector of outputs of the Cell Ranger pipeline from 10X
dir_GEX <- c(
"/data/SingleCellMQC/CellRanger/TP1/sample_filtered_feature_bc_matrix/",
"/data/SingleCellMQC/CellRanger/TP2/sample_filtered_feature_bc_matrix/",
"/data/SingleCellMQC/CellRanger/TP3/sample_filtered_feature_bc_matrix/",
"/data/SingleCellMQC/CellRanger/TP3-rep/sample_filtered_feature_bc_matrix/"
)
dir_TCR <- c(
"/data/SingleCellMQC/CellRanger/TP1/vdj_t/filtered_contig_annotations.csv",
"/data/SingleCellMQC/CellRanger/TP2/vdj_t/filtered_contig_annotations.csv",
"/data/SingleCellMQC/CellRanger/TP3/vdj_t/filtered_contig_annotations.csv",
"/data/SingleCellMQC/CellRanger/TP3-rep/vdj_t/filtered_contig_annotations.csv"
)
dir_BCR <- c(
"/data/SingleCellMQC/CellRanger/TP1/vdj_b/filtered_contig_annotations.csv",
"/data/SingleCellMQC/CellRanger/TP2/vdj_b/filtered_contig_annotations.csv",
"/data/SingleCellMQC/CellRanger/TP3/vdj_b/filtered_contig_annotations.csv",
"/data/SingleCellMQC/CellRanger/TP3-rep/vdj_b/filtered_contig_annotations.csv"
)
sample_name <- c("TP1", "TP2", "TP3", "TP3-rep")
pbmc <- Read10xData(dir_GEX = dir_GEX, dir_TCR = dir_TCR, dir_BCR = dir_BCR,
sample = sample_name, saveBPCells = T)
pbmc
# Show the size of the subject
print(object.size(pbmc), units = "auto")
2.4 Loading and adding 10X metrics
Upon standard Cell Ranger pipeline, you will obtain a summary of key sequencing and analysis metrics for single-cell experiments. To integrate it into your Seurat object, follow the steps below. (Ensure that it has been turned into a Seurat object using CalculateMetrics
, for example) The input file metrics_summary.csv should be contained.
seq_list <- c("/data/SingleCellMQC/CellRanger/TP1/metrics_summary.csv",
"/data/SingleCellMQC/CellRanger/TP2/metrics_summary.csv",
"/data/SingleCellMQC/CellRanger/TP3/metrics_summary.csv",
"/data/SingleCellMQC/CellRanger/TP3-rep/metrics_summary.csv")
sample_name <- c("TP1", "TP2", "TP3", "TP3-rep")
names(seq_list) <- sample_name
seq_metrics <- Read10XMetrics(file_path = seq_list)
pbmc <- Add10XMetrics(pbmc, seq_metrics)
2.5 Adding information of samples
You can add other information of your samples using AddSampleMeta
. The information would be store into pbmc@metadata
.
sample_information <- data.frame(
Sample = c("TP1", "TP2", "TP3", "TP3-rep"),
Batch_EXP = c("EXP-1", "EXP-2", "EXP-3", "EXP-4"),
Batch_TotalSeq_C_Antibodies = c("AB-1", "AB-1", "AB-1", "AB-2"),
Sex = c("Male", "Male", "Male", "Male"),
Time_point = c("TP-1", "TP-2", "TP-3", "TP-3")
)
pbmc <- AddSampleMeta(pbmc, merge_by_seurat="orig.ident", SampleMeta= sample_information,
merge_by_meta = "Sample")