Exploring the Cellular Landscape of the Human Peripheral Blood Mononuclear Cells Using Single-Cell RNA-sequencing
- Sep 2
- 5 min read
Updated: Sep 26
Single-cell RNA-sequencing, also referred to as scRNA-seq, is the study of an organism’s entire genome to detect and qualitatively analyse their messenger RNA molecules. This groundbreaking technique in genomics research has helped in studying cellular responses, cellular heterogeneity, and cell types.
This article explores the usage of scRNA-seq in identifying the cell types present in blood mononuclear cells. The presence of these cell types is confirmed by identifying the surface proteins present and expressed on these cells.
The dataset used for this exploration was 3k PBMCs (accessed from 10x Genomics), and the data analysis was conducted using Scanpy in Python.
To conduct accurate data analysis, the basic processes included were filtering, understanding the variability in the expression of genes, dimensionality reduction, and cell type identification.
The first process in data analysis - filtering - involved removing low-quality cells. It’s essential to remove such cells from the data set, as they have low gene counts and high mitochondrial content. Cells with low gene counts are usually removed as they can skew the analysis of the variability of expression of genes; and cells which have high mitochondrial content are usually damaged, dying, or undergoing stress. The latter type of cells can also skew the results, and thus need to be removed from the data set.
And after the defective cells in the data set are identified, the next step involves selecting highly variable genes (HVGs). These are genes which show significant variation in expression levels across different cells. Identifying such cells is crucial as it reduces the burden on computational tools and noise in the data: that is, it removes any outliers which could slow down the analysis or skew the data.
To conduct this step, Scanpy was used. Scanpy uses the ratio of the normalised variation in gene expression (gene dispersion) to the mean gene expression to detect HVGs. And to understand which genes were chosen as the HVGs from this data set, the results obtained from the calculation were graphed via Python.
Now that the HVGs were identified, the next step in the process involved incorporating dimensionality reduction to visualise the 3k PBMC dataset in a coordinate system with as few dimensions as possible. That is, the data set has its dimensions reduced to be able to plot it and visualise it on a scatter plot with a two/three-coordinate system. And to conduct this step, three methods can be used: principal component analysis (PCA), t-distributed Stochastic Neighbour Embedding (t-SNE), and Uniform Manifold Approximation and Projection.
As t-SNE is a slower dimensionality reduction method and is significantly less efficient, t-SNE was not used for dimensionality reduction. Instead, PCA and UMAP were used. This didn’t only ensure accuracy in the data analysis, but also speed and reduced computational burden.
Of the three dimensionality reductions, the first dimensionality reduction technique - PCA - is a linear unsupervised machine learning technique that converts high-dimensional data to low-dimensional data for visualisation. To perform this process, the first step taken is to find the line of best fit; that is, the trend in the values of the data. This line is called the first principal component (PC1), and corresponds to the x-axis of the lower dimensional graph. Following this, the second principal component (PC2) - corresponding to the y-axis of the lower-dimensional graph - is calculated by determining how far each data point is from the line of best fit. For the purpose of this project, PCA was used to dimensionally reduce the data to visualise the clusters of cell types present in the 3k PBMC dataset.
The second dimensionality reduction technique - t-SNE - is a nonlinear unsupervised dimensionality reduction technique that allows for visualisation of higher-dimensional data. In this technique, the algorithm focuses on plotting higher-dimensional data in lower dimensions by preserving local relationships between data points. This was useful for the purpose of the project as t-SNE keeps nearby points close to each other even when they are plotted in a lower dimension, thereby preserving the structure of the dataset. However, as interpreting the results of t-SNE are difficult, this dimensionality reduction technique was not used in the project to analyse the 3k PBMC dataset.
The final dimensionality reduction technique - UMAP - is also a nonlinear unsupervised dimensionality reduction technique, but it differs from t-SNE as it is significantly faster and more efficient when it comes to preserving the structure of the given dataset. In this process, UMAP presents higher-dimensional data in a lower-dimensional plane by identifying points which share common characteristics (such as cell type) and creating clusters of these cells in the 2D/3D plane. Typically, this graph is created by identifying the characteristic value for each point, and adding points to the clusters depending on the value at each point. For the purpose of this project, the characteristic chosen was cell type.
To analyse the 3k PBMC dataset, the primary dimensionality reduction (and visualisation) technique used was UMAP. To classify the different cell types, annotated data was taken from the dataset and plotted in UMAP to identify which cell clusters were which cell type. The following cell types were obtained:
T cells
NK cells and its monocyte subset
B cells
Monocytes and its monocyte subset
NK cells
CD8 T cells
To confirm the presence of the specific cell type(s) in a cluster, the annotated data was plotted in UMAP to identify which cell surface protein was prominently present in each of the clusters. The following cell-surface proteins were used:
CD4 for T cells
CD14 and LYZ for monocytes
MS4A1 for B cells
CD8A for CD8 T cells
GNLY and NKG7 for NK cells
FCGR3A for NK cells and monocyte subsets
Finally, the analysis involved differential gene expression analysis by identifying marker genes for each cluster of cells - that is, cells which are most differentially expressed compared to others. This process involves picking each cluster sequentially and comparing it with the remaining clusters. Following this, the results of the differential expression analysis are analysed; this gives the following graph:

The heat map visualisation above serves an important purpose: it helps identify the cell type in the UMAP depending on the corresponding gene. This is understood by the heat map legend given next to the heat map - if a cluster’s shade is on the yellow end of the spectrum, then there is a higher concentration of the specific protein in the given cluster. And if the cluster’s shade is on the purple end of the spectrum, the concentration of the chosen protein in the specific cluster is extremely low.
To understand this further, consider the following example: the bottom left cluster can be classified as monocytes due to there being a higher concentration of LYZ in the given cluster. Similarly, the remaining clusters in the UMAP can be classified via the given heat map by picking a specific protein, and observing the level of concentration of the chosen protein in each cluster of the UMAP.
In conclusion, scRNA-seq is a novel technique used for analysing a wide range of cell types, from blood cells to neural cells. It is used in fields such as developmental biology, oncology, and clinical research. Moreover, scRNA-seq is also an integral part of precision medicine! When the potential of scRNA-seq is fully harnessed, the application of this research technique can improve the quality of life and healthcare.
References:
Awan, Abid Ali. “Introduction to t-SNE.” datacamp, 9 December 2024, https://www.datacamp.com/tutorial/introduction-t-sne#.
Luecken, Malte D., and Fabian J. Theis. "Current best practices in single‐cell RNA‐seq analysis: a tutorial." Molecular systems biology 15.6 (2019): e8746.
Sanbomics. Sanbomics. YouTube, www.youtube.com/@Sanbomics
Wolf, Fabian A., Philipp Angerer, and Fabian J. Theis. "SCANPY: Large-scale single-cell gene expression data analysis." Genome Biology, vol. 19, no. 1, 2018, p. 15.
“Introduction to Single‑Cell RNA‑seq.” Analysis of single cell RNA‑seq data, SingleCellCourse, www.singlecellcourse.org/introduction-to-single-cell-rna-seq.html
“Single cell RNA‑seq analysis using Seurat.” Analysis of single cell RNA‑seq data, SingleCellCourse, www.singlecellcourse.org/single-cell-rna-seq-analysis-using-seurat.html
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction — umap 0.5.8 documentation, 8 June 2018, https://umap-learn.readthedocs.io/en/latest/.
“What Is Principal Component Analysis (PCA)?” IBM, 8 December 2023, https://www.ibm.com/think/topics/principal-component-analysis.