INTRODUCTION
The marked improvements in massive parallel sequencing coupled with single-cell sample preparations and data deconvolution have allowed single-cell RNA sequencing (scRNA-seq) to become a powerful approach to characterize the gene expression profile in single cells (1, 2). The objective of the international collaborative effort Human Cell Atlas (www.humancellatlas.org) takes advantage of this new technology platform to study the distinctive gene expression profiles on RNA level across diverse cell and tissue types and connect this information with classical cellular descriptions, such as location and morphology (3). In parallel, the development of many millions of publicly available antibodies toward human proteins has enabled single-cell analysis of the corresponding proteins in tissues and organs using immunohistochemistry (4) and fluorescent-based bioimaging (1, 5–7), allowing single-cell spatial mapping in the context of neighboring cells. The objective of the Human Protein Atlas (HPA) (www.proteinatlas.org) effort is to take advantage of these bioimaging approaches to map the expression of all human protein-coding genes across all major human cells, tissues, and organs. More than 10 million bioimages from 37 tissues showing the native protein location in intact tissue samples are publicly available in the HPA, each annotated by a certified pathologist (4). Together, these two platforms thus have the potential to create comprehensive body-wide maps of gene expression at RNA and protein level with the ultimate goal to provide publicly available genome-wide knowledge of protein-coding genes in single cell types across tissues and organs in the human body.
Here, we describe an effort to combine the information from these two efforts to create a publicly available HPA Single Cell Type Atlas with genome-wide expression data from scRNA-seq experiments integrated with the spatial antibody-based bioimaging data. We use an approach outlined in Fig. 1A in which the single–cell type transcriptomics from the scRNA-seq data from a particular cluster of cells is pooled and the average normalized protein-coding transcripts per million (pTPM) as well as a normalized expression are calculated across protein-coding genes. In this manner, the problem with technical noise involving genes having zero counts (so-called dropouts) can be minimized and even genes with very low expression levels can be detected (8). This approach allows the expression profiles for each gene in each cluster to be visualized on a genome-wide and single–cell type level taking advantage of the added information by cumulative counts from hundreds or thousands of cells.
RESULTS
The tissues included in the study
To make this possible, a survey of scRNA-seq data from nondiseased human tissues and organs was performed. We used three main criteria to include data into the pipeline: (i) publicly available raw data from human tissues with good technical quality with at least 4000 cells analyzed and at least 20 million read counts by the sequencing for each tissue; (ii) high correlation between pseudo-bulk transcriptomics profile from the scRNA-seq data and bulk RNA-seq generated as part of the HPA Tissue Atlas; and (iii) high correlation between the cluster-specific expression and the expected expression pattern of an extensive selection of marker genes representing well-known tissue- and cell type–specific markers, including both markers from the original publications and additional markers used in pathology diagnostics (data S2). Here, we present a dataset containing 13 different human tissues covering most major organs in the human body including ileum (9), colon (10), rectum (9), kidney (11), liver (12), pancreas (13), heart (14), lung (15), prostate (16), testis (17), placenta (18), skin (19), and eye (20), as well as an analysis of human blood (21) (see data S1). No brain samples were included because only single-nuclei data were available, which showed lower correlation to bulk data in comparison to single-cell data (see fig. S1) (22, 23). All raw datasets were gathered into a common cluster analysis, resulting in a total of 192 single–cell type clusters across the datasets (see data S1 for all cluster annotations). In total, the data correspond to 1.47 billion read counts and the average read count per single–cell type cluster was approximately 7.7 million.
Correlation of expression profiles across the 192 cell types
The correlations between bulk RNA-seq and pseudo-bulk single-cell transcriptomics profiles were high for all tissues, ranging from 0.76 to 0.88 (fig. S2). All clusters were manually annotated on the basis of known tissue- and cell type–specific markers and their expected expression in the corresponding clusters (data S2 and fig. S3). As examples of the results, three genes are exemplified in Fig. 1A with cluster expression profiles in prostate. Kallikrein-related peptidase 3 (KLK3), also known as prostate-specific antigen (PSA), was shown to be expressed in two neighboring clusters in prostate, both annotated as glandular epithelial cells. Vimentin (VIM), a well-known marker for mesenchymal cells, was instead expressed in five different clusters, all annotated as mesenchymal-related cell types, including smooth muscle cells and immune cells. CD34, a well-known marker for endothelial cells, was localized to one of these clusters that has been annotated as endothelial cells. A UMAP (uniform manifold approximation and projection) of all clusters (Fig. 1B) revealed, as expected, that profiles of cell types responsible for unique tissue-specific functions have a close association, here shown as distinct tissue-specific groups, e.g., intestinal, hepatic, renal, placental, pulmonary, and neuronal cells. Some cell populations carry out similar cell type–specific functions, and as expected, these clusters from different tissues show high similarity in gene expression, e.g., immune cells (nine tissues), endothelial cells (nine tissues), and fibroblasts (five tissues). Altogether, the 192 single–cell type clusters could be summarized into 51 main cell types belonging to 12 different functional groups of cells (Fig. 1, B and C).
Creation of a Single Cell Type Atlas
On the basis of these new data, a Single Cell Type Atlas has been launched (www.proteinatlas.org/celltype) with data for all protein-coding genes. More than 250,000 interactive UMAP plots are presented in this open access resource showing the primary data for every analyzed cell for all protein-coding genes and all annotated cell types (defined as annotated clusters). Similarly, by pooling the data for every cell in a cluster, we have been able to generate more than 250,000 bar plots showing the calculated transcripts per million (TPM) for each gene and cell type across the entire protein-coding genome. The integration with tissue imaging (Fig. 2A) allows validation of the cell type–specific expression on the protein level by the in situ antibody-based profiling, as exemplified by the immunohistochemical staining of phosphodiesterase 6A (PDE6A) shown by the scRNA-seq analysis to be localized to rod photoreceptor cells in eye (clusters 0, 2, 3, and 4). Similarly, the protein insulin (INS) was shown to be localized to endocrine cells in pancreas (cluster 6), surfactant protein C (SFTPC) was localized to alveolar cells type 2 (AT2) in lung (clusters 1 and 6), and uromodulin (UMOD) was localized to distal tubular cells in kidney (clusters 11 and 12).
Classification of protein-coding genes based on expression profiles
A classification to map the gene expression profile of all protein-coding genes across the different cell types was performed as described earlier (4) to determine the number of genes elevated in particular single cell types and thus showing high or low cell type specificity (table S1). In total, across all cell types, 2005 genes are cell type–enriched, meaning that the expression of a particular gene defined as adjusted TPM (see Materials and Methods) is at least fourfold higher in one cell type as compared to all other cell types analyzed here (Fig. 2C). Similarly, 2893 genes are defined as group-enriched, thus enriched in a group of up to 10 cell types, and 9062 genes are defined as cell type–enhanced, where the expression is at least fourfold higher in one cell type as compared to the mean of all other cell types. A group of genes are also classified as having low cell type specificity (n = 4257), suggesting that they are present at roughly similar levels across all the cell types. Only 11% of the genes were detected in all analyzed cell types, supporting previous estimation of the number of “housekeeping” genes needed in all cells (24, 25). In Fig. 2B, the number of elevated genes (cell type–enriched, group-enriched, or cell type–enhanced) is visualized for all the 51 different cell types. In agreement with previous observations based on bulk transcriptomics (4), testis constitutes the tissue with the highest number of cell type elevated genes, but many elevated genes were also found in the eye (photoreceptor cells, bipolar cells, and horizontal cells) as well as in ciliated cells in lung (see data S3 for a complete list of the classification results). As mentioned above, the integration of multiple analysis platforms allows the validation of the single-cell data with antibody-based image profiling in tissue. Immunohistochemistry shows not only the localization at a single-cell level but also the exact spatial pattern, cell-to-cell variation, and subcellular localization. In Fig. 2D, some examples of this validation are shown, including proteins specifically expressed in rare structures, such as, e.g., renal collecting ducts, retinal photoreceptor cells, early spermatids, intercalated discs in cardiomyocytes, and hepatic Kupffer cells.
The cell type expression landscape
The cell type–specific expression landscape was summarized in a network plot (Fig. 3A), illustrating the number of cell type–enriched and group-enriched genes and their relationships. The analysis highlights distinct expression clusters corresponding to cell types sharing similar functions, both within the same organs and between organs. As expected, many genes are simultaneously enriched in various immune cell linages and the different stages of germ cells in testis. It is also evident that despite that many organs contain epithelial cells, these still have large numbers of genes enriched in only one particular tissue. Cell types of the same origin residing in different tissues, e.g., macrophages, defined as Kupffer cells in liver and Hofbauer cells in placenta, also have genes enriched in only one cell type. Figure 3B shows immunohistochemical examples of proteins that shared elevated expression in cell types present in two different organs but with similar function, such as motility or immune-related functions.
Comparison of bulk and single-cell transcriptomics
The new single–cell type classification of protein-coding genes allowed us to perform a genome-wide comparison between this classification and the previous classification based on bulk transcriptomics. In Fig. 4A, the relationship of classification of all 19,670 protein-coding genes is shown, with most genes overlapping in classification, including a vast majority of the 2005 cell type–enriched genes that are found to be either enriched, group-enriched, or enhanced based on the tissue classification. Relatively few genes have low cell type specificity (n = 4257), but almost all of these have low tissue specificity based on the bulk transcriptomics data. It is noteworthy that the number of elevated genes is higher for our single–cell type classification as compared to the tissue-level classification, supporting the view that many genes that are found across all tissues still have cell type–specific expression profiles. Note that relatively many genes are not detected in the cell type analysis, which is not unexpected, because many tissues were not analyzed due to lack of data for many important tissues, such as the brain.
An investigation of the overlap between the tissue “bulk” expression and the single–cell type expression is shown in Fig. 4B. The analysis showed that most genes with enriched expression in a certain tissue were enriched also based on the single-cell analysis. Tissue-specific expression can thus be attributed to individual cell types present in a particular tissue, exemplified by the many liver-enriched genes that were found to be hepatocyte-enriched. Likewise, all genes enriched in heart muscle by bulk transcriptomics analysis are enriched in cardiomyocytes in the single-cell analysis. The overlap of genes that are enriched at both single–cell type and tissue level is visualized in a network in fig. S4. This highlights the usefulness of scRNA-seq to disentangle the cell type variance across the different tissues in the human body.
Correlation to tissues and blood cells
A hypergeometric test was conducted to show the statistical significance of the overlap between genes that are enriched in the single cell types and genes enriched in tissues, flow-sorted blood cells, and cell lines (Fig. 4C). As noted above, it is reassuring that the cell type–enriched genes generally show a high degree of overlap with the enriched genes defined by bulk transcriptomics from their corresponding tissues. For example, the enriched genes from the liver bulk transcriptomics show overlap with genes elevated in hepatocytes and cholangiocytes based on scRNA-seq. Similarly, the alveolar cells, ciliated cells, and club cells from the single-cell analysis share enriched genes with lung tissue. The scRNA-seq data for the immune cell clusters were also compared with transcriptomics data of flow-sorted single blood cells (Fig. 4C) (26). The macrophages, not present in the HPA Blood Atlas data (26), show as expected overlap with the flow-sorted monocytes. It is also reassuring that the scRNA-seq identified T cells show overlap with the flow-sorted T cells and natural killer (NK) cells published in the HPA Blood Atlas (25), and similarly, enriched genes in scRNA-seq B lymphocytes show overlap with the flow-sorted B cells.
Correlation with human cell lines
Last, we analyzed the overlap of the scRNA-seq analysis with transcriptomics data of in vitro cultivated human cell lines. In Fig. 4C, some examples are shown, with additional 60 cell lines visualized in fig. S5. Overall, there is a high degree of overlap of cell line–enriched genes with the corresponding cell type of origin from the scRNA-seq analysis. For example, the cell line HepG2 shows, as expected, highest degree of overlap with hepatocyte-enriched genes. Similarly, the B cell–derived U-698 cell line mostly overlaps with single-cell clusters annotated to be B cells, but some overlap with T cells. These examples suggest that these in vitro cultivated cell lines may serve as representative models for the corresponding in vivo cell types, while many other cell lines (fig. S5) show less overlap with the expected in vivo cell types, suggesting that caution should be taken when using these cell lines as models for the corresponding cell type.
MATERIALS AND METHODS
scRNA-seq dataset selection
The scRNA-seq dataset was retrieved from published studies based on healthy human tissues. We performed meta-analysis of literatures on scRNA-seq and search single-cell databases, including the Single Cell Expression Atlas (https://ebi.ac.uk/gxa/sc/home), the Human Cell Atlas (https://humancellatlas.org), the Gene Expression Omnibus (https://ncbi.nlm.nih.gov/geo/), and the European Genome-phenome Archive (https://ebi.ac.uk/ega/). To avoid technical bias and to ensure that the single-cell dataset can best represent the corresponding tissue, we applied the following criteria for data selection: (i) We limited the single-cell transcriptomic dataset to those based on the Chromium single-cell gene expression platform from 10X Genomics (version 2 or 3); (ii) scRNA-seq was performed on single-cell suspension from tissues without pre-enrichment of cell types; (iii) only studies with >4000 cells and 20 million read counts were included; and (iv) only dataset whose pseudo-bulk gene expression profile is highly correlated with the expression profile of the corresponding HPA tissue bulk sample is included. Note that, for the tissue eye, we do not have the corresponding bulk transcriptome. In addition, the dataset for lung had fewer reads (~7.3 million), while the dataset for pancreas (3719 cells) and rectum (3898 cells) had less than 4000 cells analyzed. However, these datasets were still included because the data provided important insights into cell type–enriched genes in these tissues. In total, we included datasets for 13 tissues plus peripheral blood mononuclear cells (PBMCs) (data S1).
Quantifying transcriptomic expression of clusters and pseudo-bulk
Quantified raw sequencing data were downloaded from the corresponding depository database based on the accession number provided by the study (data S1) in the available format (total cells, read, and feature counts, or count tables). Unfiltered data were used as input for downstream analysis with in-house pipeline using Single-Cell Analysis in Python (Scanpy, version 1.4.4.post1) in Python (version 3.7.3). In the pipeline, the data were filtered using two criteria: A cell is considered as valid if it has at least 200 genes, and a gene is considered as valid if it is expressed in at least 10% of the cells. Subsequently, the cell counts were normalized to have total count per cell of 10,000. Afterward, the valid cells were then clustered using Louvain clustering function within Scanpy and then gene rankings for each cluster were calculated with “rank_genes_group” function. The total read counts for all genes in each cluster were calculated by adding up the read counts of each gene in all cells belonging to the corresponding cluster. Last, the read counts were normalized to pTPM for each of the single-cell clusters. In the case of calculating the expression profile for pseudo-bulk samples based on single-cell transcriptomics, we added the read counts for all genes from all cells of the sample and normalized it to pTPM in the same way as for the cluster ones.
Defining cell types
Each of the 192 different cell type clusters was manually annotated based on an extensive survey of >500 well-known tissue- and cell type–specific markers, including both markers from the original publications and additional markers used in pathology diagnostics. For most cell types, three marker genes were used, and for each cluster, one main cell type was chosen based on the overall expression pattern of all the marker genes. For four clusters, no main cell type could be selected, and these clusters were not used for classification. The most relevant markers (data S2) are presented in a heatmap on the Cell Type Atlas on each organ- and gene-specific page to clarify cluster annotation to visitors.
Bulk RNA-seq analysis and antibody-based protein profiling
Human tissue samples for analysis of bulk RNA-seq and gene expression at protein level in the HPA datasets were collected and handled in accordance with Swedish laws and regulation. Tissues were obtained from the Clinical Pathology Department, Uppsala University Hospital, Sweden and collected within the Uppsala Biobank organization. All samples were anonymized for personal identity by following the approval and advisory report from the Uppsala Ethical Review Board (reference nos. 2002-577, 2005-388, 2007-159, and 2011-473). The RNA extraction and RNA-seq procedure have been described previously (28). For immunohistochemistry, formalin-fixed, paraffin-embedded (FFPE) tissue blocks were collected from the pathology archives based on normal histology using a hematoxylin and eosin–stained tissue section evaluated by a pathologist. For generation of tissue microarrays (TMAs), representative 1-mm-diameter cores were sampled from FFPE blocks and assembled into TMAs. TMA blocks were cut into 4-μm-thick sections using waterfall microtomes (Microm HM 355S, Thermo Fisher Scientific, Freemont, CA, USA), placed on SuperFrost Plus slides (Thermo Fisher Scientific), dried at room temperature overnight, and baked at 50°C for 12 to 24 hours before immunohistochemical staining. Automated immunohistochemistry was performed using the Autostainer 480S Module (Thermo Fisher Scientific).
Immunohistochemical staining and high-resolution digitization of stained TMA slides were performed essentially as previously described (29). Primary antibodies were diluted and optimized based on IWGAV (International Working Group for Antibody Validation) criteria for antibody validation (30). Antibodies used for the immunohistochemical example images are listed in data S4. Protocol optimization was performed on a test TMA containing 20 different normal tissues. The stained slides were digitized with Scanscope AT2 (Leica Aperio, Vista, CA, USA). All images were manually evaluated by two independent observers, comprising a total of 576 images per antibody, covering 15,320 different human genes and publicly available on v20.proteinatlas.org.
Gene classification
Clusters were normalized using trimmed means of M (TMM) using the tmm function from NOISeq (31) with a median column as reference, with the parameters doWeighting = T and logratioTrim = 0.3. Clusters were aggregated per cell type by using the median expression of each gene. Genes were then classified as per standard HPA procedure as described in table S1.
Generation of network plot
The network plot was generated in Cytoscape 3.6.1 (32), and nodes were filtered to remove complexity such that nodes were displayed if they (i) contained cell type–enriched genes, (ii) contained at least five genes, or (iii) ranked top two largest nodes for any connected cell type and contained at least three genes.
"type" - Google News
July 29, 2021 at 01:16AM
https://ift.tt/3yaNgLV
A single–cell type transcriptomics map of human tissues - Science Advances
"type" - Google News
https://ift.tt/2WhN8Zg
https://ift.tt/2YrjQdq
Bagikan Berita Ini
0 Response to "A single–cell type transcriptomics map of human tissues - Science Advances"
Post a Comment