Documentation


Cell Type Proportion Analysis


Introduction

This function allows users to keep samples in the sub-datasets (tissues from GTEx or cancer types from TCGA) they selected, and filter the cell types of interest to visualize and compare their proportions in each bulk sample. Up to five sub-datasets are supported.

After clicking the "plot" button, the interactive boxplots grouped by cell type and sub-dataset will be available. The one-way ANOVA function has been dynamically integrated into each sub-group in each boxplot, allowing users to have a quantitative comparison of the cell proportions.

Parameters

  • TCGA Tumor/TCGA Normal/GTEx/Used Expression Datasets: Select cancer types of interest in the "TCGA Tumor", "TCGA Normal" or "GTEx" field and click "add" to build dataset list in the "Used Expression Datasets" field. Also, manual input of cancer types split by comma (e.g. COAD Tumor,READ Tumor) is also acceptable. The cell type proportion analysis is based on the datasets list.
  • Cell Types: Select cell types of interest for analysis.
  • Normalization: For each sample, normalize the proportions of selected cell types to make the sum equal to 1.

Results

Click the “Plot” button: GEPIA will generate interactive box plots based on input parameters with two different grouping methods.

Grouped by tissue

For samples in each sub-dataset (e.g. LIHC Tumor), this plot allows users to quantitatively compare the proportion distributions of different cell types.

Grouped by cell type

For each cell type, this plot allows users to quantitatively compare its proportion distribution across samples in different sub-datasets.



Cell Type Correlation Analysis


Introduction

This function allows users to keep samples in the sub-datasets (tissues from GTEx or cancer types from TCGA) they selected, and select two cell types of interest to perform the correlation analysis across bulk samples. Up to five sub-datasets are supported.

After clicking the "plot" button, the interactive scatter plot colored by sub-dataset will be available. The Pearson Correlation Coefficient will be dynamically calculated, allowing users to have a quantitative metric for the proportion correlation between the two cell types selected.

Parameters

  • TCGA Tumor/TCGA Normal/GTEx/Used Expression Datasets: Select cancer types of interest in the "TCGA Tumor", "TCGA Normal" or "GTEx" field and click "add" to build dataset list in the "Used Expression Datasets" field. Also, manual input of cancer types split by comma (e.g. COAD Tumor,READ Tumor) is also acceptable. The cell type proportion analysis is based on the datasets list.
  • Select cell types for comparison: : Select two cell types of interest for the correlation analysis.

Results

Click the “Plot” button: GEPIA will generate interactive scatter plot based on input parameters.

Correlation scatter plot

For samples in sub-datasets selected, this plot allows users to quantitatively analyze the correlation between the proportions of two cell types. The data points (samples) are colored by sub-dataset.



Cell Type-level Expression Analysis


Introduction

This function allows users to keep samples in the sub-datasets (tissues from GTEx or cancer types from TCGA) they selected, and enables users to perform the differential expression in the cell type-level. Up to five sub-datasets are supported.

After clicking the "plot" button, the interactive boxplots grouped by cell type and sub-dataset will be available. The one-way ANOVA function has been dynamically integrated into each sub-group in each boxplot, allowing users to simutenously visualize and compare the gene expression in given cell types.

Parameters

  • TCGA Tumor/TCGA Normal/GTEx/Used Expression Datasets: Select cancer types of interest in the "TCGA Tumor", "TCGA Normal" or "GTEx" field and click "add" to build dataset list in the "Used Expression Datasets" field. Also, manual input of cancer types split by comma (e.g. COAD Tumor,READ Tumor) is also acceptable. The differential expression analysis is based on the datasets list.
  • Cell Types: Select cell types of interest for analysis.
  • Input Gene(s): Input the gene symbol for analysis. You can input the gene in GEPIA in order to verify the correctness of the gene symbol or convert the gene ID or gene alias into the standard gene symbol.

Results

Click the “Plot” button: GEPIA will generate interactive box plots based on input parameters with two different grouping methods.

Grouped by tissue

For samples in each sub-dataset (e.g. LIHC Tumor), this plot allows users to quantitatively compare the gene expression distributions contributed by different cell types.

Grouped by cell type

For each cell type, this plot allows users to quantitatively compare the gene expression distribution across samples in different sub-datasets.



Cell Type-level Survival Analysis


Introduction

Users can first choose the range of sub-datasets, and then separate the samples into two groups, according to the total proportion of cell cell types selected. In addition, we also apply the log-rank test to check whether the two K-M curves are statistically different. The p-value of the log-rank test will be displayed in the title of the survival plot. Up to five sub-datasets are supported.

Parameters

  • TCGA Tumor/Used Datasets: Select cancer types of interest in the "TCGA Tumor" and click "add" to build dataset list in the "Used Expression Datasets" field. Also, manual input of cancer types split by comma (e.g. COAD Tumor,READ Tumor) is also acceptable. The correlation analysis is based on the datasets list.
  • Cell types: Select cell type(s) of interest for analysis. If more than one cell types are selected, we will use the sum proportion of them.
  • Cutoff-high: Samples with the sum proportion of selected cell type(s) higher than this threshold are considered as the high-expression cohort (red lines in the survival plot).
  • Cutoff-low: Samples with the sum proportion of selected cell type(s) lower than this threshold are considered as the low-expression cohort (blue lines in the survival plot).
  • Type of survival periods: Select "Overall Survival" or "Relapse-free Survival" data of TCGA tumor samples for analysis.
  • Show 95% confidence interval: Select whether to show the 95% confidence interval of each K-M curve.
  • Show right censoring mark: Select whether to show the + mark on the K-M curve for each right-censored sample.

Results

Click the “Plot” button: GEPIA will generate interactive Kaplan-Meier survival plots based on input parameters.

Kaplan-Meier plots

This plot allows users to quantitatively visualize and compare the survival curves of two sample groups. For example, here we first filter samples from the LIHC sub-dataset, rank the samples according to the B cell proportion, and assign the top 30% and bottom 30% samples as the "high-group" and "low-group" (by setting "Cutoff-high %" to 70 and "Cutoff-low %" to 30), respectively. For any time point, users can hover the plots to see the exact survival percentage, its 95% confidence intervals, or the right censored observations. The smaller the p-value is, the more statistically different the two K-M curves will be.



TCGA/GTEx Data Information

We downloaded the TCGA (version 2016-09-01) and the GTEx (version 2016-04-19) expression data from the UCSC XENA Data Hubs.

Transcripts Per Million (TPM) matrices of the two datasets were re-calculated on the approved protein-coding genes.

Here is the information table on the abbreviation of TCGA cancer types, the sample numbers of each datasets and the tumor-normal pairs of TCGA/GTEx sub-datasets as proposed in GEPIA1/2.

TCGA Detail Tumor Normal GTEx Num
ACC Adrenocortical cancer 77 - Adrenal Gland 128
BLCA Bladder Urothelial Carcinoma 407 19 Bladder 9
BRCA Breast invasive carcinoma 1099 113 Breast 179
CESC Cervical squamous cell carcinoma and endocervical adenocarcinoma 306 3 Cervix Uteri 10
CHOL Cholangio carcinoma 36 9 - -
COAD Colon adenocarcinoma 290 41 Colon 308
DLBC Lymphoid Neoplasm Diffuse Large B-cell Lymphoma 47 - Blood 444
ESCA Esophageal carcinoma 182 13 Esophagus 653
GBM Glioblastoma multiforme 166 5 Brain 1152
HNSC Head and Neck squamous cell carcinoma 520 44 - -
KICH Kidney Chromophobe 66 25 Kidney 28
KIRC Kidney renal clear cell carcinoma 531 72 Kidney 28
KIRP Kidney renal papillary cell carcinoma 289 32 Kidney 28
LAML Acute Myeloid Leukemia 173 - Bone Marrow 70
LGG Brain Lower Grade Glioma 523 - Brain 1152
LIHC Liver hepatocellular carcinoma 371 50 Liver 110
LUAD Lung adenocarcinoma 515 59 Lung 288
LUSC Lung squamous cell carcinoma 498 50 Lung 288
MESO Mesothelioma 87 - - -
OV Ovarian serous cystadenocarcinoma 427 - Ovary 88
PAAD Pancreatic adenocarcinoma 179 4 Pancreas 167
PCPG Pheochromocytoma and Paraganglioma 182 3 - -
PRAD Prostate adenocarcinoma 496 52 Prostate 100
READ Rectum adenocarcinoma 93 10 Colon 308
SARC Sarcoma 262 2 - -
SKCM Skin Cutaneous Melanoma 469 1 Skin 812
STAD Stomach adenocarcinoma 414 36 Stomach 174
TGCT Testicular Germ Cell Tumors 154 - Testis 165
THCA Thyroid carcinoma 512 59 Thyroid 279
THYM Thymoma 119 2 Blood 444
UCEC Uterine Corpus Endometrial Carcinoma 181 23 Uterus 78
UCS Uterine Carcinosarcoma 57 - Uterus 78
UVM Uveal Melanoma 79 - - -
Adipose Tissue 515
Blood Vessel 606
Fallopian Tube 5
Heart 377
Muscle 396
Nerve 278
Pituitary 107
Salivary Gland 55
Small Intestine 92
Spleen 100
Vagina 85

Frequently Asked Questions

Q1:What is the motivation to develop GEPIA2021?

A1:We launched GEPIA project in 2017 to facilitate the widely used analysis on the expression datasets TCGA and GTEx, providing the biologists and clinicians with a handy tool to perform comprehensive and complex data mining tasks. Until the end of 2020, GEPIA and GEPIA2 have been totally cited for ~2,600 times, and have processed ~1,300,000 analysis requests for ~300,000 users worldwide. Based on the feedbacks, we decided to develop GEPIA2021, an extended version of GEPIA with multiple cell type-level analysis based on bulk sample deconvolution results.


Q2:How to select the deconvolution tool among CIBERSORT, EPIC and quanTIseq?

A2: CIBERSORT is recommended when users want to investigate the immune cell types with high resolution (providing cell sub-types such as T.cells.CD4.memory.activated). EPIC provides the reference with two non-immune cell types but the least immune cell types. The strategies to estimate the absolute proportion of all three tools are basically similar,and thus, the cell proportions of different samples become comparable. For example, CIBERSORT estimates the proportion of each cell type against the total immune cell content first, and then estimates the immune cell content in the mixture by the median gene expression of the s signature genes (in formula ii) against the median of all genes, based on the assumption that the signature gene expression cannot be contributed by other cell types beyond the reference. Notably, the absolute proportion output is natively supported by EPIC and quanTIseq, while the absolute_mode in CIBERSORT is still a beta version. The numbers of genes available for sub-expression analysis differ in three tools, which depend on the references provided by the tools. According to the publications of the 3 tools, CIBERSORT was designed for multiple type of tissues, while EPIC and quanTIseq were orginally designed for tumor samples. Therefore, we would recommend CIBERSORT as the first choice for the tumor-normal comparison. However, as all three tools were extensively validated on normal blood and tumor samples, there is still potential applicability for the usage of EPIC/quanTIseq on the blood cell deconvolution on GTEx samples. In addition, EPIC/quanTIseq also achieved high performance in this benchmark article.


Q3:Why do you build a standalone extension rather than integrating the functionalities into the origin version of GEPIA?

A3:Please let us first introduce our data processing workflow. We downloaded TCGA and GTEx data from Xena. We only keep the intersection set of genes between the downloaded data and the reference, and perform the log-normalization and downstream analysis based on these genes. Therefore, the expression of genes used for GEPIA2021 would be different from that of GEPIA/GEPIA2.


Q4:How do you explain the ANOVA results in the Proportion and the Sub-expression modules?

A4:By assuming that there is no difference in variance among all groups, we perform the F-test (one-way ANOVA) to identify the degree of difference, along with the p-value indicating the significance. If the p-value is too small (1e-16, comparable with the floating-point precision), it will be shown as zero. If the compared groups have all samples with zero proportion or expression, the F-value and p-value will be NAN.

Code Availability

The codes for the bulk data deconvolution can be accessed here.