实战环节

上一期对所拥有的数据做了个表达定量并生成了表达矩阵,现在进入差异分析的环节

差异分析

本次做差异分析所使用的工具是基于python的omicverse库

该模块的其安装方法也很简单,但需注意的是omicverse库必须在linux环境或windows系统的WSL环境下使用。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

# 使用conda
conda create -n omicverse python=3.10
conda activate omicverse

# 安装pytorch-gpu版 (二选一)
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# 安装putorch-cpu版 (二选一)
conda install pytorch torchvision torchaudio cpuonly -c pytorch

# 安装pyg
conda install pyg -c pyg

# 安装omicverse
conda install omicverse -c conda-forge

本次分析选用的数据集是GSE118916。从GEO数据库中进行下载:

使用omicverse进行常规的差异基因表达分析

1
2
import pandas as pd
import os
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 数据预处理
expressionData = pd.read_csv("expression_data.csv", encoding='UTF-8')

# 将没有基因名的数据索引拿出来
indexNum = 0
indexList = []
for i in list(expressionData.iloc[:,0]):
if '---' == i:
# print(indexNum)
indexList.append(indexNum)
indexNum += 1

expressionData_Drop = expressionData.drop(indexList)
expressionData_Drop

输出:

Gene SymbolGSM3351221GSM3351222GSM3351223GSM3351224GSM3351225GSM3351226GSM3351227GSM3351228GSM3351240GSM3351241GSM3351242GSM3351243GSM3351244GSM3351245GSM3351246GSM3351247GSM3351248GSM3351249
0HIST1H3G4.3331623.3934133.2022983.2739343.1496463.3719943.6274983.0468713.9695943.6238503.6647053.3409043.6020203.3316573.3927653.4938413.4872113.391482
1HIST1H3G5.5091345.4057155.1843024.8831174.8961765.1138275.1674495.1102045.5569054.8512875.3972185.1536985.2012804.8512875.3747345.2766535.1886805.140900
2HIST1H3G4.6869043.8714143.7242243.3977203.6152773.8413204.3252713.5602104.3414793.9923633.9876193.6874513.8550433.6949763.7667413.9630584.1011573.854747
3TNFAIP8L14.3617574.2859164.3777575.3672876.0188054.8887544.7435034.4301554.4980283.7402843.7911654.2065123.7951243.8530143.7825953.7570013.6634553.869301
4OTOP24.6924424.4676244.4307274.1323814.2651754.2376384.2875814.4281474.3524474.8342374.8733934.5217204.8993395.1798134.8661555.2358784.8290414.766776
49297GAPDH2.2504152.1963362.1884532.1409652.2056142.1720102.2166022.1806262.1975282.3111982.3191242.2432912.3541542.3972062.3480412.3176982.3213112.272872
49298STAT12.3495882.3229622.2775232.2174242.2935002.2429452.2833752.2788332.2991612.4390902.4650332.3485602.4726262.5073502.4521862.4618882.4742032.423064
49299STAT12.6718962.6117222.6125912.4683042.5850892.5351292.6060422.5426262.5848982.8112272.8054262.6759892.8817902.8427712.8285542.8407722.8156302.776199
49300STAT13.0410213.0002182.9565122.8363782.9253152.8651082.9605982.9303772.9479403.2216863.2641263.0723353.2822863.3090833.2387343.2691093.2020323.186877
49301STAT13.5211693.5190293.4859203.2995823.4026913.3470423.4277323.4056213.4089843.7579833.8372903.6113173.8415993.8279583.8307613.7523283.7498423.681698
1
2
3
4
5
6
7
8
# 将第一列设置为表格的index
expressionData_Drop.set_index(["Gene Symbol"], inplace=True)

# 导入omicverse库
import omicverse as ov
import scanpy as sc
import matplotlib.pyplot as plt
ov.ov_plot_set()

输出:

1
2
3
4
🔬 Starting plot initialization...
🧬 Detecting CUDA devices…
🚫 No CUDA devices found
✅ plot_set complete.
1
2
3
4
5
6
7
8
9
10
# 差异分析
dds = ov.bulk.pyDEG(expressionData_Drop)

# 该步骤是用于清洗数据,将重复的基因去除,保留表达量高的基因
dds.drop_duplicates_index()
print('... drop_duplicates_index success')

# 数据标准化,去除表中可能的批次效应
dds.normalize()
print('... estimateSizeFactors and normalize success')

输出:

1
2
... drop_duplicates_index success
... estimateSizeFactors and normalize success
1
2
3
4
5
6
7
# 从使用t检验方法计算矩阵中的差异表达基因,需要输入处理组和对照组。
# 处理组
treatment_groups=['GSM3351220','GSM3351221','GSM3351222','GSM3351223','GSM3351224','GSM3351225','GSM3351226','GSM3351227','GSM3351228','GSM3351229','GSM3351230','GSM3351231','GSM3351232','GSM3351233','GSM3351234']
# 对照组
control_groups=['GSM3351235','GSM3351236','GSM3351237','GSM3351238','GSM3351239','GSM3351240','GSM3351241','GSM3351242','GSM3351243','GSM3351244','GSM3351245','GSM3351246','GSM3351247','GSM3351248','GSM3351249']
result_ttest=dds.deg_analysis(treatment_groups,control_groups,method='ttest')
result_ttest.sort_values('qvalue').head()

输出:

1
2
3
⚙️ You are using ttest method for differential expression analysis.
⏰ Start to calculate qvalue...
✅ Differential expression analysis completed.
Gene SymbolpvalueqvalueFoldChangeMaxBaseMeanBaseMeanlog2(BaseMean)log2FCabs(log2FC)size-log(pvalue)-log(qvalue)sig
DMRTA13.372637e-176.773267e-130.7159175.4980994.4536532.154989-0.4821360.4821360.07159216.47203012.169202sig
CAPN131.526491e-151.532826e-110.8497196.6943916.0519852.597408-0.2349430.2349430.08497214.81630610.814507sig
LNX13.725194e-152.493769e-110.7630736.4691545.4830452.454977-0.3901080.3901080.07630714.42885110.603144sig
CAPN95.643624e-152.833522e-110.6679397.7921006.1903862.630029-0.5822120.5822120.06679414.24844210.547673sig
CLDN127.299049e-152.931736e-110.9030286.1117055.7254302.517384-0.1471580.1471580.09030314.13673410.532875sig
1
2
3
4
5
6
7
8
# 一个重要事项是,在处理差异表达基因(DEGs)时并未过滤低表达基因,在后续版本中会加入相应的处理功能。
print(dds.result.shape)
dds.result=dds.result.loc[dds.result['log2(BaseMean)']>1]
print(dds.result.shape)

输出:
(14028, 12)
(14028, 12)
1
2
3
4
5
6
# 下一步还需要设置Foldchange(倍数变化)的阈值,需要准备一个名为foldchange_set的方法来完成这一任务。
# 该功能会根据log2FC(以2为底的对数倍数变化)的分布自动计算合适的阈值,但也可以手动输入。
# -1 means automatically calculates
dds.foldchange_set(fc_threshold=-1,
pval_threshold=0.05,
logp_max=10)

输出:

1
... Fold change threshold: 0.16936119706053465
1
2
3
4
5
6
7
8
# 可视化差异表达基因(DEG)结果与特定基因
# 为可视化DEG分析结果,使用plot_volcano函数进行绘制。该函数可重点展示目标基因或高差异表达基因,需要输入以下参数:
# title:火山图标题
# figsize:图像尺寸
# plot_genes:需重点标注的目标基因列表
# plot_genes_num:若无特定目标基因,可设置自动标注的基因数量
dds.plot_volcano(title='DEG Analysis',figsize=(10,10),
plot_genes_num=0,plot_genes_fontsize=0,)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
🌋 Volcano Plot Analysis:
Total genes: 14028
↗️ Upregulated genes: 798
↘️ Downregulated genes: 643
➡️ Non-significant genes: 12587
🎯 Total significant genes: 1441
log2FC range: -0.99 to 0.79
qvalue range: 6.77e-13 to 1.00e+00

⚙️ Current Function Parameters:
Data columns: pval_name='qvalue', fc_name='log2FC'
Thresholds: pval_threshold=0.05, fc_max=0.16936119706053465, fc_min=-0.16936119706053465
Plot size: figsize=(10, 10)
Gene labels: plot_genes_num=0, plot_genes_fontsize=0
Custom genes: None (auto-select top genes)

💡 Parameter Optimization Suggestions:
✅ Current parameters are optimal for your data!
────────────────────────────────────────────────────────────
1
<AxesSubplot: title={'center': 'DEG Analysis'}, xlabel='$log_{2}FC$', ylabel='$-log_{10}(qvalue)$'>

image.png

使用 DEseq2 进行差异表达分析

借助 pyDEseq2,可以像 R 语言那样分析差异表达基因

1
2
3
4
5
6
7
8
# 将表达矩阵全部转换为整数
expressionData_int = expressionData_Drop.astype(int)
# expressionData_int.head()

dds_int = ov.bulk.pyDEG(expressionData_int)

dds_int.drop_duplicates_index()
print('... drop_duplicates_index success')

输出:

1
... drop_duplicates_index success
1
2
3
4
5
6
7
# 从使用DEseq2方法计算矩阵中的差异表达基因,需要输入处理组和对照组。
# 处理组
treatment_groups=['GSM3351220','GSM3351221','GSM3351222','GSM3351223','GSM3351224','GSM3351225','GSM3351226','GSM3351227','GSM3351228','GSM3351229','GSM3351230','GSM3351231','GSM3351232','GSM3351233','GSM3351234']
# 对照组
control_groups=['GSM3351235','GSM3351236','GSM3351237','GSM3351238','GSM3351239','GSM3351240','GSM3351241','GSM3351242','GSM3351243','GSM3351244','GSM3351245','GSM3351246','GSM3351247','GSM3351248','GSM3351249']
result = dds_int.deg_analysis(treatment_groups,control_groups,method='DEseq2')
# result_ttest.sort_values('qvalue').head()

输出:

1
2
3
⚙️ You are using DEseq2 method for differential expression analysis.
⏰ Start to create DeseqDataSet...
Using None as control genes, passed at DeseqDataSet initialization
1
2
3
4
5
6
7
8
9
10
Fitting size factors...
... done in 0.05 seconds.

Fitting dispersions...
... done in 0.65 seconds.

Fitting dispersion trend curve...
... done in 0.22 seconds.

Fitting MAP dispersions...
1
logres_prior=1.5254760032591015, sigma_prior=1.4514357345950912
1
2
3
4
5
6
7
8
9
10
11
... done in 1.82 seconds.

Fitting LFCs...
... done in 0.77 seconds.

Calculating cook's distance...
... done in 0.02 seconds.

Replacing 0 outlier genes.

Running Wald tests...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Log2 fold change & Wald test p-value: condition Treatment vs Control
baseMean log2FoldChange lfcSE stat \
Gene Symbol
COX3 /// MYBPC3 /// SLC12A1 13.074116 0.001168 0.154571 0.007559
TPT1 13.007369 0.001171 0.154922 0.007558
ND4 13.007369 0.001171 0.154922 0.007558
ATP13A5 /// HARS 12.974036 -0.006246 0.155129 -0.040261
B2M 12.840421 -0.021317 0.155953 -0.136689
... ... ... ... ...
SSX1 1.300795 -0.072834 0.473870 -0.153699
C2orf83 1.300624 -0.221223 0.475017 -0.465715
CLCA3P 1.233980 -0.551375 0.494168 -1.115764
UGT2B11 1.200886 -0.484293 0.501002 -0.966649
RAD21L1 1.233900 -0.076840 0.486072 -0.158084

pvalue padj
Gene Symbol
COX3 /// MYBPC3 /// SLC12A1 0.993969 0.997498
TPT1 0.993969 0.997498
ND4 0.993969 0.997498
ATP13A5 /// HARS 0.967885 0.997498
B2M 0.891277 0.997498
... ... ...
SSX1 0.877847 NaN
C2orf83 0.641419 NaN
CLCA3P 0.264523 NaN
UGT2B11 0.333719 NaN
RAD21L1 0.874391 NaN

[20083 rows x 6 columns]
✅ Differential expression analysis completed.
1
... done in 1.17 seconds.
1
2
3
print(result.shape)
result=result.loc[result['log2(BaseMean)']>1]
print(result.shape)

输出:

1
2
(20083, 14)
(20083, 14)
1
2
3
4
# 需要设置Foldchange(倍数变化)的阈值
dds_int.foldchange_set(fc_threshold=-1,
pval_threshold=0.05,
logp_max=10)

输出:

1
... Fold change threshold: 0.2874723264386634
1
2
3
# 差异基因的可视化
dds.plot_volcano(title='DEG Analysis',figsize=(10,10),
plot_genes_num=8,plot_genes_fontsize=12,)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
🌋 Volcano Plot Analysis:
Total genes: 14028
↗️ Upregulated genes: 798
↘️ Downregulated genes: 643
➡️ Non-significant genes: 12587
🎯 Total significant genes: 1441
log2FC range: -0.99 to 0.79
qvalue range: 6.77e-13 to 1.00e+00

⚙️ Current Function Parameters:
Data columns: pval_name='qvalue', fc_name='log2FC'
Thresholds: pval_threshold=0.05, fc_max=0.16936119706053465, fc_min=-0.16936119706053465
Plot size: figsize=(10, 10)
Gene labels: plot_genes_num=8, plot_genes_fontsize=12
Custom genes: None (auto-select top genes)

💡 Parameter Optimization Suggestions:
✅ Current parameters are optimal for your data!
────────────────────────────────────────────────────────────

image.png

1
2
3
4
# 特定基因表达量分析
dds_int.plot_boxplot(genes=['CAPN9','LNX1','PFDN2','CDK4'],treatment_groups=treatment_groups,
control_groups=control_groups,figsize=(4,6),fontsize=12,
legend_bbox=(1,0.55))

输出:
image.png

1
2
3
4
# 单个基因的差异分析
dds_int.plot_boxplot(genes=['CAPN9'],treatment_groups=treatment_groups,
control_groups=control_groups,figsize=(2,3),fontsize=12,
legend_bbox=(2,0.55))

输出:

image.png