【原】單細胞CopyKAT包推斷CNV變化

健明 2025-08-29 發布于廣東

展開全文

最近在學習單細胞CNV的各種分析，先學習一下CopyKAT包的操作使用，后續將學習其他類型分析。

在人類腫瘤的單細胞RNA測序（scRNA-seq）研究中，準確區分惡性細胞與非惡性細胞，并識別可能存在的腫瘤亞克隆，是一項關鍵挑戰。CopyKAT（Copynumber Karyotyping of Tumors）作為一種計算工具，通過整合貝葉斯方法，能夠在5MB分辨率下推斷單細胞的全基因組拷貝數變異，從而有效區分腫瘤細胞與正常細胞，并進一步識別腫瘤亞克隆。該工具基于scRNA-seq數據中相鄰基因表達量的協同變化推測基因組拷貝數狀態，其預測結果與全基因組DNA測序數據的一致性可達80%。本文通過具體示例展示如何使用R包copykat從10X Genomics單細胞RNA數據中計算拷貝數譜、鑒別腫瘤與正常細胞，并推斷腫瘤亞克隆結構。

scRNA-seq為解析腫瘤異質性提供了強大工具，但由于腫瘤微環境中多種細胞類型的共存，以及腫瘤細胞本身可能具有復雜的基因組變異，準確區分惡性與非惡性細胞仍非易事。CopyKAT利用癌細胞中普遍存在的非整倍體特征（約見于90%的人類癌癥），通過推斷基因組區域的拷貝數變異來識別惡性細胞。正常基質細胞和免疫細胞通常保持二倍體或近二倍體拷貝數狀態，從而為區分提供了基礎。

實操

加載單細胞直腸癌數據

library(COSG)
library(harmony)
library(ggsci)
library(dplyr) 
library(future)
library(Seurat)
library(cowplot)
library(data.table)
library(dplyr)
library(ggplot2)
library(patchwork)
library(stringr)

sce=CreateSeuratObject(
  counts = sce@assays$RNA$counts,
  meta.data = sce@meta.data)

sce <- Seurat::NormalizeData(sce,
                             normalization.method = "LogNormalize",
                             scale.factor = 1e4)
sce <- FindVariableFeatures(sce, selection.method = "vst", nfeatures = 2000)
sce <- ScaleData(sce)
sce <- RunPCA(sce, npcs = 50)

sce = sce %>% RunHarmony("orig.ident", plot_convergence = TRUE)
sce2 <- sce %>%
  RunUMAP(reduction = "harmony", dims = 1:30) %>%
  FindNeighbors(reduction = "harmony", dims = 1:30) %>%
  FindClusters(resolution = 0.3) %>%
  identity()


# 查看校正后的結果
p2 <- DimPlot(sce2, reduction = "umap",group.by = "celltype", pt.size=0.5, label = T,repel = TRUE)

p2

構建CopyKAT對象

CopyKAT對象和infCNV一致需要三個文件分別是，表達矩陣，基因類型，細胞屬性三種，先構建這三種矩陣并保存

setwd('copyKAT')
Idents(sce2 )=sce2$celltype

dat=GetAssayData(sce2,
                 slot='counts',assay='RNA')
dat[1:4,1:4]
dim(dat)
#sce <- SetIdent(sce, value = sce@meta.data$compare)
groupinfo=data.frame(v1=colnames(dat),
                     v2= Idents(sce2 ) )
# Idents(sce )=sce$celltype
# gtf 文件 
library(AnnoProbe)
geneInfor=annoGene(rownames(dat),
                   "SYMBOL",'human')
colnames(geneInfor)
head(geneInfor)
geneInfor=geneInfor[with(geneInfor,
                         order(chr, start)),c(1,4:6)]
geneInfor=geneInfor[!duplicated(geneInfor[,1]),]
length(unique(geneInfor[,1]))
head(geneInfor)
## 這里可以去除性染色體
# 也可以把染色體排序方式改變
dat=dat[rownames(dat) %in% geneInfor[,1],]
dat=dat[match( geneInfor[,1], rownames(dat) ),] 
dim(dat)
head(groupinfo)
dat[1:4,1:4]
table(groupinfo$v2)
head(groupinfo)

expFile='expFile.txt'
write.table(dat,file = expFile,sep = '\t',quote = F)
groupFiles='groupFiles.txt'
write.table(groupinfo,file = groupFiles,sep = '\t',
            quote = F,col.names = F,row.names = F)
head(groupinfo)
geneFile='geneFile.txt'
write.table(geneInfor,file = geneFile,sep = '\t',
            quote = F,col.names = F,row.names = F)
head(geneInfor)

copykat對象讀入

設置參考細胞，這里由于都是腫瘤樣本設置T細胞作為參考

library(copykat)
exp.File <- read.delim( expFile , row.names=1)
exp.File[1:4,1:4]
groupFiles <- read.delim( groupFiles , header=FALSE)
head(groupFiles)
colnames(exp.File)=gsub('^X','', colnames(exp.File))
colnames(exp.File)=gsub('[.]','-', colnames(exp.File))
head( colnames(exp.File))

table(groupFiles$V2)
normalCells <-  colnames(exp.File)[grepl("Tc",groupFiles$V2)]
head(normalCells)
res <- copykat(rawmat=exp.File, 
               id.type="S", 
               ngene.chr=5, 
               win.size=25, 
               KS.cut=0.1, 
               sam.name="test", 
               distance="euclidean", 
               norm.cell.names=normalCells, 
               n.cores=5,#核心數記得要適量
               output.seg="FLASE")
save(res,groupFiles,file = 'ref-of-copykat.Rdata')

CNV類型映射到UMAP上

load(file = 'ref-of-copykat.Rdata')
pred.res <- data.frame(res$prediction)
plot(res$hclustering,labels = F)
head(pred.res)
table(pred.res$copykat.pred)
dim(pred.res)

phe=sce2@meta.data
phe$cellid=rownames(phe)

phe=merge(phe,pred.res,by.x="cellid",by.y="cell.names")
rownames(phe)=phe$cellid
phe=phe[rownames(sce2@meta.data),]

sce2@meta.data=phe
table(sce2$copykat.pred)

DimPlot(sce2, 
        group.by = "copykat.pred",
        pt.size = 1,
        label = F,raster=FALSE) 
ggsave('copykat.pred.pdf',height = 4,width =5)

映射顯示基本上都在EPI類型中，證明CopyKAT效果還可以，

結合熱圖結果觀察，部分在拷貝數上出現部分異常變化的細胞仍被歸類為“正常”類型。然而，根據細胞類型注釋（例如基于標記基因的映射），發生突變更多的細胞還是在上皮細胞（EPI）類型中。這表明當前的分類結果中可能存在一定的突變類型較弱的細胞歸為正常細胞，可能一部分腫瘤細胞被CopyKAT誤判為了正常細胞。

為進一步提高細胞類型判斷的準確性，后續將繼續研究這些被預測為“正常”類型進行更深入的二次分析。通過重新檢查其拷貝數變異模式、表達譜特征，或整合其他推斷方法（如基于突變等位基因表達）推測這些細胞的類型。

參考 Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes. https://github.com/navinlabcode/copykat https://mp.weixin.qq.com/s/7Qz_LSW_cYFR2BL4pS5_NA