如果第四个"-“字符之前的子字符串与ori.same.maf.barcodes中的字符串匹配,我希望创建ori.maf.barcode变量来存储sub.same.barcodes的字符串。
sub.same.barcodes和ori.maf.barcode是如何产生的。sub.maf.barcode是ori.maf.barcode$Tumor_Sample_Barcode的子集。sub.same.barcodes是sub.maf.barcode和sub.met.barcode的交叉。现在,我想把sub.same.barcodes和ori.maf.barcode匹配起来。
ori.maf.barcode <- maf@clinical.data
sub.maf.barcode <- gsub("^([^-]*-[^-]*-[^-]*-[^-]*).*", "\\1", ori.maf.barcode$Tumor_Sample_Barcode) # Remove the dashes and keep only the first 4
sub.same.barcodes <- intersect(sub.maf.barcode, sub.met.barcode)企图:
ori.same.maf.barcodes <- ori.maf.barcode %in% sub.same.barcodes但是我的代码返回"FALSE“而不是字符向量。
dput(ori.maf.barcode[1:20])
structure(list(Tumor_Sample_Barcode = c("TCGA-2K-A9WE-01A-11D-A382-10",
"TCGA-2Z-A9J1-01A-11D-A382-10", "TCGA-2Z-A9J2-01A-11D-A382-10",
"TCGA-2Z-A9J3-01A-12D-A382-10", "TCGA-2Z-A9J5-01A-21D-A382-10",
"TCGA-2Z-A9J6-01A-11D-A382-10", "TCGA-2Z-A9J7-01A-11D-A382-10",
"TCGA-2Z-A9J8-01A-11D-A42J-10", "TCGA-2Z-A9JD-01A-11D-A42J-10",
"TCGA-2Z-A9JG-01A-11D-A42J-10", "TCGA-2Z-A9JI-01A-11D-A42J-10",
"TCGA-2Z-A9JJ-01A-11D-A42J-10", "TCGA-2Z-A9JK-01A-11D-A42J-10",
"TCGA-2Z-A9JM-01A-12D-A42J-10", "TCGA-2Z-A9JN-01A-21D-A42J-10",
"TCGA-2Z-A9JO-01A-11D-A42J-10", "TCGA-2Z-A9JQ-01A-11D-A42J-10",
"TCGA-2Z-A9JR-01A-12D-A42J-10", "TCGA-2Z-A9JS-01A-21D-A42J-10",
"TCGA-3Z-A93Z-01A-11D-A36X-10")), class = c("data.table", "data.frame"
), row.names = c(NA, -20L), .internal.selfref = <pointer: 0x0000025e377005d0>)dput(sub.met.barcode[1:20])
c("TCGA-BQ-7058-01A", "TCGA-DZ-6131-01A", "TCGA-UZ-A9PZ-01A",
"TCGA-2Z-A9JQ-01A", "TCGA-BQ-5887-11A", "TCGA-G7-7502-01A", "TCGA-B1-A47M-11A",
"TCGA-SX-A7SO-01A", "TCGA-HE-A5NJ-01A", "TCGA-MH-A856-01A", "TCGA-A4-8312-01A",
"TCGA-BQ-5892-01A", "TCGA-A4-7732-11A", "TCGA-5P-A9K9-01A", "TCGA-UZ-A9PX-01A",
"TCGA-BQ-7061-01A", "TCGA-BQ-5876-01A", "TCGA-DZ-6134-01A", "TCGA-BQ-5884-01A",
"TCGA-BQ-5889-11A")发布于 2022-10-22 16:53:29
我们可以使用sub提取子字符串直到第四个-,然后在逻辑向量上使用%in%到子集。
i1 <- trimws(sub("^(([^-]+-){4}).*", "\\1", ori.maf.barcode),
whitespace = "-") %in%
sub("^(([^-]+-){4}).*", "\\1", sub.same.barcodes)
ori.same.maf.barcodes <- ori.maf.barcode[i1]-output
> ori.same.maf.barcodes
[1] "TCGA-BQ-7058-01A-11D-1963-05"
[2] "TCGA-2Z-A9JQ-01A-11D-A42K-05"
[3] "TCGA-BQ-5887-11A-01D-1963-05"更新
使用OP‘post中的新的dput,'ori.maf.barcode’是一个data.table,列名为‘肿瘤_Sample_条码’。使用$或[[在base R中提取列,或直接使用data.table方法进行子集
library(data.table)
ori.maf.barcode[trimws(sub("^(([^-]+-){4}).*", "\\1",
Tumor_Sample_Barcode),
whitespace = "-") %in% sub("^(([^-]+-){4}).*", "\\1", sub.met.barcode)]
Tumor_Sample_Barcode
<char>
1: TCGA-2Z-A9JQ-01A-11D-A42J-10数据
ori.maf.barcode <- c("TCGA-BQ-7058-01A-11D-1963-05",
"TCGA-DZ-6131-01A-11D-1963-05",
"TCGA-UZ-A9PZ-01A-11D-A42K-05", "TCGA-2Z-A9JQ-01A-11D-A42K-05",
"TCGA-BQ-5887-11A-01D-1963-05", "TCGA-G7-7502-01A-12D-A43K-06"
)
sub.same.barcodes <- c("TCGA-BQ-7058-01A", "TCGA-DZ-6131-02A",
"TCGA-UZ-A9PZ-03A",
"TCGA-2Z-A9JQ-01A", "TCGA-BQ-5887-11A", "TCGA-2Z-A9JQ-01A")发布于 2022-10-22 16:53:17
请注意,使用您提供的示例数据,TCGA-G7-7502-01A-12D-A43K-06值不可能出现在输出中。
library(stringr)
sub.same.barcodes <- c("TCGA-BQ-7058-01A", "TCGA-DZ-6131-02A", "TCGA-UZ-A9PZ-03A",
"TCGA-2Z-A9JQ-01A", "TCGA-BQ-5887-11A", "TCGA-2Z-A9JQ-01A")
ori.maf.barcode <- c("TCGA-BQ-7058-01A-11D-1963-05", "TCGA-DZ-6131-01A-11D-1963-05",
"TCGA-UZ-A9PZ-01A-11D-A42K-05", "TCGA-2Z-A9JQ-01A-11D-A42K-05",
"TCGA-BQ-5887-11A-01D-1963-05", "TCGA-G7-7502-01A-12D-A43K-06")
idx <- which(str_extract_all(ori.maf.barcode, '.{4}-.{2}-.{4}-.{3}') %in% sub.same.barcodes)
ori.same.maf.barcodes <- ori.maf.barcode[ idx ]
print(ori.same.maf.barcodes)输出:
[1] "TCGA-BQ-7058-01A-11D-1963-05" "TCGA-2Z-A9JQ-01A-11D-A42K-05" "TCGA-BQ-5887-11A-01D-1963-05"发布于 2022-10-22 16:47:11
您已经接近了,但是您的代码ori.maf.barcode %in% sub.same.barcodes创建了返回TRUE和FALSE的逻辑等式,这就是您所看到的。为了返回等同于TRUE的值,需要将该表达式传递到一个子设置方法中,以获取所需的内容。
ori.maf.barcode[which(ori.maf.barcode %in% sub.same.barcodes)]如果是向量,则应该返回另一个向量,其中只包含逻辑语句中的TRUE条目。
您需要字符串匹配以获得基于第一部分的条目,如下所示:
这是一个循环,一次挑出一个,然后将它们添加到一个新的向量中。
new.barcodes<-c()
for (sub in sub.same.barcodes){
new<- ori.maf.barcode[which(startsWith(ori.maf.barcode, sub))]
new.barcodes<-c(new.barcodes, new)
}这将遍历前缀,并将所需的内容提取到新的向量中。
https://stackoverflow.com/questions/74165313
复制相似问题