文章/答案/技术大牛

发布

社区首页 >问答首页 >根据单独列表中是否存在样本来过滤数据帧

问根据单独列表中是否存在样本来过滤数据帧
EN

Stack Overflow用户

提问于 2019-10-27 21:30:29

回答 1查看 77关注 0票数 2

我想用1212过滤一个数据帧，这样它就只包含一个单独列表中列出的样本。这个列表有多个值，我不知道该怎么做。

下面的df称为RNASeq2

RNASeq2Norm_samples Substrng_RNASeq2Norm
   1    TCGA-3C-AAAU-01A-11R-A41B-07    TCGA.3C.AAAU
   2    TCGA-3C-AALI-01A-11R-A41B-07    TCGA.3C.AALI
   3    TCGA-3C-AALJ-01A-31R-A41B-07    TCGA.3C.AALJ
   4    TCGA-3C-AALK-01A-11R-A41B-07    TCGA.3C.AALK
   5    TCGA-4H-AAAK-01A-12R-A41B-07    TCGA.4H.AAAK
   6    TCGA-5L-AAT0-01A-12R-A41B-07    TCGA.5L.AAT0
   7    TCGA-5L-AAT1-01A-12R-A41B-07    TCGA.5L.AAT1
   8    TCGA-5T-A9QA-01A-11R-A41B-07    TCGA.5T.A9QA
   .
   .
   .
   1212

list = intersect_samples

intersect_samples: "TCGA.3C.AAAU" "TCGA.3C.AALI" "TCGA.3C.AALJ" "TCGA.3C.AALK" ... 1097

我已经尝试过此代码，但返回了所有原始的1212个样本：

RNASeq_filtered <- RNASeq2[RNASeq2$Substrng_RNASeq2Norm %in% intersect_samples,]

然而，如果我尝试

RNASeq_filtered <- RNASeq2[RNASeq2$Substrng_RNASeq2Norm %in% "TCGA.3C.AAAU",]

它将返回正确的行

str(RNASeq2)
'data.frame':   1212 obs. of  2 variables:
 $ RNASeq2             : Factor w/ 1212 levels "TCGA-3C-AAAU-01A-11R-A41B-07",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Substrng_RNASeq2Norm: Factor w/ 1093 levels "TCGA.3C.AAAU",..: 1 2 3 4 5 6 7 8 9 10 ...

str(intersect_samples)
 chr [1:1093] "TCGA.3C.AAAU" "TCGA.3C.AALI" "TCGA.3C.AALJ" "TCGA.3C.AALK" "TCGA.4H.AAAK" ...

filter

rows

string-matching

string

回答 1

Stack Overflow用户

发布于 2019-10-29 06:45:10

AFAIK R没有提供使用部分匹配(“子串”)在字符串向量中查找搜索字符串向量的便利函数。

如果您想要查找字符串中的子字符串，则%in不是正确的函数，因为它只比较整个字符串。

取而代之的是使用base R的grepl或者优秀的stringi包中可能更快的stri_detect_fixed函数。

请注意，为了更容易理解，我已经抽象了代码和数据(而不是使用您的代码和数据)。

library(stringi)

pattern = c("23", "45", "999")
data <- data.frame(row_num = 1:4,
                   string  = c("123", "234", "345", "xyz"),
                   stringsAsFactors = FALSE)
# row_num string
# 1       1    123
# 2       2    234
# 3       3    345
# 4       4    xyz

string <- data$string  # the column that contains the values to be filtered

# Iterate over each element in pattern and apply it to the string vector.
# Returns a logical vector of the same length as string (TRUE = found, FALSE = not found)
selected <- lapply(pattern, function(x) stri_detect_fixed(string, x))
# Or shorter:
# lapply(pattern, stri_detect_fixed, str = string)

selected    # show the result (it is a list of logical vectors - one per search pattern element)
# [[1]]
# [1]  TRUE  TRUE FALSE FALSE
# 
# [[2]]
# [1] FALSE FALSE  TRUE FALSE
# 
# [[3]]
# [1] FALSE FALSE FALSE FALSE

# "row-wise" reduce the logical vectors into one final vector using the logical "or" operator
# WARNING: Does not handle `NA`s correctly (one NA does makes any TRUE to NA)
selected.rows <- Reduce("|", selected)
# [1]  TRUE  TRUE  TRUE FALSE

# To handle NAs correctly (if you have NAs) you can use this (slower) code:
selected.rows <- rowSums(as.data.frame(selected), na.rm = TRUE) > 0

# Use the logical vector as row selector (TRUE returns the row, FALSE ignores the row):
string[selected.rows]
# [1] 123 234 345

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/58579865

复制

相似问题

问根据单独列表中是否存在样本来过滤数据帧
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问根据单独列表中是否存在样本来过滤数据帧EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问根据单独列表中是否存在样本来过滤数据帧
EN