我试过用gsub来解决这个问题,但这太难了。我不知道如何告诉函数只返回某些数字,而不知道其他数字。
我的问题是:我有一个大的数据框架,每个执行的测试都有一列test.comments。这是一大块文字,其中只有某些数字是我感兴趣的。
示例:
** BK病毒阳性**解释:计算出每毫升尿液18,900,000,000 BKV基因组当量............................................................................................................................十分之一的测试样本..。致电555-122-634,有问题
我想做的是在一个单独的列中添加值18900,000,000(但不包括电话号码和其他随机数)。
有时,数字被_______包围
** BK病毒阳性**解释:每ML检测到一种CALCULATED__33,400,000____BK病毒(BKV)基因组当量
在某些情况下,这一数字也很小:
经计算,每毫升检测到900 BKV基因组当量。
或
** BK病毒阳性**解释:在该患者标本中检测到calculated__<250__________BK病毒(BKV)每毫升基因组当量。
我希望有一个强有力的命令
18,900,000
33,400,000
900
<250
它还将帮助我拥有一个只返回数字> 1,000的命令,并且我可以手动编辑其他情况。
但一定有更优雅的解决方案?!?
编辑:谢谢你的帮助,大家,斯文的解决方案最适合我!
发布于 2014-11-20 19:34:35
下面是使用sub的一个可能的解决方案
sub(".*?([<>]?[,0-9]+)[ _]+BK.*", "\\1", vec)
# [1] "18,900,000,000" "33,400,000" "900" "<250" 其中vec是包含这4个例子的向量。
发布于 2014-11-20 18:35:24
这将得出这些例子中的目标(增加的第四个案例):
dput(test)
c("** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated 18,900,000,000 BKV genome equivalents per ml urine were detected in this patient specimen....... ............................................................................................................................................ 1 out of 10 test samples... Call 555-122-634 with Questions",
"** POSITIVE FOR BK VIRUS ** INTERPRETATION: A CALCULATED__33,400,000____BK VIRUS (BKV) GENOME EQUIVALENTS PER ML WERE DETECTED",
"A calculated 900 BK virus (BKV) genome equivalents per ml were detected in this patient specimen",
"** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated__<250__________BK virus (BKV) genome equivalents per ml were detected in this patient specimen."
)如果这样做不太好的话,需要更好的例子:
> gsub("(^[^>_0-9]+)([0-9,]{14}|[_]+[<0-9,]+[_]+|[,0-9]+ BK)(.+$)",
"\\2", test)
[1] "18,900,000,000 BK" "__33,400,000____" "900 BK"
[4] "__<250__________" 然后,您只需删除下划线和逗号。逻辑是,报告似乎有一个预设的数据空格数(如果14个字符或如果不是所有数字都用下划线填充,则为所有数字和逗号)。
发布于 2014-11-20 18:39:52
到目前为止,两种方法都不是完全健壮的,而且我不确定如何修复它们,因为我不是一个好的regexxxer
p1 <- "** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated 18,900,000,000 BKV genome equivalents per ml urine were detected in this patient specimen....... ............................................................................................................................................ 1 out of 10 test samples... Call 555-122-634 with Questions"
p2 <- "** POSITIVE FOR BK VIRUS ** INTERPRETATION: A CALCULATED__33,400,000____BK VIRUS (BKV) GENOME EQUIVALENTS PER ML WERE DETECTED"
p3 <- "A calculated 900 BK virus (BKV) genome equivalents per ml were detected in this patient specimen
** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated__<250__________BK virus (BKV) genome equivalents per ml were detected in this patient specimen."第一个示例没有获取第三个示例字符串中的900
pattern <- '(?:\\s+)*[\\d<>]((?:[\\d,])*(?![\\s-\\d]))'
regmatches(p1, gregexpr(pattern, p1, perl = TRUE))
# [[1]]
# [1] " 18,900,000"
regmatches(p2, gregexpr(pattern, p2, perl = TRUE))
# [[1]]
# [1] "33,400,000"
regmatches(p3, gregexpr(pattern, p3, perl = TRUE))
# [[1]]
# [1] "<250"第二个示例在第一个示例中获取额外的数字字符串,但在第三个示例中获取900。
pattern <- "[\\d<>]((?:[\\d,])*)"
regmatches(p1, gregexpr(pattern, p1, perl = TRUE))
# [[1]]
# [1] "18,900,000,000" "1" "10" "555"
# [5] "122" "634"
regmatches(p2, gregexpr(pattern, p2, perl = TRUE))
# [[1]]
# [1] "33,400,000"
regmatches(p3, gregexpr(pattern, p3, perl = TRUE))
# [[1]]
# [1] "900" "<250"https://stackoverflow.com/questions/27046301
复制相似问题