我有一个非结构化文本,里面有很多日期,我想在单词"Message"之前提取日期。我拥有的数据如下:
21 March 2017 23:10:45 text1
21 March 2017 23:10:45 More text…..
21 March 2017 23:10:45 And more text …..
21 March 2017 23:10:45 some more text **Message:** more text
22 March 2017 23:10:45 text1
22 March 2017 23:10:45 More text…..
22 March 2017 23:10:45 And more text …..
22 March 2017 23:10:45 some more text **Message:** more text
23 March 2017 23:10:45 text1
23 March 2017 23:10:45 More text…..
23 March 2017 23:10:45 And more text …..
23 March 2017 23:10:45 some more text **Message:** more text
24 March 2017 23:10:45 text1
24 March 2017 23:10:45 More text…..
24 March 2017 23:10:45 And more text …..
24 March 2017 23:10:45 some more text **Message:** more text 并且输出将是一个新的数据格式,其中有一列表示日期:
21 March 2017
22 March 2017
23 March 2017
24 March 2017发布于 2017-03-25 17:11:30
怎么样
sub("(?<=\\d{4}).*", "", grep("Message", txt, value=TRUE), perl=TRUE)
# [1] "21 March 2017" "22 March 2017" "23 March 2017" "24 March 2017"我们首先使用grep()将txt还原为仅包含"Message“的值,然后使用sub()删除第一次出现四位数字后的所有文本。
数据:
txt <- readLines(textConnection("21 March 2017 23:10:45 text1
21 March 2017 23:10:45 More text…..
21 March 2017 23:10:45 And more text …..
21 March 2017 23:10:45 some more text **Message:** more text
22 March 2017 23:10:45 text1
22 March 2017 23:10:45 More text…..
22 March 2017 23:10:45 And more text …..
22 March 2017 23:10:45 some more text **Message:** more text
23 March 2017 23:10:45 text1
23 March 2017 23:10:45 More text…..
23 March 2017 23:10:45 And more text …..
23 March 2017 23:10:45 some more text **Message:** more text
24 March 2017 23:10:45 text1
24 March 2017 23:10:45 More text…..
24 March 2017 23:10:45 And more text …..
24 March 2017 23:10:45 some more text **Message:** more text
"))https://stackoverflow.com/questions/43019274
复制相似问题