所以我这里有一个文本,我想将文本分成三个部分: harper_presenter,harper_time和harper_text,使用带R的正则表达式。
正文是:哈伯节第一天,上午9点:当计算机还很年轻的时候,黑客这个词被用来描述聪明的学生们的工作,他们探索并扩展了这项新技术可能被应用的用途。甚至还谈到了“黑客伦理”。在接下来的几年里,这个词不知何故被赋予了黑暗的含义,暗示着罪犯的行为。黑客的道德是什么,它能幸存下来吗?
哈珀的是harper_presenter,第一天上午9点是harper_time,其余的是harper_text。
如果我们不使用确切的词来过滤,那将是最好的。
实际的结果将是一个列表。
发布于 2019-04-05 10:21:05
如果你想使用正则表达式来做这件事,你可以使用stringr::str_extract_all;
text <- "HARPER'S [Day 1, 9:00 A.M.]: When the computer was young, the word hacking was used to describe the work of brilliant students who explored and expanded the uses to which this new technology might be employed. There was even talk of a \"hacker ethic.\" Somehow, in the succeeding years, the word has taken on dark connotations, suggestion the actions of a criminal. What is the hacker ethic, and does it survive?"
stringr::str_extract_all(text, "^([A-Z]+'*[A-Z]*)|(\\[.*\\])|(:.*)")
[[1]]
[1] "HARPER'S"
[2] "[Day 1, 9:00 A.M.]"
[3] ": When the computer was young, the word hacking was used to describe the work of brilliant students who explored and expanded the uses to which this new technology might be employed. There was even talk of a \"hacker ethic.\" Somehow, in the succeeding years, the word has taken on dark connotations, suggestion the actions of a criminal. What is the hacker ethic, and does it survive?"由"or“|运算符分隔的^([A-Z]+'*[A-Z]*)|(\\[.*\\])|(:.*)可以分为3个部分。
第一个([A-Z]+'*[A-Z]*)表示查找一组一个或多个大写字母,然后是0个或多个',最后是0个或多个大写字母。^指定这需要是一行的开始。
第二个(\\[.*\\])表示查找包含0个或更多内容(.)的组(方括号)。
第三个(:.*)表示查找:后跟0或更多内容(.)
https://stackoverflow.com/questions/55527043
复制相似问题