我必须将文本字符向量中的所有参数转换为一种易于引用的格式:一个具有3列(演示者、时间和文本)的列表。
例如,演示者应该是
# HARPER'S时间应该是
# [Day 1, 9:00 A.M.]文本应该是争论的其余部分。
我需要计算文本中的参数数(每次开始于
# HARPER'S [Day 1, 9:00 A.M.] 是一种争论)。我想要创建一个名为“参数”的新列表对象,列表中的每个元素都是包含三个元素(“演示者”、“时间”和“文本”)的子列表。
然后将演示者的名称和时间提取为两个字符向量(也移除缩进),并将' presenter‘元素和' time’元素保留在该参数的子列表中。
This is the text:
[1] "HARPER'S [Day 1, 9:00 A.M.]: When the computer was young, the word hacking was"
[2] "used to describe the work of brilliant students who explored and expanded the"
[3] "uses to which this new technology might be employed. There was even talk of a"
[4] "\"hacker ethic.\" Somehow, in the succeeding years, the word has taken on dark"
[5] "connotations, suggestion the actions of a criminal. What is the hacker ethic,"
[6] "and does it survive?"
[7] ""
[8] "ADELAIDE [Day 1, 9:25 A.M.]: the hacker ethic survives, and it is a fraud. It"
[9] "survives in anyone excited by technology's power to turn many small,"
[10] "insignificant things into one vast, beautiful thing. It is a fraud because"
[11] "there is nothing magical about computers that causes a user to undergo"
[12] "religious conversion and devote himself to the public good. Early automobile"
[13] "inventors were hackers too. At first the elite drove in luxury. Later"
[14] "practically everyone had a car. Now we have traffic jams, drunk drivers, air"
[15] "pollution, and suburban sprawl. The old magic of an automobile occasionally"
[16] "surfaces, but we possess no delusions that it automatically invades the"
[17] "consciousness of anyone who sits behind the wheel. Computers are power, and"
[18] "direct contact with power can bring out the best or worst in a person. It's"
[19] "tempting to think that everyone exposed to the technology will be grandly"
[20] "inspired, but, alas, it just ain't so."
[21] ""
[22] "BRAND [Day 1, 9:54 A.M.]: The hacker ethic involves several things. One is"
[23] "avoiding waste; insisting on using idle computer power -- often hacking into a"
[24] "system to do so, while taking the greatest precautions not to damage the"
[25] "system. A second goal of many hackers is the free exchange of technical"
[26] "information. These hackers feel that patent and copyright restrictions slow"
[27] "down technological advances. A third goal is the advancement of human"
[28] "knowledge for its own sake. Often this approach is unconventional. People we"
[29] "call crackers often explore systems and do mischief. The are called hackers by"
[30] "the press, which doesn't understand the issues."
[31] ""
[32] "KK [Day 1, 11:19 A.M.]: The hacker ethic went unnoticed early on because the"
[33] "explorations of basement tinkerers were very local. Once we all became"
[34] "connected, the work of these investigations rippled through the world. today"
[35] "the hacking spirit is alive and kicking in video, satellite TV, and radio. In"
[36] "some fields they are called chippers, because the modify and peddle altered"
[37] "chips. Everything that was once said about \"phone phreaks\" can be said about"
[38] "them too."我试着计算论点的长度。
length(grep("^([A-Z]+'*[A-Z]*)", text_data))
arguments = list(presenters = regmatches(text_data, regexpr("^([A-Z]+'*[A-Z]*)", text_data)), time = regmatches(text_data, regexpr("(\\[.*\\])", text_data)), text = regmatches(paste(unlist(text_data), collapse =" ")), regexpr("(:\\s.*)", regmatches(paste(unlist(text_data), collapse =" "))))
text_data“论点”清单的长度应为55。
第一个参数的输出示例如下
$presenter
[1] "HARPER'S"
$time
[1] "[Day 1, 9:00 A.M.]"
$text
[1] ": When the computer was young, the word hacking was used to describe the work of brilliant students who explored and expanded the uses to which this new technology might be employed. There was even talk of a \"hacker ethic.\" Somehow, in the succeeding years, the word has taken on dark connotations, suggestion the actions of a criminal. What is the hacker ethic, and does it survive?"非常感谢你的帮助。
发布于 2019-04-06 15:27:30
我建议
library(stringr)
data <- str_match(paste(lines, collapse="\n"), "(?sm)^([A-Z]+(?:'[A-Z]+)?)\\s+(\\[[^\\]\\[]*\\]):\\s*(.*?)(?=\n{2}|\\z)")
presenterCol <- data[[1]][,2]
timeCol <- data[[1]][,3]
textCol <- data[[1]][,4]这里的要点是,行使用paste(lines, collapse="\n")与换行符连接,这样我们就可以在单个多行字符串上运行regex,以便在开始时获取演示者的详细信息,2)在方括号内显示日期,以及3)文本的其余部分,直到整个字符串的空行或结束。
见regex演示。
Regex详细信息
(?sm) - s修饰符使.匹配换行符,m使^匹配行的开始。^ -行的开始([A-Z]+(?:'[A-Z]+)?) -第1组: 1+大写字母,然后是'和1+大写字母的可选序列\\s+ - 1+白空间(\\[[^\\]\\[]*\\]) -第2组:[,除[和]以外的0或更多字符,然后是]: -一个冒号\\s* - 0+白空间(.*?) -任何0+字符,尽可能少,直到第一个.(?=\n{2}|\\z) -(一个正的前瞻性,它要求立即在当前位置的右边)两个换行符或整个字符串的末尾。https://stackoverflow.com/questions/55549051
复制相似问题