首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何使用正则表达式和R一起在文本文件中列出发现?

如何使用正则表达式和R一起在文本文件中列出发现?
EN

Stack Overflow用户
提问于 2019-04-05 14:00:22
回答 4查看 81关注 0票数 1

我必须将文本字符向量中的所有参数转换为一种易于引用的格式:使用R(对不起,我应该更清楚),有3列(演示者、时间和文本)的列表。

例如,演示者应该是

代码语言:javascript
复制
# HARPER'S

时间应该是

代码语言:javascript
复制
# [Day 1, 9:00 A.M.]

文本应该是争论的其余部分。

我需要计算文本中的参数数(每次开始于

代码语言:javascript
复制
# HARPER'S [Day 1, 9:00 A.M.] 

是一种争论)。我想要创建一个名为“参数”的新列表对象,列表中的每个元素都是包含三个元素(“演示者”、“时间”和“文本”)的子列表。

然后将演示者的名称和时间提取为两个字符向量(也移除缩进),并将' presenter‘元素和' time’元素保留在该参数的子列表中。

代码语言:javascript
复制
This is the text: 
 [1] "HARPER'S [Day 1, 9:00 A.M.]:  When the computer was young, the word hacking was"  
  [2] "used to describe the work of brilliant students who explored and expanded the"    
  [3] "uses to which this new technology might be employed.  There was even talk of a"   
  [4] "\"hacker ethic.\"  Somehow, in the succeeding years, the word has taken on dark"  
  [5] "connotations, suggestion the actions of a criminal.  What is the hacker ethic,"   
  [6] "and does it survive?"                                                             
  [7] ""                                                                                 
  [8] "ADELAIDE [Day 1, 9:25 A.M.]:  the hacker ethic survives, and it is a fraud.  It"  
  [9] "survives in anyone excited by technology's power to turn many small,"             
 [10] "insignificant things into one vast, beautiful thing.  It is a fraud because"      
 [11] "there is nothing magical about computers that causes a user to undergo"           
 [12] "religious conversion and devote himself to the public good.  Early automobile"    
 [13] "inventors were hackers too.  At first the elite drove in luxury.  Later"          
 [14] "practically everyone had a car.  Now we have traffic jams, drunk drivers, air"    
 [15] "pollution, and suburban sprawl.  The old magic of an automobile occasionally"     
 [16] "surfaces, but we possess no delusions that it automatically invades the"          
 [17] "consciousness of anyone who sits behind the wheel.  Computers are power, and"     
 [18] "direct contact with power can bring out the best or worst in a person.  It's"     
 [19] "tempting to think that everyone exposed to the technology will be grandly"        
 [20] "inspired, but, alas, it just ain't so."                                           
 [21] ""                                                                                 
 [22] "BRAND [Day 1, 9:54 A.M.]:  The hacker ethic involves several things.  One is"     
 [23] "avoiding waste; insisting on using idle computer power -- often hacking into a"   
 [24] "system to do so, while taking the greatest precautions not to damage the"         
 [25] "system.  A second goal of many hackers is the free exchange of  technical"        
 [26] "information.  These hackers feel that patent and copyright restrictions slow"     
 [27] "down technological advances.  A third goal is the advancement of human"           
 [28] "knowledge for its own sake.  Often this approach is unconventional.  People we"   
 [29] "call crackers often explore systems and do mischief.  The are called hackers by"  
 [30] "the press, which doesn't understand the issues."                                  
 [31] ""                                                                                 
 [32] "KK [Day 1, 11:19 A.M.]:  The hacker ethic went unnoticed early on because the"    
 [33] "explorations of basement tinkerers were very local.  Once we all became"          
 [34] "connected, the work of these investigations rippled through the world.  today"    
 [35] "the hacking spirit is alive and kicking in video, satellite TV, and radio.  In"   
 [36] "some fields they are called chippers, because the modify and peddle altered"      
 [37] "chips.  Everything that was once said about \"phone phreaks\" can be said about"  
 [38] "them too."

我试着计算论点的长度。

代码语言:javascript
复制
length(grep("^([A-Z]+'*[A-Z]*)", text_data))
arguments = list(presenters = regmatches(text_data, regexpr("^([A-Z]+'*[A-Z]*)", text_data)), time = regmatches(text_data, regexpr("(\\[.*\\])", text_data)), text =  regmatches(paste(unlist(text_data), collapse =" ")), regexpr("(:\\s.*)", regmatches(paste(unlist(text_data), collapse =" "))))
text_data

“论点”清单的长度应为55。

输出的一个例子是示例数据输出格式

非常感谢你的帮助。

EN

回答 4

Stack Overflow用户

回答已采纳

发布于 2019-04-05 15:15:56

这是你的意见:

代码语言:javascript
复制
text_data = """HARPER'S [Day 1, 9:00 A.M.]:  When the computer was young, the word hacking was
used to describe the work of brilliant students who explored and expanded the
uses to which this new technology might be employed.  There was even talk of a
\"hacker ethic.\"  Somehow, in the succeeding years, the word has taken on dark
connotations, suggestion the actions of a criminal.  What is the hacker ethic,
and does it survive? 

ADELAIDE [Day 1, 9:25 A.M.]:  the hacker ethic survives, and it is a fraud.  It
survives in anyone excited by technology's power to turn many small,
insignificant things into one vast, beautiful thing.  It is a fraud because
there is nothing magical about computers that causes a user to undergo
religious conversion and devote himself to the public good.  Early automobile
inventors were hackers too.  At first the elite drove in luxury.  Later
practically everyone had a car.  Now we have traffic jams, drunk drivers, air
pollution, and suburban sprawl.  The old magic of an automobile occasionally
surfaces, but we possess no delusions that it automatically invades the
consciousness of anyone who sits behind the wheel.  Computers are power, and
direct contact with power can bring out the best or worst in a person.  It's
tempting to think that everyone exposed to the technology will be grandly
inspired, but, alas, it just ain't so.

BRAND [Day 1, 9:54 A.M.]:  The hacker ethic involves several things.  One is
avoiding waste; insisting on using idle computer power -- often hacking into a
system to do so, while taking the greatest precautions not to damage the
system.  A second goal of many hackers is the free exchange of  technical
information.  These hackers feel that patent and copyright restrictions slow
down technological advances.  A third goal is the advancement of human
knowledge for its own sake.  Often this approach is unconventional.  People we
call crackers often explore systems and do mischief.  The are called hackers by
the press, which doesn't understand the issues.

KK [Day 1, 11:19 A.M.]:  The hacker ethic went unnoticed early on because the
explorations of basement tinkerers were very local.  Once we all became
connected, the work of these investigations rippled through the world.  today
the hacking spirit is alive and kicking in video, satellite TV, and radio.  In
some fields they are called chippers, because the modify and peddle altered
chips.  Everything that was once said about \"phone phreaks\" can be said about
them too."""

使用regex提取您的三个变量

代码语言:javascript
复制
import re
argument = re.findall("(?P<presenter>[A-Z|']+).\[(?P<time>\w.+)\].\s+(?P<text>[\w\W]*?)(?=\n\n|\Z)",text_data)

以防万一,如果你想让它们成为字典的话:

代码语言:javascript
复制
mydict = {'presenter':[],'time':[],'text':[]}
for i in argument:
    mydict['presenter'].append(i[0])
    mydict['time'].append(i[1])
    mydict['text'].append(i[2])

或者如果您想将它们保存在csv文件中:

代码语言:javascript
复制
import csv
with open("filename.csv","w") as mycsv:
    writers = csv.writer(mycsv)
    header = ['presenter','time','text']
    writers.writerow(header)
    for item in argument:
        writers.writerow(item)

要加载您的csv文件:

代码语言:javascript
复制
import pandas as pd
df = pd.read_csv("filename.csv")
df

输出:

代码语言:javascript
复制
   presenter |  time              | text
--------------------------------------------------------------------------------------
0   HARPER'S |  Day 1, 9:00 A.M.  | When the computer was young, the word hacking ...
1   ADELAIDE |  Day 1, 9:25 A.M.  | the hacker ethic survives, and it is a fraud. ...
2   BRAND    |  Day 1, 9:54 A.M.  | The hacker ethic involves several things. One...
3   KK       |  Day 1, 11:19 A.M. | The hacker ethic went unnoticed early on becau...
票数 1
EN

Stack Overflow用户

发布于 2019-04-05 14:36:55

使用您想要捕获给定文本的方式,这个正则表达式应该完成您的工作,因为它将演示者、时间和文本捕获为三个组,并使用re.findall查找所有文本,并将它们放入列表中,其中这三个信息中的每一个都作为列表中的单个元素出现在元组中。看看这个regex演示,

代码语言:javascript
复制
(.*?)\s+(\[[^[\]]*\]):\s*([\w\W]*?)(?=\n\n|\Z)

演示

Python代码样本,

代码语言:javascript
复制
import re

s = """HARPER'S [Day 1, 9:00 A.M.]:  When the computer was young, the word hacking was
used to describe the work of brilliant students who explored and expanded the
uses to which this new technology might be employed.  There was even talk of a
\"hacker ethic.\"  Somehow, in the succeeding years, the word has taken on dark
connotations, suggestion the actions of a criminal.  What is the hacker ethic,
and does it survive? 

ADELAIDE [Day 1, 9:25 A.M.]:  the hacker ethic survives, and it is a fraud.  It
survives in anyone excited by technology's power to turn many small,
insignificant things into one vast, beautiful thing.  It is a fraud because
there is nothing magical about computers that causes a user to undergo
religious conversion and devote himself to the public good.  Early automobile
inventors were hackers too.  At first the elite drove in luxury.  Later
practically everyone had a car.  Now we have traffic jams, drunk drivers, air
pollution, and suburban sprawl.  The old magic of an automobile occasionally
surfaces, but we possess no delusions that it automatically invades the
consciousness of anyone who sits behind the wheel.  Computers are power, and
direct contact with power can bring out the best or worst in a person.  It's
tempting to think that everyone exposed to the technology will be grandly
inspired, but, alas, it just ain't so.

BRAND [Day 1, 9:54 A.M.]:  The hacker ethic involves several things.  One is
avoiding waste; insisting on using idle computer power -- often hacking into a
system to do so, while taking the greatest precautions not to damage the
system.  A second goal of many hackers is the free exchange of  technical
information.  These hackers feel that patent and copyright restrictions slow
down technological advances.  A third goal is the advancement of human
knowledge for its own sake.  Often this approach is unconventional.  People we
call crackers often explore systems and do mischief.  The are called hackers by
the press, which doesn't understand the issues.

KK [Day 1, 11:19 A.M.]:  The hacker ethic went unnoticed early on because the
explorations of basement tinkerers were very local.  Once we all became
connected, the work of these investigations rippled through the world.  today
the hacking spirit is alive and kicking in video, satellite TV, and radio.  In
some fields they are called chippers, because the modify and peddle altered
chips.  Everything that was once said about \"phone phreaks\" can be said about
them too."""

argument = re.findall(r'(.*?)\s+(\[[^[\]]*\]):\s*([\w\W]*?)(?=\n\n|\Z)', s)
print(argument)

打印包含包含presentertimetext三项的元组的列表

代码语言:javascript
复制
[("HARPER'S", '[Day 1, 9:00 A.M.]', 'When the computer was young, the word hacking was\nused to describe the work of brilliant students who explored and expanded the\nuses to which this new technology might be employed.  There was even talk of a\n"hacker ethic."  Somehow, in the succeeding years, the word has taken on dark\nconnotations, suggestion the actions of a criminal.  What is the hacker ethic,\nand does it survive? '), ('ADELAIDE', '[Day 1, 9:25 A.M.]', "the hacker ethic survives, and it is a fraud.  It\nsurvives in anyone excited by technology's power to turn many small,\ninsignificant things into one vast, beautiful thing.  It is a fraud because\nthere is nothing magical about computers that causes a user to undergo\nreligious conversion and devote himself to the public good.  Early automobile\ninventors were hackers too.  At first the elite drove in luxury.  Later\npractically everyone had a car.  Now we have traffic jams, drunk drivers, air\npollution, and suburban sprawl.  The old magic of an automobile occasionally\nsurfaces, but we possess no delusions that it automatically invades the\nconsciousness of anyone who sits behind the wheel.  Computers are power, and\ndirect contact with power can bring out the best or worst in a person.  It's\ntempting to think that everyone exposed to the technology will be grandly\ninspired, but, alas, it just ain't so."), ('BRAND', '[Day 1, 9:54 A.M.]', "The hacker ethic involves several things.  One is\navoiding waste; insisting on using idle computer power -- often hacking into a\nsystem to do so, while taking the greatest precautions not to damage the\nsystem.  A second goal of many hackers is the free exchange of  technical\ninformation.  These hackers feel that patent and copyright restrictions slow\ndown technological advances.  A third goal is the advancement of human\nknowledge for its own sake.  Often this approach is unconventional.  People we\ncall crackers often explore systems and do mischief.  The are called hackers by\nthe press, which doesn't understand the issues."), ('KK', '[Day 1, 11:19 A.M.]', 'The hacker ethic went unnoticed early on because the\nexplorations of basement tinkerers were very local.  Once we all became\nconnected, the work of these investigations rippled through the world.  today\nthe hacking spirit is alive and kicking in video, satellite TV, and radio.  In\nsome fields they are called chippers, because the modify and peddle altered\nchips.  Everything that was once said about "phone phreaks" can be said about\nthem too.')]
票数 1
EN

Stack Overflow用户

发布于 2019-04-05 15:01:49

代码语言:javascript
复制
library(magrittr)
library(data.table)

text2df <- function(text) {
    idx <- c(1, which(text == ""), length(text))
    apply(matrix(c(idx[-length(idx)], idx[-1]), ncol = 2), 1, function(id1_id2) {
        presenter_text <- text[id1_id2[1]:id1_id2[2]]
        first_row <- paste(presenter_text[1:2], collapse = "") # presenter_text[1] can be ''
        presenter_name <- strsplit(first_row, split = " [", fixed = T)[[1]][1]
        presentation_time <- strsplit(first_row, split = "]: ", fixed = T)[[1]][1] %>% 
            gsub(paste0(presenter_name, " ["), "", ., fixed = T)
        presentation_text <- paste(c(
            gsub(paste0(presenter_name, " [", presentation_time, "]:"), "", first_row, fixed = T) %>% 
                stringi::stri_trim_left() # remove leading spaces
            , presenter_text[3:length(presenter_text)] %>% .[!is.na(.)] # filter NA if only one row of text
        ), collapse = "")
        data.table(presenter = presenter_name, time = presentation_time, text = presentation_text)
    }) %>% rbindlist
}
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/55537094

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档