首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在r中提取特定的文本行

在r中提取特定的文本行
EN

Stack Overflow用户
提问于 2021-01-06 16:11:47
回答 4查看 772关注 0票数 1

我有一个包含数千行的.txt文件。在这个文件中,我有一个关于研究文章的元信息。每一篇论文都有关于出版年份(PY)、标题(TI)、DOI编号(DI)、出版类型(PT)和摘要(AB)的信息。因此,近300篇论文的信息存在于文本文件中。关于前两篇文章的信息格式如下。

代码语言:javascript
复制
PT J
AU Filieri, Raffaele
   Acikgoz, Fulya
   Ndou, Valentina
   Dwivedi, Yogesh
TI Is TripAdvisor still relevant? The influence of review credibility,
   review usefulness, and ease of use on consumers' continuance intention
SO INTERNATIONAL JOURNAL OF CONTEMPORARY HOSPITALITY MANAGEMENT
DI 10.1108/IJCHM-05-2020-0402
EA NOV 2020
PY 2020
AB Purpose - Recent figures show that users are discontinuing their usage
   of TripAdvisor, the leading user-generated content (UGC) platform in the
   tourism sector. Hence, it is relevant to study the factors that
   influence travelers' continued use of TripAdvisor.
   Design/methodology/approach - The authors have integrated constructs
   from the technology acceptance model, information systems (IS)
   continuance model and electronic word of mouth literature. They used
   PLS-SEM (smartPLS V.3.2.8) to test the hypotheses using data from 297
   users of TripAdvisor recruited through Prolific.
   Findings - Findings reveal that perceived ease of use, online consumer
   review (OCR) credibility and OCR usefulness have a positive impact on
   customer satisfaction, which ultimately leads to continuance intention
   of UGC platforms. Customer satisfaction mediates the effect of the
   independent variables on continuance intention.
   Practical implications - Managers of UGC platforms (i.e. TripAdvisor)
   can benefit from the findings of this study. Specifically, they should
   improve the ease of use of their platforms by facilitating travelers'
   information searches. Moreover, they should use signals to make credible
   and helpful content stand out from the crowd of reviews.
   Originality/value - This is the first study that adopts the IS
   continuance model in the travel and tourism literature to research the
   factors influencing consumers' continued use of travel-based UGC
   platforms. Moreover, the authors have extended this model by including
   new constructs that are particularly relevant to UGC platforms, such as
   performance heuristics and OCR credibility.
ZR 0
ZA 0
Z8 0
ZS 0
TC 0
ZB 0
Z9 0
SN 0959-6119
EI 1757-1049
UT WOS:000592516500001
ER

PT J
AU Li, Yelin
   Bu, Hui
   Li, Jiahong
   Wu, Junjie
TI The role of text-extracted investor sentiment in Chinese stock price
   prediction with the enhancement of deep learning
SO INTERNATIONAL JOURNAL OF FORECASTING
VL 36
IS 4
BP 1541
EP 1562
DI 10.1016/j.ijforecast.2020.05.001
PD OCT-DEC 2020
PY 2020
AB Whether investor sentiment affects stock prices is an issue of
   long-standing interest for economists. We conduct a comprehensive study
   of the predictability of investor sentiment, which is measured directly
   by extracting expectations from online user-generated content (UGC) on
   the stock message board of Eastmoney.com in the Chinese stock market. We
   consider the influential factors in prediction, including the selections
   of different text classification algorithms, price forecasting models,
   time horizons, and information update schemes. Using comparisons of the
   long short-term memory (LSTM) model, logistic regression, support vector
   machine, and Naive Bayes model, the results show that daily investor
   sentiment contains predictive information only for open prices, while
   the hourly sentiment has two hours of leading predictability for closing
   prices. Investors do update their expectations during trading hours.
   Moreover, our results reveal that advanced models, such as LSTM, can
   provide more predictive power with investor sentiment only if the inputs
   of a model contain predictive information. (C) 2020 International
   Institute of Forecasters. Published by Elsevier B.V. All rights
   reserved.
CT 14th International Conference on Services Systems and Services
   Management (ICSSSM)
CY JUN 16-18, 2017
CL Dongbei Univ Finance & Econ, Sch Management Sci & Engn, Dalian, PEOPLES
   R CHINA
HO Dongbei Univ Finance & Econ, Sch Management Sci & Engn
SP Tsinghua Univ; Chinese Univ Hong Kong; IEEE Syst Man & Cybernet Soc
ZA 0
TC 0
ZB 0
ZS 0
Z8 0
ZR 0
Z9 0
SN 0169-2070
EI 1872-8200
UT WOS:000570797300025
ER

现在,我想提取每一篇文章的摘要,并将其存储在数据框架中。为了提取抽象,我有下面的代码,这给了我第一次匹配抽象。

代码语言:javascript
复制
f = readLines("sample.txt")
#extract first match....
pattern <- "AB\\s*(.*?)\\s*ZR"
result <- regmatches(as.String(f), regexec(pattern, as.String(f)))
result[[1]][2]
[1] "Purpose - Recent figures show that users are discontinuing their usage\n   of TripAdvisor, the leading user-generated content (UGC) platform in the\n   tourism sector. Hence, it is relevant to study the factors that\n   influence travelers' continued use of TripAdvisor.\n   Design/methodology/approach - The authors have integrated constructs\n   from the technology acceptance model, information systems (IS)\n   continuance model and electronic word of mouth literature. They used\n   PLS-SEM (smartPLS V.3.2.8) to test the hypotheses using data from 297\n   users of TripAdvisor recruited through Prolific.\n   Findings - Findings reveal that perceived ease of use, online consumer\n   review (OCR) credibility and OCR usefulness have a positive impact on\n   customer satisfaction, which ultimately leads to continuance intention\n   of UGC platforms. Customer satisfaction mediates the effect of the\n   independent variables on continuance intention.\n   Practical implications - Managers of UGC platforms (i.e. TripAdvisor)\n   can benefit from the findings of this study. Specifically, they should\n   improve the ease of use of their platforms by facilitating travelers'\n   information searches. Moreover, they should use signals to make credible\n   and helpful content stand out from the crowd of reviews.\n   Originality/value - This is the first study that adopts the IS\n   continuance model in the travel and tourism literature to research the\n   factors influencing consumers' continued use of travel-based UGC\n   platforms. Moreover, the authors have extended this model by including\n   new constructs that are particularly relevant to UGC platforms, such as\n   performance heuristics and OCR credibility."

问题是,我想要提取所有的摘要,但是对于大多数抽象来说,模式是不同的。因此,所有抽象的具体模式是,我应该从、AB、和前面有空格的每一行开始提取文本。有人能在这方面帮我吗?

EN

回答 4

Stack Overflow用户

发布于 2021-01-06 16:45:44

您可以首先对行进行分组:每当一行不以空格字符开头时,组计数器由一个字符向上移动。

然后可以按组聚合f,并从聚合向量中选择摘要:

代码语言:javascript
复制
group <- cumsum(!grepl("^ ", f))
f2 <- aggregate(f, list(group), function(x) paste(x, collapse = " "))[, 2]

f2[grepl("^AB ", f2)]
票数 2
EN

Stack Overflow用户

发布于 2021-01-06 16:44:56

用这个正则表达式试试:

代码语言:javascript
复制
^AB (?:(?!^[A-Z]{2} )([\s\S]))*

PCRE演示 (要求R中的perl=TRUE )

如果要删除前缀,请在\K后添加^AB \K

票数 1
EN

Stack Overflow用户

发布于 2021-01-06 17:09:28

完全不同的方法。如果文本文件具有所显示的布局,则还可以使用readr::read_fwf读取readr::read_fwf中的所有内容。在这样做的时候,你可以从文章中得到所有的信息。您可以使用tidyr::fill来填写丢失的元信息。

代码语言:javascript
复制
library(dplyr)
library(readr)
articles <- read_fwf("tests/SO text.txt", fwf_empty("tests/SO text.txt", col_names = c("mi", "text")))

articles <- articles %>% 
  filter(!(is.na(mi) & is.na(text))) # removes empty lines between articles.

articles

# A tibble: 98 x 2
   mi    text                                                                  
   <chr> <chr>                                                                 
 1 PT    J                                                                     
 2 AU    Filieri, Raffaele                                                     
 3 NA    Acikgoz, Fulya                                                        
 4 NA    Ndou, Valentina                                                       
 5 NA    Dwivedi, Yogesh                                                       
 6 TI    Is TripAdvisor still relevant? The influence of review credibility,   
 7 NA    review usefulness, and ease of use on consumers' continuance intention
 8 SO    INTERNATIONAL JOURNAL OF CONTEMPORARY HOSPITALITY MANAGEMENT          
 9 DI    10.1108/IJCHM-05-2020-0402                                            
10 EA    NOV 2020                                                              
# ... with 88 more rows
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/65599307

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档