首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何在R中拆分字符串并将字符串放置到基于特定列的类型中?

如何在R中拆分字符串并将字符串放置到基于特定列的类型中?
EN

Stack Overflow用户
提问于 2019-05-23 23:45:23
回答 5查看 474关注 0票数 1

我有一个编程挑战,我无法克服,请帮助!我将历史时间序列、速率和日期作为字符串存储在一个单元格中,用于每个安全性(10K+安全性)。如何拆分字符串并将信息存储在各自的列中?我通常在R中工作,但是如果更容易的话,我很乐意用Python来尝试它!

以下是我的数据,“安全性”和“系列”是我的列名:

代码语言:javascript
复制
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+--+---+---+----+
| Security |              Series                                         | 
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+--+---+---+----+
| 567895B | "3/15/2019 2.51 3/17/2019 2.30 4/08/2019 2.41"               |
| 165456C | "1/05/2018 2.45 1/28/2018 2.46"                              |
| 123456A | "1/05/2016 2.45 2/05/2016 2.46 3/05/2016 2.45 5/05/2016 2.47"|
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+--+---+---+----+

我想让它看起来像这样:

代码语言:javascript
复制
+----+-------+-------+---------+------+
| Security   | date          |  rate  | 
+----+-------+-------+---------+------+
|  567895B   |   3/15/2019   |  2.51  |
|  567895B   |   3/17/2019   |  2.30  |
|  567895B   |   4/08/2019   |  2.41  |
|  165456C   |   1/05/2018   |  2.45  |      
|  165456C   |   1/28/2018   |  2.46  |
|  123456A   |   1/05/2016   |  2.45  |
|  123456A   |   2/05/2016   |  2.46  |   
|  123456A   |   3/05/2016   |  2.45  |   
|  123456A   |   5/05/2016   |  2.47  |   
+----+-------+-------+---------+------+
EN

回答 5

Stack Overflow用户

回答已采纳

发布于 2019-05-24 00:13:46

在R中,您可以使用tidyverse进行此操作。我们首先将Series中的每个值划分为不同的列。为此,我们在Series中计算空白空间的数量,并选择最大值,以便知道将添加多少列。然后使用gather将其转换为长格式,为Daterate字段创建组标识符,并使用spread将其转换为宽格式。

代码语言:javascript
复制
library(tidyverse)
n <- max(str_count(df$Series, "\\s+")) + 1

df %>% 
  separate(Series, into = paste0("col", 1:n), sep = "\\s+", fill = "right") %>%
  gather(key, value, -Security, na.rm = TRUE) %>%
  mutate(key = ceiling(as.integer(sub("col", "", key))/2)) %>%
  group_by(Security, key) %>%
  mutate(row = row_number()) %>%
  spread(row, value) %>%
  ungroup() %>%
  select(-key) %>%
  rename_at(2:3, ~c("Date", "rate"))

# A tibble: 9 x 3
#  Security Date      rate
#  <chr>    <chr>     <chr>
#1 123456A  1/05/2016 2.45 
#2 123456A  2/05/2016 2.46 
#3 123456A  3/05/2016 2.45 
#4 123456A  5/05/2016 2.47 
#5 165456C  1/05/2018 2.45 
#6 165456C  1/28/2018 2.46 
#7 567895B  3/15/2019 2.51 
#8 567895B  3/17/2019 2.30 
#9 567895B  4/08/2019 2.41 

数据

代码语言:javascript
复制
df <- structure(list(Security = c("567895B", "165456C", "123456A"), 
Series = c("3/15/2019 2.51 3/17/2019 2.30 4/08/2019 2.41", 
"1/05/2018 2.45 1/28/2018 2.46", "1/05/2016 2.45 2/05/2016 2.46 3/05/2016 
2.45 5/05/2016 2.47"
)), row.names = c(NA, -3L), class = "data.frame")
票数 2
EN

Stack Overflow用户

发布于 2019-05-24 01:12:35

代码语言:javascript
复制
 library(tidyverse)
 df%>%
   mutate(Series=strsplit(Series," (?=\\d+/)",perl = T))%>%
   unnest()%>%
   separate(Series,c('Date','Rate'),' ',convert = T)

  Security      Date Rate
1  567895B 3/15/2019 2.51
2  567895B 3/17/2019 2.30
3  567895B 4/08/2019 2.41
4  165456C 1/05/2018 2.45
5  165456C 1/28/2018 2.46
6  123456A 1/05/2016 2.45
7  123456A 2/05/2016 2.46
8  123456A 3/05/2016 2.45
9  123456A 5/05/2016 2.47

要使用基数R,您可以:

代码语言:javascript
复制
m = gregexpr("(^|\\s)(?=\\d+/)",df$Series,perl = T)
read.table(text=`regmatches<-`(df$Series, m ,val=paste("\n",df$Security,' ')))
       V1        V2   V3
1 567895B 3/15/2019 2.51
2 567895B 3/17/2019 2.30
3 567895B 4/08/2019 2.41
4 165456C 1/05/2018 2.45
5 165456C 1/28/2018 2.46
6 123456A 1/05/2016 2.45
7 123456A 2/05/2016 2.46
8 123456A 3/05/2016 2.45
9 123456A 5/05/2016 2.47
票数 2
EN

Stack Overflow用户

发布于 2019-05-24 00:57:21

使用尾注中可重复显示的数据框架secs,在string的每个日期之前插入换行符和安全名称。替换字符向量定义为repl。它被插入到Series中,返回修改后的Series的字符向量ch。然后在使用ch时读取read.table,最后将日期更改为Date类。不使用包装。

代码语言:javascript
复制
repl <- sprintf("\n%s \\1", secs$Security)
ch <- mapply(gsub, "(\\d+/\\d+/\\d+)", repl, secs$Series)
DF <- read.table(text = ch, col.names = c("Security", "Date", "Value"), as.is = TRUE)
DF$Date <- as.Date(DF$Date, format = "%m/%d/%Y")

给予:

代码语言:javascript
复制
> DF
  Security       Date Value
1  567895B 2019-03-15  2.51
2  567895B 2019-03-17  2.30
3  567895B 2019-04-08  2.41
4  165456C 2018-01-05  2.45
5  165456C 2018-01-28  2.46
6  123456A 2016-01-05  2.45
7  123456A 2016-02-05  2.46
8  123456A 2016-03-05  2.45
9  123456A 2016-05-05  2.47

相同的代码可以交替编写为以下管道:

代码语言:javascript
复制
library(dplyr)

secs %>%
  rowwise() %>%
  mutate(ch = gsub("(\\d+/\\d+/\\d+)", sprintf("\n%s \\1", Security), Series)) %>%
  ungroup %>%
  { read.table(text = .$ch, col.names = c("Security", "Date", "Value"), as.is = TRUE) } %>%
  mutate(Date = as.Date(Date, format = "%m/%d/%Y"))

备注

代码语言:javascript
复制
Lines <- '
 Security |  Series
 567895B | "3/15/2019 2.51 3/17/2019 2.30 4/08/2019 2.41"               
 165456C | "1/05/2018 2.45 1/28/2018 2.46"                              
 123456A | "1/05/2016 2.45 2/05/2016 2.46 3/05/2016 2.45 5/05/2016 2.47" '
secs <- read.table(text = Lines, header = TRUE, sep = "|", na.strings = "+", 
  as.is = TRUE, strip.white = TRUE)
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/56284185

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档