我有一个编程挑战,我无法克服,请帮助!我将历史时间序列、速率和日期作为字符串存储在一个单元格中,用于每个安全性(10K+安全性)。如何拆分字符串并将信息存储在各自的列中?我通常在R中工作,但是如果更容易的话,我很乐意用Python来尝试它!
以下是我的数据,“安全性”和“系列”是我的列名:
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+--+---+---+----+
| Security | Series |
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+--+---+---+----+
| 567895B | "3/15/2019 2.51 3/17/2019 2.30 4/08/2019 2.41" |
| 165456C | "1/05/2018 2.45 1/28/2018 2.46" |
| 123456A | "1/05/2016 2.45 2/05/2016 2.46 3/05/2016 2.45 5/05/2016 2.47"|
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+--+---+---+----+我想让它看起来像这样:
+----+-------+-------+---------+------+
| Security | date | rate |
+----+-------+-------+---------+------+
| 567895B | 3/15/2019 | 2.51 |
| 567895B | 3/17/2019 | 2.30 |
| 567895B | 4/08/2019 | 2.41 |
| 165456C | 1/05/2018 | 2.45 |
| 165456C | 1/28/2018 | 2.46 |
| 123456A | 1/05/2016 | 2.45 |
| 123456A | 2/05/2016 | 2.46 |
| 123456A | 3/05/2016 | 2.45 |
| 123456A | 5/05/2016 | 2.47 |
+----+-------+-------+---------+------+发布于 2019-05-24 00:13:46
在R中,您可以使用tidyverse进行此操作。我们首先将Series中的每个值划分为不同的列。为此,我们在Series中计算空白空间的数量,并选择最大值,以便知道将添加多少列。然后使用gather将其转换为长格式,为Date和rate字段创建组标识符,并使用spread将其转换为宽格式。
library(tidyverse)
n <- max(str_count(df$Series, "\\s+")) + 1
df %>%
separate(Series, into = paste0("col", 1:n), sep = "\\s+", fill = "right") %>%
gather(key, value, -Security, na.rm = TRUE) %>%
mutate(key = ceiling(as.integer(sub("col", "", key))/2)) %>%
group_by(Security, key) %>%
mutate(row = row_number()) %>%
spread(row, value) %>%
ungroup() %>%
select(-key) %>%
rename_at(2:3, ~c("Date", "rate"))
# A tibble: 9 x 3
# Security Date rate
# <chr> <chr> <chr>
#1 123456A 1/05/2016 2.45
#2 123456A 2/05/2016 2.46
#3 123456A 3/05/2016 2.45
#4 123456A 5/05/2016 2.47
#5 165456C 1/05/2018 2.45
#6 165456C 1/28/2018 2.46
#7 567895B 3/15/2019 2.51
#8 567895B 3/17/2019 2.30
#9 567895B 4/08/2019 2.41 数据
df <- structure(list(Security = c("567895B", "165456C", "123456A"),
Series = c("3/15/2019 2.51 3/17/2019 2.30 4/08/2019 2.41",
"1/05/2018 2.45 1/28/2018 2.46", "1/05/2016 2.45 2/05/2016 2.46 3/05/2016
2.45 5/05/2016 2.47"
)), row.names = c(NA, -3L), class = "data.frame")发布于 2019-05-24 01:12:35
library(tidyverse)
df%>%
mutate(Series=strsplit(Series," (?=\\d+/)",perl = T))%>%
unnest()%>%
separate(Series,c('Date','Rate'),' ',convert = T)
Security Date Rate
1 567895B 3/15/2019 2.51
2 567895B 3/17/2019 2.30
3 567895B 4/08/2019 2.41
4 165456C 1/05/2018 2.45
5 165456C 1/28/2018 2.46
6 123456A 1/05/2016 2.45
7 123456A 2/05/2016 2.46
8 123456A 3/05/2016 2.45
9 123456A 5/05/2016 2.47要使用基数R,您可以:
m = gregexpr("(^|\\s)(?=\\d+/)",df$Series,perl = T)
read.table(text=`regmatches<-`(df$Series, m ,val=paste("\n",df$Security,' ')))
V1 V2 V3
1 567895B 3/15/2019 2.51
2 567895B 3/17/2019 2.30
3 567895B 4/08/2019 2.41
4 165456C 1/05/2018 2.45
5 165456C 1/28/2018 2.46
6 123456A 1/05/2016 2.45
7 123456A 2/05/2016 2.46
8 123456A 3/05/2016 2.45
9 123456A 5/05/2016 2.47发布于 2019-05-24 00:57:21
使用尾注中可重复显示的数据框架secs,在string的每个日期之前插入换行符和安全名称。替换字符向量定义为repl。它被插入到Series中,返回修改后的Series的字符向量ch。然后在使用ch时读取read.table,最后将日期更改为Date类。不使用包装。
repl <- sprintf("\n%s \\1", secs$Security)
ch <- mapply(gsub, "(\\d+/\\d+/\\d+)", repl, secs$Series)
DF <- read.table(text = ch, col.names = c("Security", "Date", "Value"), as.is = TRUE)
DF$Date <- as.Date(DF$Date, format = "%m/%d/%Y")给予:
> DF
Security Date Value
1 567895B 2019-03-15 2.51
2 567895B 2019-03-17 2.30
3 567895B 2019-04-08 2.41
4 165456C 2018-01-05 2.45
5 165456C 2018-01-28 2.46
6 123456A 2016-01-05 2.45
7 123456A 2016-02-05 2.46
8 123456A 2016-03-05 2.45
9 123456A 2016-05-05 2.47相同的代码可以交替编写为以下管道:
library(dplyr)
secs %>%
rowwise() %>%
mutate(ch = gsub("(\\d+/\\d+/\\d+)", sprintf("\n%s \\1", Security), Series)) %>%
ungroup %>%
{ read.table(text = .$ch, col.names = c("Security", "Date", "Value"), as.is = TRUE) } %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y"))备注
Lines <- '
Security | Series
567895B | "3/15/2019 2.51 3/17/2019 2.30 4/08/2019 2.41"
165456C | "1/05/2018 2.45 1/28/2018 2.46"
123456A | "1/05/2016 2.45 2/05/2016 2.46 3/05/2016 2.45 5/05/2016 2.47" '
secs <- read.table(text = Lines, header = TRUE, sep = "|", na.strings = "+",
as.is = TRUE, strip.white = TRUE)https://stackoverflow.com/questions/56284185
复制相似问题