我有三个人进行谈话的时间历程数据。除了其他的东西,如音高和强度,我有时间信息和谁在这里发言-人L,人R,人B或没有人("0")。下面是一个简短的例子,其中t是以秒为单位的时间,s是说话者的信息:
> t = 1:10
> s = c("L", "0", "L", "0", "R", "B", "R", "0", "0", "L")
> data.frame(t,s)
t a
1 1 L
2 2 0
3 3 L
4 4 0
5 5 R
6 6 B
7 7 R
8 8 0
9 9 0
10 10 L我想补充一下关于演讲的信息转到数据上。一个回合是一个人说话,包括暂停,直到其他人开始说话。在上面的具体例子中,目标如下:
t a goal
1 1 L L1
2 2 0 L1
3 3 L L1
4 4 0 L1
5 5 R R1
6 6 B B1
7 7 R R2
8 8 0 R2
9 9 0 R2
10 10 L L2我知道如何使用for循环来实现这一点,但是,我的数据有600000行,所以这会非常慢。有人知道怎样才能完成这样的事情吗?
发布于 2022-08-31 10:19:55
有很多功能,但非常简单:
row
group_by
fill NAs替换为最向上的非NA值。
row_number
row_numbers创建组(这是主要的function!)
paste s和gp到所需的值。)。
library(tidyverse)
data.frame(t, s) %>%
mutate(snew = na_if(s, "0"),
rown = row_number()) %>%
fill(snew) %>%
group_by(snew) %>%
mutate(gp = cumsum(c(TRUE, diff(rown) > 1)), .keep = "unused") %>%
ungroup() %>%
mutate(goal = paste0(snew, gp), .keep = "unused") t s goal
1 1 L L1
2 2 0 L1
3 3 L L1
4 4 0 L1
5 5 R R1
6 6 B B1
7 7 R R2
8 8 0 R2
9 9 0 R2
10 10 L L2 发布于 2022-08-31 10:49:36
你感兴趣的关键领域:
s <- c("L", "0", "L", "0", "R", "B", "R", "0", "0", "L")一个基R解决方案:
## fill "0" using vectorized "last observation carried forward"
zero <- which(s == "0")
logi <- c(TRUE, diff(zero) > 1)
s[zero] <- rep(s[zero[logi] - 1], tabulate(cumsum(logi)))
## generate numeric ID
ID <- with(rle(s), rep(ave(values, values, FUN = seq_along), lengths))
## final `paste0`
paste0(s, ID)
#[1] "L1" "L1" "L1" "L1" "R1" "B1" "R2" "R2" "R2" "L2"发布于 2022-08-31 10:59:25
使用data.table
library(data.table)
chnalocf = \(x) x[nafill(replace(seq_along(x), is.na(x), NA), "locf")]
setDT(df)
df[, s2 := chnalocf(replace(x, x == "0", NA))
][, tmp := rleid(s2)
][, goal := paste0(s2, rleid(tmp)), by = s2
][, !c("s2", "tmp")]
# t s goal
# <int> <char> <char>
# 1: 1 L L1
# 2: 2 0 L1
# 3: 3 L L1
# 4: 4 0 L1
# 5: 5 R R1
# 6: 6 B B1
# 7: 7 R R2
# 8: 8 0 R2
# 9: 9 0 R2
# 10: 10 L L2https://stackoverflow.com/questions/73554280
复制相似问题