我正在处理一个大的时间序列,其中一列包含四个不同的传感器,另一列包含测量值。我需要为属于同一时间的测量值分配一个id。问题是,每个设备的测量时间略有不同,因此我不能简单地按时间戳对它们进行分组。在按时间排序的数据帧中,应该被分组的测量值可以通过唯一设备Ids的序列来标识。这里的问题是,一次4台设备记录值,另一次3台设备记录值。我的数据是这样的。
timestamp device measurement
1 2019-08-27 07:29:20.671313 sdr_03 49.868820
2 2019-08-27 07:29:20.932043 sdr_02 54.160831
3 2019-08-27 07:29:21.839312 sdr_03 48.974476
4 2019-08-27 07:29:21.850454 sdr_02 50.808674
5 2019-08-27 08:57:01.990833 sdr_03 50.533058
6 2019-08-27 08:57:02.022798 sdr_04 51.143322
7 2019-08-27 09:16:56.454308 sdr_02 57.447151
8 2019-08-27 09:16:56.482433 sdr_04 50.012745
9 2019-08-27 09:16:56.761776 sdr_01 71.500305
10 2019-08-27 09:16:57.305510 sdr_02 56.851177
11 2019-08-27 09:16:57.333628 sdr_04 60.390141
12 2019-08-27 09:16:57.612972 sdr_01 73.470345你可以用下面的代码重现:
my_data<-data.frame(timestamp = c("2019-08-27 07:29:20.671313","2019-08-27 07:29:20.932043","2019-08-27 07:29:21.839312",
"2019-08-27 07:29:21.850454", "2019-08-27 08:57:01.990833","2019-08-27 08:57:02.022798",
"2019-08-27 09:16:56.454308", "2019-08-27 09:16:56.482433", "2019-08-27 09:16:56.761776",
"2019-08-27 09:16:57.305510" ,"2019-08-27 09:16:57.333628", "2019-08-27 09:16:57.612972"),
device=c("sdr_03", "sdr_02", "sdr_03", "sdr_02", "sdr_03" ,"sdr_04", "sdr_02", "sdr_04" ,"sdr_01", "sdr_02" ,"sdr_04",
"sdr_01"),
measurement=c(49.868820, 54.160831, 48.974476, 50.808674, 50.533058, 51.143322,57.447151,50.012745, 71.500305,56.851177,
60.390141, 73.470345)
)只要column device的前一行中没有任何元素再次出现,我就需要将相同的值赋给连续的行
timestamp device measurement match_id
1 2019-08-27 07:29:20.671313 sdr_03 49.868820 1
2 2019-08-27 07:29:20.932043 sdr_02 54.160831 1
3 2019-08-27 07:29:21.839312 sdr_03 48.974476 2
4 2019-08-27 07:29:21.850454 sdr_02 50.808674 2
5 2019-08-27 08:57:01.990833 sdr_03 50.533058 3
6 2019-08-27 08:57:02.022798 sdr_04 51.143322 3
7 2019-08-27 09:16:56.454308 sdr_02 57.447151 3
8 2019-08-27 09:16:56.482433 sdr_04 50.012745 4
9 2019-08-27 09:16:56.761776 sdr_01 71.500305 4
10 2019-08-27 09:16:57.305510 sdr_02 56.851177 4
11 2019-08-27 09:16:57.333628 sdr_04 60.390141 5
12 2019-08-27 09:16:57.612972 sdr_01 73.470345 5您可以从以下位置获得:
my_data<-data.frame(timestamp = c("2019-08-27 07:29:20.671313","2019-08-27 07:29:20.932043","2019-08-27 07:29:21.839312",
"2019-08-27 07:29:21.850454", "2019-08-27 08:57:01.990833","2019-08-27 08:57:02.022798",
"2019-08-27 09:16:56.454308", "2019-08-27 09:16:56.482433", "2019-08-27 09:16:56.761776",
"2019-08-27 09:16:57.305510" ,"2019-08-27 09:16:57.333628", "2019-08-27 09:16:57.612972"),
device=c("sdr_03", "sdr_02", "sdr_03", "sdr_02", "sdr_03" ,"sdr_04", "sdr_02", "sdr_04" ,"sdr_01", "sdr_02" ,"sdr_04",
"sdr_01"),
measurement=c(49.868820, 54.160831, 48.974476, 50.808674, 50.533058, 51.143322,57.447151,50.012745, 71.500305,56.851177,
60.390141, 73.470345),match_id=c(1,1,2,2,3,3,3,4,4,4,5,5) )我已经寻找答案三天了。任何帮助都是非常感谢的。
Allan Camerons dplyr解决方案导致稍后在数据帧中重新出现的匹配ids参见第1、2、6、9行。一次记录的设备可能少于4个,因此始终期望每次测量的记录设备数量相同的解决方案将不起作用。
# A tibble: 12 x 4
# Groups: device [4]
timestamp device measurement new_id
<dttm> <fct> <dbl> <int>
1 2019-08-27 07:29:20.671313 sdr_03 49.9 1
2 2019-08-27 07:29:20.932043 sdr_02 54.2 1
3 2019-08-27 07:29:21.839312 sdr_03 49.0 2
4 2019-08-27 07:29:21.850454 sdr_02 50.8 2
5 2019-08-27 08:57:01.990833 sdr_03 50.5 3
6 2019-08-27 08:57:02.022798 sdr_04 51.1 1
7 2019-08-27 09:16:56.454308 sdr_02 57.4 3
8 2019-08-27 09:16:56.482433 sdr_04 50.0 2
9 2019-08-27 09:16:56.761775 sdr_01 71.5 1
10 2019-08-27 09:16:57.305510 sdr_02 56.9 4
11 2019-08-27 09:16:57.333627 sdr_04 60.4 3
12 2019-08-27 09:16:57.612972 sdr_01 73.5 2而Sotos解决方案导致比存在的唯一设备更多的连续匹配in。例如,第5-9行
# A tibble: 12 x 4
timestamp device measurement new_id
<chr> <fct> <dbl> <int>
1 2019-08-27 07:29:20 sdr_03 49.9 1
2 2019-08-27 07:29:20 sdr_02 54.2 1
3 2019-08-27 07:29:21 sdr_03 49.0 2
4 2019-08-27 07:29:21 sdr_02 50.8 2
5 2019-08-27 08:57:01 sdr_03 50.5 3
6 2019-08-27 08:57:02 sdr_04 51.1 3
7 2019-08-27 09:16:56 sdr_02 57.4 3
8 2019-08-27 09:16:56 sdr_04 50.0 3
9 2019-08-27 09:16:56 sdr_01 71.5 3
10 2019-08-27 09:16:57 sdr_02 56.9 4
11 2019-08-27 09:16:57 sdr_04 60.4 4
12 2019-08-27 09:16:57 sdr_01 73.5 4这两种解决方案都很有效(谢谢!)如果两次测量之间的时间差大于0.7秒或同时记录4个设备。遗憾的是,大多数情况下情况并非如此。我认为,忽略时间戳而检查连续行中的重复项的解决方案可能会更好。我使用rle()或data.table找到了许多重复值的解决方案,但是没有一个解决方案可以识别唯一值序列。请帮帮我!
发布于 2020-01-22 23:41:32
我很确定我真的想多了,但这是一个有效的解决方案,
library(dplyr)
data %>%
mutate(timestamp = format(timestamp, '%Y-%m-%d %H:%M:%S')) %>%
group_by(timestamp) %>%
mutate(new = data.table::rleid(duplicated(device))) %>%
group_by(timestamp, new) %>%
mutate(new1 = row_number() + new) %>%
ungroup() %>%
mutate(new_id = cumsum(c(TRUE, diff(new1) < 0))) %>%
select(-c(new, new1))这给了我们
A tibble: 12 x 4时间戳设备测量new_id 1 2019-08-27 09:48:54 sdr_02 80.2 1 2 2019-08-27 09:48:54 sdr_01 71.7 1 3 2019-08-27 09:48:48:54 sdr_04 74.2 1 4 2019-08-27 09:48:54 sdr_0362.6 1 5 2019-08-27 09:48:55 sdr_02 77.1 2 6 2019-08-27 09:48:55 sdr_01 69.2 2 7 2019-08-27 09:48:55 sdr_03 62.1 2 8 2019-08-27 09:48:55 sdr_02 77.1 3 9 2019-08-27 09:48:55 sdr_01 54.63 10 2019-08-27 09:48:55 sdr_03 64.3 3 11 2019-08-27 09:48:56 sdr_02 66.5 4 12 2019-08-27 09:48:56 sdr_01 71.7 4
发布于 2020-01-22 23:52:32
这就不能更简单地完成吗?
library(dplyr)
df %>%
group_by(device) %>%
mutate(new_id = seq_len(length(device)), timestamp = as.POSIXct(timestamp))
#> # A tibble: 12 x 4
#> # Groups: device [4]
#> timestamp device measurement new_id
#> <dttm> <fct> <dbl> <int>
#> 1 2019-08-27 09:48:54 sdr_02 80.2 1
#> 2 2019-08-27 09:48:54 sdr_01 71.7 1
#> 3 2019-08-27 09:48:54 sdr_04 74.2 1
#> 4 2019-08-27 09:48:54 sdr_03 62.6 1
#> 5 2019-08-27 09:48:55 sdr_02 77.1 2
#> 6 2019-08-27 09:48:55 sdr_01 69.2 2
#> 7 2019-08-27 09:48:55 sdr_03 62.1 2
#> 8 2019-08-27 09:48:55 sdr_02 77.1 3
#> 9 2019-08-27 09:48:55 sdr_01 54.6 3
#> 10 2019-08-27 09:48:55 sdr_03 64.3 3
#> 11 2019-08-27 09:48:56 sdr_02 66.5 4
#> 12 2019-08-27 09:48:56 sdr_01 71.7 4更新
根据OP的评论,似乎最好的方法是定义一个函数,该函数保持它遇到的设备的运行计数,并在达到重复设备时递增。
# Code # Pseudocode
# ======================================= # ===================================
group_instances <- function(my_labels) #
{ #
my_labels <- as.character(my_labels) # (Ensure we use a character vector)
#
result <- numeric(length(my_labels)) # Create a numeric result vector
matches <- as.character(my_labels[1]) # Create tally of encountered devices
#
for(i in seq_along(my_labels)[-1]) # For each device record after the first
{ #
if(my_labels[i] %in% matches) # If we have this device in our tally
{ #
matches <- my_labels[i] # Reset our tally of devices
result[i] <- result[i - 1] + 1 # and increment our ID
} #
else # Otherwise
{ #
matches <- c(matches, my_labels[i]) # Add it to our tally of devices
result[i] <- result[i - 1] # and copy the ID from the row above
} #
} #
return(result + 1) # Our IDs started at zero, so add one
}现在我们可以
my_data %>% mutate(ID = as.factor(group_instances(device)))
#> timestamp device measurement ID
#> 1 2019-08-27 07:29:20.671313 sdr_03 49.86882 1
#> 2 2019-08-27 07:29:20.932043 sdr_02 54.16083 1
#> 3 2019-08-27 07:29:21.839312 sdr_03 48.97448 2
#> 4 2019-08-27 07:29:21.850454 sdr_02 50.80867 2
#> 5 2019-08-27 08:57:01.990833 sdr_03 50.53306 3
#> 6 2019-08-27 08:57:02.022798 sdr_04 51.14332 3
#> 7 2019-08-27 09:16:56.454308 sdr_02 57.44715 3
#> 8 2019-08-27 09:16:56.482433 sdr_04 50.01275 4
#> 9 2019-08-27 09:16:56.761776 sdr_01 71.50030 4
#> 10 2019-08-27 09:16:57.305510 sdr_02 56.85118 4
#> 11 2019-08-27 09:16:57.333628 sdr_04 60.39014 5
#> 12 2019-08-27 09:16:57.612972 sdr_01 73.47034 5发布于 2020-01-23 10:04:33
我认为递归函数是必需的。基本上,每当在前一个组中找到设备时,您都需要启动一个新组。下面是Rcpp中的一个实现
library(Rcpp)
cppFunction("
IntegerVector dev_not_in_prev_grp(IntegerVector device, int ndev) {
int i, j, k, sz = device.size();
std::vector<bool> exists(ndev);
IntegerVector res(sz);
for (k=0; k<ndev; k++)
exists[k] = false;
for (i=0; i<sz; i++) {
if (exists[device[i]-1]) {
res[i] = 1;
for (k=0; k<ndev; k++)
exists[k] = false;
}
exists[device[i]-1] = true;
}
return(res);
}
")用法:
ndev <- 4L
devmap <- setNames(1L:ndev, sprintf("sdr_%02d", 1L:ndev))
cumsum(dev_not_in_prev_grp(devmap[my_data$device], ndev)) + 1L输出:
[1] 1 1 2 2 3 3 3 4 4 4 5 5https://stackoverflow.com/questions/59862624
复制相似问题