我使用ff和ffbase包来组合两个ffdf对象,但是当我使用merge语句时,目标ffdf中的行数从100万行增加到了800万行。
ffdf1是100万行乘6列:
> summary(ffdf2)
Length Class Mode
userid 1000000 ff_vector list
V2 1000000 ff_vector list
V3 1000000 ff_vector list
V4 1000000 ff_vector list
V5 1000000 ff_vector list
V6 1000000 ff_vector listffdf2大约有2000万行x 3列,如下所示:
userid gender age
1 1 3
2 1 2
3 2 5
4 0 4
5 2 3
... ... ...我使用以下代码来合并这两个代码:
ffdf3 <- merge(ffdf1, ffdf2, by.x="userid",by.y="userid",all.x=T,sort=F)结果是这样的:
> summary(ffdf3)
Length Class Mode
userid 8000000 ff_vector list
V2 8000000 ff_vector list
V3 8000000 ff_vector list
V4 8000000 ff_vector list
V5 8000000 ff_vector list
V6 8000000 ff_vector list
gender 8000000 ff_vector list
age 8000000 ff_vector list你知道为什么长度从1毫米增加到8毫米吗?
编辑:
当我尝试这样做时:
ffdf3 <- merge(ffdf1, ffdf2, by.x="userid",by.y="userid",all.x=F,sort=F)我得到了:
> summary(ffdf3)
Length Class Mode
userid 740383 ff_vector list
V2 740383 ff_vector list
V3 740383 ff_vector list
V4 740383 ff_vector list
V5 740383 ff_vector list
V6 740383 ff_vector list
gender 740383 ff_vector list
age 740383 ff_vector list下面是运行合并的输出:
2012-05-13 14:49:06, x has 2 chunks, y has 8 chunks
2012-05-13 14:49:06, working on x chunk 1:500000
2012-05-13 14:49:07, working on y chunk 1:2958661
2012-05-13 14:49:16, working on y chunk 2958662:5917322
2012-05-13 14:49:32, working on y chunk 5917323:8875983
2012-05-13 14:49:45, working on y chunk 8875984:11834644
2012-05-13 14:49:57, working on y chunk 11834645:14793305
2012-05-13 14:50:09, working on y chunk 14793306:17751966
2012-05-13 14:50:20, working on y chunk 17751967:20710627
2012-05-13 14:50:30, working on y chunk 20710628:23669283
2012-05-13 14:50:40, working on x chunk 500001:1000000
2012-05-13 14:50:41, working on y chunk 1:2958661
2012-05-13 14:50:52, working on y chunk 2958662:5917322
2012-05-13 14:51:03, working on y chunk 5917323:8875983
2012-05-13 14:51:14, working on y chunk 8875984:11834644
2012-05-13 14:51:24, working on y chunk 11834645:14793305
2012-05-13 14:51:36, working on y chunk 14793306:17751966
2012-05-13 14:51:47, working on y chunk 17751967:20710627
2012-05-13 14:51:58, working on y chunk 20710628:23669283此外,ffdf1包含677840个唯一的userid,因此在1 1mm行中有一些重复项。
发布于 2012-05-16 22:32:35
merge.ffdf包含一个错误,当前仅允许执行正确的内部联接,而不允许执行all.x=TRUE和all.y=FALSE。该函数在@ http://code.google.com/p/fffunctions/上运行。问题是,当你没有匹配的记录时,当进行左外部连接时,需要更改vmode以正确地允许NA,这一点正在处理中。
仅供参考。这个问题现在已经在http://code.google.com/p/fffunctions/的开发版本中得到了解决,并将在未来几周内上传到CRAN。
https://stackoverflow.com/questions/10573060
复制相似问题