问R read.table项目太多
EN

Stack Overflow用户

提问于 2017-03-21 18:03:06

回答 1查看 591关注 0票数 3

我有一个53 Gb的文件，它的头是这样的：

1   10  2873
1   100 22246
1   1000    28474
1   10000   35663
1   10001   35755
1   10002   35944
1   10003   36387
1   10004   36453
1   10005   36758
1   10006   37240

我在一个内存为128 Gb的CentOS7 64位服务器上运行R3.3.2。我已经将4098个类似的文件读入R，但是，我不能将最大的文件读入R。

df <- read.table(f, header=FALSE, col.names=c('a', 'b', 'dist'), sep='\t', quote='', comment.char='')
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : har='')
  too many items

它返回错误，说“太多的项目”。然后我关注了这个tip

df5rows <- read.table(f, nrows=5, header=FALSE, col.names=c('a', 'b', 'dist'), sep='\t', quote='', comment.char='')
classes <- sapply(df5rows, class)
df <- read.table(f, nrows=3231959401, colClass=classes, header=FALSE, col.names=c('a', 'b', 'dist'), sep='\t', quote='', comment.char='')

它仍然显示“太多的项目”和"NAs被引入“。我也尝试了不使用colClasses，同样的结果：

df <- read.table(f, nrows=3231959401, header=FALSE, col.names=c('a', 'b', 'dist'), sep='\t', quote='', comment.char='')
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : har='')
  too many items
In addition: Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  NAs introduced by coercion to integer range

使用的内存永远不会超过90 Gb (当没有任何nrows或colClasses时，使用这些参数时，它永远不会超过60 Gb)。我不明白为什么R不能读取文件。

我还检查了没有包含4列或更多列的行。

read.table

回答 1

Stack Overflow用户

发布于 2017-03-21 18:27:10

您是否尝试过使用(sed或VI)等光线编辑器剪切文件？然后，您只需合并这两个数据集。在具有大文件的非常相似的机器上，我遇到了同样的问题。它是一个垃圾行，关于文件的大小，这种类型的错误会发生。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/42923816

复制

相似问题

问R read.table项目太多
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问R read.table项目太多EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问R read.table项目太多
EN