首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >data.table fread无法为具有200 RAM内存的300 RAM文件分配内存

data.table fread无法为具有200 RAM内存的300 RAM文件分配内存
EN

Stack Overflow用户
提问于 2019-02-05 05:56:52
回答 2查看 1.5K关注 0票数 4

fread无法读取200 an空闲RAM的300 an .csv文件,并出现错误。

代码语言:javascript
复制
Error: cannot allocate vector of size 5.6 Mb

任务管理器截图:

该文件包含373522行和401列,其中1列(标识符)为字符,400列为数字列。

UPD:这个问题似乎与内存不足无关,而是与释放分配机制有关,因为如上所述,我有200 of的空闲RAM,并且希望只读取300 of csv文件和数字列。

UPD2:添加详细输出

如何读取文件:

代码语言:javascript
复制
lang-r
data <- fread(
    file = fn,
    sep = ",",
    stringsAsFactors = FALSE,
    data.table = FALSE,
    nrows = 1
)

col_classes <- c(
    "character",
    rep("numeric", ncol(data) - 1)
) 

data <- fread(
    file = fn,
    sep = ",",
    na.strings = c("NA", "na", "NULL", "null", ""),
    stringsAsFactors = FALSE,
    colClasses = col_classes,
    showProgress = TRUE,
    data.table = FALSE
)


lang-r
> file.size(fn)
[1] 331201365

会议信息:

代码语言:javascript
复制
lang-r
> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.3.1     purrr_0.2.5       dplyr_0.7.8       data.table_1.12.0 crayon_1.3.4     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0       assertthat_0.2.0 R6_2.3.0         magrittr_1.5     pillar_1.3.1     stringi_1.2.4   
 [7] rlang_0.3.1      rstudioapi_0.9.0 bindrcpp_0.2.2   tools_3.5.2      glue_1.3.0       yaml_2.2.0      
[13] compiler_3.5.2   pkgconfig_2.0.2  tidyselect_0.2.5 bindr_0.1.1      tibble_2.0.1

如果需要详细的输出,请告诉我。

代码语言:javascript
复制
lang-r
omp_get_max_threads() = 64
omp_get_thread_limit() = 2147483647
DTthreads = 0
RestoreAfterFork = true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 64 threads (omp_get_max_threads()=64, nth=64)
  NAstrings = [<<NA>>, <<na>>, <<NULL>>, <<null>>, <<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file I:/secret_file_name.csv
  File opened, size = 315.9MB (331201365 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<id,column_1>>
[06] Detect separator, quoting rule, and ncolumns
  Using supplied sep ','
  sep=','  with 100 lines of 301 fields using quote rule 0
  Detected 301 columns on line 1. This line is either column names or first data row. Line starts as: <<id,column_1>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 301
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because (331201363 bytes from row 1 to eof) / (2 * 163458 jump0size) == 1013
  Type codes (jump 000)    : A7777777777777777777777557777777557777775577777777755777777755777555777555777777...7777777777  Quote rule 0
  Type codes (jump 002)    : A7777777777777777777777557777777557777775577777777755777777755777777777557777777...7777777777  Quote rule 0
  Type codes (jump 020)    : A7777777777777777777777557777777557777775577777777755777777755777777777557777777...7777777777  Quote rule 0
  Type codes (jump 027)    : A7777777777777777777777777777777557777775577777777755777777777777777777557777777...7777777777  Quote rule 0
  Type codes (jump 058)    : A7777777777777777777777777777777777777777777777777777777777777777777777557777777...7777777777  Quote rule 0
  Type codes (jump 100)    : A7777777777777777777777777777777777777777777777777777777777777777777777557777777...7777777777  Quote rule 0
  'header' determined to be true due to column 2 containing a string on row 1 and a lower type (float64) in the rest of the 10059 sample rows
  =====
  Sampled 10059 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 2 to the end of last row: 331159811
  Line length: mean=903.71 sd=756.62 min=326 max=4068
  Estimated number of rows: 331159811 / 903.71 = 366444
  Initial alloc = 732888 rows (366444 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 6 type and 0 drop user overrides : A7777777777777777777777777777777777777777777777777777777777777777777777777777777...7777777777
[10] Allocate memory for the datatable
  Allocating 301 column slots (301 - 0 dropped) with 732888 rows
Error: cannot allocate vector of size 5.6 Mb
EN

回答 2

Stack Overflow用户

发布于 2019-02-05 06:01:52

R在RAM上处理数据。因此,您的全局环境的大小最多可以是分配给R的RAM的大小。

这里有一些小把戏。

1-使用gc()强制垃圾收集

2-删除不必要的数据

3-使用较小的数据类型,如整数,而不是数字。

看看我之前的答案,here

票数 0
EN

Stack Overflow用户

发布于 2019-02-10 15:22:36

内存泄漏似乎与data.table::fread()无关。当我试图总共读取74个82 in的csv文件并全部读取bind_cols时,内存泄漏了。为此,我使用以下(伪)代码:

代码语言:javascript
复制
files <- list.files(some_dir)

result <- fread("the_main_file.csv", ...)

# to ensure all ids of all files are equal to avoid left_join
check_ids <- c()

for (fn in files) {
    data <- fread(fn, other_important_parameters)
    check_ids <- check_ids %>% c(sum(result$id == data$id))
    data$id <- NULL
    result <- result %>% bind_cols(data)
    # this probably is useless
    remove(data)
}

在这种情况下,我得到一个错误“无法分配.”使用512 Gb内存中的290 Gb和512 Gb中的512 Gb(类似于Q体内的屏幕截图)

但是,当我试图绑定数据子集(n-1行而不是n行)时,内存没有任何问题!任务管理器显示290 GB的使用和提交。

下面是更新的伪代码,它在没有内存泄漏的情况下工作。

代码语言:javascript
复制
files <- list.files(some_dir)

result <- fread("the_main_file.csv", ...)

ix <- rep(TRUE, nrow(result))
ix[1] <- FALSE

result <- result[ix, ]


# to ensure all ids of all files are equal to avoid left_join
check_ids <- c()

for (fn in files) {
    data <- fread(fn, other_important_parameters)
    check_ids <- check_ids %>% c(sum(result$id == data$id))
    data$id <- NULL
    result <- result %>% bind_cols(data[ix, ])
    # this probably is useless
    remove(data)
}

因此,fread的工作似乎是正确的,data.frames的绑定也有问题。

有人能解释一下为什么会这样吗?

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/54528429

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档