首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >优化`data.table::fread`速度的建议

优化`data.table::fread`速度的建议
EN

Stack Overflow用户
提问于 2019-11-28 00:30:45
回答 1查看 524关注 0票数 1

当我尝试用fread读取R中相当大的文件时,我遇到了读取速度比预期慢得多的情况。

这个文件大约60M行x 147列,我只选择了27列,直接在使用selectfread调用中;27列中只有23列是在实际文件中找到的。(可能是我输入了一些错误的字符串,但我猜这没那么重要。)

代码语言:javascript
复制
data.table::fread("..\\TOI\\TOI_RAW_APextracted.csv",
                     verbose = TRUE,
                     select = cols2Select)

正在使用的系统是一个Azure VM,具有16核Intel Xeon和114 GB内存,运行Windows10。我还使用了R 3.5.2、RStudio 1.2.1335和data.table 1.12.0

我还应该补充说,该文件是一个csv文件,我已将其传输到虚拟机的本地驱动器上,因此不涉及网络/以太网。我不确定Azure虚拟机是如何工作的,也不确定它们使用的是什么驱动器,但我会假设它等同于SSD。VM上同时未运行/正在处理任何其他内容。

请在下面找到freadverbose输出

代码语言:javascript
复制
omp_get_max_threads() = 16 omp_get_thread_limit() = 2147483647 DTthreads = 0 RestoreAfterFork = true Input contains no \n. Taking this to be a filename to open [01] Check arguments   Using 16 threads (omp_get_max_threads()=16, nth=16)   NAstrings = [<<NA>>]   None of the NAstrings look like numbers.   show progress = 1   0/1 column will be read as integer [02] Opening the file   Opening file ..\TOI\TOI_RAW_APextracted.csv   File opened, size = 49.00GB (52608776250 bytes).   Memory mapped ok [03] Detect and skip BOM [04] Arrange mmap to be \0 terminated   \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal. [05] Skipping initial rows if needed   Positioned on line 1 starting: <<"POLNO","ProdType","ProductCod>> [06] Detect separator, quoting rule, and ncolumns   Detecting sep automatically ...   sep=','  with 100 lines of 147 fields using quote rule 0   Detected 147 columns on line 1. This line is either column names or first data row. Line starts as: <<"POLNO","ProdType","ProductCod>>   Quote rule picked = 0  fill=false and the most number of columns found is 147 [07] Detect column types, good nrow estimate and whether first row is column names Number of sampling jump points = 100 because (52608776248 bytes from row 1 to eof) / (2 * 85068 jump0size) == 309216   Type codes (jump 000)    : A5AA5555A5AA5AAAA57777777555555552222AAAAAA25755555577555757AA5AA5AAAAA5555AAA2A...2222277555 Quote rule 0   Type codes (jump 001)    : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777555577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0   Type codes (jump 002)    : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0   Type codes (jump 003)    : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0   Type codes (jump 010)    : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0   Type codes (jump 031)    : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0   Type codes (jump 098)    : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0   Type codes (jump 100)    : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0   'header' determined to be true due to column 2 containing a string on row 1 and a lower type (int32) in the rest of the 10045 sample rows   =====   Sampled 10045 rows (handled \n inside quoted fields) at 101 jump points   Bytes from first data row on line 2 to the end of last row: 52608774311   Line length: mean=956.51 sd=35.58 min=823 max=1063   Estimated number of rows: 52608774311 /
956.51 = 55000757   Initial alloc = 60500832 rows (55000757 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]  
===== [08] Assign column names [09] Apply user overrides on column types   After 0 type and 124 drop user overrides : 05000005A0005AA0A0000770000077000A000A00000000770700000000000000A00A000000000000...0000000000 [10] Allocate memory for the datatable   Allocating 23 column slots (147 - 124 dropped) with 60500832 rows [11] Read the data   jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================|   jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================| Read 54964696 rows x 23 columns from 49.00GB (52608776250 bytes) file in 30:26.810 wall clock time [12] Finalizing the datatable   Type counts:
       124 : drop      '0'
         3 : int32     '5'
         7 : float64   '7'
        13 : string    'A'
=============================
   0.000s (  0%) Memory map 48.996GB file
   0.035s (  0%) sep=',' ncol=147 and header detection
   0.001s (  0%) Column type detection using 10045 sample rows
   6.000s (  0%) Allocation of 60500832 rows x 147 cols (9.466GB) of which 54964696 ( 91%) rows used
1820.775s (100%) Reading 50176 chunks (0 swept) of 1.000MB (each chunk 1095 rows) using 16 threads    + 1653.728s ( 91%) Parse to row-major thread buffers (grown 32 times)    +   22.774s (  1%) Transpose    + 
144.273s (  8%) Waiting
  24.545s (  1%) Rereading 1 columns due to out-of-sample type exceptions
1826.810s        Total Column 2 ("ProdType") bumped from 'int32' to 'string' due to <<"B810">> on row 14

基本上,我想知道这是不是很正常,或者我能做些什么来提高阅读速度。基于我所见过的各种基准测试,以及我自己对使用较小文件的fread的经验和直觉,我原本期望它能更快地被读取。

我还想知道多核功能是否得到了充分利用,因为我听说在Windows下这可能并不总是那么简单。不幸的是,我对这个主题的了解非常有限,但从verbose输出来看,fread确实检测到了16个内核。

EN

回答 1

Stack Overflow用户

发布于 2019-11-28 11:01:18

想法:

(1)如果您使用的是Windows,请使用Microsoft Open R;如果云是Azure,则更是如此。实际上,Open R和Azure客户端之间可能存在协调。由于Intel的MKL和微软内置的增强功能,我发现Microsoft Open R在Windows上运行速度更快。

(2)我怀疑'Select‘和'Drop’在完整文件读取后会起作用。可能会在之后读取所有的文件、子集或过滤器。

(3)我认为重启有点过头了。我经常像这样运行gc三次:‘gc();’‘我听别人说这没什么用。但至少这让我感觉好多了。实际上,我注意到它在Windows上对我很有帮助。

(4)最新版本的data.table fread实现了'YAML‘。这看起来很有希望。

(5) setDTthread(0)使用所有内核。太多的并行化可能会对你不利。试着把你的内核减半。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/59074647

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档