当我尝试用fread读取R中相当大的文件时,我遇到了读取速度比预期慢得多的情况。
这个文件大约60M行x 147列,我只选择了27列,直接在使用select的fread调用中;27列中只有23列是在实际文件中找到的。(可能是我输入了一些错误的字符串,但我猜这没那么重要。)
data.table::fread("..\\TOI\\TOI_RAW_APextracted.csv",
verbose = TRUE,
select = cols2Select)正在使用的系统是一个Azure VM,具有16核Intel Xeon和114 GB内存,运行Windows10。我还使用了R 3.5.2、RStudio 1.2.1335和data.table 1.12.0
我还应该补充说,该文件是一个csv文件,我已将其传输到虚拟机的本地驱动器上,因此不涉及网络/以太网。我不确定Azure虚拟机是如何工作的,也不确定它们使用的是什么驱动器,但我会假设它等同于SSD。VM上同时未运行/正在处理任何其他内容。
请在下面找到fread的verbose输出
omp_get_max_threads() = 16 omp_get_thread_limit() = 2147483647 DTthreads = 0 RestoreAfterFork = true Input contains no \n. Taking this to be a filename to open [01] Check arguments Using 16 threads (omp_get_max_threads()=16, nth=16) NAstrings = [<<NA>>] None of the NAstrings look like numbers. show progress = 1 0/1 column will be read as integer [02] Opening the file Opening file ..\TOI\TOI_RAW_APextracted.csv File opened, size = 49.00GB (52608776250 bytes). Memory mapped ok [03] Detect and skip BOM [04] Arrange mmap to be \0 terminated \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal. [05] Skipping initial rows if needed Positioned on line 1 starting: <<"POLNO","ProdType","ProductCod>> [06] Detect separator, quoting rule, and ncolumns Detecting sep automatically ... sep=',' with 100 lines of 147 fields using quote rule 0 Detected 147 columns on line 1. This line is either column names or first data row. Line starts as: <<"POLNO","ProdType","ProductCod>> Quote rule picked = 0 fill=false and the most number of columns found is 147 [07] Detect column types, good nrow estimate and whether first row is column names Number of sampling jump points = 100 because (52608776248 bytes from row 1 to eof) / (2 * 85068 jump0size) == 309216 Type codes (jump 000) : A5AA5555A5AA5AAAA57777777555555552222AAAAAA25755555577555757AA5AA5AAAAA5555AAA2A...2222277555 Quote rule 0 Type codes (jump 001) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777555577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0 Type codes (jump 002) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0 Type codes (jump 003) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 010) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 031) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 098) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 100) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 'header' determined to be true due to column 2 containing a string on row 1 and a lower type (int32) in the rest of the 10045 sample rows ===== Sampled 10045 rows (handled \n inside quoted fields) at 101 jump points Bytes from first data row on line 2 to the end of last row: 52608774311 Line length: mean=956.51 sd=35.58 min=823 max=1063 Estimated number of rows: 52608774311 /
956.51 = 55000757 Initial alloc = 60500832 rows (55000757 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
===== [08] Assign column names [09] Apply user overrides on column types After 0 type and 124 drop user overrides : 05000005A0005AA0A0000770000077000A000A00000000770700000000000000A00A000000000000...0000000000 [10] Allocate memory for the datatable Allocating 23 column slots (147 - 124 dropped) with 60500832 rows [11] Read the data jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================| jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================| Read 54964696 rows x 23 columns from 49.00GB (52608776250 bytes) file in 30:26.810 wall clock time [12] Finalizing the datatable Type counts:
124 : drop '0'
3 : int32 '5'
7 : float64 '7'
13 : string 'A'
=============================
0.000s ( 0%) Memory map 48.996GB file
0.035s ( 0%) sep=',' ncol=147 and header detection
0.001s ( 0%) Column type detection using 10045 sample rows
6.000s ( 0%) Allocation of 60500832 rows x 147 cols (9.466GB) of which 54964696 ( 91%) rows used
1820.775s (100%) Reading 50176 chunks (0 swept) of 1.000MB (each chunk 1095 rows) using 16 threads + 1653.728s ( 91%) Parse to row-major thread buffers (grown 32 times) + 22.774s ( 1%) Transpose +
144.273s ( 8%) Waiting
24.545s ( 1%) Rereading 1 columns due to out-of-sample type exceptions
1826.810s Total Column 2 ("ProdType") bumped from 'int32' to 'string' due to <<"B810">> on row 14基本上,我想知道这是不是很正常,或者我能做些什么来提高阅读速度。基于我所见过的各种基准测试,以及我自己对使用较小文件的fread的经验和直觉,我原本期望它能更快地被读取。
我还想知道多核功能是否得到了充分利用,因为我听说在Windows下这可能并不总是那么简单。不幸的是,我对这个主题的了解非常有限,但从verbose输出来看,fread确实检测到了16个内核。
发布于 2019-11-28 11:01:18
想法:
(1)如果您使用的是Windows,请使用Microsoft Open R;如果云是Azure,则更是如此。实际上,Open R和Azure客户端之间可能存在协调。由于Intel的MKL和微软内置的增强功能,我发现Microsoft Open R在Windows上运行速度更快。
(2)我怀疑'Select‘和'Drop’在完整文件读取后会起作用。可能会在之后读取所有的文件、子集或过滤器。
(3)我认为重启有点过头了。我经常像这样运行gc三次:‘gc();’‘我听别人说这没什么用。但至少这让我感觉好多了。实际上,我注意到它在Windows上对我很有帮助。
(4)最新版本的data.table fread实现了'YAML‘。这看起来很有希望。
(5) setDTthread(0)使用所有内核。太多的并行化可能会对你不利。试着把你的内核减半。
https://stackoverflow.com/questions/59074647
复制相似问题