文章/答案/技术大牛

发布

社区首页 >问答首页 >有比fread()更快的方法来读取大数据吗？

问有比fread()更快的方法来读取大数据吗？
EN

Stack Overflow用户

提问于 2019-05-31 14:15:50

回答 2查看 4.4K关注 0票数 8

嗨，首先，我已经搜索了堆栈和谷歌，并找到了这样的帖子：将非常大的表快速读取为数据格式。虽然这些都是有帮助的，并且得到了很好的回答，但我正在寻找更多的信息。

我正在寻找读取/导入可以高达50-60GB的“大”数据的最佳方法。我目前正在使用来自fread()的data.table函数，它是目前我所知道的最快的函数。我工作的pc/服务器拥有一个好的cpu (工作站)和32 GB的RAM，但数据仍然超过10 GB，有时甚至接近数十亿的观测结果需要很长时间才能被读取。

我们已经有了sql数据库，但出于某些原因，我们必须在R中处理大数据。对于这样的大文件，有什么方法可以加快R的速度，或者比fread()更好的选择吗？

谢谢。

编辑: fread("data.txt"，逐字=真)

omp_get_max_threads() = 2
omp_get_thread_limit() = 2147483647
DTthreads = 0
RestoreAfterFork = true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 2 threads (omp_get_max_threads()=2, nth=2)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file C://somefolder/data.txt
  File opened, size = 1.083GB (1163081280 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<ID,Dat,No,MX,NOM_TX>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=','  with 100 lines of 5 fields using quote rule 0
  Detected 5 columns on line 1. This line is either column names or first data row. Line starts as: <<ID,Dat,No,MX,NOM_TX>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 5
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because (1163081278 bytes from row 1 to eof) / (2 * 5778 jump0size) == 100647
  Type codes (jump 000)    : 5A5AA  Quote rule 0
  Type codes (jump 100)    : 5A5AA  Quote rule 0
  'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) in the rest of the 10054 sample rows
  =====
  Sampled 10054 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 2 to the end of last row: 1163081249
  Line length: mean=56.72 sd=20.65 min=25 max=128
  Estimated number of rows: 1163081249 / 56.72 = 20506811
  Initial alloc = 41013622 rows (20506811 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 5A5AA
[10] Allocate memory for the datatable
  Allocating 5 column slots (5 - 0 dropped) with 41013622 rows
[11] Read the data
  jumps=[0..1110), chunk_size=1047820, total_size=1163081249
|--------------------------------------------------|
|==================================================|
Read 20935277 rows x 5 columns from 1.083GB (1163081280 bytes) file in 00:31.484 wall clock time
[12] Finalizing the datatable
  Type counts:
         2 : int32     '5'
         3 : string    'A'
=============================
   0.007s (  0%) Memory map 1.083GB file
   0.739s (  2%) sep=',' ncol=5 and header detection
   0.001s (  0%) Column type detection using 10054 sample rows
   1.809s (  6%) Allocation of 41013622 rows x 5 cols (1.222GB) of which 20935277 ( 51%) rows used
  28.928s ( 92%) Reading 1110 chunks (0 swept) of 0.999MB (each chunk 18860 rows) using 2 threads
   +   26.253s ( 83%) Parse to row-major thread buffers (grown 0 times)
   +    2.639s (  8%) Transpose
   +    0.035s (  0%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
  31.484s        Total

fread

data.table

bigdata

回答 2

Stack Overflow用户

回答已采纳

发布于 2019-05-31 14:42:10

您可以使用select = columns只加载相关列，而不会使内存饱和。例如：

dt <- fread("./file.csv", select = c("column1", "column2", "column3"))

我使用read.delim()读取一个fread()无法完全加载的文件。因此，您可以将数据转换为.txt并使用read.delim()。

但是，为什么不打开一个连接到您要从其中提取数据的SQL服务器。您可以使用library(odbc)打开到SQL服务器的连接，并像通常那样编写查询。您可以通过这种方式优化内存使用。

查看这个简短的介绍到odbc。

票数 3

Stack Overflow用户

发布于 2019-06-01 05:36:41

假设您希望将文件完全读入R中，那么使用数据库或选择列/行的子集将不会有多大帮助。

在这种情况下，可以帮助的是：

确保您使用的是最新版本的data.table
确保设置了最佳线程数。使用setDTthreads(0L)使用所有可用线程，默认情况下，data.table使用50%的可用线程。
检查fread(..., verbose=TRUE)的输出，并可能将其添加到您的问题中
将您的文件放在快盘或RAM磁盘上，然后从那里读取。

如果您的数据有许多不同的字符变量，您可能无法获得很大的速度，因为填充R的内部全局字符缓存是单线程的，因此解析可以进行得很快，但是创建字符向量将是瓶颈。

票数 4

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/56396770

复制

相似问题

问有比fread()更快的方法来读取大数据吗？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问有比fread()更快的方法来读取大数据吗？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问有比fread()更快的方法来读取大数据吗？
EN