首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在R中使用fread的大型CSV导入不完整

在R中使用fread的大型CSV导入不完整
EN

Stack Overflow用户
提问于 2019-06-17 21:18:09
回答 1查看 260关注 0票数 0

我在R中使用fread导入一些大型CSV文件,以创建一个摘要文件,以便进一步分析。在绘制汇总数据时,有几个时间点(每个CSV代表一个月的数据)看起来很低,并且脱离了趋势。进一步研究这个问题,似乎fread没有导入完整的文件并停止。

我尝试过fread、countLines和read_csv,但都没有成功。在下面的示例中

来自https://digital.nhs.uk/data-and-information/publications/statistical/practice-level-prescribing-data/presentation-level-march-2013https://digital.nhs.uk/data-and-information/publications/statistical/practice-level-prescribing-data/gp-practice-prescribing-presentation-level-data-april-2013文件的数据分别约为1.4 are

对于以下两个文件- T201303PDPI+BNF.CSV是不完整的文件,T201304PDPI+BNF.CSV是一个完整的文件

我已经将收到的错误消息包含在以下代码的注释中:

代码语言:javascript
复制
library(data.table)
library(R.utils)
library(readr)

prescribing = fread("T201303PDPI+BNFT.CSV")
# Discarded single-line footer: <<Q34,5PF,M87013,21020001190    ,Coloplast SpeediCath Compact Fle Size 8-,0000001,000>>
prescribing2 = fread("T201304PDPI+BNFT.CSV")

countLines("T201303PDPI+BNFT.CSV")
# [1] 4427688
# attr(,"lastLineHasNewline")
# [1] FALSE
countLines("T201304PDPI+BNFT.CSV")
# [1] 10024499
# attr(,"lastLineHasNewline")
# [1] TRUE

prescribing = read_csv("T201303PDPI+BNFT.CSV")
# row col   expected        actual                   file
# 4427687 NIC            embedded null 'T201303PDPI+BNFT.CSV'
prescribing2 = read_csv("T201304PDPI+BNFT.CSV")

我想继续使用fread来导入数据(它很快),但是我不知道如何导入整个文件并忽略嵌入的null。我们将非常感谢您的帮助。

编辑:

代码语言:javascript
复制
prescribing = fread("T201304PDPI+BNFT.CSV", verbose = TRUE)
omp_get_num_procs()==4
R_DATATABLE_NUM_PROCS_PERCENT=="" (default 50)
R_DATATABLE_NUM_THREADS==""
omp_get_thread_limit()==2147483647
omp_get_max_threads()==4
OMP_THREAD_LIMIT==""
OMP_NUM_THREADS==""
data.table is using 2 threads. This is set on startup, and by setDTthreads(). See ?setDTthreads.
RestoreAfterFork==true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 2 threads (omp_get_max_threads()=4, nth=2)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file T201304PDPI+BNFT.CSV
  File opened, size = 1.298GB (1393405361 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: << SHA,PCT,PRACTICE,BNF CODE,BNF>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=','  with 100 lines of 11 fields using quote rule 0
  Detected 11 columns on line 1. This line is either column names or first data row. Line starts as: << SHA,PCT,PRACTICE,BNF CODE,BNF>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 11
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because (1393405359 bytes from row 1 to eof) / (2 * 13900 jump0size) == 50122
  Type codes (jump 000)    : AAAAA577552  Quote rule 0
  Type codes (jump 100)    : AAAAA577552  Quote rule 0
  'header' determined to be true due to column 6 containing a string on row 1 and a lower type (int32) in the rest of the 10050 sample rows
  =====
  Sampled 10050 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 2 to the end of last row: 1393405220
  Line length: mean=139.00 sd=0.02 min=137 max=139
  Estimated number of rows: 1393405220 / 139.00 = 10024513
  Initial alloc = 11026964 rows (10024513 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : AAAAA577552
[10] Allocate memory for the datatable
  Allocating 11 column slots (11 - 0 dropped) with 11026964 rows
[11] Read the data
  jumps=[0..1328), chunk_size=1049250, total_size=1393405220
|--------------------------------------------------|
|==================================================|
Read 10024498 rows x 11 columns from 1.298GB (1393405361 bytes) file in 01:23.504 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '2'
         3 : int32     '5'
         2 : float64   '7'
         5 : string    'A'
=============================
   0.018s (  0%) Memory map 1.298GB file
   0.224s (  0%) sep=',' ncol=11 and header detection
   0.000s (  0%) Column type detection using 10050 sample rows
   0.901s (  1%) Allocation of 11026964 rows x 11 cols (0.739GB) of which 10024498 ( 91%) rows used
  82.362s ( 99%) Reading 1328 chunks (0 swept) of 1.001MB (each chunk 7548 rows) using 2 threads
   +   79.434s ( 95%) Parse to row-major thread buffers (grown 0 times)
   +    2.817s (  3%) Transpose
   +    0.111s (  0%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
  83.504s        Total

prescribing = fread("T201303PDPI+BNFT.CSV", verbose = TRUE)
omp_get_num_procs()==4
R_DATATABLE_NUM_PROCS_PERCENT=="" (default 50)
R_DATATABLE_NUM_THREADS==""
omp_get_thread_limit()==2147483647
omp_get_max_threads()==4
OMP_THREAD_LIMIT==""
OMP_NUM_THREADS==""
data.table is using 2 threads. This is set on startup, and by setDTthreads(). See ?setDTthreads.
RestoreAfterFork==true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 2 threads (omp_get_max_threads()=4, nth=2)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file T201303PDPI+BNFT.CSV
  File opened, size = 1.295GB (1390238802 bytes).
  Memory mapped ok
[03] Detect and skip BOM
  Last byte(s) of input found to be 0x00 (NUL) and removed.
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
  File ends abruptly with '0'. Final end-of-line is missing. Using cow page to write 0 to the last byte.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: << SHA,PCT,PRACTICE,BNF CODE,BNF>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=','  with 100 lines of 11 fields using quote rule 0
  Detected 11 columns on line 1. This line is either column names or first data row. Line starts as: << SHA,PCT,PRACTICE,BNF CODE,BNF>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 11
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because (615448576 bytes from row 1 to eof) / (2 * 13900 jump0size) == 22138
  Type codes (jump 000)    : AAAAA577552  Quote rule 0
  A line with too-few fields (7/11) was found on line 50 of sample jump 100. Most likely this jump landed awkwardly so type bumps here will be skipped.
  Type codes (jump 100)    : AAAAA577552  Quote rule 0
  'header' determined to be true due to column 6 containing a string on row 1 and a lower type (int32) in the rest of the 10049 sample rows
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 2 to the end of last row: 615448437
  Line length: mean=139.00 sd=0.00 min=139 max=139
  Estimated number of rows: 615448437 / 139.00 = 4427687
  Initial alloc = 4870455 rows (4427687 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : AAAAA577552
[10] Allocate memory for the datatable
  Allocating 11 column slots (11 - 0 dropped) with 4870455 rows
[11] Read the data
  jumps=[0..586), chunk_size=1050253, total_size=615448437
  Restarting team from jump 585. nSwept==0 quoteRule==1
  jumps=[585..586), chunk_size=1050253, total_size=615448437
  Restarting team from jump 585. nSwept==0 quoteRule==2
  jumps=[585..586), chunk_size=1050253, total_size=615448437
  Restarting team from jump 585. nSwept==0 quoteRule==3
  jumps=[585..586), chunk_size=1050253, total_size=615448437
Read 4427686 rows x 11 columns from 1.295GB (1390238802 bytes) file in 00:03.479 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '2'
         3 : int32     '5'
         2 : float64   '7'
         5 : string    'A'
=============================
   0.079s (  2%) Memory map 1.295GB file
   0.614s ( 18%) sep=',' ncol=11 and header detection
   0.000s (  0%) Column type detection using 10049 sample rows
   0.526s ( 15%) Allocation of 4870455 rows x 11 cols (0.327GB) of which 4427686 ( 91%) rows used
   2.260s ( 65%) Reading 586 chunks (0 swept) of 1.002MB (each chunk 7555 rows) using 2 threads
   +    0.661s ( 19%) Parse to row-major thread buffers (grown 0 times)
   +    1.576s ( 45%) Transpose
   +    0.023s (  1%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
   3.479s        Total
Warning message:
In fread("T201303PDPI+BNFT.CSV", verbose = TRUE) :
  Discarded single-line footer: <<Q34,5PF,M87013,21020001190    ,Coloplast SpeediCath Compact Fle Size 8-,0000001,000>>
EN

回答 1

Stack Overflow用户

发布于 2019-06-17 22:02:34

如果问题是要去掉nul字符,那么尝试下面这样的代码,假设tr在您的路径中。根据您的shell,您可能需要稍微修改它。

代码语言:javascript
复制
fread(cmd = "tr -d '\\000' < inputfile")
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/56632183

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档