文章/答案/技术大牛

发布

社区首页 >问答首页 >鸡肉方案读取行花费的时间太长

问鸡肉方案读取行花费的时间太长
EN

Stack Overflow用户

提问于 2020-07-10 09:35:09

回答 2查看 156关注 0票数 0

有没有一种快速读取和标记化大型语料库的方法？我正在尝试读取一个中等大小的文本文件，编译后的小鸡似乎就挂起了(我在大约2分钟后终止了进程)，然而，比如说，球拍的性能还可以接受(大约20秒)。我能做些什么才能在鸡肉上获得同样的表现吗？这是我用来读取文件的代码。欢迎所有的建议。

(define *corpus*
  (call-with-input-file "largeish_file.txt"
    (lambda (input-file)
      (let loop ([line (read-line input-file)]
                 [tokens '()])
        (if (eof-object? line)
            tokens
            (loop (read-line input-file)
                  (append tokens (string-split line))))))))

racket

chicken-scheme

回答 2

Stack Overflow用户

发布于 2020-07-10 14:56:32

尝试使用更大的初始堆运行它：

./prog -:hi100M

程序做了大量的分配，这意味着堆的大小需要调整很多，这会触发许多主要的GC(这些GC是昂贵的)。

当您启用调试输出时，您可以看到堆大小的调整：

./prog -:d

如果您想要查看GC输出，请尝试：

./prog -:g

票数 1

Stack Overflow用户

发布于 2020-08-02 19:22:15

如果您能够一次性将整个文件读取到内存中，您可以使用类似以下代码的代码，这样应该会更快：

(let loop ((lines (with-input-from-file "largeish_file.txt"
                    read-lines)))
  (if (null? lines)
      '()
      (append (string-split (car lines))
              (loop (cdr lines)))))

以下是一些快速基准测试代码：

(import (chicken io)
        (chicken string))

;; Warm-up
(with-input-from-file "largeish_file.txt" read-lines)

(time
 (with-output-to-file "a.out"
   (lambda ()
     (display
      (call-with-input-file "largeish_file.txt"
        (lambda (input-file)
          (let loop ([line (read-line input-file)]
                     [tokens '()])
            (if (eof-object? line)
                tokens
                (loop (read-line input-file)
                      (append tokens (string-split line)))))))))))

(time
 (with-output-to-file "b.out"
   (lambda ()
     (display
      (let loop ((lines (with-input-from-file "largeish_file.txt"
                          read-lines)))
        (if (null? lines)
            '()
            (append (string-split (car lines))
                    (loop (cdr lines)))))))))

这是我系统上的结果：

$ csc bench.scm && ./bench
28.629s CPU time, 13.759s GC time (major), 68772/275 mutations (total/tracked), 4402/14196 GCs (major/minor), maximum live heap: 4.63 MiB
0.077s CPU time, 0.033s GC time (major), 68778/292 mutations (total/tracked), 10/356 GCs (major/minor), maximum live heap: 3.23 MiB

只需确保我们从两个代码片段中获得相同的结果：

$ cmp a.out b.out && echo They contain the same data
They contain the same data

largeish_file.txt是通过对大约100KB的syslog文件进行cat，直到它达到大约10000行而生成的(提到这一点是为了让您对输入文件的配置文件有一个大概的了解)：

$ ls -l largeish_file.txt
-rw-r--r-- 1 mario mario 587340 Aug  2 11:55 largeish_file.txt

$ wc -l largeish_file.tx
5790 largeish_file.txt

我在Debian系统上使用鸡肉5.2.0得到的结果。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/62826083

复制

相似问题

问鸡肉方案读取行花费的时间太长
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问鸡肉方案读取行花费的时间太长EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问鸡肉方案读取行花费的时间太长
EN