文章/答案/技术大牛

发布

社区首页 >问答首页 >方案中的高效文本文件读取

问方案中的高效文本文件读取
EN

Code Review用户

提问于 2023-01-12 20:35:17

回答 1查看 51关注 0票数 2

我一直在尝试将大型文本文件作为单个字符串读取到Scheme程序中的方法。开发是在R7RS方案中进行的，特别是赤壁计划。

在尝试了许多方法之后，有些方法相当复杂，而且涉及到，最好的表现是相当简单的。下面是三个这样的例子，以及一些示例用法和计时结果。

(import (scheme base)
        (scheme time))

;; Use with-input-from-file.
;; Return the contents of a text file as a single string,
;; #\newlines included.
(define (file->string1 path)
  (let* ((start-time (current-second)))
    (let ((lst (with-input-from-file path
                 (lambda ()
                   (let ((p (current-input-port)))
                     (let loop ((ch (read-char p))
                                (acc '()))
                       (if (eof-object? ch)
                           acc
                           (loop (read-char p) (cons ch acc)))))))))
      (let ((result (list->string (reverse lst)))
            (duration (- (current-second) start-time)))
        (display "file->string1 complete in ")
        (display duration)
        (display " seconds.\n")
        result))))

;; Use call-with-input-file.
;; Return the contents of a text file as a single string,
;; #\newlines included.
(define (file->string2 path)
  (let ((start-time (current-second))
        (lst (call-with-input-file path
               (lambda (p)
                 (let loop ((ch (read-char p))
                            (acc '()))
                   (if (eof-object? ch)
                       acc
                       (loop (read-char p) (cons ch acc))))))))
    (let ((result (list->string (reverse lst)))
          (duration (- (current-second) start-time)))
      (display "file->string2 complete in ")
      (display duration)
      (display " seconds.\n")
      result)))

;; Use an output string to collect the data read.
;; Return the contents of a text file as a single string,
;; #\newlines included.
(define (file->string3 path)
  (let* ((start-time (current-second))
         (result (call-with-input-file path
                   (lambda (p)
                     (let ((out (open-output-string)))
                       (let loop ()
                         (cond
                          ((eof-object? (peek-char p))
                           (get-output-string out))
                          (else
                           (write-char (read-char p) out)
                           (loop))))))))
         (duration (- (current-second) start-time)))
    (display "file->string3 complete in ")
    (display duration)
    (display " seconds.\n")
    result))

;; Example usage and results loading "War and Peace" from the
;; Project Gutenberg.
;; https://www.gutenberg.org/cache/epub/2600/pg2600.txt
;;
;; ➜  david in schemacs on branch: (main) ! rlwrap chibi-scheme
;; > (load "file-ops.scm")
;; > (define wap1 (file->string1 "war-and-peace.txt"))
;; file->string1 complete in 0.6330661773681641 seconds.
;; > (define wap2 (file->string2 "war-and-peace.txt"))
;; file->string2 complete in 5.0067901611328125e-06 seconds.
;; > (define wap3 (file->string3 "war-and-peace.txt"))
;; file->string3 complete in 0.48236799240112305 seconds.
;; > (string-length wap1)
;; 3227709
;; > (string-length wap2)
;; 3227709
;; > (string-length wap3)
;; 3227709
;; > (string=? wap1 wap2 wap3)
;; #t

作为测试用例，我使用了从古腾堡项目获得的“战争与和平”的副本作为文件读取。它由3,227,709个字符组成，包括古登堡项目的一些序言和最后评论。

这三个过程都产生相同的结果，但是第二个过程始终报告一个更快的执行时间。在重复运行时，报告了接近上述值的定时值。使用一个“白鲸迪克”的副本作为文件读取产生类似的时间关系之间的程序。

我看不出第二次手术在时间上有什么不同。结果是假的吗？

由于文体上的原因，一种方法比另一种更可取吗？

scheme

回答 1

Code Review用户

回答已采纳

发布于 2023-01-13 06:16:47

谢谢你对时间的调查。恕我直言，我一点也不认为这些人物是表演者。稍后再讨论这个问题。

file->string1使用let*分配start-time，然后(带有令人困惑的缩进)计算读取结果。

file->string2在两个任务中使用一个let。允许按两种顺序来做。优化器可能注意到其中之一有更多的依赖关系需要解决，应该首先进行调度，可能会利用与另一个表达式的计算重叠的情况。用了5微秒来评估。

tl;dr:制定基准很难做到正确。为细节操心。

让我们后退一步，从更高的层次对此进行批判。我承认，我对阅读300多万个字符并检查每一个字符以看“这是EOF吗？”、“这是EOF吗？”这件事并不满意。

当然，方案提供了一个POSIX大容量读取原语？类似于with-input-from-file，然后在char-set:full中使用read-string，或者可能:指定或甚至:标准或:打印。

如果没有，我建议使用FFI外部函数接口直接调用第2章C read()函数。

让我们暂时切换到python，没有人会指责python是一种“快速”语言。

from pathlib import Path
from time import time
import requests

temp = Path("/tmp")

def read_war_and_peace(url="https://www.gutenberg.org/ebooks/2600.txt.utf-8"):
    cache = temp / Path(url).name
    if not cache.exists():
        resp = requests.get(url)
        cache.write_text(resp.text)
    print(cache.stat().st_size)
    with open(cache) as fin:
        return fin.read()


if __name__ == "__main__":
    t0 = time()
    print(len(read_war_and_peace()), time() - t0)

这里发生了两件事。我们读取3_359_372二进制字节，然后将它们解码为UTF8文本的3_227_489代码点(字符)。使用fin.read()进行这一操作非常简单。

观察到的表现？在2.2 GHz (2015) MacBook空气中，我观察到了16毫秒的流逝时间，即~ 202 MiB /秒。相对于字节码的评估，几乎所有的时间都用于C代码。方案编译器是相当复杂的，因此，如果方案应用程序使用适当的数据结构，从而使编译器发光，那么它们可以与Rust或C竞争。

我认为，我们在这里看到的三个功能是好的，但不是表演。30倍的理发似乎比我们想接受的要多。如果我们希望在482毫秒内完成，我们应该在大块上操作，而不是一次字符操作。

票数 4

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/282549

复制

相似问题

问方案中的高效文本文件读取
EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问方案中的高效文本文件读取EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问方案中的高效文本文件读取
EN