首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >R7RS方案中的Trie数据结构

R7RS方案中的Trie数据结构
EN

Code Review用户
提问于 2023-03-30 15:13:12
回答 2查看 51关注 0票数 0

从R6RS计划过渡到R7RS是非常有教育意义和乐趣的。赤壁计划是我最喜欢的,因为它密切符合标准。

在启动一个新项目时,我希望重用我在Chez (R6RS)中开发的Trie数据过程库。一开始,这种转变是无痛的。然而,当开始在一些单元测试上工作时,性能并不好。在一本大型(约172,000本)单词词典中阅读并对它们进行计数的测试花费了数百倍的时间。这基本上是即时在Chez,增加了一个几乎不明显的延迟开始的测试。

在方便的R7RS发行版中运行,执行时间要短得多。

下面是一个小例子。这是一个大大简化的原始库版本。它加载一本词典,并对字典中的单词进行计数。(用于构建字典的文件来自:https://www.wordgamedictionary.com/enable/)

代码语言:javascript
复制
;;;
;;; trie-problem.scm -- illustrate performance problem with trie library.
;;;

(import (scheme base)
        (scheme file)
        (scheme write)
        ;; Depending on Scheme implementation used, you may have to 
        ;; un-comment one of the following.
        ;;(chibi time)
        ;;(gauche time)
        )

(define alphabet-size 256)

;; In a trie, each node consists of an "end-of-word" marker and
;; and a vector of pointers to child nodes.
(define-record-type trie
  (mk-trie c end-of-word? children)
  trie?
  (c trie-c trie-c-set!)
  (end-of-word? trie-end-of-word? trie-end-of-word-set!)
  (children trie-children trie-children-set!))

;; Return a new, initialized trie node.
(define (trie-new-node)
  (let ((node (mk-trie #\null #f (make-vector alphabet-size '()))))
    node))

;; Return the child node associated with the index into the child vector.
(define (child-for-index node idx)
  (vector-ref (trie-children node) idx))

;; Return the child node associated with the character in the child vector.
(define (child-for-char node c)
  (child-for-index node (char->integer c)))

;; Add a word to the trie node.
(define (trie-add-word! node word)
  (let loop ((crawler node)
             (lst  (string->list word)))
    (if (null? lst)
        (trie-end-of-word-set! crawler #t)
        (let* ((c (car lst))
               (idx (char->integer c)))
          (when (null? (child-for-index crawler idx))
            (let ((new-node (trie-new-node)))
              (trie-c-set! new-node c)
              (vector-set! (trie-children crawler) idx new-node)))
          (loop (child-for-index crawler idx) (cdr lst))))))

;; Convenience function to convert a vector index to the corresponding
;; letter and append it to the string.
(define (append-char-for-index s i)
  (string-append s (string (integer->char i))))

;; Do an in-order traversal of the tree and, for each complete
;; word, invoke the callback function with the word.
(define (trie-process-helper node s callback)
  (when (trie-end-of-word? node)
    (callback s))
  (let loop ((idx 0))
    (when (< idx alphabet-size)
      (let ((child (child-for-index node idx)))
        (when (not (null? child))
          (trie-process-helper child (append-char-for-index s idx) callback))
        (loop (+ 1 idx))))))

;; Return the number of words stored in the trie.
(define (trie-count-words node)
  (let* ((cntr 0)
         (count-fn (lambda (s) (set! cntr (+ 1 cntr)))))
    (trie-process-helper node "" count-fn)
    cntr))

;; A dedicated word-counting procedure. Uses no "helper" function
;; and does not contain code to generalize to additional abilities.
(define (faster-counter node)
  (letrec ((word-count 0)
           (cf (lambda (nd)
                 (when (trie-end-of-word? nd)
                   (set! word-count (+ 1 word-count)))
                 (let loop ((idx 0))
                   (when (< idx alphabet-size)
                     (let ((child (child-for-index nd idx)))
                       (when (not (null? child))
                         (cf child))
                       (loop (+ 1 idx))))))))
    (cf node)
    word-count))

;; Return an ordered list of lines in the named file.
(define (readlines filename)
  (call-with-input-file filename
    (lambda (p)
      (let loop ((line (read-line p))
                 (result '()))
        (if (eof-object? line)
            result
            ;;(reverse result)
            (loop (read-line p) (cons line result)))))))

(let ((root (trie-new-node)))

  ;; Open the dictionary file, containing one word per line, and
  ;; build a trie from the contents.
  (define (trie-build-dictionary! word-list-file-name root-node)
    (display "Enter 'trie-build-dictionary!") (newline)
    (with-input-from-file word-list-file-name
      ;; Process the file by adding words from the
      ;; file, one word per line, to the trie root-node.
      (lambda ()
        (let loop ((line (read-line (current-input-port))))
          (when (not (eof-object? line))
            (trie-add-word! root-node line)
            (loop (read-line (current-input-port)))))))
    (display "Exit 'trie-build-dictionary!") (newline))

  (time "Build dictionary"
        ;; Use the small dictionary file for debugging.
        ;;(trie-build-dictionary! "./word-lists/subset_list.txt" root))
        (trie-build-dictionary! "./word-lists/ENABLE_word_list.txt" root))

  (display "Done loading dictionary. Counting...") (newline)
  (time "Counting words"
        (begin
          (display "Words in trie: ") (display (trie-count-words root))
          (newline)))

    (time "Fast counting words"
        (begin
          (display "Fast count of words: ")
          (display (faster-counter root)) (newline))))

正如我提到的,运行这段代码几乎是瞬间的,它是用Chez编写的程序的启动例程的一部分。但在赤壁我得到了:

代码语言:javascript
复制
;; Run with chibi
➜  david in trees chibi-scheme trie-problem.scm
Enter 'trie-build-dictionary!
Exit 'trie-build-dictionary!
Build dictionary: 356168 ms
Done loading dictionary. Counting...
Words in trie: 172820
Counting words: 238451 ms
Fast count of words: 172820
Fast counting words: 30125 ms

(2017年iMac Pro运行macOS 13.3 x86)

我已经看到,Chibi的开发人员不像开发工作的其他方面那样关心性能,但我仍然对这种差异感到惊讶。六分钟来加载文件并构建基于trie的字典。用四分钟数字典中的字数。

此代码使用trie-count-words过程进行计数。与完整库中的其他几个过程一样,trie-count-words将真正的工作推送给了“助手”trie-process-helper。这个过程需要两个参数来指定它的行为。在这种情况下,只使用一个。当找到一个单词时,它只增加一个计数器。

考虑到广义处理例程可能不是最优的,我添加了第二个专门用于计数的过程,faster-counter。结果显示速度显著提高,但仍需30秒进行计数。

也许赤壁不是最好的选择?因此,我尝试使用其他几个R7RS实现,Gauche方案和小鸡方案。以下是一些结果。

代码语言:javascript
复制
;; Run with the Gauche interpreter
➜  david in trees gosh -r7 trie-problem.scm
Enter 'trie-build-dictionary!
Exit 'trie-build-dictionary!
;(time "Build dictionary" (trie-build-dictionary! "./word-lists/ENABLE_w"...
; real   1.554
; user   2.080
; sys    0.460
Done loading dictionary. Counting...
Words in trie: 172820
;(time "Counting words" (begin (display "Words in trie: ") (display (tri ...
; real   8.902
; user   8.880
; sys    0.030
Fast count of words: 172820
;(time "Fast counting words" (begin (display "Fast count of words: ") (d ...
; real   8.750
; user   8.750
; sys    0.000
代码语言:javascript
复制
;; Run with Chicken interpreter
➜  david in trees csi -R r7rs -q -batch trie-problem.scm
Enter 'trie-build-dictionary!
Exit 'trie-build-dictionary!
5.032s CPU time, 2.217s GC time (major), 1866429/163502 mutations (total/tracked), 12/29010 GCs (major/minor), maximum live heap: 775.96 MiB
Done loading dictionary. Counting...
Words in trie: 172820
57.799s CPU time, 0.182s GC time (major), 16193729/2250046 mutations (total/tracked), 1/786642 GCs (major/minor), maximum live heap: 775.95 MiB
Fast count of words: 172820
56.73s CPU time, 0.184s GC time (major), 15785834/2246632 mutations (total/tracked), 1/784933 GCs (major/minor), maximum live heap: 775.95 MiB
代码语言:javascript
复制
;; Run with Chicken compiler
➜  david in trees csc -R r7rs trie-problem.scm
➜  david in trees ./trie-problem
Enter 'trie-build-dictionary!
Exit 'trie-build-dictionary!
2.584s CPU time, 1.872s GC time (major), 387893/25139 mutations (total/tracked), 10/3388 GCs (major/minor), maximum live heap: 775.67 MiB
Done loading dictionary. Counting...
Words in trie: 172820
4.948s CPU time, 387884/135 mutations (total/tracked), 0/73428 GCs (major/minor), maximum live heap: 775.67 MiB
Fast count of words: 172820
4.773s CPU time, 4/0 mutations (total/tracked), 0/68512 GCs (major/minor), maximum live heap: 337.52 KiB

再次,好得多,但仍然不是很好的Chez。

不过也有一些奇怪的地方。“更快”的计票程序几乎没有什么区别。(也许更好的优化者?)此外,计数过程比加载和构建字典花费更长的时间。这似乎不对。

问题是,是否有更好的方法来做到这一点?我没有想到的更快的方法?

谢谢你的帮助。

EN

回答 2

Code Review用户

发布于 2023-03-30 16:08:30

你想知道慢下来是怎么来的。

代码语言:javascript
复制
                     (let ((child (child-for-index nd idx)))
                       (when (not (null? child))
                         (cf child))
                       (loop (+ 1 idx))))))))

我们能不能把那个(incf idx)移开,这样它就不会在尽头了?也许通过使idx成为递归cf调用的一个参数?

我推测您的R6编译器巧妙地找到了一些代码动作来启用TCO,而R7编译器却没有跳过太多的循环。

检查生成的程序集代码将确认或拒绝这一假设。

发布每行CPU分析的比较也会有帮助。

tl;dr:“慢”代码似乎是递归的,而不是尾递归的。

票数 1
EN

Code Review用户

发布于 2023-04-06 15:58:16

在审查了这个问题并提出了更多的问题后,我想到Chez实现的时间并没有显示出来。我试着用Chez写一个问题的版本来收集这些时间。

代码语言:javascript
复制
;; Run with Chez
➜  david in trees chez --script trie-problem.ss
Building dictionary...
(time (trie-build-dictionary! "./word-lists/ENABLE_word_list.txt" ...))
    101 collections
    1.337951594s elapsed cpu time, including 1.218552223s collecting
    1.337968000s elapsed real time, including 1.218811000s collecting
    847145312 bytes allocated, including 40059392 bytes reclaimed

Counting words...
Words in trie: 172820
(time (begin (display "Words in trie: ") ...))
    3 collections
    0.561774181s elapsed cpu time, including 0.008546659s collecting
    0.561770000s elapsed real time, including 0.008564000s collecting
    24809104 bytes allocated, including 23145280 bytes reclaimed

这比预期的要慢一些。部分原因是用于示例程序的alphabet-size是最初用Chez编写的程序的两倍大。所以,有一些表演。

最初实现的Trie数据结构的一个问题是,它占用了大量内存,其中很多内存通常是空的。而指定字母表大小的需要是丑陋的。

因此,我使用SRFI 69中的哈希表重新实现了完整的库(以及这里显示的示例程序)来存储子库。因此,Trie中使用的字母表没有硬连线限制,所使用的内存取决于所存储的字符,而不是用于保存指向孩子的指针的固定长度向量。

这里是原始问题中提供的示例的一个版本,但是使用哈希表而不是向量来存储指向当前节点的子节点的指针。

代码语言:javascript
复制
;;;
;;; try-hash-problem.scm -- illustrate performance problem with trie library.
;;; This version is re-worked to use a hash-table as the basis for the
;;; trie children rather than a vector. Also eliminates the need to
;;; hard-wire an alphabet size into the implementation.
;;;

(import (scheme base)
        (scheme file)
        (scheme write)
        (srfi 69))

(cond-expand
 (gauche (import (gauche time)))
 (chibi (import (chibi time)))
 (else))

;; In a trie, each node consists of an "end-of-word" marker and
;; a hash-table of pointers to child nodes.
(define-record-type trie
  (mk-trie c end-of-word? children)
  trie?
  (c trie-c trie-c-set!)
  (end-of-word? trie-end-of-word? trie-end-of-word-set!)
  (children trie-children trie-children-set!))

;; Return a new, initialized trie node.
(define (trie-new-node)
  (let ((node (mk-trie #\null #f (make-hash-table))))
    node))

;; Add a word to the trie.
(define (trie-add-word! trie word)
  (let loop ((node trie)
             (chars (string->list word)))
    (cond ((null? chars)
           (trie-end-of-word-set! node #t))
          (else
           (let* ((c (car chars))
                  (children (trie-children node))
                  (child (hash-table-ref/default children c #f)))
             (unless child
               (set! child (trie-new-node))
               (hash-table-set! children c child))
             (loop child (cdr chars)))))))

;; Return the number of words stored in the trie.
(define (trie-count-words trie)
  (let ((children (trie-children trie)))
    (hash-table-fold
      children
      (lambda (k v count)
        (+ count (trie-count-words v)))
      (if (trie-end-of-word? trie) 1 0))))

;; Open the dictionary file, containing one word per line, and
;; build a trie from the contents.
(define (trie-build-dictionary! word-list-file-name root-node)
  (with-input-from-file word-list-file-name
    ;; Process the file by adding words from the
    ;; file, one word per line, to the trie root-node.
    (lambda ()
      (let loop ((line (read-line (current-input-port))))
        (when (not (eof-object? line))
          (trie-add-word! root-node line)
          (loop (read-line (current-input-port))))))))

;; Do some timing on building a trie from a file of words, then
;; counting the words in the trie.
(let ((root (trie-new-node)))

  (display "Building dictionary...") (newline)
  (time
   ;; Use the small dictionary file for debugging.
   ;;(trie-build-dictionary! "./word-lists/subset_list.txt" root))
   (trie-build-dictionary! "./word-lists/ENABLE_word_list.txt" root))

  (display "\nCounting words...") (newline)
  (time
   (begin
     (display "Words in trie: ")
     (display (trie-count-words root))
     (newline))))

所需的更改几乎是微不足道的,甚至在更完整的库版本(完整库中更多的过程)中也是如此。

运行这一计划显示,所收集的时间有了很大的改善。在测试的实现中,Gauche解释器报告了最佳性能:

代码语言:javascript
复制
;; Run with Gauche
➜  david in trees gosh -r7 try-hash-problem.scm
Building dictionary...
;(time (trie-build-dictionary! "./word-lists/ENABLE_word_list.txt" root))
; real   0.705
; user   1.890
; sys    0.060

Counting words...
Words in trie: 172820
;(time (begin (display "Words in trie: ") (display (trie-count-words roo ...
; real   0.258
; user   0.590
; sys    0.010

例子中的问题也是行数减少。

还请注意,trie-count-words过程已被重新处理以使用一种类型的折叠,即SRFI中的hash-table-fold

有了这些更改,一些实现的时间就进入了范围,从而使库再次有用。

票数 0
EN
页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://codereview.stackexchange.com/questions/284236

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档