首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Haskell:在tree+zipper构造中避免堆溢出

Haskell:在tree+zipper构造中避免堆溢出
EN

Stack Overflow用户
提问于 2014-10-29 22:37:06
回答 1查看 139关注 0票数 0

我正在尝试实现一个简单的词典压缩算法,它使用确定性有限自动机作为数据结构(实际上它是确定性非循环有限状态自动机,请参阅Wikipedia entry)。当我对一个大型词典数据库运行该程序时(我有两个数据集--一个包含大约900.000个唯一的单词,另一个包含大约4.000.000个唯一的单词),我得到一个堆溢出:

代码语言:javascript
复制
mindfa.exe: Heap exhausted;
Current maximum heap size is 1073741824 bytes (1024 MB);
use `+RTS -M<size>' to increase it.
   6,881,239,544 bytes allocated in the heap
   4,106,345,528 bytes copied during GC
   1,056,362,696 bytes maximum residency (96 sample(s))
       6,884,200 bytes maximum slop
            1047 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     13140 colls,     0 par    2.14s    2.13s     0.0002s    0.0019s
  Gen  1        96 colls,     0 par   197.37s   199.06s     2.0736s    3.3260s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    2.54s  ( 12.23s elapsed)
  GC      time  190.09s  (191.68s elapsed)
  RP      time    0.00s  (  0.00s elapsed)
  PROF    time    9.42s  (  9.51s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time  202.05s  (203.91s elapsed)

  %GC     time      94.1%  (94.0% elapsed)

  Alloc rate    2,706,148,904 bytes per MUT second

  Productivity   1.3% of total user, 1.2% of total elapsed  

我猜测其中一个问题是addWordaddWords函数中的懒惰。

代码语言:javascript
复制
-- | Update the tree structure, starting from the current location.
addWord :: Zipper TnLabel -> B.ByteString -> Zipper TnLabel
addWord z s | B.null s = z
addWord (Zipper (DFA ts) parents) s = addWord z rest 
    where
        ch = B.head s
        rest = B.tail s
        pack defaultFlag = packTransitionLabel ch (if B.null rest then bit bitWordStop else defaultFlag)
        z = case break (\(w,_) -> getCh w == ch) ts of
            (_, []) -> Zipper
                { _focus = DFA []
                , _parents = (pack 0, [], ts) : parents
                }
            (left, (w, newFocus):right) -> Zipper
                { _focus = newFocus
                , _parents = ((pack w), left, right) : parents
                }

-- | Add a list of words to the DFA tree.
addWords :: Zipper TnLabel -> [B.ByteString] -> Zipper TnLabel
addWords z [] = z
addWords z (s:ss) = addWords z' ss
    where
        z' = addWord (root z) s

我读过关于seq$!!的文章,但仍然不明白如何在我的示例中使用它们。如何使代码严格(Er)?另一方面,也许我使用了错误的数据结构(树+拉链)?

下面是我正在做的一个简短的、自包含的、正确的(可编译的)示例。当你运行它时,它应该打印出状态的数量,转换的数量和整个DFA树,如下所示:

代码语言:javascript
复制
Lexicon
        State# 16
        Transition# 21
*
|
b--*
   |
   e--*
   |  |
   |  d!-*
   |     |
   |     s!-*
   |     |
   |     d--*
   |        |
   |        i--*
   |        |  |
   |        |  n--*
   |        |     |
   |        |     g!-*
   |        |
   |        e--*
   |           |
   |           d!-*
   |
   a--*
      |
      d!-*
         |
         n--*
         |  |
         |  e--*
         |     |
         |     s--*
         |        |
         |        s!-*
         |
         l--*
         |  |
         |  y!-*
         |
         a--*
            |
            s--*
               |
               s!-*

代码:

代码语言:javascript
复制
{-# LANGUAGE FlexibleInstances, OverloadedStrings #-}
module Main (main) where 

import Prelude hiding (print)
import qualified Data.ByteString.Lazy as B hiding (unpack)
import qualified Data.ByteString.Lazy.Char8 as B (unpack) 
import Data.Word (Word8, Word16)
import Data.Bits ((.|.), (.&.), bit, complement, testBit)
import Foreign.Storable (sizeOf)
import Text.Printf hiding (fromChar, toChar)

--------------------------------------------- Deterministic finite automaton
type TnLabel = Word16

bitsInWord :: Int
bitsInWord = sizeOf (0::TnLabel) * 8

bitWordStop :: Int
bitWordStop = bitsInWord-1      -- ^ marks the end of a word

packTransitionLabel :: Word8 -> TnLabel -> TnLabel
packTransitionLabel ch flags = (flags .&. complement 0xFF) .|. fromIntegral ch

getCh :: TnLabel -> Word8
getCh w = fromIntegral $ w .&. 0xFF

type Transition e = (e, DFA e)

data DFA e = DFA [Transition e]
  deriving (Show, Eq)

-- DFA Zipper ----------------------------------------------------------------- 
data Zipper e = Zipper 
    { _focus :: DFA e
    , _parents :: [(e, [Transition e], [Transition e])]
    }
    deriving (Show)

-- Moving around ---------------------------------------------------------------
-- | The parent of the given location.
parent :: Zipper TnLabel -> Maybe (Zipper TnLabel)
parent (Zipper _ []) = Nothing
parent (Zipper focus ((event, left, right):parents)) = Just Zipper
    { _focus = DFA $ left++((event,focus):right)
    , _parents = parents
    }

-- | The top-most parent of the given location.
root :: Zipper TnLabel -> Zipper TnLabel
root z@(Zipper _ []) = z
root z = case parent z of
    Nothing -> z
    Just z2 -> root z2

-- Modification -----------------------------------------------------------------
-- | Update the tree structure, starting from the current location.
addWord :: Zipper TnLabel -> B.ByteString -> Zipper TnLabel
addWord z s | B.null s = z
addWord (Zipper (DFA ts) parents) s = addWord z rest 
    where
        ch = B.head s
        rest = B.tail s
        pack defaultFlag = packTransitionLabel ch (if B.null rest then bit bitWordStop else defaultFlag)
        z = case break (\(w,_) -> getCh w == ch) ts of
            (_, []) -> Zipper
                { _focus = DFA []
                , _parents = (pack 0, [], ts) : parents
                }
            (left, (w, newFocus):right) -> Zipper
                { _focus = newFocus
                , _parents = ((pack w), left, right) : parents
                }

-- | Add a list of words to the DFA tree.
addWords :: Zipper TnLabel -> [B.ByteString] -> Zipper TnLabel
addWords z [] = z
addWords z (s:ss) = addWords z' ss
    where
        z' = addWord (root z) s

-- Conversion ------------------------------------------------------------
empty :: Zipper TnLabel
empty = Zipper 
    { _focus = DFA []
    , _parents = []
    }

toDFA :: Zipper TnLabel -> DFA TnLabel
toDFA (Zipper dfa _) = dfa

fromList :: [B.ByteString] -> DFA TnLabel
fromList = toDFA . root . addWords empty

-- Stats ------------------------------------------------------------------
-- | Number of states in the whole DFA tree.
stateCount :: DFA TnLabel -> Int
stateCount = go 0
    where
        go acc (DFA []) = acc
        go acc (DFA ts) = go' (acc+1) ts
        go' acc [] = acc
        go' acc ((_,dfa):ts) = go 0 dfa + go' acc ts

-- | Number of transitions in the whole DFA tree.
transitionCount :: DFA TnLabel -> Int
transitionCount = go 0
    where
        go acc (DFA []) = acc
        go acc (DFA ts) = go' acc ts
        go' acc [] = acc
        go' acc ((_,dfa):ts) = go 1 dfa + go' acc ts

-- DFA drawing --------------------------------------------------------- 
draw' :: DFA TnLabel -> [String]
draw' (DFA ts) = "*" : drawSubTrees ts
    where 
        drawSubTrees [] = []
        drawSubTrees [(w, node)] = "|" : shift (toChar w : flagCh w : "-") "   " (draw' node)
        drawSubTrees ((w, node):xs) = "|" : shift (toChar w : flagCh w : "-") "|  " (draw' node) ++ drawSubTrees xs
        shift first other = zipWith (++) (first : repeat other)
        flagCh flags = if testBit flags bitWordStop then '!' else '-'
        toChar w = head . B.unpack . B.singleton $ getCh w

draw :: DFA TnLabel -> String
draw = unlines . draw'

print :: DFA TnLabel -> IO ()
print = putStr . draw

-- Main -----------------------------------------------------------------
main :: IO ()
main = do
    let dfa = fromList ["bad", "badass", "badly", "badness", "bed", "bedded", "bedding", "beds"]
    printf "Lexicon\n"
    printf "\tState# %d\n" (stateCount dfa)
    printf "\tTransition# %d\n" (transitionCount dfa)
    print dfa
EN

回答 1

Stack Overflow用户

发布于 2014-10-30 06:43:53

您可能不需要执行以下任何操作;您尝试过-O-O2吗?GHC优化包括一个“严格性分析器”,它通常可以为您筛选其中的一些东西。

无论如何,懒惰看起来肯定是一个可能的罪魁祸首,您首先要做的就是通过用!前缀注释字段来严格数据结构。例如,Data.IntMap中的IntMap类型就是:

代码语言:javascript
复制
type Prefix = Int
type Mask   = Int
type Key    = Int
data IntMap a = Bin {-# UNPACK #-} !Prefix
                    {-# UNPACK #-} !Mask
                    !(IntMap a)
                    !(IntMap a)
              | Tip {-# UNPACK #-} !Key a
              | Nil

"unpack“编译指示告诉GHC将整数直接存储在BinTip中,而不是作为指向堆上对象的指针;!告诉GHC立即执行将它们转换为真正整数的数学操作;而操作前缀、掩码和键的函数最终是{-# INLINE ... #-}编译指示的主题,它会说,”嘿,这不是递归的“,从而将这些操作简化为前置数学函数。

您可能会惊讶地发现,这段代码实际上是在懒惰和严格的IntMap用例之间共享的。!(IntMap a)只保证树的结构(以及它的键、前缀和掩码)是严格的,但是它仍然包含计算它的叶元素的承诺,如果它们还没有被计算的话。在您的示例中,这样做是不必要的(因为您没有在节点中存储任何信息),但是在Data.IntMap.Strict中是通过使用seq填充操作节点的函数来完成的

代码语言:javascript
复制
insert :: Key -> a -> IntMap a -> IntMap a
insert k x t = k `seq` x `seq`
case t of
    ...

阅读更多关于strictness on the wiki的内容。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/26633143

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档