首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何将大型数据块解析为Haskell中的内存?

如何将大型数据块解析为Haskell中的内存?
EN

Stack Overflow用户
提问于 2014-03-20 06:37:27
回答 2查看 676关注 0票数 9

经过思考,这整个问题可以归结为更简洁的东西。我在找一个Haskell数据结构

  • 看上去像一张清单
  • 有O(1)查找
  • 具有O(1)元素替换或O(1)元素附加(或预先.如果是这样的话,我可以反向查找索引)。我总是可以用一个或另一个来编写我以后的算法。
  • 内存开销很小

我正在尝试构建一个图像文件解析器。文件格式是您的基本8位颜色ppm文件,虽然我打算支持16位彩色文件和PNG和JPEG文件。现有的Netpbm库尽管有大量的取消装箱注释,但在尝试加载我使用的文件时实际上消耗了所有可用内存:

照片3-10张,最小为45 10,最大为110 10。

现在,我无法理解Netpbm代码中的优化,所以我决定自己尝试一下。这是一种简单的文件格式。

首先,我决定,无论文件格式是什么,我都要以这种格式存储最终未压缩的图像:

代码语言:javascript
复制
import Data.Vector.Unboxed (Vector)
data PixelMap = RGB8 {
      width :: Int
    , height :: Int
    , redChannel :: Vector Word8
    , greenChannel :: Vector Word8
    , blueChannel :: Vector Word8
    }

然后我编写了一个解析器,它工作在三个向量上,如下所示:

代码语言:javascript
复制
import Data.Attoparsec.ByteString
data Progress = Progress {
      addr      :: Int
    , size      :: Int
    , redC      :: Vector Word8
    , greenC    :: Vector Word8
    , blueC     :: Vector Word8
    }

parseColorBinary :: Progress -> Parser Progress
parseColorBinary progress@Progress{..}
    | addr == size = return progress
    | addr < size = do
        !redV <- anyWord8
        !greenV <- anyWord8
        !blueV <- anyWord8
        parseColorBinary progress { addr    = addr + 1
                                  , redC    = redC V.// [(addr, redV)]
                                  , greenC  = greenC V.// [(addr, greenV)]
                                  , blueC   = blueC V.// [(addr, blueV)] }

在解析器的末尾,我像这样构造了RGB8:

代码语言:javascript
复制
Progress{..} <- parseColorBinary $ ...
return $ RGB8 width height redC greenC blueC

像这样编写的程序,加载这45 5GB图像中的一个,将消耗5GB或更多的内存。如果我更改了Progress的定义,使redCgreenCblueC都是!(Vector Word8),那么程序将保持在合理的内存范围内,但是加载一个文件所花费的时间太长,以至于我还没有允许它真正完成。最后,如果我用标准列表替换这里的向量,我的内存使用量将恢复到每个文件5GB (我假设.实际上,在我达到这个目标之前,我已经耗尽了空间),加载时间大约是6秒。Ubuntu的预览应用程序一旦启动,就会立即加载和呈现文件。

根据每一次对V/的调用都完全复制向量的理论,我尝试切换到Data.Vector.Unboxed.Mutable,但是.我连打字机都拿不到。文档是不存在的,理解数据类型所做的工作也需要与多个其他库进行斗争。我甚至不知道它是否能解决问题,所以我甚至不愿意尝试。

根本的问题其实是非常简单的:

如何在不使用大量的内存的情况下,快速地读取、保留和操作一个非常大的数据结构?我发现的所有例子都是关于产生暂时的巨大数据,然后尽快删除它。

在原则上,我希望最终的表示是不可变的,但是如果我必须使用可变的结构来达到这个目的,我并不太在意。

为了完整起见,完整的代码(BSD3 3-许可的)在https://bitbucket.org/savannidgerinel/photo-tools的bitbucket上。performance分支包含解析器的严格版本,通过快速更改Codec.Image.NetpbmProgress数据结构可以使解析器不严格。

运行性能测试

代码语言:javascript
复制
ulimit -Sv 6000000 -- set a ulimit of 6GB, or change to whatever makes sense for you
cabal build
dist/build/perf-test/perf-test +RTS -p -sstderr
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2014-03-20 15:00:30

首先,我认为只要简单地阅读整块字节串,然后将内容解压缩到未装箱的向量中就足够了。实际上,即使没有神秘的空间泄漏,您发布的解析代码也会相当糟糕:在输入的每一个字节上都复制所有三个向量的全部内容!讨论二次复杂性。

所以我写了如下:

代码语言:javascript
复制
chunksOf3 :: [a] -> [(a, a, a)]
chunksOf3 (a:b:c:xs) = (a, b, c) : chunksOf3 xs
chunksOf3 _          = []

parseRGB :: Int -> Atto.Parser (Vector Word8, Vector Word8, Vector Word8)
parseRGB size = do
    input <- Atto.take (size * 3)
    let (rs, gs, bs) = unzip3 $ chunksOf3 $ B.unpack input
    return (V.fromList rs, V.fromList gs, V.fromList bs)

然后用随机字节的45 Mb文件对其进行测试。我承认,这段代码导致内存使用量达到千兆字节,这让我感到惊讶。我很好奇空间到底是从哪里漏出来的。

不过,可变矢量工作得很好。下面的代码使用133 Mb和Criterion,将其基准为60毫秒的文件读取包括在内。我在评论中加入了一些解释。ST monad上也有大量的资料,所以和其他地方都有可变的向量(不过,我同意库文档对初学者是不友好的)。

代码语言:javascript
复制
import Data.Vector.Unboxed (Vector)
import Data.ByteString (ByteString)

import qualified Data.Vector.Unboxed as V
import qualified Data.ByteString as B
import qualified Data.Vector.Unboxed.Mutable as MV

import Control.Monad.ST.Strict 
import Data.Word
import Control.Monad
import Control.DeepSeq

-- benchmarking stuff
import Criterion.Main (defaultMainWith, bench, whnfIO)
import Criterion.Config (defaultConfig, Config(..), ljust)

-- This is just the part that parses the three vectors for the colors.
-- Of course, you can embed this into an Attoparsec computation by taking 
-- the current input, feeding it to parseRGB, or you can just take the right 
-- sized chunk in the parser and omit the "Maybe" test from the code below. 
parseRGB :: Int -> ByteString -> Maybe (Vector Word8, Vector Word8, Vector Word8)
parseRGB size input 
    | 3* size > B.length input = Nothing
    | otherwise = Just $ runST $ do

        -- We are allocating three mutable vectors of size "size"
        -- This is usually a bit of pain for new users, because we have to
        -- specify the correct type somewhere, and it's not an exactly simple type.
        -- In the ST monad there is always an "s" type parameter that labels the
        -- state of the action. A type of "ST s something" is a bit similar to
        -- "IO something", except that the inner type often also contains "s" as
        -- parameter. The purpose of that "s" is to statically disallow mutable
        -- variables from escaping the ST action. 
        [r, g, b] <- replicateM 3 $ MV.new size :: ST s [MV.MVector s Word8]

        -- forM_ = flip mapM_
        -- In ST code forM_ is a nicer looking approximation of the usual
        -- imperative loop. 
        forM_ [0..size - 1] $ \i -> do
            let i' = 3 * i
            MV.unsafeWrite r i (B.index input $ i'    )
            MV.unsafeWrite g i (B.index input $ i' + 1)
            MV.unsafeWrite b i (B.index input $ i' + 2)

        -- freeze converts a mutable vector living in the ST monad into 
        -- a regular vector, which can be then returned from the action
        -- since its type no longer depends on that pesky "s".
        -- unsafeFreeze does the conversion in place without copying.
        -- This implies that the original mutable vector should not be used after
        -- unsafeFreezing. 
        [r, g, b] <- mapM V.unsafeFreeze [r, g, b]
        return (r, g, b)

-- I prepared a file with 3 * 15 million random bytes.
inputSize = 15000000
benchConf = defaultConfig {cfgSamples = ljust 10}

main = do
    defaultMainWith benchConf (return ()) $ [
        bench "parseRGB test" $ whnfIO $ do 
            input <- B.readFile "randomInp.dat" 
            force (parseRGB inputSize input) `seq` putStrLn "done"
        ]
票数 4
EN

Stack Overflow用户

发布于 2014-03-20 15:19:40

下面是一个版本,它直接从磁盘解析文件,而不将任何中间文件加载到内存中:

代码语言:javascript
复制
import Control.Applicative
import Control.Monad (void)
import Data.Attoparsec.ByteString (anyWord8)
import Data.Attoparsec.ByteString.Char8 (decimal)
import qualified Data.Attoparsec.ByteString as Attoparsec
import Data.ByteString (ByteString)
import Data.Vector.Unboxed (Vector)
import Data.Word (Word8)
import Control.Foldl (FoldM(..), impurely, vector, premapM) -- Uses `foldl-1.0.3`
import qualified Pipes.ByteString
import Pipes.Parse
import Pipes.Attoparsec (parse, parsed)
import qualified System.IO as IO

data PixelMap = PixelMap {
      width :: Int
    , height :: Int
    , redChannel :: Vector Word8
    , greenChannel :: Vector Word8
    , blueChannel :: Vector Word8
    } deriving (Show)

-- Fold three vectors simultaneously, ensuring strictness and efficiency
rgbVectors
    :: FoldM IO (Word8, Word8, Word8) (Vector Word8, Vector Word8, Vector Word8)
rgbVectors =
    (,,) <$> premapM _1 vector <*> premapM _2 vector <*> premapM _3 vector
  where
    _1 (a, b, c) = a
    _2 (a, b, c) = b
    _3 (a, b, c) = c

triples
    :: Monad m
    => Producer ByteString m r
    -> Producer (Word8, Word8, Word8) m ()
triples p = void $ parsed ((,,) <$> anyWord8 <*> anyWord8 <*> anyWord8) p

-- I will probably ask Renzo to simplify the error handling for `parse`
-- This is a helper function to just return `Nothing`
parse'
    :: Monad m
    => Attoparsec.Parser r -> Parser ByteString m (Maybe r)
parse' parser = do
    x <- parse parser
    return $ case x of
        Just (Right r) -> Just r
        _              -> Nothing

parsePixelMap :: Producer ByteString IO r -> IO (Maybe PixelMap)
parsePixelMap p = do
    let parseWH = do
            mw <- parse' decimal
            mh <- parse' decimal
            return ((,) <$> mw <*> mh)
    (x, p') <- runStateT parseWH p
    case x of
        Nothing     -> return Nothing
        Just (w, h) -> do
            let size = w * h
                parser = impurely foldAllM rgbVectors
                source = triples (p' >-> Pipes.ByteString.take size)
            (rs, gs, bs) <- evalStateT parser source
            return $ Just (PixelMap w h rs gs bs)

main = IO.withFile "image.ppm" IO.ReadMode $ \handle -> do
    pixelMap <- parsePixelMap (Pipes.ByteString.fromHandle handle)
    print pixelMap

我在50 MB文件上没有头逻辑进行了测试,它在大致相同的空间中运行。

票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/22525250

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档