我是Haskell的新手(也是FP和lazy-evaluation的新手)。我正在尝试编写一个日志分析器,但目前它分配了4G的内存,因此即使对于小到90M的日志也会崩溃。
我剥离了程序,只收集经常引用页面的一部分。此外,我将它们存储在一个三元trie中(因为大多数URL共享公共前缀),所以它们不应该占用那么多内存。
因此,我期望程序只需要几MB的内存,只存储这些引用,而不是那么多。
我认为罪魁祸首是下面主文件中的readStats函数:
-- main.hs
import Record
import Output
import Stats
import System.Environment
import Data.List
import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as C8
readStats :: String -> IO Stats
readStats p = do
f <- B.readFile p
return $ foldl'
(\t l -> applyEither t (parseLogLine l))
emptyStats
(C8.lines f)
where applyEither t (Right rec) = applyRecord t rec
applyEither t (Left err) = applyError t err
main :: IO ()
main = do
args <- getArgs
stats <- readStats $ head args
putStrLn $ page stats 我在想,因为我将结果从B.readFile赋值给f,所以整个文件以[Char]的形式存储在内存中,我想这会因为指针而占用更多的内存。
如何使GC在解析parseLogLine中所需的代码行后立即从f进行收集
另外,我非常感谢所有关于结构/编码风格的建议,因为我是Haskell的新手。
谢谢。
编辑:以下是其他函数/结构:
Trie:
data Trie a = Node Char (Trie a) (Trie a) (Trie a) (Maybe a)
| Empty deriving (Show, Eq)
sanify :: Trie a -> Trie a
sanify (Node _ Empty Empty Empty Nothing) = Empty
sanify (Node _ Empty lo Empty Nothing) = lo
sanify (Node _ Empty Empty hi Nothing) = hi
sanify t = t
update :: Trie a -> String -> (Maybe a -> Maybe a) -> Trie a
update _ [] _ = error "Can not insert an empty string to a Trie"
update Empty (x:[]) f = sanify $ Node x Empty Empty Empty (f Nothing)
update Empty (x:xs) f = sanify $ Node x (update Empty xs f) Empty Empty Nothing
update (Node c eq lo hi val) xss@(x:xs) f =
case x `compare` c of
LT -> sanify $ Node c eq (update lo xss f) hi val
GT -> sanify $ Node c eq lo (update hi xss f) val
EQ -> case xs of
[] -> sanify $ Node c eq lo hi (f val)
_ -> sanify $ Node c (update eq xs f) lo hi val 记录:
import Network.URL
data Record = Record {
ip :: IP,
date :: UTCTime,
method :: Method,
path :: URL,
referer :: Maybe URL,
status :: Integer,
userAgent :: String
} deriving (Show, Eq)
parseRecord :: Parser Record
parseRecord = do
ip <- parseIP
P8.skipWhile (/= '[')
date <- parseDate
P.string (B8.pack " \"")
method <- P8.takeWhile (/= ' ')
.....
data LogError = LogError {msg :: String, line :: B8.ByteString}
parseLogLine :: B8.ByteString -> Either LogError Record
parseLogLine line = case parseOnly parseRecord line of
Right a -> Right a
Left msg -> Left $ LogError msg line统计数据:
type StringCounter = T.Trie Int
increment :: StringCounter -> String -> StringCounter
increment t s = T.update t s incNode
where incNode n = case n of
Nothing -> Just 1
Just i -> Just (i+1)
sortCounter :: StringCounter -> [(String, Int)]
sortCounter = sortWith (negate.snd) . T.toList
data Stats = Stats {
paths :: StringCounter,
referers :: StringCounter,
errors :: [LogError]
}
emptyStats :: Stats
emptyStats = Stats T.Empty T.Empty []
buildStats :: [Record] -> Stats
buildStats = foldl' applyRecord emptyStats
applyRecord :: Stats -> Record -> Stats
applyRecord env rec = env {
paths = increment (paths env) (exportURL $ path rec),
referers = case referer rec of
Nothing -> referers env
Just ref -> increment (referers env) (exportURL $ stripParams ref)
}
applyError :: Stats -> LogError -> Stats
applyError env err = env { errors = err : errors env } 发布于 2014-02-03 03:53:03
我并没有真正看过你的代码,但是有一个通用的建议:使用管道,Luke。对于处理数据流- like日志流-它们真的很棒。最重要的是,它们使您能够在O(1)空间中运行。不要搞懒IO,比如readFile;它是为一次性代码准备的。
https://stackoverflow.com/questions/21513869
复制相似问题