文章/答案/技术大牛

发布

问优化Haskell文本处理
EN

Stack Overflow用户

提问于 2012-09-12 17:02:28

回答 3查看 270关注 0票数 2

我正在用Haskell编写一些简单的字符计数例程，将统计数据存储在一个新的数据类型中：

data Stat = Stat {
    stChars    :: !Int,
    stVowels   :: !Int,
    stPairEL   :: !Int,
    stWords    :: !Int
}

我在数百或数千个纯文本文件上运行，每个文件大约50K-100 K。

tabulateFile :: FilePath -> IO Stat
tabulateFile path = do
  putStrLn path
  contents <- L.readFile path
  return $! tabulateText ' ' contents defaultStat

我没有使用折叠左，而是使用原始递归，这样我就可以保留前面的字符。

tabulateText :: Char -> L.ByteString -> Stat -> Stat
tabulateText lastChr bs stat =
  case U.uncons bs of
    Nothing -> stat
    Just (chr, newBs) ->
      tabulateText lchr newBs (countChar lastChr lchr stat)
        where lchr = toLower chr

{-# INLINE countChar #-}
countChar :: Char -> Char -> Stat -> Stat
countChar !lastChr !chr !(Stat stChars stVowels stPairEL stWords) =
  Stat
    (stChars  + 1)
    (stVowels + (countIf $ isVowel chr))
    (stPairEL + (countIf (lastChr == 'e' && chr == 'l')))
    (stWords  + (countIf ((not $ isLetter lastChr) && isLetter chr)))

isVowel :: Char -> Bool
isVowel c = Set.member c vowels

vowels = Set.fromAscList ['a', 'e', 'i', 'o', 'u', ...] -- rest of vowels elided

现在，它的速度是运行cat * | wc的两倍多，但是我的直觉告诉我，文件I/O应该比CPU所需的时间大得多。简单地使用带有热缓存的cat * | wc进程大约20 my /s，但是使用我的Haskell程序(用-O编译)运行的速度还不到10 my/s，即使经过了一些基本的优化。分析告诉我，大部分时间都花在tabulateText和countChar上。

有什么东西我可以在这里优化吗？

编辑:粘贴到http://hpaste.org/74638的完整文件

optimization

haskell

nlp

回答 3

Stack Overflow用户

发布于 2012-09-12 17:13:52

您应该提供导入，以便有人可以编译代码。然而，这里有几件事情看起来很有可能：

使用-O2 -funbox-strict-fields编译(以获得严格字段的好处)
tabulateText在lastChr和stat中应该严格
Set.member似乎是进行等式比较的一种非常昂贵的方法。用跳台。

例如。

isSpaceChar8 :: Char -> Bool
isSpaceChar8 c =
    c == ' '     ||
    c == '\t'    ||
    c == '\n'    ||
    c == '\r'    ||
    c == '\f'    ||
    c == '\v'    ||
    c == '\xa0'

这将很好地内联和优化。

不确定countIf做什么，但它失去了坏。我怀疑这是一个if，而您返回0？不如：

Stat
   (a + 1)
   (if isVowel c then a + 1 else a)
   ...

然后看看核心。

票数 10

Stack Overflow用户

发布于 2012-09-12 18:28:52

{-# LANGUAGE BangPatterns #-}
import qualified Data.ByteString.Lazy.Char8 as U
import qualified Data.ByteString.Lazy as L
import Data.Word
import Data.Char
import Control.Applicative 

data Stat = Stat {
    stChars    :: !Int,
    stVowels   :: !Int,
    stPairEL   :: !Int,
    stWords    :: !Int
} deriving Show
defaultStat = Stat 0 0 0 0

{-# INLINE  tabulateFile #-}
tabulateFile :: FilePath -> IO Stat
tabulateFile path = newTabulate <$> L.readFile path

{-#  INLINE newTabulate #-}
newTabulate :: L.ByteString -> Stat 
newTabulate = snd . U.foldl' countIt (' ',defaultStat) 
    where 
        {-#  INLINE countIt #-}
        countIt :: (Char,Stat) -> Char -> (Char,Stat)
        countIt (!lastChr,!Stat stChars stVowels stPairEL stWords) !chr = 
            (chr,Stat
                (stChars  + 1)
                (if isVowel chr then stVowels + 1 else stVowels)
                (if (lastChr == 'e' && chr == 'l') then stPairEL + 1 else stPairEL)
                (if ((isSpace lastChr) && isLetter chr) then stWords+1 else stWords))

{-# INLINE isVowel #-}
isVowel :: Char -> Bool
isVowel c = 
    c == 'e' ||
    c == 'a' ||
    c == 'i' ||
    c == 'o' ||
    c == 'u' 



main:: IO ()
main = do 
    stat <- tabulateFile "./test.txt"
    print stat

Don建议的大多数优化都包括使用有效的foldl‘。它的性能比cat + wc稍慢，但是由于您正在做更多的计算，所以它是可以的。我还没有在很大的文件上测试过它，但是它应该可以与cat + wc相媲美。

使用-O2 -funbox-strict-fields编译以获得优化的代码。

在查看内核并查看是否有可能进行更多优化之后，我将对其进行更多的更改。优化的一个可能点是在构造stat时在构造函数中设置if条件，例如，如果chr是元音，那么它已经是一个letter，所以您不需要在stWords等中使用另一个条件，但是这会使您的代码崩溃，但是您可以尝试看看它是否对大型文件确实有帮助。

票数 4

Stack Overflow用户

发布于 2012-09-16 19:42:09

在测试了其他替代方案之后，似乎高CPU使用率主要是因为我使用的是Data.ByteString.Lazy.UTF8。我修改了tabulateText中的数据结构，使foldl在UTF8 ByteString上使用，从而使运行时减少了一点点。

因此，我在文件上并行化了程序，有时在我的机器上可以得到7倍的加速比。

我第一次用一个tabulateFile包unsafePerformIO

unsafeTabulateFile :: FilePath -> Stat
unsafeTabulateFile f =
  unsafePerformIO $ tabulateFile f

然后用Control.Parallel.Strategies做parMap rseq unsafeTabulateFile files。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/12393088

复制

相似问题

问优化Haskell文本处理
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问优化Haskell文本处理EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问优化Haskell文本处理
EN