文章/答案/技术大牛

发布

问并行文件匹配，Python
EN

Stack Overflow用户

提问于 2011-10-02 05:56:43

回答 4查看 5.1K关注 0票数 4

我正在尝试改进一个扫描恶意代码文件的脚本。我们在一个文件中有一个正则表达式模式列表，每行一个模式。这些正则表达式是针对grep的，因为我们当前的实现基本上是一个bash script find\grep组合。bash脚本在我的基准目录上花费了358秒。我能够写一个python脚本，在72秒内完成这项工作，但我想要改进得更多。首先，我将发布基础代码，然后对我尝试过的代码进行调整：

import os, sys, Queue, threading, re

fileList = []
rootDir = sys.argv[1]

class Recurser(threading.Thread):

    def __init__(self, queue, dir):
    self.queue = queue
    self.dir = dir
    threading.Thread.__init__(self)

    def run(self):
    self.addToQueue(self.dir)

    ## HELPER FUNCTION FOR INTERNAL USE ONLY
    def addToQueue(self,  rootDir):
      for root, subFolders, files in os.walk(rootDir):
    for file in files:
       self.queue.put(os.path.join(root,file))
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)

class Scanner(threading.Thread):

    def __init__(self, queue, patterns):
    self.queue = queue
    self.patterns = patterns
    threading.Thread.__init__(self)

    def run(self):
    nextFile = self.queue.get()
    while nextFile is not -1:
       #print "Trying " + nextFile
       self.scanFile(nextFile)
       nextFile = self.queue.get()


    #HELPER FUNCTION FOR INTERNAL UES ONLY
    def scanFile(self, file):
       fp = open(file)
       contents = fp.read()
       i=0
       #for patt in self.patterns:
       if self.patterns.search(contents):
      print "Match " + str(i) + " found in " + file

############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################


fileQueue = Queue.Queue()

#Get the shell scanner patterns
patterns = []
fPatt = open('/root/patterns')
giantRE = '('
for line in fPatt:
   #patterns.append(re.compile(line.rstrip(), re.IGNORECASE))
   giantRE = giantRE + line.rstrip() + '|'

giantRE = giantRE[:-1] + ')'
giantRE = re.compile(giantRE, re.IGNORECASE)

#start recursing the directories
recurser = Recurser(fileQueue,rootDir)
recurser.start()

print "starting scanner"
#start checking the files
for scanner in xrange(0,8):
   scanner = Scanner(fileQueue, giantRE)
   scanner.start()

这显然是调试\丑陋的代码，不要介意百万queue.put(-1)，我稍后会清理它。有些缩进没有正确显示，特别是在scanFile中。

不管怎样，我注意到了一些事情。使用1、4甚至8个线程(对于xrange(0，？？)：中的scanner )都没有区别。不管怎样，我仍然有72秒的时间。我认为这要归功于python的GIL。

与创建一个巨大的正则表达式相反，我尝试将每一行(模式)作为一个编译表达式RE放在一个列表中，并在我的scanfile函数中迭代这个列表。这导致了更长的执行时间。

为了避免python的GIL，我尝试让每个线程分支指向grep，如下所示：

#HELPER FUNCTION FOR INTERNAL UES ONLY
def scanFile(self, file):
      s = subprocess.Popen(("grep", "-El", "--file=/root/patterns", file), stdout = subprocess.PIPE)
      output = s.communicate()[0]
      if output != '':
         print 'Matchfound in ' + file

这导致了更长的执行时间。

关于提高性能的任何建议。

编辑：

我还不能发布我自己的问题的答案，但这里是对提出的几个问题的答案：

@David Nehme -只是为了让人们知道我知道我有一百万个queue.put(-1)

@Blender -标记队列的底部。我的扫描器线程一直在等待，直到它们到达底部的-1 (而nextFile不是-1:)。然而，由于GIL使用1个线程、4个线程或8个线程，所以处理器核心是8。派生8个子进程导致代码明显变慢(142秒vs 72秒)

@ed -是的，它和find\grep组合一样慢，实际上更慢，因为它不加区别地抓取不需要的文件

@Ron -无法升级，这必须是通用的。你认为这会加速72秒以上吗？bash grepper执行358秒。我的python巨型RE方法使用1-8个线程执行72秒。popen方法有8个thrads (8个子进程)，运行时间为142秒。到目前为止，巨大的RE python only方法显然是赢家

@直觉

这是我们当前find\grep组合的要点(不是我的脚本)。这相当简单。这里有一些额外的东西，比如ls，但没有任何东西会导致5倍的速度减慢。即使grep -r的效率稍微高一些，5倍的速度也是一个巨大的减速。

 find "${TARGET}" -type f -size "${SZLIMIT}" -exec grep -Eaq --file="${HOME}/patterns" "{}" \; -and -ls | tee -a "${HOME}/found.txt"

python代码效率更高，我不知道为什么，但我在实验中测试了它。我更喜欢用python来做这件事。我已经用python实现了5倍的加速，我想要得到更多的加速。

：WINNER：

看起来我们有赢家了。

intued的shell脚本以34秒位居第二，而@steveha脚本以24秒位居第一。由于我们的许多机器都没有python2.6，所以我不得不cx_freeze它。我可以编写一个shell脚本包装器来wget tar并解压缩它。然而，为了简单起见，我确实喜欢intued。

感谢你们的帮助，我现在有了一个有效的sysadmining工具。

python

multithreading

string

parallel-processing

gil

回答 4

Stack Overflow用户

回答已采纳

发布于 2011-10-02 14:28:16

我认为，您应该为您的threading解决方案使用multiprocessing模块，而不是使用Python模块。Python线程可能会与GIL冲突；如果您只是运行多个Python进程，则GIL不是问题。

我认为对于您正在做的事情来说，一个工作进程池正是您想要的。默认情况下，对于系统处理器中的每个核心，该池将默认为一个进程。只需使用要检查的文件名列表和执行检查的函数调用.map()方法即可。

http://docs.python.org/library/multiprocessing.html

如果这不比你的threading实现更快，那么我不认为GIL是你的问题。

编辑:好的，我添加了一个可以工作的Python程序。这使用一个工作进程池来打开每个文件并在每个文件中搜索模式。当worker找到匹配的文件名时，它只需将其打印(到标准输出)，这样您就可以将此脚本的输出重定向到一个文件中，您就拥有了文件列表。

编辑:我认为这是一个更容易阅读，更容易理解的版本。

我对此进行了计时，搜索了我计算机上/usr/include中的文件。它在大约半秒内完成搜索。使用通过xargs管道传输的find来运行尽可能少的grep进程，大约需要0.05秒，大约是10倍的加速。但我讨厌让find正常工作所必须使用的巴洛克怪异语言，我喜欢Python版本。也许在非常大的目录上，差异会更小，因为Python的半秒时间必须是启动时间的一部分。也许半秒对于大多数目的来说已经足够快了！

import multiprocessing as mp
import os
import re
import sys

from stat import S_ISREG


# uncomment these if you really want a hard-coded $HOME/patterns file
#home = os.environ.get('HOME')
#patterns_file = os.path.join(home, 'patterns')

target = sys.argv[1]
size_limit = int(sys.argv[2])
assert size_limit >= 0
patterns_file = sys.argv[3]


# build s_pat as string like:  (?:foo|bar|baz)
# This will match any of the sub-patterns foo, bar, or baz
# but the '?:' means Python won't bother to build a "match group".
with open(patterns_file) as f:
    s_pat = r'(?:{})'.format('|'.join(line.strip() for line in f))

# pre-compile pattern for speed
pat = re.compile(s_pat)


def walk_files(topdir):
    """yield up full pathname for each file in tree under topdir"""
    for dirpath, dirnames, filenames in os.walk(topdir):
        for fname in filenames:
            pathname = os.path.join(dirpath, fname)
            yield pathname

def files_to_search(topdir):
    """yield up full pathname for only files we want to search"""
    for fname in walk_files(topdir):
        try:
            # if it is a regular file and big enough, we want to search it
            sr = os.stat(fname)
            if S_ISREG(sr.st_mode) and sr.st_size >= size_limit:
                yield fname
        except OSError:
            pass

def worker_search_fn(fname):
    with open(fname, 'rt') as f:
        # read one line at a time from file
        for line in f:
            if re.search(pat, line):
                # found a match! print filename to stdout
                print(fname)
                # stop reading file; just return
                return

mp.Pool().map(worker_search_fn, files_to_search(target))

票数 5

Stack Overflow用户

发布于 2011-10-02 06:59:43

我对Python脚本为什么会比find/grep组合更快感到有点困惑。如果您想以一种类似于Ron Smith在他的回答中建议的方式使用grep，您可以这样做

find -type f | xargs -d \\n -P 8 -n 100 grep --file=/root/patterns

启动grep进程，该进程将在退出前处理100个文件，同时保持多达8个此类进程处于活动状态。让它们处理100个文件应该会使每个文件的进程启动开销时间可以忽略不计。

注意：xargs的-d \\n选项是一个GNU扩展，它不能在所有POSIX系统上工作。它指定文件名之间的*d*分隔符是换行符。尽管从技术上讲，文件名可以包含换行符，但实际上没有人会这样做并保住自己的工作。为了与非GNU xargs兼容，您需要向find添加-print0选项，并在xargs中使用-0而不是-d \\n。这将安排find和xargs将空字节\0 (十六进制0x00)用作分隔符。

您也可以采取首先计算要抓取的文件数的方法

NUMFILES="$(find -type f | wc -l)";

然后使用这个数字在8个进程中平均分配(假设bash是外壳程序)

find -type f | xargs -d \\n -P 8 -n $(($NUMFILES / 8 + 1)) grep --file=/root/patterns

我认为这可能会更有效，因为find的磁盘I/O不会干扰各种grep的磁盘I/O。我认为这在一定程度上取决于文件的大小，以及它们是否连续存储-对于小文件，磁盘无论如何都会占用大量空间，所以这并不重要。还要注意，特别是如果您有相当大的RAM，这样的命令的后续运行将更快，因为一些文件将保存在您的内存缓存中。

当然，您可以对8进行参数化，以便更容易地试验不同数量的并发进程。

作为ed。在评论中提到，这种方法的性能仍然不会像单进程grep -r那样令人印象深刻。我猜这取决于磁盘阵列的相对速度、系统中处理器的数量等。

票数 5

Stack Overflow用户

发布于 2011-10-02 06:27:12

如果您愿意升级到3.2版或更高版本，则可以利用concurrent.futures.ProcessPoolExecutor。我认为与您尝试的popen方法相比，它会提高性能，因为它会预先创建一个进程池，每次popen方法都会创建一个新进程。如果由于某种原因不能迁移到3.2版本，那么您可以编写自己的代码为早期版本做同样的事情。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/7623211

复制

相似问题

问并行文件匹配，Python
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问并行文件匹配，PythonEN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问并行文件匹配，Python
EN