文章/答案/技术大牛

发布

社区首页 >问答首页 >解决方案的工作速度比for循环快。

问解决方案的工作速度比for循环快。
EN

Stack Overflow用户

提问于 2020-03-11 19:21:33

回答 3查看 95关注 0票数 0

我有一个包含1.5M文本文件的目录，其内容如下：

TDB_TCFE9:
POLY version  3.32
POLY:POLY:POLY:POLY:POLY: Using global minimization procedure
Calculated         71400 grid points in             1 s
Found the set of lowest grid points in              1 s
 Calculated POLY solution      11 s, total time    13  s
POLY: VPV(LIQUID)=0
POLY:POLY:POLY:POLY:POLY:POLY: Using global minimization procedure
Calculated         71400 grid points in             1 s
Found the set of lowest grid points in              0 s
 Calculated POLY solution       5 s, total time     6  s
POLY: *** STATUS FOR ALL PHASES

这些文件的名称类似于id1.TCM.log。只有id更改后的数字。

我想要做的是grep，在VPV(LIQUID)=之后的值，然后把这个值给x1，然后给x2。如果x2大于0.0001，则将相应文件移动到with_liquid目录。如果没有，什么也不做。

我使用的代码是

for j in `seq 1500000`
do
echo ${j}
  x1=`grep -a VPV\(LIQUID\) id${j}.TCM.log |sed s/POLY_3://g|awk 'BEGIN{FS="="}{print $2}'|tail -1`
    x2=$(printf "%.14f" $x1)
    if [ $(echo "$x2 > 0.0001"| bc -l)  -eq 1 ]; then
    mv id${j}.TCM.log with_liquid
    fi
done

效果很好。唯一的问题是时间太长了。我怎么能做得更快？我还可以使用python代码或任何其他解决方案。

非常感谢。

python

linux

grep

回答 3

Stack Overflow用户

回答已采纳

发布于 2020-03-11 19:43:32

只需提高脚本的性能就可以了：

for file in id{0..1500000}.TCM.log; do  # Use a brace expansion
                                        # instead of the subshell to seq
    printf 'Processing %s\n' "$file"
    x1=$(grep -o 'VPV(LIQUID)=\d*\.\?\d*' "$file")  # VPV(LIQUID)=0.0 or the like
    printf -v x2 '%.14f' "${x1#*=}"  # Parameter expansion to strip the part before the =
    bc -l <<< "$x2 > 0.0001" ||  # Use the exit status value directly
        mv "$file" with_liquid/
done

考虑到这里有多少文件，如何实现这一点很重要，但不像并行化那么重要。这样做的一个简单方法是使用xargs等带有-P标志的工具。一个更强大的实现是GNU并行。假设您在一个名为for的可执行文件中定义脚本的中间部分(for循环中的所有内容，并将file重命名为$1)。

printf '%s\n' id{0..1500000}.TCM.log | xargs -P 50 ./xsplit

为了您的目的，您需要仔细地调优maxprocs --这是xargs将产生的进程的最大数量。我在这里任意选择了50个。

另一个考虑是，根据您给出的示例文件，这些文件都很小--不到600个字节。因此，整个150万个文件都在1GB以下--你可以同时在普通硬件上将所有这些文件加载到内存中。把整个脚本归结为

[[ $(<file) =~ VPV\(LIQUID\)=([[:digit:]]\.?[[:digit:]]*) ]] &&
    bc -l <<< "${BASH_REMATCH[1]} > 0.0001" ||
    mv "$file" with_liquid/

票数 2

Stack Overflow用户

发布于 2020-03-11 20:00:37

代码慢的主要原因是每个文件需要大约九个分叉。

如果你不分叉的话，它会快得多：

awk -F= '/VPV\(LIQUID\)/ && $2 > 0.0001 { print FILENAME }' id*.TCM.log |
    xargs mv -t with_liquid

票数 2

Stack Overflow用户

发布于 2020-03-11 21:41:32

这里有一个可能的Python实现。在使用轻量级线程(适合于I/O绑定作业)还是使用进程(可以更好地处理CPU密集型操作)之间进行选择。目前，使用进程的代码已经用#注释掉了。你应该两者都试一试，看看哪种更好。还可以围绕线程数(进程数量不应大于实际拥有的处理器数量)。

#!/usr/bin/env python3

from concurrent.futures import ThreadPoolExecutor
#from concurrent.futures import ProcessPoolExecutor
import re, glob, os

NUMBER_OF_THREADS = 20
NUMBER_OF_PROCESSES = 5

x1_regex = re.compile(r'VPV\(LIQUID\)=(\d+(?:\.\d*)?|\.\d+)')

def process_file(i):
    fn = f'id{i}.TCM.log'
    try:
        with open(fn) as f:
            text = f.read()
        m = x1_regex.search(text)
        if m:
            x1 = float(m[1])
            if x1 > .0001:
                os.rename(fn, f'with_liquid/{fn}')
        return None # no error
    except Exception as e:
        return (e, fn)

if __name__ == '__main__':
    with ThreadPoolExecutor(max_workers=NUMBER_OF_THREADS) as executor:
    #with ProcessPoolExecutor(max_workers=NUMBER_OF_PROCESSES) as executor:
        for result in executor.map(process_file, range(150001)):
            if result:
                print(f'Exception {result[0]} processing file {result[1]}.')

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60643144

复制

相似问题

问解决方案的工作速度比for循环快。
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问解决方案的工作速度比for循环快。EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问解决方案的工作速度比for循环快。
EN