有办法加快下面的shell脚本吗?我每天需要40分钟才能更新大约150000份文件。当然,考虑到要创建和更新的文件数量,这可能是可以接受的。我不否认这一点。但是,如果有一种更有效的方法来写这个或者完全重写逻辑,我会对它敞开心扉。拜托,我在找人帮忙
#!/bin/bash
DATA_FILE_SOURCE="<path_to_source_data/${1}"
DATA_FILE_DEST="<path_to_dest>"
for fname in $(ls -1 "${DATA_FILE_SOURCE}")
do
for line in $(cat "${DATA_FILE_SOURCE}"/"${fname}")
do
FILE_TO_WRITE_TO=$(echo "${line}" | awk -F',' '{print $1"."$2".daily.csv"}')
CONTENT_TO_WRITE=$(echo "${line}" | cut -d, -f3-)
if [[ ! -f "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}" ]]
then
echo "${CONTENT_TO_WRITE}" >> "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
else
if ! grep -Fxq "${CONTENT_TO_WRITE}" "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
then
sed -i "/${1}/d" "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
"${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
echo "${CONTENT_TO_WRITE}" >> "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
fi
fi
done
done发布于 2021-09-16 14:03:59
您发布的脚本中仍然有一些不清楚的部分,比如sed命令。虽然我用更明智的做法重写了它,更少的是外部调用,但女巫确实应该加快它的速度。
#!/usr/bin/env sh
DATA_FILE_SOURCE="<path_to_source_data/$1"
DATA_FILE_DEST="<path_to_dest>"
for fname in "$DATA_FILE_SOURCE/"*; do
while IFS=, read -r a b content || [ "$a" ]; do
destfile="$DATA_FILE_DEST/$a.$b.daily.csv"
if grep -Fxq "$content" "$destfile"; then
sed -i "/$1/d" "$destfile"
fi
printf '%s\n' "$content" >>"$destfile"
done < "$fname"
done发布于 2021-09-16 13:59:31
!/bin/bash设置-e -o管道故障声明-ir MAX_PARALLELISM=20 #选择限制声明-i pid声明-a pids #.对于“${DATA_FILE_SOURCE}/”*中的fname,请执行if ((${pids@} >= MAX_PARALLELISM));然后等待-p pid -n回显${pidspid}失败,${?}“1>&2 unset 'pidspid‘fi,而IFS=读取-r行;do FILE_TO_WRITE_TO=”.完成< "${fname}“&#在这里分叉pids$!="${fname}”中为pid执行的“${fname}”;请等待-n“$(Pid)”“{pidspid} "${pidspid} ${pidspid}失败,${pidspid}”1>&2完成“。
下面是一个直接运行的框架,展示了上面的工具是如何工作的(最多需要处理36个项,最多有20个并行进程):
#!/bin/bash set -e -o pipefail -ir MAX_PARALLELISM=20 #选取一个限制声明-i pid声明-a pids do_something_and_maybe_fail() {睡眠$(随机%10)返回some_name_{a.f}{0.5}.txt中的fname $(随机%2*5)};如果(${pids@} >= MAX_PARALLELISM)执行# 36项;然后等待-p pid‘pidspid}${pidspid}失败,${?}“1>&2 unset 'pidspid’fi do_something_and_maybe_fail &# forking”pids$!="${fname}“echo "${#pids@}运行”1>&2 for pid,在“${!pids@}”中运行“1>&2 for pid”;在“${!pids@}”中,等待-n“$(Pid)”“{pidspid} "${pidspid}与${?}”1>&2 do “一起失败。
awk、grep和cut)。fork()ing与相比效率极低。
- Running one single `awk` / `grep` / `cut` process on an entire input file (to preprocess all lines at once for easier processing in `bash`) and feeding the whole output into (e.g.) a `bash` loop.
- Using Bash expansions instead, where feasible, e.g. `"${line/,/.}"` and other tricks from the `EXPANSION` section of the `man bash` page, without `fork()`ing any further processes.- `ls -1` is unnecessary. First, `ls` won’t write multiple columns unless the output is a terminal, so a plain `ls` would do. Second, `bash` expansions are usually a cleaner and more efficient choice. (You can use `nullglob` to correctly handle empty directories / “no match” cases.)- Looping over the output from `cat` is a (less common) [useless use of `cat`](https://blog.sanctum.geek.nz/useless-use-of-cat/) case. Feed the file into a loop in `bash` instead and read it line by line. (This also gives you more line format flexibility.)https://stackoverflow.com/questions/69208364
复制相似问题