我在努力摆脱旧的资产版本。文件命名严格如下:
<timestamp>_<constant><version><assetID>.zip<.extra>
例如,202201012359_FOOBAR0101234567.zip.done。
<timestamp>是将文件添加到文件夹的日期。
<constant>在正在处理的文件夹中不会更改。
<version>是从00开始的两位数字,它用<assetID>描述资产的版本。
<extra>是可选的,因此扩展可以是.zip、.zip.done或.zip.somethingelse。
但是,资产可能具有所有三个不同的扩展,并且可以使用不同的时间戳多次存在。这意味着资产可能有多个具有相同ID和版本号的附加文件,但时间戳不同。
目标是找到每一个具有相同ID的资产的最新版本,并删除旧版本。重要的是版本号,而不是时间戳。
所有资产都位于一个文件夹中,没有子文件夹。
电流溶液
到目前为止,实现这一目标的方式如下:
#!/bin/bash
location="/home/user/FOOBAR"
echo "Deleting older files..."
# Declare variable to print the outcome of removed asset ID's
declare -A assetsRemoved
# The main loop which finds all the files in the folder
find $location -maxdepth 1 -type f -name "*.zip*" -a -name "*FOOBAR*" | while read line; do
# <timestamp>_FOOBAR<iterator><assetId><file-extensions>
# 20201229104919_FOOBAR0300040682.zip.done
# Separate assetId
rest=${line#*'.zip'}
# .done
pos=$(( ${#line} - ${#rest} - 4 ))
# 20201229104919_FOOBAR0300040682<^>.zip.done
assetId=${line:pos-8:8}
# 20201229104919_FOOBAR03<00040682>.zip.done
# Find all files with same assetId
assets="$(find ~+ $location -maxdepth 1 -type f -name "*$assetId.zip*" -a -name "*FOOBAR*")"
# Init loop variables
max=-1
mostRecent=""
cleanedOld=0
# Loop all files with same assetId
for file in $assets
do
# Separate basename without extension
basenameNoExt="${file%%.*}"
# <20201229104919_FOOBAR00300040682>.zip.done
# Separate iterator, 2 numbers
iter=${basenameNoExt:${#basenameNoExt}-10:2}
# 20201229104919_FOOBAR0<03>00040682.zip.done
if [[ $iter -gt $max ]]
then
max=$iter
if [[ -n $mostRecent ]]
then
rm $mostRecent*
cleanedOld=1
fi
mostRecent=$basenameNoExt
elif [[ $iter -lt $max ]]
then
[ -f $file ] && rm $basenameNoExt*
cleanedOld=1
fi
# $iter == $max -> same asset with different file extension, leave to be
done
if [[ $max -gt 0 && cleanedOld -gt 0 ]]
then
assetsRemoved[$assetId]=$max
fi
done
for a in "${!assetsRemoved[@]}"; do
echo "Cleaned asset $a from versions lower than ${assetsRemoved[$a]}"
done问题所在
这个解决方案有一个严重的问题:它是缓慢的。由于它首先查找所有文件,获取一个文件并在删除旧版本的同时计算出最大版本,所以最外层的find-循环中的下一个迭代尝试对可能已经处理或删除的资产执行查找-remove-命令。
问题是
是否有一种方法在find 的每个结果被收集之前执行的命令?或者还有其他更有效的方法来循环结果呢?有超过100 k的文件需要处理,我假设通配符rm在搜索要删除的相关文件时会循环它们。这需要对文件进行100.000^2次以上的迭代。有什么办法可以防止这种情况发生吗?
示例
考虑一个包含以下文件的文件夹:
20191229104919_FOOBAR0001234567.zip
20191229104919_FOOBAR0001234567.zip.done
20191229104919_FOOBAR0001234567.zip.somethingelse
20191229104919_FOOBAR0087654321.zip
20191129104919_FOOBAR0087654321.zip.done
20191129104919_FOOBAR0087654321.zip.somethingelse
20191129100000_FOOBAR0187654321.zip
20191229100000_FOOBAR0187654321.zip.done
20191229100000_FOOBAR0187654321.zip.somethingelse
20201229104919_FOOBAR0101234567.zip
20201229104919_FOOBAR0101234567.zip.done
20201229104919_FOOBAR0101234567.zip.somethingelse
20211229104919_FOOBAR0201234567.zip
20211229104919_FOOBAR0201234567.zip.done
20211229104919_FOOBAR0201234567.zip.somethingelse
20221229104919_FOOBAR0201234567.zip
20221229104919_FOOBAR0201234567.zip.done
20221229104919_FOOBAR0201234567.zip.somethingelse清理后剩下的文件如下:
20191129100000_FOOBAR0187654321.zip
20191229100000_FOOBAR0187654321.zip.done
20191229100000_FOOBAR0187654321.zip.somethingelse
20211229104919_FOOBAR0201234567.zip
20211229104919_FOOBAR0201234567.zip.done
20211229104919_FOOBAR0201234567.zip.somethingelse
20221229104919_FOOBAR0201234567.zip
20221229104919_FOOBAR0201234567.zip.done
20221229104919_FOOBAR0201234567.zip.somethingelse注意:
最新的版本才是最重要的。必须保留具有不同时间戳和扩展的相同资产版本和ID。
发布于 2022-09-09 15:22:38
感谢@ogus的工作和清洁解决方案!
为了文档起见,我将添加最后使用的解决方案,并澄清xargs在本例中的使用。
#!/bin/bash
# Takes optional argument to delete found assets while running.
removeFound=${1:-n}
location="/home/user/bashtest"
if [[ "$removeFound" =~ ^(y|Y|yes|Yes|YES)$ ]]
then
echo "Deleting older assets from $location"
else
echo "Searching old assets from $location"
fi
# Find all .zip and .zip.somethingelse -files, pipe lines to awk, save to variable
assetsToDelete=`printf '%s\n' $location/*.zip* | awk '{
# <timestamp>FOOBAR<iterator><assetId><file-extensions>
# 20191229104919_FOOBAR0387654321.zip.done
# Extension position
extPos = index($0, ".zip")
# 20191229104919_FOOBAR0387654321<^>.zip.done
# Separate asset ID
assetId = substr($0, extPos - 8, 8)
# 20191229104919_FOOBAR03<87654321>.zip.done
# Separate iterator, 2 numbers
assetVer = substr($0, extPos - 10, 2)
# 20191229104919_FOOBAR<03>87654321.zip.done
# List variables used below:
# assetList -> [assetId][asset file(s)] -> keys: list of asset IDs encountered, values: one or more asset file paths, absolute, separated by ORS (newline)
# maxAssetV -> [assetId][assetMaxVersion] -> keys: list of asset IDs encountered, values: maximum version of the corresponding asset encountered
# Everything printed out with <print> is the output of the awk-command, thus to be deleted
# Find if ID has not been recorded, or version is smaller than recorded
if (!(assetId in assetList) || assetVer > maxAssetV[assetId]) {
# Asset recorded, version is smaller, remove old asset by printing its path
if (assetId in assetList)
print assetList[assetId]
# Record new or newer asset
assetList[assetId] = $0
maxAssetV[assetId] = assetVer
}
# Find if asset is the same version as current max version
else if (assetVer == maxAssetV[assetId]) {
# Record the asset by stacking it on the list, separated with ORS (newline)
assetList[assetId] = assetList[assetId] ORS $0
}
# Asset recorded and with smaller version -> print thus delete
else {
print
}
}' `
if [ -z "$assetsToDelete" ]; then
echo "Zero older assets found in the $location"
else
if [[ "$removeFound" =~ ^(y|Y|yes|Yes|YES)$ ]]
then
echo $assetsToDelete | awk -v OFS="\n" '{$1=$1}1' | xargs -n1 -I {} sh -c 'echo {}; rm {}'
else
echo "Moving files to ./remove folder, delete manually from there."
echo "To delete on the go, run script with parameter <yes>"
echo $assetsToDelete | awk -v OFS="\n" '{$1=$1}1' | xargs -n1 -I {} sh -c 'echo {}; mv {} $(dirname {})/remove/'
fi
fi
exit发布于 2022-09-06 12:36:26
在包含这些文件的目录中运行,这将列出要删除的文件:
printf '%s\n' *.zip *.zip.* | awk '{
i = index($0, ".zip")
id = substr($0, i - 7, 8)
ver = substr($0, i - 9, 2)
if (!(id in keep) || ver > keep_ver[id]) {
if (id in keep)
print keep[id]
keep[id] = $0
keep_ver[id] = ver
}
else if (ver == keep_ver[id]) {
keep[id] = keep[id] ORS $0
}
else {
print
}
}'如果输出看起来不错,将其输送到xargs rm以实际删除它们。
发布于 2022-09-06 11:40:27
与双循环不同的是,2-pass方案如何:
#!/bin/bash
location="/home/user/FOOBAR"
declare -A latestver # associates the latest version number with assetID
# pass 1: extract the latest version number for the assetID
for f in "$location"/*FOOBAR*.zip*; do
tmp=${f%.zip*} # remove the suffix ".zip*"
ver=${tmp: -10:2} # extract the version number
id=${tmp: -8:8} # extract the assetID
(( 10#$ver > 10#${latestver[$id]} )) && latestver[$id]="$ver"
# update the latest version number assiciated with the assetID
done
# pass 2: if the associated version number of the assetID does not match, remove the file
for f in "$location"/*FOOBAR*.zip*; do
tmp=${f%.zip*} # remove the suffix ".zip*"
id=${tmp: -8:8} # extract the version number
ver=${latestver[$id]} # expected latest version number
if [[ $f != *FOOBAR$ver$id.zip* ]]; then
# filename does not match, meaning the version number is older
echo rm -- "$f" # then remove the file
fi
done如果输出看起来不错,则删除"echo“。
https://stackoverflow.com/questions/73618466
复制相似问题