新的shell脚本..。
我有一个巨大的csv文件,具有可变长度的f11,如
“000000aad0000bhb200000uwwed.”
"000000aba200000bbrb2000000wwqr00000caba2000000bhbd000000qwew...“
。。
在拆分字符串的大小为10后,需要6-9个字符.然后,我必须使用分隔符‘AC.26’(类似于
我们..。
0aba,bbrb,0wwq,caba,0bhb,0qwe.
并将已处理的f11与其他字段连接起来。
这是处理10k记录->所需的时间。
实4m43.506 s
用户0m12.366s
sys 0m12.131s
20K记录->
实5m20.244 s
用户2m21.591 s
sys 3m20.042s
80K记录(约370万f11拆分并与‘AC.26’合并) ->
真21m18.854 s
用户9m41.944 s
sys 13m29.019 s
我预计处理650 K记录的时间是30分钟(大约5600万f11拆分和合并)。有优化的方法吗?
while read -r line1; do
f10=$( echo $line1 | cut -d',' -f1,2,3,4,5,7,9,10)
echo $f10 >> $path/other_fields
f11=$( echo $line1 | cut -d',' -f11 )
f11_trim=$(echo "$f11" | tr -d '"')
echo $f11_trim | fold -w10 > $path/f11_extract
cat $path/f11_extract | awk '{print $1}' | cut -c6-9 >> $path/str_list_trim
arr=($(cat $path/str_list_trim))
printf "%s|" ${arr[@]} >> $path/str_list_serialized
printf '\n' >> $path/str_list_serialized
arr=()
rm $path/f11_extract
rm $path/str_list_trim
done < $input
sed -i 's/.$//' $path/str_list_serialized
sed -i 's/\(.*\)/"\1"/g' $path/str_list_serialized
paste -d "," $path/other_fields $path/str_list_serialized > $path/final_out发布于 2021-10-26 06:11:03
您的代码不具有时间效率,原因是:
在loop.
你只需使用awk就可以完成这项工作:
awk -F, -v OFS="," ' # assign input/output field separator to a comma
{
len = length($11) # length of the 11th field
s = ""; d = "" # clear output string and the delimiter
for (i = 1; i <= len / 10; i++) { # iterate over the 11th field
s = s d substr($11, (i - 1) * 10 + 6, 4) # concatenate 6-9th substring of 10 characters long chunks
d = "|" # set the delimiter to a pipe character
}
$11 = "\"" s "\"" # assign the 11th field to the generated string
} 1' "$input" # the final "1" tells awk to print all fields输入示例:
1,2,3,4,5,6,7,8,9,10,000000aaad000000bhb200000uwwed
1,2,3,4,5,6,7,8,9,10,000000aba200000bbrb2000000wwqr00000caba2000000bhbd000000qwew输出:
1,2,3,4,5,6,7,8,9,10,"0aaa|0bhb|uwwe"
1,2,3,4,5,6,7,8,9,10,"0aba|bbrb|0wwq|caba|0bhb|0qwe"https://stackoverflow.com/questions/69717681
复制相似问题