文章/答案/技术大牛

发布

社区首页 >问答首页 >减少“同时读取”循环的处理时间

问减少“同时读取”循环的处理时间
EN

Stack Overflow用户

提问于 2021-10-26 04:59:08

回答 1查看 96关注 0票数 3

新的shell脚本..。

我有一个巨大的csv文件，具有可变长度的f11，如

“000000aad0000bhb200000uwwed.”

"000000aba200000bbrb2000000wwqr00000caba2000000bhbd000000qwew...“

。。

在拆分字符串的大小为10后，需要6-9个字符.然后，我必须使用分隔符‘AC.26’(类似于

我们..。

0aba，bbrb，0wwq，caba，0bhb，0qwe.

并将已处理的f11与其他字段连接起来。

这是处理10k记录->所需的时间。

实4m43.506 s

用户0m12.366s

sys 0m12.131s

20K记录->

实5m20.244 s

用户2m21.591 s

sys 3m20.042s

80K记录(约370万f11拆分并与‘AC.26’合并) ->

真21m18.854 s

用户9m41.944 s

sys 13m29.019 s

我预计处理650 K记录的时间是30分钟(大约5600万f11拆分和合并)。有优化的方法吗？

while read -r line1; do
    f10=$( echo $line1 | cut -d',' -f1,2,3,4,5,7,9,10)
    echo $f10 >> $path/other_fields
    
    f11=$( echo $line1 | cut -d',' -f11 )
    f11_trim=$(echo "$f11" | tr -d '"')
    echo $f11_trim | fold -w10 > $path/f11_extract 

    cat $path/f11_extract | awk '{print $1}' | cut -c6-9 >> $path/str_list_trim
    
    arr=($(cat $path/str_list_trim))
    printf "%s|" ${arr[@]} >> $path/str_list_serialized
    printf '\n' >> $path/str_list_serialized
    arr=()
    
    rm $path/f11_extract
    rm $path/str_list_trim

done < $input
sed -i 's/.$//' $path/str_list_serialized
sed -i 's/\(.*\)/"\1"/g' $path/str_list_serialized

paste -d "," $path/other_fields $path/str_list_serialized > $path/final_out

while-loop

arrays

shell

awk

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-10-26 06:11:03

您的代码不具有时间效率，原因是：

在loop.

generating中调用多个命令(包括awk )的
--许多中间时态文件.

你只需使用awk就可以完成这项工作：

awk -F, -v OFS="," '                                    # assign input/output field separator to a comma
{
    len = length($11)                                   # length of the 11th field
    s = ""; d = ""                                      # clear output string and the delimiter
    for (i = 1; i <= len / 10; i++) {                   # iterate over the 11th field
        s = s d substr($11, (i - 1) * 10 + 6, 4)        # concatenate 6-9th substring of 10 characters long chunks
        d = "|"                                         # set the delimiter to a pipe character
    }
    $11 = "\"" s "\""                                   # assign the 11th field to the generated string
} 1' "$input"                                           # the final "1" tells awk to print all fields

输入示例：

1,2,3,4,5,6,7,8,9,10,000000aaad000000bhb200000uwwed
1,2,3,4,5,6,7,8,9,10,000000aba200000bbrb2000000wwqr00000caba2000000bhbd000000qwew

输出：

1,2,3,4,5,6,7,8,9,10,"0aaa|0bhb|uwwe"
1,2,3,4,5,6,7,8,9,10,"0aba|bbrb|0wwq|caba|0bhb|0qwe"

票数 4

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/69717681

复制

相似问题

问减少“同时读取”循环的处理时间
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问减少“同时读取”循环的处理时间EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问减少“同时读取”循环的处理时间
EN