文章/答案/技术大牛

发布

社区首页 >问答首页 >使用sha1sum使用awk的散列

问使用sha1sum使用awk的散列
EN

Stack Overflow用户

提问于 2015-01-13 20:10:33

回答 2查看 2.2K关注 0票数 4

我有一个“管道分隔”文件，大约有20列。我只想散列第一列，它是一个类似帐号的数字，使用sha1sum并返回其余的列。

我用awk或sed做这件事的最好方法是什么？

Accountid|Time|Category|.....
8238438|20140101021301|sub1|...
3432323|20140101041903|sub2|...
9342342|20140101050303|sub1|...

上面是一个只显示3列的文本文件的例子。只有第一列在其上实现了哈希函数。结果应该是：

Accountid|Time|Category|.....
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...

hash

awk

sed

sha1

回答 2

Stack Overflow用户

回答已采纳

发布于 2015-01-13 20:44:12

这里有一个awk可执行脚本，它可以满足您的需要：

#!/usr/bin/awk -f

BEGIN { FS=OFS="|" }

FNR != 1 { $1 = encodeData( $1 ) }

47

function encodeData( fld ) {
    cmd = sprintf( "echo %s | sha1sum", fld )
    cmd | getline output
    close( cmd )
    split( output, arr, " " )
    return arr[1]
    }

以下是流程的分解：

将输入和输出字段分隔符设置为|
当行不是第一个(标题)行时，将$1重新分配给编码的值。
当47为true时，打印整行(始终)

下面是encodeData函数的分解：

创建一个cmd将数据提供给sha1sum
把它给getline
关闭cmd
在我的系统中，在sha1sum之后有额外的信息，所以我放弃它，通过split输出
返回sha1sum输出的第一个字段。

根据您的数据，我得到以下信息：

Accountid|Time|Category|.....
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...

通过调用awk.script data (或者./awk.script data (如果您使用bash)运行)

编辑由EdMorton：对不起，但上面的脚本是正确的方法，但需要一些调整才能使其更健壮，这比试图在评论中描述它们要容易得多：

$ cat tst.awk
BEGIN { FS=OFS="|" }

NR==1 { for (i=1; i<=NF; i++) f[$i] = i; next }
{ $(f["Accountid"]) = encodeData($(f["Accountid"])); print }

function encodeData( fld,       cmd, output ) {
    cmd = "echo \047" fld "\047 | sha1sum"
    if ( (cmd | getline output) > 0 ) {
        sub(/ .*/,"",output)
    }
    else {
        print "failed to hash " fld | "cat>&2"
        output = fld
    }
    close( cmd )
    return output
}
$ awk -f tst.awk file
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...

f[]数组将脚本从硬编码(需要散列的字段数)中分离出来，函数的附加arg使它们成为本地的，因此每次调用都是空/零，getline上的if意味着如果它失败就不会返回以前的成功值(参见http://awk.info/?tip/getline)，其余的可能是样式/首选项，但性能会有所提高。

票数 2

Stack Overflow用户

发布于 2015-01-13 20:39:00

™的最佳方式是什么值得商榷。使用awk的一种方法是

awk -F'|' 'BEGIN { OFS=FS } NR == 1 { print } NR != 1 { gsub(/'\''/, "'\'\\\\\'\''", $1); command = ("echo '\''" $1 "'\'' | sha1sum -b | cut -d\\  -f 1"); command | getline hash; close(command); $1 = hash; print }' filename

那是

BEGIN {
  OFS = FS          # set output field separator to field separator; we will use
                    # it because we meddle with the fields.
}
NR == 1 {           # first line: just print headers.
  print
}
NR != 1 {           # from there on do the hash/replace
  # this constructs a shell command (and runs it) that echoes the field
  # (singly-quoted to prevent surprises) through sha1sum -b, cuts out the hash
  # and gets it back into awk with getline (into the variable hash)
  # the gsub bit is to prevent the shell from barfing if there's an apostrophe
  # in one of the fields.
  gsub(/'/, "'\\''", $1);
  command = ("echo '" $1 "' | sha1sum -b | cut -d\\  -f 1")
  command | getline hash
  close(command)

  # then replace the field and print the result.
  $1 = hash
  print
}

您将注意到顶部的shell命令和底部的awk代码之间的差异；这都是由于shell扩展造成的。因为我把awk代码放在shell命令中的单引号中(双引号在这个上下文中是不值得讨论的，$1和all是什么)，而且因为代码包含单引号，使其内联工作会导致反斜杠的噩梦。因此，我的建议是将awk代码放入一个文件中，比如foo.awk，然后运行

awk -F'|' -f foo.awk filename

而不是。

票数 4

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/27930643

复制

相似问题

问使用sha1sum使用awk的散列
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用sha1sum使用awk的散列EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用sha1sum使用awk的散列
EN