文章/答案/技术大牛

发布

社区首页 >问答首页 >AWK -用一个键比较两个文件并打印一个摘要(丢失，相同，不同)

问AWK -用一个键比较两个文件并打印一个摘要(丢失，相同，不同)
EN

Stack Overflow用户

提问于 2017-10-04 13:45:11

回答 2查看 163关注 0票数 1

我有两个像这样的文件。两个文件都是在第一、第二字段中排序的。(一个ID可以有多行)

归档a

3337312|6dc1d4397108002245c770fa66ee4d7767dcc23e|1
3337313|cb1c00eeccb25ea5a069da63a1b0c2565379ff9c|1
3337318|61a813730578c552b62de5618e1d66b1eb74b4f8|1
3337319|6af3b98f25a6a9b9d887486aefddfb53947bbf1c|1
3337320|1e3126f41f848509efad0b3415b003704377778c|1

档案b

3337312|6dc1d4397108002245c770fa66ee4d7767dcc23e|1
3337315|780055f13efffcb4bee115c6cf546af85ac6c0a7|1
3337316|19535297b9913b6bca1796b68505498d5e81b5ed|1
3337318|61a813730578c552b62de5618e1d66b1eb74b4f8|1
3337319|6af3b98f25a6a9b9d887486aefddfb53947bbf1c|1

第一行是一个键；3个字段，管道分开。文件大约是1gb。

我想要做的是返回一个结果集，看起来如下：

3333 rows in File A
4444 rows in File B
1234 rows are identical 
2345 rows are different (aka the 2nd/3rd field are different but the key matches)
111  rows in File A not in File B
222  rows in File B not in File A

这是实现它的SQL代码，这是我的退路。

--CREATE TABLE aws_hash_compare (the_filename VARCHAR(100) NOT NULL, switch_id BIGINT, hash_value CHAR(40),the_count TINYINT)

--CREATE UNIQUE CLUSTERED INDEX ucidx__awshashcompare__the_filename__switch_id ON aws_hash_compare(the_filename, switch_id)

DECLARE @mSsql_filename sysname = 'FileA'
DECLARE @mYsql_filename sysname = 'FileB'


SELECT COUNT(*) AS MSSQL FROM aws_hash_compare 
WHERE the_filename = @mSsql_filename

SELECT COUNT(*) AS MYSQL FROM aws_hash_compare 
WHERE the_filename = @mYsql_filename 


SELECT COUNT(*) AS switch_id_match FROM aws_hash_compare mysql 
INNER JOIN aws_hash_compare mssql 
ON mysql.the_filename = @mYsql_filename 
AND mssql.the_filename = @mSsql_filename
AND mysql.switch_id = mssql.switch_id

SELECT COUNT(*) AS complete_match FROM aws_hash_compare mysql 
INNER JOIN aws_hash_compare mssql 
ON mysql.the_filename = @mYsql_filename 
AND mssql.the_filename = @mSsql_filename
AND mysql.switch_id = mssql.switch_id
AND mssql.hash_value = mysql.hash_value
AND mssql.the_count = mysql.the_count

SELECT COUNT(*) AS hash_differences FROM aws_hash_compare mysql 
INNER JOIN aws_hash_compare mssql 
ON mysql.the_filename = @mYsql_filename 
AND mssql.the_filename = @mSsql_filename
AND mysql.switch_id = mssql.switch_id
AND (mssql.hash_value <> mysql.hash_value OR mssql.the_count <> mysql.the_count)

SELECT COUNT(*) AS missing_from_MSSQL FROM aws_hash_compare mysql WHERE the_filename = @mYsql_filename 
AND NOT EXISTS (SELECT 1 FROM aws_hash_compare mssql WHERE the_filename = @mSsql_filename 
                AND mssql.switch_id = mysql.switch_id)

SELECT COUNT(*) AS missing_from_MYSQL FROM aws_hash_compare mssql WHERE the_filename = @mSsql_filename
AND NOT EXISTS (SELECT 1 FROM aws_hash_compare mysql WHERE the_filename = @mYsql_filename
                AND mssql.switch_id = mysql.switch_id)

compare

awk

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-10-04 14:01:00

跟随awk也能帮助你。

awk -F"|" '
FNR==NR{
  a[$0]=$0;
  b[$1];
  next
}
FNR==1{
  file1_count=(NR-1) " rows in " ARGV[1]
}
($1 in b) && !($0 in a){
  first_field_matching++
}
($0 in a){
  common++;
  delete a[$0];
  next
}
{
  found_in_B_not_in_A++
}
END{
  found_in_A_not_in_B=length(a);
  print file1_count RS FNR " rows in " ARGV[2] RS common " rows are identical" \
RS first_field_matching " rows are different (aka the 2nd/3rd field are\
different but the key matches)" RS found_in_A_not_in_B " rows in File A\
not in File B" RS found_in_B_not_in_A " rows in File B not in File A"
}
' file_A file_B

假设下面是文件A和文件B(我对您提供的Input_files做了一个小更改，以验证$1与其他条件相同的一个条件)。

cat file_A
3337312|6dc1d4397108002245c770fa66ee4d7767dcc23e|1
3337313|cb1c00eeccb25ea5a069da63a1b0c2565379ff9c|1
3337318|61a813730578c552b62de5618e1d66b1eb74b4f8|1
3337319|786af3b98f25a6a9b9d887486aefddfb53947bbf1c|1
3337320|1e3126f41f848509efad0b3415b003704377778c|1

cat file_B
3337312|6dc1d4397108002245c770fa66ee4d7767dcc23e|1
3337315|780055f13efffcb4bee115c6cf546af85ac6c0a7|1
3337316|19535297b9913b6bca1796b68505498d5e81b5ed|1
3337318|61a813730578c552b62de5618e1d66b1eb74b4f8|1
3337319|6af3b98f25a6a9b9d887486aefddfb53947bbf1c|1

现在，当我们运行上面的代码时，下面将是相同的输出。

5 rows in file_A
5 rows in file_B
2 rows are identical
1 rows are different (aka the 2nd/3rd field aredifferent but the key matches)
3 rows in File Anot in File B
3 rows in File B not in File A

票数 2

Stack Overflow用户

发布于 2017-10-04 15:16:29

下面是一个使用comm比较文件的版本，然后使用awk生成结果。它可能较慢，但可能使用较少的内存。comm要求对其输入文件进行排序。

我假设每个文件都有一个密钥出现一次。

comm filea fileb | awk -F'\t' '
    BEGIN { na = nb = identical = common = 0 }
    $1 {
        split($1, f, /[|]/)
        if (f[1] in b) {common++; delete b[f[1]]} else {a[f[1]]}
        na++
    }
    $2 {
        split($2, f, /[|]/)
        if (f[1] in a) {common++; delete a[f[1]]} else {b[f[1]]}
        nb++
    }
    $3 {
        identical++
        na++
        nb++
    }
    END {
        printf "%d rows in file A\n", na
        printf "%d rows in file B\n", nb
        printf "%d rows are identical\n", identical
        printf "%d rows are different but share a key\n", common
        printf "%d rows in file A only\n", length(a)
        printf "%d rows in file B only\n", length(b)
    }
'

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/46566530

复制

相似问题

问AWK -用一个键比较两个文件并打印一个摘要(丢失，相同，不同)
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问AWK -用一个键比较两个文件并打印一个摘要(丢失，相同，不同)EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问AWK -用一个键比较两个文件并打印一个摘要(丢失，相同，不同)
EN