我得到了两个未引用和单列 TSV文件(从数据库导出),其中有几千人的名字,我需要找到出现在这两个文件中的名称。这两个文件都是UTF-8、CRLF终止的,并从BOM 0xEF 0xBB 0xBF开始。
一个简单的join或comm命令可以完成这个任务,但是名称上有一些不同:
# cat file1.tsv
A. Einstein
Louis Pasteur
Diego Armando Maradona
Isaac Newton
Frava D’onä
D Rüge
Françoise Barré-Sinoussi# cat file2.tsv
Diego Maradona
Albert Einstein
Francoise, BARRE SINOUSSI
Louis Pasteur
frava d'ona
Marie-Louise Von FRANZ
Dimitri Rügefile2.tsv中的预期匹配将是:
Diego Maradona
Albert Einstein
Francoise, BARRE SINOUSSI
Louis Pasteur
frava d'ona
Dimitri Rüge我编写了这个bash sed awk grep脚本,它动态地生成匹配姓氏的正则表达式:
#!/bin/bash
# U+0300 = 0xCC80 = 52352
# U+033F = 0xCCBF = 52415
# U+0340 = 0xCD80 = 52608
# U+036E = 0xCDAE = 52654
_COMBINING_CHARS_=()
for i in {52352..52415} {52608..52654}
do
hex=$(printf %04X "$i")
_COMBINING_CHARS_+=( "$(printf '\x'"${hex:0:2}"'\x'"${hex:2:2}")" )
done
_COMBINING_CHARS_ERE_=$(IFS='|'; printf %s "${_COMBINING_CHARS_[*]}")# Function that removes the BOM, CRLF, and COMBINING characters:
sanitize() {
LANG=C sed -E \
-e $'1s/^\xEF\xBB\xBF//' \
-e $'s/\r$//' \
-e "s/$_COMBINING_CHARS_ERE_//g" \
-- "$@"
}# Function that generates a regex for the _lastname_:
toERE() {
awk '
{
if ( $0 ~ /,/) {
n = split($0, a, ",");
$0 = a[n];
} else {
$0 = $NF
}
sub("^[[:space]]+","");
sub("[[:space]]+$","");
gsub("[[:space:]-]+"," ");
}
{
ere = ""
sep = "";
for ( nf = 1; nf <= NF; nf++ ) {
n = split($nf, c, "");
for ( i = 1; i <= n; i++ ) {
ere = ere "[[=" c[i] "=]]"
}
ere = sep ere
sep = "[[:space:]-]+"
}
print ere "[[:space:]]*$"
}
' < <(sanitize "$@")
}grep -E -f <(toERE "$1") <(sanitize "$2")不幸的是,给定输入的结果是:
grep: illegal byte sequenceUTF-8多字节字符似乎是问题所在,但我想不出用awk来处理它的方法。
发布于 2022-02-14 13:10:14
agrep呢?man agrep:使用近似匹配功能搜索文件中的字符串或正则表达式。这并不完美,我们会看到:
$ while IFS= read -r line
do
echo -n "$line: "
agrep -B -y "$line" file1
done < file2输出:
Diego A. Maradona: agrep: 1 word matches within 6 errors
Maradona, Diego Armando
Albert Einstein: agrep: 1 word matches within 5 errors
A. Einstein
Louis Pasteur: Louis Pasteur
frava dona: agrep: 2 words match within 4 errors
Maradona, Diego Armando
Fräva Dona很好的例子,因为我们已经在最后三行中看到了一个问题。
发布于 2022-02-15 10:03:30
建议采用以下技巧:
cat file1.csv file1.csv | sort | uniq -d解释
cat file1.csv file1.csv一个接一个地组合bot文件
sort把相似的线放在一起
uniq -d只打印有重复项的行
https://stackoverflow.com/questions/71111940
复制相似问题