我有两个由制表符分隔的文件,其中我需要将文件1第一列中的文本与文件2行中的任何位置相匹配。在匹配时,我想要打印文件1的匹配行的第二列中的内容到文件2中匹配行的末尾(例如下面的例子)。
我知道这几乎可以用awk来完成,但是我不能很好地使用awk或sed,在这里搜索相关的问题,并且试图修改他们的脚本对我来说还没有成功。如有任何意见,将不胜感激。
档案1
protein_1.p1 note "PJD5F7, match to databaseID=64575, (species X)";
protein_1.p2 note "PJD5F7, match to databaseID=64575, (species X)";
protein_3.p1 note "PA5F9H, match to databaseID=93689, (species W)";
protein_4.p1 note "Q7GT5J, match to databaseID=89045, (species Y)";
protein_4.p3 note "YE6G3L, match to databaseID=44968, (species Z)";档案2
chromosome_1 programID transcript_id "protein_1.p1"; parent "protein_1";
chromosome_1 programID transcript_id "protein_1.p2"; parent "protein_1";
chromosome_1 programID transcript_id "protein_2.p1"; parent "protein_2";
chromosome_1 programID transcript_id "protein_2.p2"; parent "protein_2";
chromosome_1 programID transcript_id "protein_3.p1"; parent "protein_3";
chromosome_1 programID transcript_id "protein_4.p1"; parent "protein_4";
chromosome_1 programID transcript_id "protein_4.p2"; parent "protein_4";
chromosome_1 programID transcript_id "protein_4.p3"; parent "protein_4";期望输出
chromosome_1 programID transcript_id "protein_1.p1"; parent "protein_1"; note "PJD5F7, match to databaseID=64575, (species X)";
chromosome_1 programID transcript_id "protein_1.p2"; parent "protein_1"; note "PJD5F7, match to databaseID=64575, (species X)";
chromosome_1 programID transcript_id "protein_2.p1"; parent "protein_2";
chromosome_1 programID transcript_id "protein_2.p2"; parent "protein_2";
chromosome_1 programID transcript_id "protein_3.p1"; parent "protein_3"; note "PA5F9H, match to databaseID=93689, (species W)";
chromosome_1 programID transcript_id "protein_4.p1"; parent "protein_4"; note "Q7GT5J, match to databaseID=89045, (species Y)";
chromosome_1 programID transcript_id "protein_4.p2"; parent "protein_4";
chromosome_1 programID transcript_id "protein_4.p3"; parent "protein_4"; note "YE6G3L, match to databaseID=44968, (species Z)";发布于 2020-11-25 06:48:57
我们可以解析file1,映射值($2)到键($1),然后解析file2并将值追加到行,当行的一部分($3)匹配任何键时。
BEGIN {OFS = FS = "\t"}
FNR == NR {arr[$1] = $2; next}
{for (x in arr) if ($3 ~ x) {$0 = $0 " " arr[x]; break}}
{print}这会为您的示例打印正确的结果,但由于许多原因,这并不是您想要的结果。第一个原因是,它可能会在各种情况下失败,比如protein_1.p1和protein_1.p11。第二个原因是性能,file2的每一行时间不是恒定的,而是file1的大小。
所以我们必须修改上面的脚本。您可能希望为要匹配的蛋白质字符串定义正则表达式。这样,匹配就变得足够严格了,而且在第二次解析时,时间取决于字段上的正则表达式,而不是数组大小。
BEGIN {OFS = FS = "\t"; re = "\\"}
FNR == NR {if ($1 ~ re) arr[$1] = $2; next}
match($3, re) {$0 = $0 " " arr[substr($3,RSTART,RLENGTH)]}
{print}备注:
re:"protein_“后面跟着一个或多个数字,".p”,再加上一个或多个数字--所有这些都是单词赏金。这个点是字面的。单词字符是[:alnum:]和_,所以剩下的是赏金。file1的第一个字段也有一个是否正确的检查。match(),那么内置变量RSTART、RLENGTH保存索引和匹配字符串的长度,这个子字符串就是我们在哈希中使用的。用法:
> awk -f tst.awk file1 file2
chromosome_1 programID transcript_id "protein_1.p1"; parent "protein_1"; note "PJD5F7, match to databaseID=64575, (species X)";
chromosome_1 programID transcript_id "protein_1.p2"; parent "protein_1"; note "PJD5F7, match to databaseID=64575, (species X)";
chromosome_1 programID transcript_id "protein_2.p1"; parent "protein_2";
chromosome_1 programID transcript_id "protein_2.p2"; parent "protein_2";
chromosome_1 programID transcript_id "protein_3.p1"; parent "protein_3"; note "PA5F9H, match to databaseID=93689, (species W)";
chromosome_1 programID transcript_id "protein_4.p1"; parent "protein_4"; note "Q7GT5J, match to databaseID=89045, (species Y)";
chromosome_1 programID transcript_id "protein_4.p2"; parent "protein_4";
chromosome_1 programID transcript_id "protein_4.p3"; parent "protein_4"; note "YE6G3L, match to databaseID=44968, (species Z)";https://unix.stackexchange.com/questions/621367
复制相似问题