我有一个带有36列(6+30)的PLINK ped文件,如下所示:
FID IID PID MID SEX PHENO SNP_1a SNP_1b SNP_2a SNP_2b SNP_3a SNP_3b SNP_4a SNP_4b SNP_5a SNP_5b SNP_6a SNP_6b SNP_7a SNP_7b SNP_8a SNP_8b SNP_9a SNP_9b SNP_10a SNP_10b SNP_11a SNP_11b SNP_12a SNP_12b SNP_13a SNP_13b SNP_14a SNP_14b SNP_15a SNP_15b
A1 A1 0 0 1 1 0 0 0 0 2 2 1 2 1 2 1 2 2 1 2 1 1 1 1 2 0 0 0 0 0 0 2 1 2 2
A2 A2 0 0 1 1 1 1 1 1 0 0 1 2 2 2 2 2 1 1 0 0 2 1 2 2 0 0 0 0 0 0 1 1 0 0
A3 A3 0 0 1 2 1 1 1 1 0 0 2 2 2 2 2 2 1 1 0 0 1 1 2 2 0 0 0 0 0 0 1 1 0 0我有兴趣修改基因型栏(第7栏),以便:
因此,上面示例的输出将包含21列(6+15),如下所示:
FID IID PID MID SEX PHENO SNP_1 SNP_2 SNP_3 SNP_4 SNP_5 SNP_6 SNP_7 SNP_8 SNP_9 SNP_10 SNP_11 SNP_11 SNP_12 SNP_13 SNP_14 SNP_15
A1 A1 0 0 1 1 NA NA 2 2 2 2 2 2 1 2 NA NA NA 2 2
A2 A2 0 0 1 1 1 1 NA 2 2 2 1 NA 2 2 NA NA NA 1 NA
A3 A3 0 0 1 2 1 1 NA 2 2 2 1 NA 1 2 NA NA NA 1 NA我希望有人能帮我,谢谢你提前!
发布于 2016-08-18 14:12:21
$ cat > test.awk
NR>1{
for(i=j=7; i<NF; i+=2) # for fields 7-(NF-1)
$(j++) = ($i$(i+1)~/2/) ? "2" : (($i$(i+1)=="11") ? "1" : "NA") # see below *)
for (i=1; i<=21; i++) # reduced to 21 fields
printf "%-2s%s", $i,(i<21?OFS:ORS) # print
}
$ awk -f test.awk test.in
A1 A1 0 0 1 1 NA NA 2 2 2 2 2 2 1 2 NA NA NA 2 2
A2 A2 0 0 1 1 1 1 NA 2 2 2 1 NA 2 2 NA NA NA 1 NA
A3 A3 0 0 1 2 1 1 NA 2 2 2 1 NA 1 2 NA NA NA 1 NA如果规则1 (2或2)或2 (1和1)失败,则返回NA。
*) Catenate a和b字段($i$(i+1),在每次迭代中将2添加到i )并检查它们2或11,并将结果写入已经处理过的cols (即。字段7和8的结果存储到字段7、9和10到8等,每次迭代时j增长1)。
发布于 2016-08-18 15:09:20
下面是另一个带有头的神秘awk
$ awk '{for(i=1;i<7;i++) printf "%s ",$i;
for(i=7;i<=NF;i+=2)
{c=$i^2+$(i+1)^2;
r=(c>1)+(c>3);
printf "%2s ",(NR>1)?(r?r:"NA"):substr($i,1,length($i)-1)};
print ""}' file
FID IID PID MID SEX PHENO SNP_1 SNP_2 SNP_3 SNP_4 SNP_5 SNP_6 SNP_7 SNP_8 SNP_9 SNP_10 SNP_11 SNP_12 SNP_13 SNP_14 SNP_15
A1 A1 0 0 1 1 NA NA 2 2 2 2 2 2 1 2 NA NA NA 2 2
A2 A2 0 0 1 1 1 1 NA 2 2 2 1 NA 2 2 NA NA NA 1 NA
A3 A3 0 0 1 2 1 1 NA 2 2 2 1 NA 1 2 NA NA NA 1 NA发布于 2016-08-18 14:04:46
你试过什么?我们不是来给你写代码的。
但是,由于我感觉很慷慨,这里有一个非常基本的Perl解决方案,它似乎适用于您的数据。我不打算解释,因为我认为你应该努力:-)
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
# Skip header
<DATA>;
while (<DATA>) {
chomp;
my @data = split;
my @fixed = @data[0..5];
my @snps = @data[6..$#data];
my @new_snps;
while (my ($s1, $s2) = splice @snps,0, 2) {
push @new_snps, summarise($s1, $s2);
}
say join ' ', @fixed, @new_snps;
}
sub summarise {
my ($s1, $s2) = @_;
return 2 if $s1 == 2 or $s2 == 2;
return 1 if $s1 == 1 and $s2 == 1;
return 'NA' if $s1 == 0 and $s2 == 0;
return '?';
}
__DATA__
FID IID PID MID SEX PHENO SNP_1a SNP_1b SNP_2a SNP_2b SNP_3a SNP_3b SNP_4a SNP_4b SNP_5a SNP_5b SNP_6a SNP_6b SNP_7a SNP_7b SNP_8a SNP_8b SNP_9a SNP_9b SNP_10a SNP_10b SNP_11a SNP_11b SNP_12a SNP_12b SNP_13a SNP_13b SNP_14a SNP_14b SNP_15a SNP_15b
A1 A1 0 0 1 1 0 0 0 0 2 2 1 2 1 2 1 2 2 1 2 1 1 1 1 2 0 0 0 0 0 0 2 1 2 2
A2 A2 0 0 1 1 1 1 1 1 0 0 1 2 2 2 2 2 1 1 0 0 2 1 2 2 0 0 0 0 0 0 1 1 0 0
A3 A3 0 0 1 2 1 1 1 1 0 0 2 2 2 2 2 2 1 1 0 0 1 1 2 2 0 0 0 0 0 0 1 1 0 0https://stackoverflow.com/questions/39018943
复制相似问题