我试图将一个0,1表解析为一个MedLine文件,以执行一些统计的下游分析: PCA、GWAS等等。我使用一个名为Bio.Medline的Bio.Medline模块对其进行格式化,并添加了一些额外的shell命令。现在我不知道怎么继续下去了。
我需要将File 1 (一个键值文件,每行一张纸和制表符分隔的关键字)转换为一个包含折叠关键字和显示为1或0值的关键字的存在/缺失的文件。
我想用Perl来做这件事,但其他解决方案是受欢迎的。
谢谢,贝尔纳多
File 1
19801464 Animals Biodiversity Computational Biology/methods DNA
19696045 Environmental Microbiology Computational Biology/methods Software期望产出:
Animals Biodiversity Computational Biology/methods DNA Environmental Microbiology Software
19801464 1 1 1 0 0
19696045 0 1 0 1 1发布于 2014-07-16 05:13:41
这个perl脚本将构建一个您应该能够使用的哈希。为了方便起见,我将List::MoreUtils用于uniq,Data::Printer用于转储数据结构:
#!/usr/bin/env perl
use strict;
use warnings;
use List::MoreUtils qw(uniq);
use DDP;
my %paper ;
my @categories;
while (<DATA>){
chomp;
my @record = split /\t/ ;
$paper{$record[0]} = { map { $_ => 1 } @record[1..$#record] } ;
push @categories , @record[1..$#record] ;
}
@categories = uniq @categories;
foreach (keys %paper) {
foreach my $category(@categories) {
$paper{$_}{$category} //= 0 ;
}
};
p %paper ;
__DATA__
19801464 Animals Biodiversity Computational Biology/methods DNA
19696045 Environmental Microbiology Computational Biology/methods Software输出
{
19696045 {
'Animals Biodiversity' 0,
'Computational Biology/methods' 1,
DNA 0,
'Environmental Microbiology' 1,
Software 1
},
19801464 {
'Animals Biodiversity' 1,
'Computational Biology/methods' 1,
DNA 1,
'Environmental Microbiology' 0,
Software 0
}
}从那里到产生您想要的输出,可能需要printf来正确地格式化行。以下几点可能足以满足您的目的:
print "\t", (join " ", @categories);
for (keys %paper) {
print "\n", $_, "\t\t" ;
for my $category(@categories) {
print $paper{$_}{$category}," "x17 ;
}
}编辑
格式化输出的几个替代方案..。(我们使用x将格式节乘以@categories数组中的元素长度或元素数,以便它们匹配):
使用format
my $format_line = 'format STDOUT =' ."\n"
. '@# 'x ~~@categories . "\n"
. 'values %{ $paper{$num} }' . "\n"
. '.'."\n";
for $num (keys %paper) {
print $num ;
no warnings 'redefine';
eval $format_line;
write;
}使用printf
print (" "x9, join " ", @categories, "\n");
for $num (keys %paper) {
print $num ;
map{ printf "%19d", $_ } values %{ $paper{$num} } ;
print "\n";
}使用form
use Perl6::Form;
for $num (keys %paper) {
print form
"{<<<<<<<<}" . "{>}" x ~~@categories ,
$num , values %{ $paper{$num} }
}根据您对数据的计划,您可能能够用perl完成其余的分析,因此,在工作流程的稍后阶段之前,打印的精确格式可能并不是一个优先事项。有关想法,请参见BioPerl。
发布于 2014-07-16 08:51:30
您可以使用Python和Pandas执行此操作:
In [1]: df = pd.read_table("file", header=None, sep="\t", names=["A", "B","C","D"], index_col=0)
In [2]: df
Out[2]:
A B C \
0 19801464 Animals Biodiversity Computational Biology/methods
1 19696045 Environmental Microbiology Computational Biology/methods
D
0 DNA
1 Software
In [3]: b = pd.get_dummies(df.B)
In [4]: c = pd.get_dummies(df.C)
In [5]: d = pd.get_dummies(df.D)
In [6]: presence_absence = b.merge(c, right_index=True, left_index=True).merge(d,right_index=True, left_index=True)
In [7]: presence_absence
Out[7]:
Animals Biodiversity Environmental Microbiology \
A
19801464 1 0
19696045 0 1
Computational Biology/methods DNA Software
A
19801464 1 1 0
19696045 1 0 1希望这能有所帮助
https://stackoverflow.com/questions/24770909
复制相似问题