我在pandas中有一个dataframe,它有五列: contig、length、identity、percent和hit。此数据从BLAST输出中解析,并按重叠群长度和匹配百分比进行排序。我的目标是让输出只为每个唯一的重叠群写一行。输出示例:
contig length identity percent hit
contig-100_0 5485 [1341/1341] [100.%] ['hit1']
contig-100_0 5485 [5445/5445] [100.%] ['hit2']
contig-100_0 5485 [59/59] [100.%] ['hit3']
contig-100_1 2865 [2865/2865] [100.%] ['hit1']
contig-100_2 2800 [2472/2746] [90.0%] ['hit1']
contig-100_3 2417 [2332/2342] [99.5%] ['hit1']
contig-100_4 2204 [2107/2107] [100.%] ['hit1']
contig-100_4 2000 [1935/1959] [98.7%] ['hit2']我希望上面的代码看起来像这样:
contig length identity percent hit
contig-100_0 5485 [1341/1341] [100.%] ['hit1']
contig-100_1 2865 [2865/2865] [100.%] ['hit1']
contig-100_2 2800 [2472/2746] [90.0%] ['hit1']
contig-100_3 2417 [2332/2342] [99.5%] ['hit1']
contig-100_4 2204 [2107/2107] [100.%] ['hit1']下面是我用来生成上述输出的代码:
df = pd.read_csv(path+i,sep='\t', header=None, engine='python', \
names=['contig','length','identity','percent','hit'])
df = df.sort_values(['length', 'percent'], ascending=[False, False])
top_hits = df.to_string(justify='left',index=False)
with open ('sorted_contigs', 'a') as sortedfile:
sortedfile.write(top_hits+"\n")我知道pandas中唯一的()方法,并且认为我需要使用的语法是df.contig.unique(),但是我不确定我应该把它放在代码中的什么位置。我还在学习熊猫,所以任何帮助都是非常感谢的!谢谢。
发布于 2019-03-01 01:09:25
您可以使用DataFrame.groupby(<colname>).head(<num_of_rows>)完成此操作
df.groupby('contig').head(1)和输出:
contig length identity percent hit
0 contig-100_0 5485 [1341/1341] [100.%] ['hit1']
3 contig-100_1 2865 [2865/2865] [100.%] ['hit1']
4 contig-100_2 2800 [2472/2746] [90.0%] ['hit1']
5 contig-100_3 2417 [2332/2342] [99.5%] ['hit1']
6 contig-100_4 2204 [2107/2107] [100.%] ['hit1']https://stackoverflow.com/questions/54930819
复制相似问题