我正在处理一个数据文件,它看起来像这样:
A B C D E F G H
ctg.s1.000000F_arrow CDS gene 21215 22825 0 + . DAFEIOHN_00017
ctg.s1.000000F_arrow CDS gene 21215 22825 0 + . DAFEIOHN_00017
ctg.s1.000000F_arrow CDS gene 64501 66033 0 - . DAFEIOHN_00049
ctg.s1.000000F_arrow CDS gene 70234 78846 0 + . DAFEIOHN_00053
ctg.s1.000000F_arrow CDS gene 103455 106526 0 + . DAFEIOHN_00074
ctg.s1.000000F_arrow CDS gene 161029 161712 0 + . DAFEIOHN_00132
ctg.s1.000000F_arrow CDS gene 170711 171520 0 + . DAFEIOHN_00142
ctg.s1.000000F_arrow CDS gene 203959 204450 0 - . DAFEIOHN_00174
ctg.s1.000000F_arrow CDS gene 211381 212196 0 + . DAFEIOHN_00184
ctg.s1.000000F_arrow CDS gene 236673 238499 0 + . DAFEIOHN_00209
ctg.s1.000000F_arrow CDS gene 533077 533850 0 + . DAFEIOHN_00475
ctg.s1.000000F_arrow CDS gene 533995 535194 0 + . DAFEIOHN_00572
ctg.s1.000000F_arrow CDS gene 641146 643083 0 + . DAFEIOHN_00572如您所见,在H列中有重复的元素,如DAFEIOHN_00017或DAFEIOHN_00572。我想修改这个dataframe,以便获得如下内容:
A B C D E F G H I
ctg.s1.000000F_arrow CDS gene 21215 22825 0 + . DAFEIOHN_00017 2
ctg.s1.000000F_arrow CDS gene 64501 66033 0 - . DAFEIOHN_00049 1
ctg.s1.000000F_arrow CDS gene 70234 78846 0 + . DAFEIOHN_00053 1
ctg.s1.000000F_arrow CDS gene 103455 106526 0 + . DAFEIOHN_00074 1
ctg.s1.000000F_arrow CDS gene 161029 161712 0 + . DAFEIOHN_00132 1
ctg.s1.000000F_arrow CDS gene 170711 171520 0 + . DAFEIOHN_00142 1
ctg.s1.000000F_arrow CDS gene 203959 204450 0 - . DAFEIOHN_00174 1
ctg.s1.000000F_arrow CDS gene 211381 212196 0 + . DAFEIOHN_00184 1
ctg.s1.000000F_arrow CDS gene 236673 238499 0 + . DAFEIOHN_00209 1
ctg.s1.000000F_arrow CDS gene 533077 533850 0 + . DAFEIOHN_00475 1
ctg.s1.000000F_arrow CDS gene 533995 535194 0 + . DAFEIOHN_00572 2在第二个dataframe中,重复的元素只显示一次,并且有一个新的列I,其中提供了H列的每个元素的出现。
我怎么能这么做?
谢谢。
发布于 2021-12-25 00:13:08
可以使用drop_duplicates删除特定列中重复的行,并使用assign创建一个新列,其中包含从groupby('H')和transform('count')组合中返回的值,以确定H的每个唯一值的计数。
df = df.drop_duplicates(subset='H').assign(I=df.groupby('H')['H'].transform('count'))输出:
>>> df
A B C D E F G H I
0 ctg.s1.000000F_arrow CDS-gene 21215 22825 0 + . DAFEIOHN_00017 2
2 ctg.s1.000000F_arrow CDS-gene 64501 66033 0 - . DAFEIOHN_00049 1
3 ctg.s1.000000F_arrow CDS-gene 70234 78846 0 + . DAFEIOHN_00053 1
4 ctg.s1.000000F_arrow CDS-gene 103455 106526 0 + . DAFEIOHN_00074 1
5 ctg.s1.000000F_arrow CDS-gene 161029 161712 0 + . DAFEIOHN_00132 1
6 ctg.s1.000000F_arrow CDS-gene 170711 171520 0 + . DAFEIOHN_00142 1
7 ctg.s1.000000F_arrow CDS-gene 203959 204450 0 - . DAFEIOHN_00174 1
8 ctg.s1.000000F_arrow CDS-gene 211381 212196 0 + . DAFEIOHN_00184 1
9 ctg.s1.000000F_arrow CDS-gene 236673 238499 0 + . DAFEIOHN_00209 1
10 ctg.s1.000000F_arrow CDS-gene 533077 533850 0 + . DAFEIOHN_00475 1
11 ctg.s1.000000F_arrow CDS-gene 533995 535194 0 + . DAFEIOHN_00572 2发布于 2021-12-25 00:09:23
我们可以使用一个groupby并对元素进行如下计数:
df.groupby('H').count()https://stackoverflow.com/questions/70477243
复制相似问题