这是this question about merging two files with protein data的后续版本。
当我使用biopandas包导入数据帧时,我无法让duplicated/drop_duplicates丢弃我的副本。我的数据帧非常大:
# df:
col1 col2 col3 col4 col5 col6 col7 col8 col9
0 ATOM N SER 15 17.203 0.286 72.985 4pxz
1 ATOM CA SER 15 16.713 1.342 73.869 4pxz
2 ATOM C SER 15 17.885 2.188 74.412 4pxz
3 ATOM O SER 15 18.028 3.351 74.013 4pxz
4 ATOM CB SER 15 15.889 0.750 75.014 4pxz
... ... ... ... ... ... ... ... ...
3 ATOM CD ARG 93 12.319 8.102 61.886 hatp
4 ATOM NE ARG 93 11.978 6.754 61.425 hatp
5 ATOM CZ ARG 93 11.731 5.714 62.217 hatp
6 ATOM NH2 ARG 93 11.430 4.535 61.694 hatp
7 ATOM NH1 ARG 93 11.793 5.843 63.538 hatp
3148 rows × 8 columns我想使用以下命令在重复范围内检查它:
df2 = df[df.duplicated(['col3','col4','col5'])] # show me duplicates containing identical type(col3), abbreviation(col4) and number(col5).我得到了:
col1 col2 col3 col4 col5 col6 col7 col8
2132 ATOM CA HIS 1063 38.442 -16.479 -5.209 4pxz
2136 ATOM CB HIS 1063 37.502 -15.555 -6.008 4pxz
2138 ATOM CG HIS 1063 38.007 -15.211 -7.378 4pxz
2140 ATOM ND1 HIS 1063 38.342 -16.194 -8.293 4pxz
2142 ATOM CD2 HIS 1063 38.213 -14.000 -7.943 4pxz
2144 ATOM CE1 HIS 1063 38.749 -15.553 -9.375 4pxz
2146 ATOM NE2 HIS 1063 38.688 -14.231 -9.213 4pxz
0 ATOM CA ARG 93 11.357 9.429 58.493 hatp
1 ATOM CB ARG 93 12.236 9.564 59.757 hatp
2 ATOM CG ARG 93 11.569 9.166 61.087 hatp
3 ATOM CD ARG 93 12.319 8.102 61.886 hatp
4 ATOM NE ARG 93 11.978 6.754 61.425 hatp
5 ATOM CZ ARG 93 11.731 5.714 62.217 hatp
6 ATOM NH2 ARG 93 11.430 4.535 61.694 hatp
7 ATOM NH1 ARG 93 11.793 5.843 63.538 hatp预期输出:
col1 col2 col3 col4 col5 col6 col7 col8 col9
606 ATOM CA ARG 93 11.357 9.429 58.493 4pxz
609 ATOM CB ARG 93 12.236 9.564 59.757 4pxz
610 ATOM CG ARG 93 13.088 8.333 60.120 4pxz
611 ATOM CD ARG 93 13.985 7.822 58.995 4pxz
612 ATOM NE ARG 93 14.503 6.485 59.295 4pxz
613 ATOM CZ ARG 93 15.012 5.642 58.400 4pxz
614 ATOM NH1 ARG 93 15.074 5.979 57.116 4pxz
615 ATOM NH2 ARG 93 15.455 4.453 58.780 4pxz
0 ATOM CA ARG 93 11.357 9.429 58.493 hatp
1 ATOM CB ARG 93 12.236 9.564 59.757 hatp
2 ATOM CG ARG 93 11.569 9.166 61.087 hatp
3 ATOM CD ARG 93 12.319 8.102 61.886 hatp
4 ATOM NE ARG 93 11.978 6.754 61.425 hatp
5 ATOM CZ ARG 93 11.731 5.714 62.217 hatp
6 ATOM NH2 ARG 93 11.430 4.535 61.694 hatp
7 ATOM NH1 ARG 93 11.793 5.843 63.538 hatp正如您所看到的,它没有遵循duplicated()方法中的说明(drop_duplicates的工作原理完全相同)。我需要使用:
df2 = df[df['col5'] == 93]怎么啦?
发布于 2019-10-16 16:08:38
命令不是df.duplicated吗
还要确保传递选项keep=False。
发布于 2019-10-16 16:51:53
正确的答案是:
df2 = df[df.duplicated(subset = ['col3','col4','col5'], keep = False)]非常感谢你们的朋友们!
https://stackoverflow.com/questions/58408151
复制相似问题